Anji Jain

# GAN Inversion: A brief walkthrough

Recent researches have leveraged Generative Adversarial Networks (GANs) for generating photo-realistic images and performing image manipulation. However, to perform such manipulation on real images requires its corresponding latent code. GAN inversion aims to do just that by inverting an image into the latent space to obtain its latent code. This blog aims to give the reader an overview of GAN inversion approaches, their extensions and applications, and future work. We also briefly explore the latent space navigation in GANs as well as comparisons with VAE-GANs. This blog is inspired mainly by the GAN inversion survey by Bau et al. __[1]__, while we also explore some of the newer extensions in the field.
1. Background
1.1 Generative Adversarial Networks
Generative adversarial networks (GANs) are deep learning-based generative models, referring to models aiming to generate new, synthetic data resembling the training data distribution. These differ from discriminative models intending to learn boundaries between classes in the data distribution. GANs were first proposed by Goodfellow et al. [__2__], and various models have been developed since then to synthesize high-quality image generation.
GANs comprise of two main components:

Generator (G): aims to learn the real data distribution to generate data closer to the distribution and fool its adversary, the discriminator.

Discriminator (D): aims to discriminate between real and generated images.

Goodfellow et al. [__2__] propose to train the generator and discriminator components together with opposing objective functions striving to defeat each other. The model converges when the generator successfully defeats the discriminator by generating data indistinguishable from real data. The objective function proposed is as follows:

Eq. 1: GAN objective function proposed by [__2__]

where 𝔁 represents the data and 𝒛 represents a random noise vector. As can be seen, the discriminator maximizes the objective function while learning to perform the distinction, while the generator minimizes it by generating realistic data. The network architecture can be seen in Fig. 1.

Fig. 1: Generative Adversarial Network Architecture

1.2 GAN latent space During the training procedure, randomly selected points from the distribution are passed through the generator to generate synthesized images. The discriminator learns to differentiate between the real training dataset images and the synthesized images, labelled as real and fake respectively, as a classification approach. The generator learns patterns in the distribution space, mapping a location in the space to specific characteristics in the generated image. This N-dimensional space with a GAN’s learned patterns is the GAN’s latent space, generally referred to as the Z space. The vectors in the latent space, including the random noise vectors mentioned earlier, are termed latent vectors. The latent space differs each time a GAN model is trained. While we did not find much reasoning on why nearby points in the GAN distribution space have similar characteristics, the understanding is that it is to do with the internal mechanisms of GANs. Recent researches have shown that well-trained GANs encode disentangled semantic information in their latent space. These researches do so by introducing spaces in addition to the Z latent vector space to better incorporate the style and semantic information within the space.

Fig. 2: The W space proposed by [__4__]

StyleGAN [__4__] introduced the W space in order to incorporate semantic information into the latent space. Fig. 2 illustrates this. For example, we have a dataset of human faces, which lacks images for some combination of features, e.g., long-haired males. In this case, (a) represents the distribution space for the combination of the features, e.g., masculinity and hair length. (b) represents the Z space, where the mapping becomes curved/entangled to ensure that the entire space incorporates valid feature combinations. This leads to these feature entanglement in the Z space. (c) represents the mapping of Z space to W space, introduced by Xia et al. [__4__] to obtain more disentangled features in the W space.
Other researches have proposed the W+ space [__3__], S space [__5__] for spatial disentanglement, and P space [__6__] for regularizing StyleGAN embeddings.
Many studies report disentanglement from a qualitative point of view. However, there still lacks a good evaluation metric for the perceptual quality of a generated image and its comparison with the expected outcome.
A well-trained GAN must be capable of generating realistic images as well as enabling image manipulation. It must also enable interpolation between the latent codes of two images to generate images with a smooth transition.
For instance, [__3__] defines the interpolation in the W+ space as follows:
where the step factor, λ, for interpolation is defined to be:

Eq. 2: Linear interpolation defined by __[3]__

w₁ and w₂ are latent vectors for the two images to interpolate between, and w is the interpolated latent vector.

Fig. 3 shows linear latent space interpolation performed by Image2StyleGAN [__3__], trained on face images, in their W+ latent space. It can be seen that the GAN performs interpolation well on face images while failing to interpolate for non-face images realistically. This is due to the non-face images lying outside of the training data distribution.

Fig. 3: Latent space interpolation results from [__3__]

It is also possible to apply specific changes to the image by modifying the latent codes in semantically meaningful directions. However, such manipulation requires a known latent vector for the corresponding image, hence is only directly applicable to images generated using the GAN. Performing such operations on real images requires their latent vector to be obtained, leading to research in the inversion of GANs.
This review aims to provide an overview of [__1__] and a detailed analysis of some GAN inversion researches. Hence, we won’t be going into a detailed overview of GANs. The remainder of this post focuses on GAN inversion and approaches. More information on GANs and their latent space can be found in [__2__].
2. GAN inversion and approaches
2.1 GAN inversion
GAN inversion aims to obtain the latent vector for any given image, such that when passed through the generator, it generates an image close to the real image. Obtaining the latent vector provides more flexibility to perform image manipulation on any image instead of being constrained to GAN-generated images obtained from random sampling and generation. Fig. 4 shows an illustration of GAN inversion.

Fig. 4: GAN inversion from [__1__]

The inversion problem can be defined as:

Eq. 3: Inversion problem objective function such that z refers to a latent vector, G refers to the generator, l refers to the distance metric in the image space such as l1, l2, perceptual loss, etc., applied at pixel-level.

2.2 GAN inversion approaches Previous researches have explored three main inversion techniques. The generator has been trained beforehand.

Optimization-based: optimizes the latent vector z to reconstruct an image xʳᵉᶜ close to the real image x. The objective function is defined as follows:

Eq. 4: Optimization-based GAN inversion

where θ refers to the trained parameters of G. Various approaches perform optimization using gradient descent. The optimization problem is highly non-convex and requires a good initialization, else risks being stuck in local minima. It is important to note that the optimization procedure is performed during inference and can be computationally expensive, requiring many passes through the generator each time a new image is to be mapped to its latent vector.

Fig. 5: Optimization-based GAN inversion from [__1__]

Learning-based: incorporates an encoder module, E, trained using multiple images X generated by G with their corresponding known latent vectors Z. The encoder module aims to generate a latent vector zn for the image xn , such that when zn is passed through G, it reconstructs an image closer to xn.

Eq. 5: Learning-based GAN inversion

The encoder architecture generally resembles the discriminator D architecture, only differing on the final layers. It generally performs better than a basic optimization approach and does not fall into a local minima. It is worth noting that the latent vector can be obtained directly by passing the image through the encoder during inference.

Fig. 6: Learning-based GAN inversion from [__1__]

Hybrid-based: incorporates both the above-mentioned approaches. Similar to learning-based approaches, they train the encoder using images generated by G. During inference, the real image x is passed through E to obtain z, which serves as an initialization for the latent vector for optimization to further reduce the distance between x and xʳᵉᶜ.

Fig. 7: Hybrid GAN inversion from [__1__]

3. Differences from VAE-GANs Variational Autoencoders (VAEs) are another variant of generative models, proposed by Kingma et al. [7]. VAEs aim to train an encoder-decoder architecture to reconstruct images while also using KL-divergence to ensure that the latent distribution is close to an expected Gaussian distribution, with a defined mean and standard deviation. Variational Autoencoders Generative Adversarial Networks (VAE-GANs) are a combination of VAEs and GANs, proposed by Larsen et al. [8]. Their architecture can be seen in Fig. 8. On incorporating GANs into the architecture, the VAE decoder serves as the generator, and a discriminator tries to differentiate between a generated and real image. While VAEs tend to produce blurry images, combining them with GANs helped them obtain comparable performance as the baseline GAN architectures on image generation tasks.

Fig. 8: VAE-GAN architecture by [__8__]

The learning-based GAN inversion architecture constitutes similar modules as VAE-GANs, and this might become a point of confusion. However, these architectures differ in their internal dynamics, their training approach, and their purpose itself.
Learning-based GAN inversion approaches aim to understand the latent space of an already trained GAN as well as obtain a corresponding latent code for an image by training the encoder independently. VAE-GAN, on the other hand, seeks to train the encoder together with the GAN modules to generate images.
4. Characteristics of GAN inversion
**Semantic Aware**
The GAN’s latent space encodes rich semantic information. Some GAN inversion approaches aim to obtain semantic-aware latent codes to improve image manipulation. Zhu et al. [__9__] proposed reconstructing images at both pixel- and semantic-level for better semantic awareness. Leng et al. __[15]__ proposed improving this further by incorporating the semantic information in both the real image and latent space domains.
**Layer-wise**
With larger GAN generator architectures, it is not always feasible to perform full GAN inversion. Some approaches propose to decompose the generator into layers and perform layer-wise inversion instead. Bau et al. __[13]__ proposed using layer-wise inversion to understand what GANs learn, Bau et al. __[14]__ proposed to visualize what GANs don’t learn, while Gu et al. [__10__] used this idea to perform GAN inversion with multiple latent codes.
**Out of distribution**
Some approaches support out-of-distribution generalization for inverting images out of the training data distribution. Gu et al. [__10__] aims to support out-of-distribution GAN inversion by optimizing for multiple latent codes.
We will be exploring some of these researches in more detail in the following parts of this series. But for now, let’s explore latent space navigation together with GAN inversion.
5. Latent space navigation
The survey on GAN inversion, [__1__], mentions that well-trained GANs have the ability to encode disentangled semantics (decoupling entangled semantics, e.g., older people tend to wear glasses more than younger people) in their latent space. With more disentangled semantic information encoded, it would be possible to find disentangled directions in the latent space referring to age and glasses. These are very useful for image manipulation tasks.
Interpretable directions
Many GAN inversion approaches aim to determine interpretable directions in the latent space. Some learning-based inversion approaches incorporate a pre-trained classifier to identify boundaries in the latent space for different classes based on synthesized images. Such supervised approaches could be restrictive. Hence, recently, an unsupervised approach has been explored, focusing on identifying interpretable directions without pre-labelled data (e.g., using PCA to identify important directions or closed-form approaches).

Fig. 9: Latent space

Numerous researches enforce disentanglement to obtain better interpretable directions in the latent space for attributes (e.g., gender, age, hair color in face images). Jahanian et al. __[12]__ suggested an approach to steer in interpretable directions, linear and non-linear, in the GAN latent space for meaningful geometric transformation (e.g., zoom, shift, color manipulation) without enforcing disentanglement during GAN training. They observed that these interpretable directions have similar effects across all object classes generated by the GAN.
Nurit et al. __[11]__ noticed that the output of the first layer of the generator is a coarse representation of the generated image. Hence, applying a meaningful geometric transformation on the first layer output is similar to applying it to the original image itself. They show that containing this problem to a single layer can help obtain interpretable directions in a closed-form, referring to computing interpretable directions without any training or optimization using just the generator’s weights. They investigate the unsupervised exploration of the transformations in the latent space using PCA. However, it still remains an open problem for extracting attributes such as gender and age in the latent space, requiring training or optimization.
**Non-interference**
When moving in a particular semantic direction in the latent space of GANs, due to entanglement between different semantics in the space, it is challenging to perform manipulation without interference. Some approaches aim to incorporate non-interference by finding orthogonal directions through projection in the latent space or incorporating a semantic space into their architecture to obtain more linearly separable attributes. These are very useful for performing multi-attribute image manipulation without interference.
Shen et al. __[18]__ performs conditional manipulation in the sub-space to find orthogonal semantic directions to support non-interference. Conditional manipulation refers to obtaining directions in the latent space to support decoupling between coupled attributes. Their approach can be seen in Fig. 10. With hyper-planes defining separation boundaries for semantic attributes, n₁ and n₂ represent the unit normal vectors from two hyper-planes, and n₁ — (n₁ᵀn₂)n₂ represents the new orthogonal semantic direction.

Fig. 10: Conditional manipulation in subspace for non-interference by __[18]__

The results from their conditional manipulation can be seen in Fig. 11.

Fig. 11: Results from conditional manipulation proposed by __[18]__

Voynov et al. __[16]__ and Ramesh et al. __[17]__ use Jacobian decomposition to identify more disentangled directions within the latent space.
6. Extensions to GAN inversion approaches
1.1 In-domain GAN inversion for Real Image Editing
Previous approaches attempted to reconstruct input images at a pixel level to obtain their corresponding latent vector. They, however, fail to obtain semantically meaningful latent codes within the original latent space. This means that the obtained latent code may not lie in the original latent space of the GAN or wouldn’t hold the semantic meaning encoded in the GAN latent space. Hence, making it difficult to perform semantic editing on the reconstructed image. For example, performing semantic editing on the latent space of Image2StyleGAN __[3]__ resulted in unrealistic images for age, expression, and eyeglasses as can be seen in the image below.

Fig. 12: Facial image manipulation on Image2StyleGAN __[3]__

In __[9]__, the authors proposed an in-domain GAN inversion approach to obtain a latent code with the ability to reconstruct an image both at a pixel and a semantic level; suggesting that the reconstructed image not only matches at a pixel level but its corresponding latent vector lies within the latent space of the GAN holding enough semantic meaning to perform image manipulation. Their approach comprises of two phases:

Firstly, they train a domain-guided encoder on real-image datasets instead of GAN-generated images. While reconstructing input images at the pixel level, they also regularize the training by incorporating features extracted from VGG on the images. The encoder also competes with the discriminator to ensure the obtained code generates realistic images.

Fig. 13: Domain-guided encoder objective function proposed by __[9]__

Eq. 6: Domain-guided encoder objective function proposed by __[9]__

They then perform domain-regularized optimization to improve the initial latent vector obtained from the encoder by further regularizing the training with features obtained from VGG and the latent code obtained from the encoder.

Fig. 14: Domain-regularized optimization proposed by __[9]__

Eq. 7: Domain-regularized optimization objective function proposed by __[9]__

Zhu et al. [9] performed semantic analysis on the latent space. Their approach was able to encode more semantically meaningful information in the latent space as compared to the state-of-the-art model GAN inversion approach, Image2StyleGAN [3], on various evaluation metrics, including Frchet Inception Distance (FID) [20] and Sliced Wasserstein Discrepancy (SWD) [19]. Their approach was also able to achieve leading performance on image reconstruction quality metrics while being ~35x faster during optimization due to better initialization. Their image reconstruction results can be seen in Fig. 15 below.