GAN Inversion: A brief walkthrough



Recent researches have leveraged Generative Adversarial Networks (GANs) for generating photo-realistic images and performing image manipulation. However, to perform such manipulation on real images requires its corresponding latent code. GAN inversion aims to do just that by inverting an image into the latent space to obtain its latent code. This blog aims to give the reader an overview of GAN inversion approaches, their extensions and applications, and future work. We also briefly explore the latent space navigation in GANs as well as comparisons with VAE-GANs. This blog is inspired mainly by the GAN inversion survey by Bau et al. [1], while we also explore some of the newer extensions in the field. ​ 1. Background ​ 1.1 Generative Adversarial Networks Generative adversarial networks (GANs) are deep learning-based generative models, referring to models aiming to generate new, synthetic data resembling the training data distribution. These differ from discriminative models intending to learn boundaries between classes in the data distribution. GANs were first proposed by Goodfellow et al. [2], and various models have been developed since then to synthesize high-quality image generation. GANs comprise of two main components:

  • Generator (G): aims to learn the real data distribution to generate data closer to the distribution and fool its adversary, the discriminator.

  • Discriminator (D): aims to discriminate between real and generated images.

Goodfellow et al. [2] propose to train the generator and discriminator components together with opposing objective functions striving to defeat each other. The model converges when the generator successfully defeats the discriminator by generating data indistinguishable from real data. The objective function proposed is as follows:

Eq. 1: GAN objective function proposed by [2]


where 𝔁 represents the data and 𝒛 represents a random noise vector. As can be seen, the discriminator maximizes the objective function while learning to perform the distinction, while the generator minimizes it by generating realistic data. The network architecture can be seen in Fig. 1.


Fig. 1: Generative Adversarial Network Architecture


1.2 GAN latent space During the training procedure, randomly selected points from the distribution are passed through the generator to generate synthesized images. The discriminator learns to differentiate between the real training dataset images and the synthesized images, labelled as real and fake respectively, as a classification approach. The generator learns patterns in the distribution space, mapping a location in the space to specific characteristics in the generated image. This N-dimensional space with a GAN’s learned patterns is the GAN’s latent space, generally referred to as the Z space. The vectors in the latent space, including the random noise vectors mentioned earlier, are termed latent vectors. The latent space differs each time a GAN model is trained. While we did not find much reasoning on why nearby points in the GAN distribution space have similar characteristics, the understanding is that it is to do with the internal mechanisms of GANs. Recent researches have shown that well-trained GANs encode disentangled semantic information in their latent space. These researches do so by introducing spaces in addition to the Z latent vector space to better incorporate the style and semantic information within the space.



Fig. 2: The W space proposed by [4]


StyleGAN [4] introduced the W space in order to incorporate semantic information into the latent space. Fig. 2 illustrates this. For example, we have a dataset of human faces, which lacks images for some combination of features, e.g., long-haired males. In this case, (a) represents the distribution space for the combination of the features, e.g., masculinity and hair length. (b) represents the Z space, where the mapping becomes curved/entangled to ensure that the entire space incorporates valid feature combinations. This leads to these feature entanglement in the Z space. (c) represents the mapping of Z space to W space, introduced by Xia et al. [4] to obtain more disentangled features in the W space. Other researches have proposed the W+ space [3], S space [5] for spatial disentanglement, and P space [6] for regularizing StyleGAN embeddings. Many studies report disentanglement from a qualitative point of view. However, there still lacks a good evaluation metric for the perceptual quality of a generated image and its comparison with the expected outcome. A well-trained GAN must be capable of generating realistic images as well as enabling image manipulation. It must also enable interpolation between the latent codes of two images to generate images with a smooth transition. For instance, [3] defines the interpolation in the W+ space as follows: where the step factor, λ, for interpolation is defined to be:

Eq. 2: Linear interpolation defined by [3]

w₁ and w₂ are latent vectors for the two images to interpolate between, and w is the interpolated latent vector.

Fig. 3 shows linear latent space interpolation performed by Image2StyleGAN [3], trained on face images, in their W+ latent space. It can be seen that the GAN performs interpolation well on face images while failing to interpolate for non-face images realistically. This is due to the non-face images lying outside of the training data distribution.

Fig. 3: Latent space interpolation results from [3]


It is also possible to apply specific changes to the image by modifying the latent codes in semantically meaningful directions. However, such manipulation requires a known latent vector for the corresponding image, hence is only directly applicable to images generated using the GAN. Performing such operations on real images requires their latent vector to be obtained, leading to research in the inversion of GANs. This review aims to provide an overview of [1] and a detailed analysis of some GAN inversion researches. Hence, we won’t be going into a detailed overview of GANs. The remainder of this post focuses on GAN inversion and approaches. More information on GANs and their latent space can be found in [2]. ​ 2. GAN inversion and approaches ​ 2.1 GAN inversion GAN inversion aims to obtain the latent vector for any given image, such that when passed through the generator, it generates an image close to the real image. Obtaining the latent vector provides more flexibility to perform image manipulation on any image instead of being constrained to GAN-generated images obtained from random sampling and generation. Fig. 4 shows an illustration of GAN inversion.

Fig. 4: GAN inversion from [1]


The inversion problem can be defined as:

Eq. 3: Inversion problem objective function such that z refers to a latent vector, G refers to the generator, l refers to the distance metric in the image space such as l1, l2, perceptual loss, etc., applied at pixel-level.


2.2 GAN inversion approaches Previous researches have explored three main inversion techniques. The generator has been trained beforehand.

  • Optimization-based: optimizes the latent vector z to reconstruct an image xʳᵉᶜ close to the real image x. The objective function is defined as follows:

Eq. 4: Optimization-based GAN inversion


where θ refers to the trained parameters of G. Various approaches perform optimization using gradient descent. The optimization problem is highly non-convex and requires a good initialization, else risks being stuck in local minima. It is important to note that the optimization procedure is performed during inference and can be computationally expensive, requiring many passes through the generator each time a new image is to be mapped to its latent vector.

Fig. 5: Optimization-based GAN inversion from [1]


  • Learning-based: incorporates an encoder module, E, trained using multiple images X generated by G with their corresponding known latent vectors Z. The encoder module aims to generate a latent vector zn for the image xn , such that when zn is passed through G, it reconstructs an image closer to xn.

Eq. 5: Learning-based GAN inversion


The encoder architecture generally resembles the discriminator D architecture, only differing on the final layers. It generally performs better than a basic optimization approach and does not fall into a local minima. It is worth noting that the latent vector can be obtained directly by passing the image through the encoder during inference.

Fig. 6: Learning-based GAN inversion from [1]


  • Hybrid-based: incorporates both the above-mentioned approaches. Similar to learning-based approaches, they train the encoder using images generated by G. During inference, the real image x is passed through E to obtain z, which serves as an initialization for the latent vector for optimization to further reduce the distance between x and xʳᵉᶜ.

Fig. 7: Hybrid GAN inversion from [1]


3. Differences from VAE-GANs Variational Autoencoders (VAEs) are another variant of generative models, proposed by Kingma et al. [7]. VAEs aim to train an encoder-decoder architecture to reconstruct images while also using KL-divergence to ensure that the latent distribution is close to an expected Gaussian distribution, with a defined mean and standard deviation. Variational Autoencoders Generative Adversarial Networks (VAE-GANs) are a combination of VAEs and GANs, proposed by Larsen et al. [8]. Their architecture can be seen in Fig. 8. On incorporating GANs into the architecture, the VAE decoder serves as the generator, and a discriminator tries to differentiate between a generated and real image. While VAEs tend to produce blurry images, combining them with GANs helped them obtain comparable performance as the baseline GAN architectures on image generation tasks.

Fig. 8: VAE-GAN architecture by [8]


The learning-based GAN inversion architecture constitutes similar modules as VAE-GANs, and this might become a point of confusion. However, these architectures differ in their internal dynamics, their training approach, and their purpose itself. Learning-based GAN inversion approaches aim to understand the latent space of an already trained GAN as well as obtain a corresponding latent code for an image by training the encoder independently. VAE-GAN, on the other hand, seeks to train the encoder together with the GAN modules to generate images. ​ 4. Characteristics of GAN inversion Semantic Aware The GAN’s latent space encodes rich semantic information. Some GAN inversion approaches aim to obtain semantic-aware latent codes to improve image manipulation. Zhu et al. [9] proposed reconstructing images at both pixel- and semantic-level for better semantic awareness. Leng et al. [15] proposed improving this further by incorporating the semantic information in both the real image and latent space domains. Layer-wise With larger GAN generator architectures, it is not always feasible to perform full GAN inversion. Some approaches propose to decompose the generator into layers and perform layer-wise inversion instead. Bau et al. [13] proposed using layer-wise inversion to understand what GANs learn, Bau et al. [14] proposed to visualize what GANs don’t learn, while Gu et al. [10] used this idea to perform GAN inversion with multiple latent codes. Out of distribution Some approaches support out-of-distribution generalization for inverting images out of the training data distribution. Gu et al. [10] aims to support out-of-distribution GAN inversion by optimizing for multiple latent codes. We will be exploring some of these researches in more detail in the following parts of this series. But for now, let’s explore latent space navigation together with GAN inversion. ​ 5. Latent space navigation The survey on GAN inversion, [1], mentions that well-trained GANs have the ability to encode disentangled semantics (decoupling entangled semantics, e.g., older people tend to wear glasses more than younger people) in their latent space. With more disentangled semantic information encoded, it would be possible to find disentangled directions in the latent space referring to age and glasses. These are very useful for image manipulation tasks. Interpretable directions Many GAN inversion approaches aim to determine interpretable directions in the latent space. Some learning-based inversion approaches incorporate a pre-trained classifier to identify boundaries in the latent space for different classes based on synthesized images. Such supervised approaches could be restrictive. Hence, recently, an unsupervised approach has been explored, focusing on identifying interpretable directions without pre-labelled data (e.g., using PCA to identify important directions or closed-form approaches).

Fig. 9: Latent space


Numerous researches enforce disentanglement to obtain better interpretable directions in the latent space for attributes (e.g., gender, age, hair color in face images). Jahanian et al. [12] suggested an approach to steer in interpretable directions, linear and non-linear, in the GAN latent space for meaningful geometric transformation (e.g., zoom, shift, color manipulation) without enforcing disentanglement during GAN training. They observed that these interpretable directions have similar effects across all object classes generated by the GAN. Nurit et al. [11] noticed that the output of the first layer of the generator is a coarse representation of the generated image. Hence, applying a meaningful geometric transformation on the first layer output is similar to applying it to the original image itself. They show that containing this problem to a single layer can help obtain interpretable directions in a closed-form, referring to computing interpretable directions without any training or optimization using just the generator’s weights. They investigate the unsupervised exploration of the transformations in the latent space using PCA. However, it still remains an open problem for extracting attributes such as gender and age in the latent space, requiring training or optimization. Non-interference When moving in a particular semantic direction in the latent space of GANs, due to entanglement between different semantics in the space, it is challenging to perform manipulation without interference. Some approaches aim to incorporate non-interference by finding orthogonal directions through projection in the latent space or incorporating a semantic space into their architecture to obtain more linearly separable attributes. These are very useful for performing multi-attribute image manipulation without interference. Shen et al. [18] performs conditional manipulation in the sub-space to find orthogonal semantic directions to support non-interference. Conditional manipulation refers to obtaining directions in the latent space to support decoupling between coupled attributes. Their approach can be seen in Fig. 10. With hyper-planes defining separation boundaries for semantic attributes, n₁ and n₂ represent the unit normal vectors from two hyper-planes, and n₁ — (n₁ᵀn₂)n₂ represents the new orthogonal semantic direction.

Fig. 10: Conditional manipulation in subspace for non-interference by [18]

The results from their conditional manipulation can be seen in Fig. 11.

Fig. 11: Results from conditional manipulation proposed by [18]


Voynov et al. [16] and Ramesh et al. [17] use Jacobian decomposition to identify more disentangled directions within the latent space. 6. Extensions to GAN inversion approaches ​ 1.1 In-domain GAN inversion for Real Image Editing Previous approaches attempted to reconstruct input images at a pixel level to obtain their corresponding latent vector. They, however, fail to obtain semantically meaningful latent codes within the original latent space. This means that the obtained latent code may not lie in the original latent space of the GAN or wouldn’t hold the semantic meaning encoded in the GAN latent space. Hence, making it difficult to perform semantic editing on the reconstructed image. For example, performing semantic editing on the latent space of Image2StyleGAN [3] resulted in unrealistic images for age, expression, and eyeglasses as can be seen in the image below.

Fig. 12: Facial image manipulation on Image2StyleGAN [3]


In [9], the authors proposed an in-domain GAN inversion approach to obtain a latent code with the ability to reconstruct an image both at a pixel and a semantic level; suggesting that the reconstructed image not only matches at a pixel level but its corresponding latent vector lies within the latent space of the GAN holding enough semantic meaning to perform image manipulation. Their approach comprises of two phases:

  • Firstly, they train a domain-guided encoder on real-image datasets instead of GAN-generated images. While reconstructing input images at the pixel level, they also regularize the training by incorporating features extracted from VGG on the images. The encoder also competes with the discriminator to ensure the obtained code generates realistic images.

Fig. 13: Domain-guided encoder objective function proposed by [9]

Eq. 6: Domain-guided encoder objective function proposed by [9]


  • They then perform domain-regularized optimization to improve the initial latent vector obtained from the encoder by further regularizing the training with features obtained from VGG and the latent code obtained from the encoder.

Fig. 14: Domain-regularized optimization proposed by [9]

Eq. 7: Domain-regularized optimization objective function proposed by [9]


Zhu et al. [9] performed semantic analysis on the latent space. Their approach was able to encode more semantically meaningful information in the latent space as compared to the state-of-the-art model GAN inversion approach, Image2StyleGAN [3], on various evaluation metrics, including Frchet Inception Distance (FID) [20] and Sliced Wasserstein Discrepancy (SWD) [19]. Their approach was also able to achieve leading performance on image reconstruction quality metrics while being ~35x faster during optimization due to better initialization. Their image reconstruction results can be seen in Fig. 15 below.

Fig. 15: Image reconstruction results by [9]. The image reconstruction results from the domain-guided encoder (b) do not incorporate the all attributes of the input face image (a) including identity at times. With in-domain inversion (c), it is possible to achieve a better reconstruction.


The results for facial image inversion and manipulation can be seen in Fig. 16. While inversion using Image2StyleGAN displays slightly better results than the in-domain approach, both at image-level and the results reported for the evaluation metric MSE by [9], the interpretable directions navigation with in-domain inversion achieves much better results. Our understanding is that reconstructing images at a pixel level might result in an out-of-domain latent vector, hence failing to generate realistic images when moving in interpretable directions.


Fig. 16: Facial image manipulation using In-domain inversion [9].


1.2 Force-in-domain GAN inversion While In-domain GAN inversion incorporates the semantic information in the real image space, analysis by [15] shows that their inverted latent vectors do not overlap with the latent space distribution of the GAN. Additionally, the authors of [9] found some irregular image reconstruction results with their approach and later proposed In-domain inversion with batch normalization to fix the issue. However, that still did not fix the initial problem regarding the deviation from the latent space distribution of the GAN. In order to visualize the latent code distributions, the authors perform dimensionality reduction to obtain a 2-dimensional vector for the latent codes. Fig. 17 illustrates the result for the visualization.

Fig. 17: A visualization of the latent space distribution for real and in-domain latent codes


To tackle the aforementioned problem, [15] propose force-in-domain GAN inversion to incorporate semantic information in both the real image and latent code domains. They do so by including an additional discriminator to force the latent code within the latent space domain. Their proposed approach can be seen in Fig. 18. They adopt the generator G and FC block from a pre-trained StyleGAN [4] model, while the encoder E and discriminators D and Dʷ are trainable modules during inversion. The FC block maps from Z to W space, incorporating a more disentangled latent space. While training the trainable modules, E aims to reconstruct the image at both pixel and semantic levels, fooling the discriminator D, which intends to differentiate between real and reconstructed images. Dʷ tries to distinguish between the latent codes generated by E from the ones sampled in the Z space and mapped to the W space, forcing E to generate latent codes that will fool Dʷ and overlap with the latent space domain.

Fig. 18: Force-in-domain GAN inversion architecture proposed by [15]


The visualization of the latent space for force-in-domain GAN inversion can be seen to overlap with the original distribution in Fig. 19.

Fig. 19: A visualization of the latent space distribution for real and force-in-domain latent codes


They also report lower MSE and FID scores compared to [9], indicating better reconstruction and closer distributions. They also perform linear interpolation on inverted codes of the two input images to generate a smooth transition between the two. Fig. 20 illustrates their results from inversion and interpolation.

Fig. 20: Image inversion and interpolation results reported by [15]


1.3 Image processing using Multi-code GAN prior Previous approaches perform optimization or train an encoder to invert an image, but the generation is far from ideal. Due to the highly non-convex nature of the optimization approach, Gu et al. [10] suggests that a single latent code does not suffice in fully recovering every detail of an image. A single latent vector also mentions the challenges due to the finite dimensionality and limited expressiveness of a latent code, and suggest an over-parameterization of the latent space to help in better reconstruction with an expectation that each latent code will reconstruct certain sub-regions of the image. They propose mGANprior, also called multi-code GAN prior, that performs GAN inversion by incorporating multiple (N) latent codes. Linearly blending images generated by passing multiple latent codes through the generator G will not ensure a meaningful reconstructed image due to non-linearity in the image space. It was also shown by Bau et al. [13] that different units (channels) in the intermediate layers of the generator are responsible for different semantics in the image. Please refer to the GAN dissection section below for more details. Hence, [10] suggests extracting feature maps from intermediate generator layers and introducing adaptive channel importance parameters. Similarly to [13], this is done by incorporating trainable parameters to determine the importance of each channel in the feature map. Fig. 21 shows their proposed approach.

Fig. 21: mGANprior proposed approach by [10]


GAN inversion proves to be easier on the intermediate space rather than the latent space. They formulate the optimization as:

Eq. 8: Optimization objective for [10]

where

Eq. 9: Loss definition


and φ represents the perceptual feature extractor. Fig. 22 shows the effect of the number of latent codes and selection of intermediate layer on the GAN inversion task.

Fig. 22: Effects of number of latent codes and intermediate layer selection on GAN inversion


They experiment on various image processing applications, including colourization, super-resolution, image in-painting, etc. Fig. 23 illustrates their results.

Fig. 23: Results of mGANprior on various image processing applications


It is also notable that their model can generalize to other datasets. For example, they optimized the inversion parameters on a face dataset but could perform inversion and reconstruct bedroom images successfully.

Fig. 24: Results for bedroom image reconstruction with model optimized for face images.