GAN Inversion: A brief walkthrough

2021_11_Nov_Web.jpg
DEC 24, 2021

Recent researches have leveraged Generative Adversarial Networks (GANs) for generating photo-realistic images and performing image manipulation. However, to perform such manipulation on real images requires its corresponding latent code. GAN inversion aims to do just that by inverting an image into the latent space to obtain its latent code. This blog aims to give the reader an overview of GAN inversion approaches, their extensions and applications, and future work. We also briefly explore the latent space navigation in GANs as well as comparisons with VAE-GANs. This blog is inspired mainly by the GAN inversion survey by Bau et al. [1], while we also explore some of the newer extensions in the field.

1. Background

   1.1 Generative Adversarial Networks

Generative adversarial networks (GANs) are deep learning-based generative models, referring to models aiming to generate new, synthetic data resembling the training data distribution. These differ from discriminative models intending to learn boundaries between classes in the data distribution. GANs were first proposed by Goodfellow et al. [2], and various models have been developed since then to synthesize high-quality image generation.

GANs comprise of two main components:

 

  • Generator (G): aims to learn the real data distribution to generate data closer to the distribution and fool its adversary, the discriminator.

  • Discriminator (D): aims to discriminate between real and generated images.

 

Goodfellow et al. [2] propose to train the generator and discriminator components together with opposing objective functions striving to defeat each other. The model converges when the generator successfully defeats the discriminator by generating data indistinguishable from real data. The objective function proposed is as follows:

Eq 1.png

Eq. 1: GAN objective function proposed by [2]

where 𝔁 represents the data and 𝒛 represents a random noise vector. As can be seen, the discriminator maximizes the objective function while learning to perform the distinction, while the generator minimizes it by generating realistic data.

 

The network architecture can be seen in Fig. 1.

Fig 1.png

Fig. 1: Generative Adversarial Network Architecture

Fig. 1: Generative Adversarial Network Architecture

   1.2 GAN latent space

 

During the training procedure, randomly selected points from the distribution are passed through the generator to generate synthesized images. The discriminator learns to differentiate between the real training dataset images and the synthesized images, labelled as real and fake respectively, as a classification approach.

 

The generator learns patterns in the distribution space, mapping a location in the space to specific characteristics in the generated image. This N-dimensional space with a GAN’s learned patterns is the GAN’s latent space, generally referred to as the Z space. The vectors in the latent space, including the random noise vectors mentioned earlier, are termed latent vectors. The latent space differs each time a GAN model is trained.

 

While we did not find much reasoning on why nearby points in the GAN distribution space have similar characteristics, the understanding is that it is to do with the internal mechanisms of GANs.

 

Recent researches have shown that well-trained GANs encode disentangled semantic information in their latent space. These researches do so by introducing spaces in addition to the Z latent vector space to better incorporate the style and semantic information within the space.

Fig 2.png

Fig. 2: The W space proposed by [4]

StyleGAN [4] introduced the W space in order to incorporate semantic information into the latent space. Fig. 2 illustrates this. For example, we have a dataset of human faces, which lacks images for some combination of features, e.g., long-haired males. In this case, (a) represents the distribution space for the combination of the features, e.g., masculinity and hair length. (b) represents the Z space, where the mapping becomes curved/entangled to ensure that the entire space incorporates valid feature combinations. This leads to these feature entanglement in the Z space. (c) represents the mapping of Z space to W space, introduced by Xia et al. [4] to obtain more disentangled features in the W space.

 

Other researches have proposed the W+ space [3], S space [5] for spatial disentanglement, and P space [6] for regularizing StyleGAN embeddings.

 

Many studies report disentanglement from a qualitative point of view. However, there still lacks a good evaluation metric for the perceptual quality of a generated image and its comparison with the expected outcome.

 

A well-trained GAN must be capable of generating realistic images as well as enabling image manipulation. It must also enable interpolation between the latent codes of two images to generate images with a smooth transition.

 

For instance, [3] defines the interpolation in the W+ space as follows:

where the step factor, λ, for interpolation is defined to be:

Eq 2.png

Eq. 2: Linear interpolation defined by [3]

Eq 2-1.png

w₁ and w₂ are latent vectors for the two images to interpolate between, and w is the interpolated latent vector.

 

Fig. 3 shows linear latent space interpolation performed by Image2StyleGAN [3], trained on face images, in their W+ latent space. It can be seen that the GAN performs interpolation well on face images while failing to interpolate for non-face images realistically. This is due to the non-face images lying outside of the training data distribution.

Fig 3.png

Fig. 3: Latent space interpolation results from [3]

It is also possible to apply specific changes to the image by modifying the latent codes in semantically meaningful directions. However, such manipulation requires a known latent vector for the corresponding image, hence is only directly applicable to images generated using the GAN. Performing such operations on real images requires their latent vector to be obtained, leading to research in the inversion of GANs.


This review aims to provide an overview of [1] and a detailed analysis of some GAN inversion researches. Hence, we won’t be going into a detailed overview of GANs. The remainder of this post focuses on GAN inversion and approaches. More information on GANs and their latent space can be found in [2].

2. GAN inversion and approaches

   2.1 GAN inversion

GAN inversion aims to obtain the latent vector for any given image, such that when passed through the generator, it generates an image close to the real image. Obtaining the latent vector provides more flexibility to perform image manipulation on any image instead of being constrained to GAN-generated images obtained from random sampling and generation. Fig. 4 shows an illustration of GAN inversion.

Fig 4.png

Fig. 4: GAN inversion from [1]

The inversion problem can be defined as:

Eq 3.png

Eq. 3: Inversion problem objective function

such that z refers to a latent vector, G refers to the generator, l refers to the distance metric in the image space such as l1, l2, perceptual loss, etc., applied at pixel-level.

   2.2 GAN inversion approaches

 

Previous researches have explored three main inversion techniques. The generator has been trained beforehand.

 

  • Optimization-based: optimizes the latent vector z to reconstruct an image xʳᵉᶜ close to the real image x. The objective function is defined as follows:

Eq 4.png

Eq. 4: Optimization-based GAN inversion

where θ refers to the trained parameters of G.

 

Various approaches perform optimization using gradient descent. The optimization problem is highly non-convex and requires a good initialization, else risks being stuck in local minima. It is important to note that the optimization procedure is performed during inference and can be computationally expensive, requiring many passes through the generator each time a new image is to be mapped to its latent vector.

Fig 5.png

Fig. 5: Optimization-based GAN inversion from [1]

  • Learning-based: incorporates an encoder module, E, trained using multiple images X generated by G with their corresponding known latent vectors Z. The encoder module aims to generate a latent vector zn for the image xn , such that when zn is passed through G, it reconstructs an image closer to xn.

Eq 5.png

Eq. 5: Learning-based GAN inversion

The encoder architecture generally resembles the discriminator D architecture, only differing on the final layers. It generally performs better than a basic optimization approach and does not fall into a local minima. It is worth noting that the latent vector can be obtained directly by passing the image through the encoder during inference.

Fig 6.png

Fig. 6: Learning-based GAN inversion from [1]

  • Hybrid-based: incorporates both the above-mentioned approaches. Similar to learning-based approaches, they train the encoder using images generated by G. During inference, the real image x is passed through E to obtain z, which serves as an initialization for the latent vector for optimization to further reduce the distance between x and xʳᵉᶜ.

Fig 7.png

Fig. 7: Hybrid GAN inversion from [1]

3. Differences from VAE-GANs

Variational Autoencoders (VAEs) are another variant of generative models, proposed by Kingma et al. [7]. VAEs aim to train an encoder-decoder architecture to reconstruct images while also using KL-divergence to ensure that the latent distribution is close to an expected Gaussian distribution, with a defined mean and standard deviation.


Variational Autoencoders Generative Adversarial Networks (VAE-GANs) are a combination of VAEs and GANs, proposed by Larsen et al. [8]. Their architecture can be seen in Fig. 8. On incorporating GANs into the architecture, the VAE decoder serves as the generator, and a discriminator tries to differentiate between a generated and real image. While VAEs tend to produce blurry images, combining them with GANs helped them obtain comparable performance as the baseline GAN architectures on image generation tasks.

Fig 8.png

Fig. 8: VAE-GAN architecture by [8]

The learning-based GAN inversion architecture constitutes similar modules as VAE-GANs, and this might become a point of confusion. However, these architectures differ in their internal dynamics, their training approach, and their purpose itself.

 

Learning-based GAN inversion approaches aim to understand the latent space of an already trained GAN as well as obtain a corresponding latent code for an image by training the encoder independently. VAE-GAN, on the other hand, seeks to train the encoder together with the GAN modules to generate images.

4. Characteristics of GAN inversion

 

Semantic Aware

The GAN’s latent space encodes rich semantic information. Some GAN inversion approaches aim to obtain semantic-aware latent codes to improve image manipulation. Zhu et al. [9] proposed reconstructing images at both pixel- and semantic-level for better semantic awareness. Leng et al. [15] proposed improving this further by incorporating the semantic information in both the real image and latent space domains.

 

Layer-wise

With larger GAN generator architectures, it is not always feasible to perform full GAN inversion. Some approaches propose to decompose the generator into layers and perform layer-wise inversion instead. Bau et al. [13] proposed using layer-wise inversion to understand what GANs learn, Bau et al. [14] proposed to visualize what GANs don’t learn, while Gu et al. [10] used this idea to perform GAN inversion with multiple latent codes.

 

Out of distribution

Some approaches support out-of-distribution generalization for inverting images out of the training data distribution. Gu et al. [10] aims to support out-of-distribution GAN inversion by optimizing for multiple latent codes.

We will be exploring some of these researches in more detail in the following parts of this series. But for now, let’s explore latent space navigation together with GAN inversion.

5. Latent space navigation

The survey on GAN inversion, [1], mentions that well-trained GANs have the ability to encode disentangled semantics (decoupling entangled semantics, e.g., older people tend to wear glasses more than younger people) in their latent space. With more disentangled semantic information encoded, it would be possible to find disentangled directions in the latent space referring to age and glasses. These are very useful for image manipulation tasks.

Interpretable directions

 

Many GAN inversion approaches aim to determine interpretable directions in the latent space. Some learning-based inversion approaches incorporate a pre-trained classifier to identify boundaries in the latent space for different classes based on synthesized images. Such supervised approaches could be restrictive. Hence, recently, an unsupervised approach has been explored, focusing on identifying interpretable directions without pre-labelled data (e.g., using PCA to identify important directions or closed-form approaches).

Fig 9.png

Fig. 9: Latent space

Numerous researches enforce disentanglement to obtain better interpretable directions in the latent space for attributes (e.g., gender, age, hair color in face images). Jahanian et al. [12] suggested an approach to steer in interpretable directions, linear and non-linear, in the GAN latent space for meaningful geometric transformation (e.g., zoom, shift, color manipulation) without enforcing disentanglement during GAN training. They observed that these interpretable directions have similar effects across all object classes generated by the GAN.

 

Nurit et al. [11] noticed that the output of the first layer of the generator is a coarse representation of the generated image. Hence, applying a meaningful geometric transformation on the first layer output is similar to applying it to the original image itself. They show that containing this problem to a single layer can help obtain interpretable directions in a closed-form, referring to computing interpretable directions without any training or optimization using just the generator’s weights. They investigate the unsupervised exploration of the transformations in the latent space using PCA. However, it still remains an open problem for extracting attributes such as gender and age in the latent space, requiring training or optimization.

 

Non-interference

When moving in a particular semantic direction in the latent space of GANs, due to entanglement between different semantics in the space, it is challenging to perform manipulation without interference. Some approaches aim to incorporate non-interference by finding orthogonal directions through projection in the latent space or incorporating a semantic space into their architecture to obtain more linearly separable attributes. These are very useful for performing multi-attribute image manipulation without interference.


Shen et al. [18] performs conditional manipulation in the sub-space to find orthogonal semantic directions to support non-interference. Conditional manipulation refers to obtaining directions in the latent space to support decoupling between coupled attributes. Their approach can be seen in Fig. 10. With hyper-planes defining separation boundaries for semantic attributes, n₁ and n₂ represent the unit normal vectors from two hyper-planes, and n₁ — (n₁ᵀn₂)n₂ represents the new orthogonal semantic direction.

Fig 10.png

Fig. 10: Conditional manipulation in subspace for non-interference by [18]

The results from their conditional manipulation can be seen in Fig. 11.

Fig 11.png

Fig. 11: Results from conditional manipulation proposed by [18]

Voynov et al. [16] and Ramesh et al. [17] use Jacobian decomposition to identify more disentangled directions within the latent space.

 

6. Extensions to GAN inversion approaches

   1.1 In-domain GAN inversion for Real Image Editing

Previous approaches attempted to reconstruct input images at a pixel level to obtain their corresponding latent vector. They, however, fail to obtain semantically meaningful latent codes within the original latent space. This means that the obtained latent code may not lie in the original latent space of the GAN or wouldn’t hold the semantic meaning encoded in the GAN latent space. Hence, making it difficult to perform semantic editing on the reconstructed image. For example, performing semantic editing on the latent space of Image2StyleGAN [3] resulted in unrealistic images for age, expression, and eyeglasses as can be seen in the image below.

Fig 12.png

Fig. 12: Facial image manipulation on Image2StyleGAN [3]

In [9], the authors proposed an in-domain GAN inversion approach to obtain a latent code with the ability to reconstruct an image both at a pixel and a semantic level; suggesting that the reconstructed image not only matches at a pixel level but its corresponding latent vector lies within the latent space of the GAN holding enough semantic meaning to perform image manipulation. Their approach comprises of two phases:

 

  • Firstly, they train a domain-guided encoder on real-image datasets instead of GAN-generated images. While reconstructing input images at the pixel level, they also regularize the training by incorporating features extracted from VGG on the images. The encoder also competes with the discriminator to ensure the obtained code generates realistic images.

Fig 13.png

Fig. 13: Domain-guided encoder objective function proposed by [9]

Eq 6.png

Eq. 6: Domain-guided encoder objective function proposed by [9]

  • They then perform domain-regularized optimization to improve the initial latent vector obtained from the encoder by further regularizing the training with features obtained from VGG and the latent code obtained from the encoder.

Fig 14.png

Fig. 14: Domain-regularized optimization proposed by [9]

Eq 7.png

Eq. 7: Domain-regularized optimization objective function proposed by [9]

Zhu et al. [9] performed semantic analysis on the latent space. Their approach was able to encode more semantically meaningful information in the latent space as compared to the state-of-the-art model GAN inversion approach, Image2StyleGAN [3], on various evaluation metrics, including Frchet Inception Distance (FID) [20] and Sliced Wasserstein Discrepancy (SWD) [19]. Their approach was also able to achieve leading performance on image reconstruction quality metrics while being ~35x faster during optimization due to better initialization. Their image reconstruction results can be seen in Fig. 15 below.

Fig 15.png

Fig. 15: Image reconstruction results by [9]. The image reconstruction results from the domain-guided encoder (b) do not incorporate the all attributes of the input face image (a) including identity at times. With in-domain inversion (c), it is possible to achieve a better reconstruction.

The results for facial image inversion and manipulation can be seen in Fig. 16.


While inversion using Image2StyleGAN displays slightly better results than the in-domain approach, both at image-level and the results reported for the evaluation metric MSE by [9], the interpretable directions navigation with in-domain inversion achieves much better results. Our understanding is that reconstructing images at a pixel level might result in an out-of-domain latent vector, hence failing to generate realistic images when moving in interpretable directions.

Fig 16.png

Fig. 16: Facial image manipulation using In-domain inversion [9].

   1.2 Force-in-domain GAN inversion


While In-domain GAN inversion incorporates the semantic information in the real image space, analysis by [15] shows that their inverted latent vectors do not overlap with the latent space distribution of the GAN. Additionally, the authors of [9] found some irregular image reconstruction results with their approach and later proposed In-domain inversion with batch normalization to fix the issue. However, that still did not fix the initial problem regarding the deviation from the latent space distribution of the GAN. In order to visualize the latent code distributions, the authors perform dimensionality reduction to obtain a 2-dimensional vector for the latent codes. Fig. 17 illustrates the result for the visualization.

Fig 17.png

Fig. 17: A visualization of the latent space distribution for real and in-domain latent codes

To tackle the aforementioned problem, [15] propose force-in-domain GAN inversion to incorporate semantic information in both the real image and latent code domains. They do so by including an additional discriminator to force the latent code within the latent space domain. Their proposed approach can be seen in Fig. 18. They adopt the generator G and FC block from a pre-trained StyleGAN [4] model, while the encoder E and discriminators D and Dʷ are trainable modules during inversion. The FC block maps from Z to W space, incorporating a more disentangled latent space. While training the trainable modules, E aims to reconstruct the image at both pixel and semantic levels, fooling the discriminator D, which intends to differentiate between real and reconstructed images. Dʷ tries to distinguish between the latent codes generated by E from the ones sampled in the Z space and mapped to the W space, forcing E to generate latent codes that will fool Dʷ and overlap with the latent space domain.

Fig 18.png

Fig. 18: Force-in-domain GAN inversion architecture proposed by [15]

The visualization of the latent space for force-in-domain GAN inversion can be seen to overlap with the original distribution in Fig. 19.

Fig 19.png

Fig. 19: A visualization of the latent space distribution for real and force-in-domain latent codes

They also report lower MSE and FID scores compared to [9], indicating better reconstruction and closer distributions. They also perform linear interpolation on inverted codes of the two input images to generate a smooth transition between the two. Fig. 20 illustrates their results from inversion and interpolation.

Fig 20.png

Fig. 20: Image inversion and interpolation results reported by [15]

   1.3 Image processing using Multi-code GAN prior

 

Previous approaches perform optimization or train an encoder to invert an image, but the generation is far from ideal. Due to the highly non-convex nature of the optimization approach, Gu et al. [10] suggests that a single latent code does not suffice in fully recovering every detail of an image.

 

A single latent vector also mentions the challenges due to the finite dimensionality and limited expressiveness of a latent code, and suggest an over-parameterization of the latent space to help in better reconstruction with an expectation that each latent code will reconstruct certain sub-regions of the image.

 

They propose mGANprior, also called multi-code GAN prior, that performs GAN inversion by incorporating multiple (N) latent codes. Linearly blending images generated by passing multiple latent codes through the generator G will not ensure a meaningful reconstructed image due to non-linearity in the image space. It was also shown by Bau et al. [13] that different units (channels) in the intermediate layers of the generator are responsible for different semantics in the image. Please refer to the GAN dissection section below for more details.

 

Hence, [10] suggests extracting feature maps from intermediate generator layers and introducing adaptive channel importance parameters. Similarly to [13], this is done by incorporating trainable parameters to determine the importance of each channel in the feature map. Fig. 21 shows their proposed approach.

Fig 21.png

Fig. 21: mGANprior proposed approach by [10]

GAN inversion proves to be easier on the intermediate space rather than the latent space. They formulate the optimization as:

Eq 8.png

Eq. 8: Optimization objective for [10]

where

Eq 9.png

Eq. 9: Loss definition

and φ represents the perceptual feature extractor.

 

Fig. 22 shows the effect of the number of latent codes and selection of intermediate layer on the GAN inversion task.

Fig 22.png

Fig. 22: Effects of number of latent codes and intermediate layer selection on GAN inversion

They experiment on various image processing applications, including colourization, super-resolution, image in-painting, etc. Fig. 23 illustrates their results.

Fig 23.png

Fig. 23: Results of mGANprior on various image processing applications

It is also notable that their model can generalize to other datasets. For example, they optimized the inversion parameters on a face dataset but could perform inversion and reconstruct bedroom images successfully.

Fig 24.png

Fig. 24: Results for bedroom image reconstruction with model optimized for face images.

7. Applications of GAN inversion

Recent approaches have explored various GAN inversion applications, including image manipulation, restoration, interpolation, style transfer, compressive sensing, and interactive generation. Some researchers looked into the application of GAN inversion for dissecting GANs to understand their internal representations as well as for understanding what GANs do not learn.

   1.1 Image interpolation

Image interpolation refers to the process of morphing between two images by interpolating between their corresponding latent vectors in the latent space, usually expecting to achieve a smooth transition between these images. The basic strategy is to perform linear interpolation in the latent space to achieve this, and can be formulated as follows:

Eq 10.png

Eq. 10: Image interpolation

where the step factor, λ, is defined as:

Eq 10-1.png

z1 and z2  represent the corresponding latent vectors of two images, and z represents the latent vector for the interpolated image.

 

   1.2 Image manipulation

With the knowledge that GANs encode rich semantic information in their latent spaces, image manipulation transforms the image’s latent code into specific semantic directions. The basic approach is to perform a linear transformation in the latent space, which can be defined as follows:

Eq 11.png
Eq 11-1.png

Eq. 11: Image manipulation

where z’ and x’ refer to the manipulated latent code and image, α refers to the step factor, and n refers to a specific semantic direction in the latent space.

 

   1.3 Semantic diffusion

This task aims to diffuse a specific part of a target image into a context image, expecting the output to retain the characteristics of the target image (e.g., face identity) while conforming to the context image. Zhu et al. [9] and Leng et al. [15] investigated the model performance for this application. Fig. 25 and Fig. 26 show their respective results.

Fig 25.png

Fig. 25: Semantic diffusion results by [9]

Fig 26.png

Fig. 26: Semantic diffusion results by [15]

   1.4 GAN dissection

While there has been significant progress in GAN training in the past few years to generate realistic images, there hasn’t been much research on understanding what the GAN learns internally and how the architectural choices could affect the learning. Bau et al. [13] aims to dissect GANs to visualize and understand the encoded internal representations at the unit, object, and scene levels.

 

The idea is to see whether a particular class c is present in the image.

 

They suggest that the feature maps r extracted from layers of GAN generators encode some information about the existence of a class c at particular locations in an image, such that for r = h(z) and x = f(r) = f(h(z)) = G(z).

 

They propose this as a two-step approach:

 

  • GAN dissection: This suggests that for each object class c, the aim is to identify whether the uᵗʰ unit of feature map ru,P extracted at a particular layer encodes some internal representation of the class at location P. This is done by upsampling the feature map to the image size and thresholding to mask out the non-relevant pixels. It then performs segmentation for class c on the image generated by the generator and performs an IOU check to measure the agreement between the segmented image and the thresholded feature maps.

Fig 27.png

Fig. 27: GAN dissection by [13]

  • GAN intervention: It is important to note that all units highly correlated to an object class are not necessarily responsible for rendering the object into the image. Hence, it is essential to find the set of units that cause an object to occur in the image. They decompose feature maps into unforced and causal (causing object rendering) units and force insertion and ablation of these units to identify their effect. They also incorporate a learnable continuous per-channel factor that implies the effect each unit of feature map has on the rendering.

Fig 28.png

Fig. 28: GAN intervention by [13]

  • They also found that dissecting later layers of the generator covered more low-level objects with intricate details like edges and contours, while earlier layers covered broader objects such as a ceiling. Fig. 18 illustrates these findings.

Fig 29.png

Fig. 29: Layer based comparison by [13]

   1.5 Interactive generation

Interactive generation refers to generation/modifications to images from user interaction, wherein the user can fill regions to add/remove objects in an image. One interesting application of this is GANPaint by [21]. Their demo can be found here: GAN Paint 

1.6 Seeing what a GAN does not learn

Mode dropping, referring to GANs entirely omitting certain portions of the target distribution, is a challenge in GAN training not explored much. Fig. 30 illustrates the results from image reconstruction using the Progressive GAN church model. As it can be seen, the model drops people or fences during the reconstruction.

Fig 30.png

Fig. 30: Image reconstruction using Progressive GAN church model

Some previous approaches investigated this by measuring the distance between real and generated distributions. Bau et al. [14] proposed an approach to gain a semantically meaningful understanding of what GANs cannot generate, both at the distribution and instance levels. They do this as a two-step approach:

 

Generated Image Segmentation Statistics: Firstly, they obtain the mean and covariance statistics for different object classes by performing pixel-level segmentation on a set of real and generated images and identifying the total area in pixels covered by each object. If certain object class statistics for generated images depart from real images, it would be something to examine further.

 

Based on our understanding, objects that occur more frequently in the training set and cover wider pixel areas of the images seem less likely to be omitted by the GAN. While for objects that don’t occur as frequently in the training set, the generator can get away with not generating them at all as the discriminator does not take that as an important feature to classify an image as real/fake.

Fig 31.png

Fig. 31: Image segmentation statistics on StyleGAN, WGAN-GP, and Progressive GAN.

This approach provides information about classes that GANs omit at a distribution level but does not provide an insight into when a GAN fails to generate an object in a specific image.

Layer Inversion: This is where GAN inversion comes in. The most basic idea would be to perform GAN inversion to reconstruct an image as close to the real image as possible. However, due to the large size of GANs recently, full GAN inversion is not always practical. Hence, they perform layer inversion instead. To do this, they decompose the generator into layers, such that,

Eq 12.png

Eq. 12: Decomposing the generator

In this case, the initial layers of the generator are described as g1,…,gn, while the later layers form Gf. It can hence be concluded that any image that can be generated by G can be generated by Gf. They define r* as G(z) = Gf (r*). Hence, they perform inversion at the layer-level. To get a good initialization for r*, they require a good initialization for z, which is done using the layer-based approach by training an encoder E. z is then passed through  g1,…,gn to obtain a good initialization for r*, r* is then further optimized to reduce the image reconstruction loss. In order to avoid local minima during optimization, small learnable perturbations (delta) are also added at each layer until the generator’s nth layer, gn, to obtain better reconstructions.

Eq 13.png

Eq. 13: Optimization objective by [14]

In the loss function, ℓ represents the combination of pixel-level loss and perceptual loss. The loss function also incorporates the perturbations by incorporating a regularization factor, λreg.

This proposed architecture is illustrated in Fig 31.

Fig 32.png

Fig. 32: Layer inversion by [14]

The results on LSUN outdoor church dataset and unrelated (non-church) images can be seen in Fig. 33. It can be seen that the generated images tend to miss particular objects such as humans, furniture, signboard, etc.

Fig 33.png

Fig. 33: Layer-wise inversion results for model trained on LSUN church

It is quite interesting that we rarely see such images being reported by researches, possibly due to this aspect of generated images not being investigated much. Hence, this study by Bau et al. [14] gives great insight into what GANs tend to miss out on.

 

8. Further improvements

 

Evaluation metrics

Most current evaluation metrics are concentrated on photo-realism generation quality and comparing training and generated distribution. While some approaches aim to evaluate the generated images by passing them through a classification [14] or segmentation [16] model trained on real images, there is a lack of metrics to directly assess the quality of the inverted latent code and directions of predicted and expected results.

 

Precise Control

Image manipulation by navigating through the GAN’s latent space using GAN inverted code has been quite successful. However, it is vital to have fine-grained control over attributes manipulated during latent space navigation, for which current approaches are not suitable. For example, it is possible to find latent space directions for GAN models to change pose, however, changing the pose precisely by 1ᵒ is a task that requires more work. Hence, more research is needed to obtain disentangled latent spaces and identifying interpretable directions.

 

Domain generalization

While many researches have proven effective cross-domain applications including [15], there is still scope for improvement in developing unified models to support multiple applications.

9. Summary

This series on GAN Inversion aimed at summarizing the key technical details of GAN Inversion and its approaches. Additionally, this series also discusses about the extensions and applications of GAN Inversion in recent research. Furthermore, through this series, we also aimed to explore the mechanism of latent space navigation in GAN Inversion.

While we covered the main idea behind GAN inversion as well as some of its applications and extensions in this series, there has been much research in the area recently. Please refer to the survey paper [1] on GAN inversion on which this post is based and [22] which provides a great compilation of GAN inversion resources.

 

References:

[1] Xia, W., Zhang, Y., Yang, Y., Xue, J., Zhou, B., & Yang, M. (2021). GAN Inversion: A Survey. ArXiv, abs/2101.05278.

[2] Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative Adversarial Nets. NIPS.

[3] Abdal, R., Qin, Y., & Wonka, P. (2019). Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 4431–4440.

[4] Karras, T., Laine, S., & Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4396–4405.

[5] Xu, J., Xu, H., Ni, B., Yang, X., Wang, X., & Darrell, T. (2020). Hierarchical Style-based Networks for Motion Synthesis. ECCV.

[6] Zhu, P., Abdal, R., Qin, Y., & Wonka, P. (2020). Improved StyleGAN Embedding: Where are the Good Latents? ArXiv, abs/2012.09036.

[7] Kingma, D.P., & Welling, M. (2014). Auto-Encoding Variational Bayes. CoRR, abs/1312.6114.

[8] Larsen, A.B., Sønderby, S.K., Larochelle, H., & Winther, O. (2016). Autoencoding beyond pixels using a learned similarity metric. ArXiv, abs/1512.09300.

[9] Zhu, J., Shen, Y., Zhao, D., & Zhou, B. (2020). In-Domain GAN Inversion for Real Image Editing. ArXiv, abs/2004.00049.

[10] Gu, J., Shen, Y., & Zhou, B. (2020). Image Processing Using Multi-Code GAN Prior. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3009–3018.

[11] Spingarn-Eliezer, N., Banner, R., & Michaeli, T. (2021). GAN Steerability without optimization. ArXiv, abs/2012.05328.

[12] Jahanian, A., Chai, L., & Isola, P. (2020). On the “steerability” of generative adversarial networks. ArXiv, abs/1907.07171.

[13] Bau, D., Zhu, J., Strobelt, H., Zhou, B., Tenenbaum, J.B., Freeman, W.T., & Torralba, A. (2019). GAN Dissection: Visualizing and Understanding Generative Adversarial Networks. ArXiv, abs/1811.10597.

[14] Bau, D., Zhu, J., Wulff, J., Peebles, W.S., Strobelt, H., Zhou, B., & Torralba, A. (2019). Seeing What a GAN Cannot Generate. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 4501–4510.

[15] Leng, G., Zhu, Y., & Xu, Z.J. (2021). Force-in-domain GAN inversion. ArXiv, abs/2107.06050.

[16] Voynov, A., & Babenko, A. (2020). Unsupervised Discovery of Interpretable Directions in the GAN Latent Space. ArXiv, abs/2002.03754.

[17] Ramesh, A., Choi, Y., & LeCun, Y. (2018). A Spectral Regularizer for Unsupervised Disentanglement. ArXiv, abs/1812.01161.

[18] Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020). Interpreting the Latent Space of GANs for Semantic Face Editing. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9240–9249.

[19] Lee, C., Batra, T., Baig, M.H., & Ulbricht, D. (2019). Sliced Wasserstein Discrepancy for Unsupervised Domain Adaptation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10277–10287.

 

[20] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NIPS.

[21] Painting with GANs from MIT-IBM Watson AI Lab. (2019). Painting with GANs from MIT-IBM Watson AI Lab. http://gandissect.res.ibm.com/ganpaint.html

[22] GitHub — weihaox/awesome-gan-inversion: A collection of resources on GAN inversion.

Anji-AI copy.jpg
Written By
Anji Jain
AI Researcher

Related Posts

DeepAI-Dec_Sertis1-01.jpg
Bkkbiz-Dec2021-Metaverse_FN_1.jpg
Bkkbiz-Nov2021-HybridWork-01.jpg