How Generative Adversarial Networks and Their Variants Work: An Overview How Generative Adversarial Networks and Their Variants Work: An Overview

Electrical and Computer Engineering, Seoul National University, Republic of Korea

ACM Comput. Surv., Vol. 52, No. 1, Article 10, Publication date: January 2019.

Generative Adversarial Networks (GANs) have received wide attention in the machine learning field for their potential to learn high-dimensional, complex real data distribution. Specifically, they do not rely on any assumptions about the distribution and can generate real-like samples from latent space in a simple manner. This powerful property allows GANs to be applied to various applications such as image synthesis, image attribute editing, image translation, domain adaptation, and other academic fields. In this article, we discuss the details of GANs for those readers who are familiar with, but do not comprehend GANs deeply or who wish to view GANs from various perspectives. In addition, we explain how GANs operates and the fundamental meaning of various objective functions that have been suggested recently. We then focus on how the GAN can be combined with an autoencoder framework. Finally, we enumerate the GAN variants that are applied to various tasks and other fields for those who are interested in exploiting GANs for their research.

CCS Concepts: • Computing methodologies → Artificial intelligence; Machine learning; Computer vision representations;

Additional Key Words and Phrases: Generative adversarial networks, integral probability metric, mode collapse, variational auto-encoder, semi-supervised learning, domain adaptation

ACM Reference format:
Yongjun Hong, Uiwon Hwang, Jaeyoon Yoo, and Sungroh Yoon. 2019. How Generative Adversarial Networks and Their Variants Work: An Overview. ACM Comput. Surv. 52, 1, Article 10 (January 2019), 43 pages.


Recently, in the machine learning field, generative models have become more important and popular because of their applicability in various fields. Their capability to represent complex and high-dimensional data can be utilized in treating images [2, 12, 57, 65, 127, 133, 145], videos [122, 125, 126], music generation [41, 66, 141], natural languages [48, 73], and other academic domains such as medical images [16, 77, 136] and security [109, 124]. Specifically, generative models are highly useful for image-to-image translation (see Figure 1) [9, 57, 137, 145], which transfers images to another specific domain; image super-resolution [65]; changing some features of an object in an image [3, 37, 75, 94, 144]; and predicting the next frames of a video [122, 125, 126]. In addition, generative models can be the solution for various problems in the machine learning field such as semi-supervised learning [21, 67, 104, 115], which tries to address the lack of labeled data, and domain adaptation [2, 12, 47, 108, 111, 140], which leverages known knowledge for some tasks in other domains where little information is given.

Fig. 1.
Fig. 1. Examples of unpaired image-to-image translation from CycleGAN [145]. CycleGAN use a GAN concept, converting contents of the input image to the desired output image. There are many creative applications using a GAN, including image-to-image translation, and these will be introduced in Section 4. Images from CycleGAN [145].

Formally, a generative model learns to model a real data probability distribution $p_{\mathrm{data}}(x)$ where the data $x$ exists in the $d$-dimensional real space $R^d$, and most generative models, including autoregressive models [92, 105], are based on the maximum likelihood principle with a model parametrized by parameters $\theta$. With independent and identically distributed (i.i.d.) training samples $x^i$ where $i \in \lbrace 1,2,\ldots n\rbrace$, the likelihood is defined as the product of probabilities that the model gives to each training data: $\prod _{i=1}^n p_{\mathrm{\theta }}(x^i),$ where $p_{\mathrm{\theta }}(x)$ is the probability that the model assigns to $x$. The maximum likelihood principle trains the model to maximize the likelihood that the model follows the real data distribution.

From this point of view, we need to assume a certain form of $p_{\mathrm{\theta }}(x)$ explicitly to estimate the likelihood of the given data and retrieve the samples from the learned model after the training. In this way, some approaches [91, 92, 105] successfully learned the generative model in various fields, including speech synthesis. However, while the explicitly defined probability density function brings about computational tractability, it may fail to represent the complexity of real data distribution and learn the high-dimensional data distributions [87].

Generative Adversarial Networks (GANs) [36] were proposed to solve the disadvantages of other generative models. Instead of maximizing the likelihood, GAN introduces the concept of adversarial learning between the generator and the discriminator. The generator and the discriminator act as adversaries with respect to each other to produce real-like samples. The generator is a continuous, differentiable transformation function mapping a prior distribution $p_z$ from the latent space $\mathcal {Z}$ into the data space $\mathcal {X}$ that tries to fool the discriminator. The discriminator distinguishes its input by whether it comes from the real data distribution or the generator. The basic intuition behind adversarial learning is that as the generator tries to deceive the discriminator, which also evolves against the generator, the generator improves. This adversarial process gives GAN notable advantages over the other generative models.

GAN avoids defining $p_{\mathrm{\theta }}(x)$ explicitly, and instead trains the generator using a binary classification of the discriminator. Thus, the generator does not need to follow a certain form of $p_{\mathrm{\theta }}(x)$. In addition, since the generator is a simple, usually deterministic feedforward network from $\mathcal {Z}$ to $\mathcal {X}$, GAN can sample the generated data in a simple manner, unlike other models using the Markov chain [113] in which the sampling is computationally slow and inaccurate. Furthermore, GAN can parallelize the generation, which is not possible with other models, such as PixelCNN [105], PixelRNN [92], and WaveNet [91], due to their autoregressive nature.

Because of these advantages, GAN has been gaining considerable attention, and the desire to use GAN in many fields is growing. In this study, we explain GAN [35, 36] in detail how GAN generates sharper and better real-like samples than other generative models by adopting two components, the generator and the discriminator. We look into how GAN works theoretically and how GAN has been applied to various applications.

1.1 Article Organization

Table 1 shows GAN and GAN variants which will be discussed in Sections 2 and 3. In Section 2, we first present a standard objective function of a GAN and describe how its components work. After that, we present various objective functions proposed recently, focusing on their similarities in terms of the feature-matching problem. We then explain the architecture of GAN, extending the discussion to dominant obstacles caused by optimizing a minimax problem, especially a mode collapse, and how to address those issues.

Table 1. An Overview of GANs Discussed in Sections 2 and 3
Subject Topic Reference
Object functions f-divergence GAN [36], f-GAN [89], LSGAN [76]
IPM WGAN [5], WGAN-GP [42], FISHER GAN [84], McGAN [85], MMDGAN [68]
Architecture Hierarchy StackedGAN [49], GoGAN [54], Progressive GAN [56]
Auto encoder BEGAN [10], EBGAN [143], MAGAN [128]
Issues Theoretical analysis Toward principled methods for training GANs [4] Generalization and equilibrium in GAN [6]
Mode collapse MRGAN [13], DRAGAN [61], MAD-GAN [33], Unrolled GAN [79]
Latent space Decomposition CGAN [80], ACGAN [90], InfoGAN [15], ss-InfoGAN [116]
Encoder ALI [25], BiGAN [24], Adversarial Generator-Encoder Networks [123]
VAE VAEGAN [64], $\alpha$-GAN [102]

In Section 3, we discuss how GAN can be exploited to learn the latent space where a compressed and low-dimensional representation of data lies. In particular, we emphasize how the GAN extracts the latent space from the data space using autoencoder frameworks. Section 4 provides several extensions of the GAN applied to other domains and various topics, as shown in Table 2. In Section 5, we observe a macroscopic view of GAN, especially why GAN is advantageous over other generative models. Finally, Section 6 concludes the article.

Table 2. Categorization of GANs Applied for Various Topics
Domain Topic Reference
Image Image translation Pix2pix [52], PAN [127], CycleGAN [145], DiscoGAN [57]
Super resolution SRGAN [65]
Object detection SeGAN [28], Perceptual GAN for small object detection [69]
Object transfiguration GeneGAN [144], GP-GAN [132]
Joint image generation Coupled GAN [74]
Video generation VGAN [125], Pose-GAN [126], MoCoGAN [122]
Text to image Stack GAN [49], TAC-GAN [18]
Change facial attributes SD-GAN [23], SL-GAN [138], DR-GAN [121], AGEGAN [3]
Sequential data Music generation C-RNN-GAN [83], SeqGAN [141], ORGAN [41]
Text generation RankGAN [73]
Speech conversion VAW-GAN [48]
Others Semi-supervised learning SSL-GAN [104], CatGAN [115], Triple-GAN [67]
Domain adaptation DANN [2], CyCADA [47] Unsupervised pixel-level domain adaptation [12]
Continual learning Deep generative replay [110]
Medical image segmentation DI2IN [136], SCAN [16], SegAN [134]
Steganography Steganography GAN [124], Secure steganography GAN [109]


As its name implies, GAN is a generative model that learns to make real-like data adversarially [36]. It contains two components, the generator $G$ and the discriminator $D$. $G$ takes the role of producing real-like fake samples from the latent variable $z$, whereas $D$ determines whether its input comes from $G$ or from real data space. $D$ outputs a high value as it determines that its input is more likely to be real. $G$ and $D$ compete with each other to achieve their individual goals—thus the term “adversarial.” This adversarial learning situation can be formulated as Equation (1), with parametrized networks $G$ and $D$. $p_{\mathrm{data}}(x)$ and $p_{z}(z)$ in Equation (1) denote the real data probability distribution defined in the data space $\mathcal {X}$ and the probability distribution of $z$ defined on the latent space $\mathcal {Z}$, respectively.

$V(G,D)$ is a binary cross-entropy function that is commonly used in binary classification problems [76]. Note that $G$ maps $z$ from $\mathcal {Z}$ into the element of $\mathcal {X}$, whereas $D$ takes an input $x$ and distinguishes whether $x$ is a real sample or a fake sample generated by $G$.

As $D$ wants to classify real or fake samples, $V(G,D)$ is a natural choice for an objective function as an aspect of the classification problem. From $D$’s perspective, if a sample comes from real data, $D$ will maximize its output; whereas, if a sample comes from $G$, $D$ will minimize its output; thus, the $\log (1-D(G(z)))$ term appears in Equation (1). Simultaneously, $G$ wants to deceive $D$, so it tries to maximize $D$’s output when a fake sample is presented to $D$. Consequently, $D$ tries to maximize $V(G,D)$ while $G$ tries to minimize $V(G,D)$, thus forming the minimax relationship in Equation (1). Figure 2 shows an illustration of the GAN.

Fig. 2.
Fig. 2. Generative adversarial network.

Theoretically, assuming that the two models $G$ and $D$ both have sufficient capacity, the equilibrium between $G$ and $D$ occurs when $p_\mathrm{data}(x)=p_g(x)$ and $D$ always produce $\frac{1}{2}$, where $p_g(x)$ means a probability distribution of the data provided by the generator [36]. Formally, for fixed $G,$ the optimal discriminator $D^\star$ is $D^\star (x) = \frac{p_\mathrm{g}(x)}{p_\mathrm{g}(x)+p_{\mathrm{data}}(x),}$ which can be shown by differentiating Equation (1). If we plug in the optimal $D^\star$ into Equation (1), the equation becomes the Jensen Shannon Divergence (JSD) between $p_{\mathrm{data}}(x)$ and $p_{\mathrm{g}}(x)$. Thus, the optimal generator minimizing JSD$(p_{\mathrm{data}}||p_{\mathrm{g}})$ is the data distribution $p_\mathrm{data}(x),$ and $D$ becomes $\frac{1}{2}$ by substituting the optimal generator into the optimal $D^\star$ term.

Beyond the theoretical support of GAN, the preceding paragraph leads us to infer two points. First, from the optimal discriminator in the preceding, GAN can be connected into the density ratio trick [102]. That is, the density ratio between the data distribution and the generated data distribution is as follows:

\begin{equation} Dr(x)=\frac{p_{\mathrm{data}}(x)}{p_{\mathrm{g}}(x)}=\frac{p(x|y=1)}{p(x|y=0)}=\frac{p(y=1|x)}{p(y=0|x)}=\frac{D^\star (x)}{1-D^\star (x)} , \end{equation}
where $y=0$ and $y=1$ indicate the generated data and the real data, respectively, and $p(y=1)=p(y=0)$ is assumed. This means that GAN addresses the intractability of the likelihood by just using the relative behavior of the two distributions [102] and transferring this information to the generator to produce real-like samples. Second, GAN can be interpreted to measure the discrepancy between the generated data distribution and the real data distribution and then learn to reduce it. The discriminator is used to implicitly measure the discrepancy.

Despite the advantages and theoretical support of GAN, many shortcomings have been found due to practical issues and the inability to implement the assumption in theory, including the infinite capacity of the discriminator. There have been many attempts made to solve these issues by changing the objective function, the architecture, and more. Holding the fundamental framework of GAN, we assess variants of the object function and the architectures proposed for the development of GAN. We then focus on the crucial failures of GAN and how to address those issues.

2.1 Object Functions

The goal of generative models is to match the real data distribution $p_{\mathrm{data}}(x)$ from $p_{\mathrm{g}}(x)$. Thus, minimizing differences between two distributions is a crucial point for training generative models. As mentioned earlier, standard GAN [36] minimizes JSD$(p_{\mathrm{data}}||p_{\mathrm{g}})$ estimated by using the discriminator. Recently, researchers have found that various distances or divergence measures can be adopted instead of JSD and can improve the performance of the GAN. In this section, we discuss how to measure the discrepancy between $p_{\mathrm{data}}(x)$ and $p_{\mathrm{g}}(x)$ using various distances and object functions derived from these distances.

2.1.1 f-Divergence. The f-divergence $D_f(p_{\mathrm{data}}||p_{\mathrm{g}})$ is one of the means used to measure differences between two distributions with a specific convex function $f$. Using the ratio of the two distributions, the f-divergence for $p_{\mathrm{data}}$ and $p_{\mathrm{g}}$ with a function $f$ is defined as follows:

\begin{equation} D_f(p_{\mathrm{data}}||p_{\mathrm{g}}) = \int _{\mathcal {X}}p_{\mathrm{g}}(x)f\left(\frac{p_{\mathrm{data}}(x)}{p_{\mathrm{g}}(x)}\right)dx . \end{equation}
It should be noted that $D_f(p_{\mathrm{data}}||p_{\mathrm{g}})$ can act as a divergence between two distributions under the conditions that $f$ is a convex function and $f(1)=0$ is satisfied. Because of the condition $f(1)=0$, if two distributions are equivalent, their ratio becomes 1 and their divergence goes to 0. Though $f$ is termed a generator function [89] in general, we call $f$ an f-divergence function to avoid confusion with the generator $G$.

f-GAN [89] generalizes the GAN objective function in terms of f-divergence under an arbitrary convex function $f$. As we do not know the distributions exactly, Equation (3) should be estimated through a tractable form, such as an expectation form. By using the convex conjugate $f(u)=\sup _{t\in dom f^{\star }}(tu - f^{\star }(t))$, Equation (3) can be reformulated as follows:

\begin{equation} D_f(p_{\mathrm{data}}||p_{\mathrm{g}}) = \int _{\mathcal {X}}p_{\mathrm{g}}(x) \sup _{t\in dom f^{\star }}\left(t\frac{p_{\mathrm{data}}(x)}{p_{\mathrm{g}}(x)} - f^{\star }(t)\right)dx \\ \end{equation}
\begin{equation} \ge \sup _{T \in \mathcal {T}} \left(\int _{\mathcal {X}}T(x)p_{\mathrm{data}}(x) - f^{\star }\left(T(x)\right)p_{\mathrm{g}}(x)\right)dx \\ \end{equation}
where $f^{\star }$ is a Fenchel conjugate [29] of a convex function $f,$ and $dom f^{\star }$ indicates a domain of $f^{\star }$.

Equation (5) follows from the fact that the summation of the maximum is larger than the maximum of the summation, and $\mathcal {T}$ is an arbitrary function class that satisfies . Note that we replace $t$ in Equation (4) by $T(x): \mathcal {X} \rightarrow dom f^{\star }$ in Equation (5) to make $t$ involved in $\int _{\mathcal {X}}$. If we express $T(x)$ in the form of $T(x)=a(D_\omega (x))$ with and , we can interpret $T(x)$ as the parameterized discriminator with a specific activation function $a(\cdot)$.

We can then create various GAN frameworks with the specified generator function $f$ and activation function $a$ using the parameterized generator $G_\theta$ and discriminator $\mathcal {T}_\omega$. Similar to the standard GAN [36], f-GAN first maximizes the lower bound in Equation (6) with respect to $T_\omega$ to make the lower bound tight to $D_f(p_{\mathrm{data}}||p_{\mathrm{g}})$ and then minimizes the approximated divergence with respect to $G_\theta$ to make $p_{\mathrm{g}}(x)$ similar to $p_{data}(x)$. In this manner, f-GAN tries to generalize various GAN objectives by estimating some types of divergences given $f,$ as shown in Table 3.

Table 3. GANs Using F-Divergence
GAN Divergence Generator $f(t)$
KLD $t\log t$
GAN [36] JSD - $2\log$2 $t\log t-(t+1)\log (t+1)$
LSGAN [76] Pearson $\mathcal {X}^2$ $(t-1)^2$
EBGAN [143] Total Variance $|t-1|$

Table reproduced from [89].

Kullback-Leibler Divergence (KLD), reverse KLD, JSD, and other divergences can be derived using the f-GAN framework with the specific generator function $f$, though they are not all represented in Table 3. Among f-GAN-based GANs, Least-Square GAN (LSGAN) is one of the most widely used GANs due to its simplicity and high performance; we briefly explain LSGAN in the next paragraph. To summarize, f-divergence $D_f(p_{\mathrm{data}}||p_{\mathrm{g}})$ in Equation (3) can be indirectly estimated by calculating expectations of its lower bound to deal with the intractable form with the unknown probability distributions. f-GAN generalizes various divergences under an f-divergence framework and thus it can derive the corresponding GAN objective with respect to a specific divergence.

Least Square GAN. The standard GAN uses a sigmoid cross-entropy loss for the discriminator to classify whether its input is real or fake. However, if a generated sample is well-classified as real by the discriminator, there would be no reason for the generator to be updated even though the generated sample is located far from the real data distribution. A sigmoid cross-entropy loss can barely push such generated samples toward real data distribution since its classification role has been achieved.

Motivated by this phenomenon, LSGAN replaces a sigmoid cross-entropy loss with a least-square loss, which directly penalizes fake samples by moving them close to the real data distribution. Compared to Equation (1), LSGAN solves the following problems:

where $a,b,$ and $c$ refer to the baseline values for the discriminator.

Equations (7) and (8) use a least-square loss, under which the discriminator is forced to have designated values ($a,b,$ and $c$) for the real samples and the generated samples, respectively, rather than a probability for the real or fake samples. Thus, contrary to a sigmoid cross-entropy loss, a least-square loss not only classifies the real samples and the generated samples but also pushes generated samples closer to the real data distribution. In addition, LSGAN can be connected to an f-divergence framework, as shown in Table 3.

2.1.2 Integral Probability Metric. The Integral Probability Metric (IPM) defines a critic function $f$, which belongs to a specific function class $\mathcal {F}$, and IPM is defined as a maximal measure between two arbitrary distributions under the frame of $f$. In a compact space $\mathcal {X} \subset R^d$, let $\mathcal {P}(\mathcal {X})$ denote the probability measures defined on $\mathcal {X}$. IPM metrics between two distributions $p_{\mathrm{data}}$, $p_{\mathrm{g}} \in \mathcal {P}(\mathcal {X})$ is defined as follows:

As shown in Equation (9), IPM metric $d_\mathcal {F}(p_{\mathrm{data}}, p_{\mathrm{g}})$ defined on $\mathcal {X}$ determines a maximal distance between $p_{\mathrm{data}}(x)$ and $p_{\mathrm{g}}(x)$ with functions belonging to $\mathcal {F,}$ which is a set of measurable, bounded, real-valued functions. It should be noted that $\mathcal {F}$ determines various distances and their properties. Here, we consider the function class $\mathcal {F}_{v,w}$ whose elements are the critic $f$, which can be represented as a standard inner product of parameterized neural networks $\Phi _w$ and a linear output activation function $v$ in real space, as described in Equation (10). $w$ belongs to parameter space $\Omega$ that forces the function space to be bounded. Under the function class in Equation (10), we can reformulate Equation (9) as the following equations:

\begin{equation} \mathcal {F}_{v,w} = \lbrace {f(x)=\lt v, \Phi _w(x)\gt |v\in R^m, \Phi _w(x):\mathcal {X}\rightarrow R^m}\rbrace \\ \end{equation}

In Equation (13), the range of $v$ determines the semantic meanings of the corresponding IPM metrics. From now on, we discuss IPM metric variants such as the Wasserstein metric, Maximum Mean Discrepancy (MMD), and the Fisher metric based on Equation (13).

Wasserstein GAN. Wasserstein GAN (WGAN) [5] presents significant studies regarding the distance between $p_{\mathrm{data}}(x)$ and $p_{\mathrm{g}}(x)$. GAN learns the generator function $g_\theta$ that transforms a latent variable $z$ into $p_{\mathrm{g}}(x)$ rather than directly learning the probability distribution $p_{\mathrm{data}}(x)$ itself. A measure between $p_{\mathrm{g}}(x)$ and $p_{\mathrm{data}}(x)$ thus is required to train $g_\theta$. WGAN suggests the Earth-Mover (EM) distance, which is also called the Wasserstein distance, as a measure of the discrepancy between the two distributions. The Wasserstein distance is defined as follows:

where $\sqcap (p_{\mathrm{data}}, p_{\mathrm{g}})$ denotes the set of all joint distributions where the marginals of $\gamma (x,y)$ are $p_{\mathrm{data}}(x)$ and $p_{\mathrm{g}}(x),$ respectively.

Probability distributions can be interpreted as the amount of mass they place at each point, and EM distance is the minimum total amount of work required to transform $p_{\mathrm{data}}(x)$ into $p_{\mathrm{g}}(x)$. From this view, calculating the EM distance is equivalent to finding a transport plan $\gamma (x, y)$, which defines how we distribute the amount of mass from $p_{\mathrm{data}}(x)$ over $p_{\mathrm{g}}(y)$. Therefore, a marginality condition can be interpreted that $p_{\mathrm{data}}(x)=\int _{y}\gamma (x,y)dy$ is the amount of mass to move from point $x$ and $p_{\mathrm{g}}(y)=\int _{x}\gamma (x,y)dx$ is the amount of mass to be stacked at the point $y$. Because work is defined as the amount of mass times the distance it moves, we have to multiply the Euclidean distance $\Vert x-y\Vert$ by $\gamma (x,y)$ at each point $x$, $y,$ and the minimum amount of work is derived as Equation (14).

The benefit of the EM distance over other metrics is that it is a more sensible objective function when learning distributions with the support of a low-dimensional manifold. The article on WGAN shows that EM distance is the weakest convergent metric in that the convergent sequence under the EM distance does not converge under other metrics, and it is continuous and differentiable almost everywhere under the Lipschitz condition, which standard feedforward neural networks satisfy. Thus, EM distance results in a more tolerable measure than do other distances such as KLD and total variance distance regarding convergence of the distance.

As the $\inf$ term in Equation (14) is intractable, it is converted into a tractable equation via Kantorovich-Rubinstein duality with the Lipschitz function class [99], [43]; that is, $f:X\rightarrow R$, satisfying $d_R(f(x_1), f(x_2)) \le 1\times d_X(x_1, x_2)$, $\forall x_1,x_2 \in X,$ where $d_X$ denotes the distance metric in the domain $X$. A duality of Equation (14) is as follows:

Consequently, if we parametrize the critic $f$ with $w$ to be a 1-Lipschitz function, the formulation becomes a minimax problem in that we train $f_w$ first to approximate $W(p_{\mathrm{data}}, p_{\mathrm{g}})$ by searching for the maximum, as in Equation (15), and minimize such approximated distance by optimizing the generator $g_\theta$. To guarantee that $f_w$ is a Lipschitz function, weight clipping is conducted for every update of $w$ to ensure that the parameter space of $w$ lies in a compact space. It should be noted that $f(x)$ is called the critic because it does not explicitly classify its inputs as the discriminator, but rather scores its input.

Variants of WGAN. WGAN with gradient penalty (WGAN-GP) [42] points out that the weight clipping for the critic while training WGAN incurs a pathological behavior of the discriminator and suggests adding a penalizing term of the gradient's norm instead of the weight clipping. It shows that guaranteeing the Lipschitz condition for the critic via weight clipping constrains the critic to a very limited subset of all Lipschitz functions; this biases the critic toward a simple function. The weight clipping also creates a gradient problem as it pushes weights to the extremes of the clipping range. Instead of weight clipping, adding a gradient penalty term to Equation (15) for the purpose of implementing the Lipschitz condition by directly constraining the gradient of the critic has been suggested [42].

Loss-Sensitive GAN (LS-GAN) [98] also uses a Lipschitz constraint but with a different method. It learns loss function $L_\theta$ instead of the critic, such that the loss of a real sample should be smaller than a generated sample by a data-dependent margin, leading to more focus on fake samples whose margin is high. Moreover, LS-GAN assumes that the density of real samples $p_{\mathrm{data}}(x)$ is Lipschitz continuous so that nearby data do not abruptly change. The reason for adopting the Lipschitz condition is independent of WGAN's Lipschitz condition. The article on LS-GAN discusses that the nonparametric assumption that the model should have infinite capacity, proposed by Goodfellow et al. [36], is too harsh a condition to satisfy even for deep neural networks and causes various problems in training; hence, it constrains a model to lie in Lipschitz continuous function space while WGAN's Lipschitz condition comes from the Kantorovich-Rubinstein duality and only the critic is constrained. In addition, LS-GAN uses a weight-decay regularization technique to impose the weights of a model to lie in a bounded area to ensure the Lipschitz function condition.

Mean Feature Matching. From Equation (10), we can generalize several IPM metrics under the measure of the inner product. If we constrain $v$ with $p$ norm, where $p$ is a nonnegative integer, we can derive Equation (18) as a feature matching problem as follows, by adding the $\Vert v\Vert _p\le 1$ condition where $\Vert v\Vert _p=\lbrace \Sigma _{i=1}^{m}v_i^p\rbrace ^{1/p}$. It should be noted that, with conjugate exponent $q$ of $p$ such that $\frac{1}{p}+\frac{1}{q}=1$, the dual norm of norm $p$ satisfies $\Vert x\Vert _q=sup\lbrace {\lt v,x\gt :\Vert v\Vert _p\le 1}\rbrace$ by Holder's inequality [120]. Motivated by this dual norm property [85], we can derive a $l_q$ mean matching problem as follows:

\begin{equation} = \max _{w\in \Omega }\Vert \mu _w(p_{\mathrm{data}})-\mu _w(p_{\mathrm{g}})\Vert _q, \end{equation}
where denotes an embedding mean from distribution $\mathcal {P}$ represented by a neural network $\Phi _w$.

In terms of  WGAN, WGAN uses the 1-Lipschitz function class condition by weight clipping. The weight clipping indicates that the infinite norm $\Vert v\Vert _\infty =max_i |v_i|$ is constrained. Thus, WGAN can be interpreted as an $l_1$ mean feature matching problem. Mean and Covariance feature matching GAN (McGAN) [85] extended this concept to match not only the $l_q$ mean feature but also the second-order moment feature by using the singular value decomposition concept; it aims to also maximize an embedding covariance discrepancy between $p_{\mathrm{data}}(x)$ and $p_{\mathrm{g}}(x)$. Geometric GAN [72] shows that the McGAN framework is equivalent to a Support Vector Machine (SVM) [106], which separates the two distributions with a hyperplane that maximizes the margin. It encourages the discriminator to move away from the separating hyperplane and the generator to move toward the separating hyperplane. However, such high-order moment matching requires complex matrix computations. MMD addresses this problem with a kernel technique which induces an approximation of high-order moments and can be analyzed in a feature matching framework.

Maximum Mean Discrepancy (MMD). Before describing the MMD methodology, we must first list some mathematical facts. A Hilbert space $\mathcal {H}$ is a complete vector space with the metric endowed by the inner product in the space. Kernel $k$ is defined as such that $k(y,x)=k(x,y)$. Then, for any given positive definite kernel $k(\cdot ,\cdot)$, there exists a unique space of functions called the Reproducing Kernel Hilbert Space (RKHS), $\mathcal {H}_k,$ which is a Hilbert space and satisfies $\lt f,k(\cdot ,x)\gt _\mathcal {H_K}=f(x),\,\forall x \in \mathcal {X}, \forall f \in \mathcal {H}_k$ (so-called reproducing property).

MMD methodology can be seen as feature matching under the RKHS space for some given kernel with an additional function class restriction such that $\mathcal {F}=\lbrace {f|\Vert f\Vert _{\mathcal {H}_k}}\le 1\rbrace$. It can be related to Equation (10) in that the RKHS space is a completion of where $\Phi _{x_i}(x)=k(x,x_i)$. Therefore, we can formulate MMD methodology as another mean feature matching from Equation (11) as follows:

\begin{equation} = \Vert \mu (p_{\mathrm{data}})-\mu (p_{\mathrm{g}})\Vert _{\mathcal {H}_k,} \end{equation}
where denotes a kernel embedding mean, and Equation (20) comes from the reproducing property.

From Equation (21), $d_\mathcal {F}$ is defined as the maximum kernel mean discrepancy in RKHS. It is widely used in statistical tasks, such as a two-sample test to measure a dissimilarity. Given $p_{\mathrm{data}}(x)$ with mean $\mu _{p_{\mathrm{data}}}$, $\,p_{\mathrm{g}}(x)$ with mean $\mu _{p_{\mathrm{g,}}}$ and kernel $k$, the square of the MMD distance $M_k(p_{\mathrm{data}}, p_{\mathrm{g}})$ can be reformulated as follows:

Generative Moment Matching Networks (GMMN) [70] suggest directly minimizing MMD distance with a fixed Gaussian kernel $k(x,x^{\prime })=exp(-{\Vert x-x^{\prime }\Vert }^2)$ by optimizing $\min _\theta M_k(p_{\mathrm{data}}, p_{\mathrm{g}})$. It appears quite dissimilar to the standard GAN [36] because there is no discriminator that estimates the discrepancy between two distributions. Unlike GMMN, MMDGAN [68] suggests adversarial kernel learning by not fixing the kernel but learning it itself. They replace the fixed Gaussian kernel with a composition of a Gaussian kernel and an injective function $f_\phi$ as follows: $\tilde{k}(x,x^{\prime }) = exp(-{\Vert f_\phi (x)-f_\phi (x^{\prime })\Vert }^2),$ and learn a kernel to maximize the mean discrepancy. An objective function with optimizing kernel $k$ then becomes $\min _\theta \max _\phi M_{\tilde{k}}(p_{\mathrm{data}}, p_{\mathrm{g}})$ and now is similar to the standard GAN objective, as in Equation (1). To enforce $f_\phi$ modeled with the neural network to be injective, an autoencoder satisfying $f_{\textit{decoder}} \approx f^{-1}$ is adopted for $f_\phi$.

Similar to other IPM metrics, MMD distance is continuous and differentiable almost everywhere in $\theta$. It can also be understood under the IPM framework with function class $\mathcal {F} = \mathcal {H}_{K,}$ as discussed earlier. By introducing an RKHS with kernel $k$, MMD distance has an advantage over other feature matching metrics in that kernel $k$ can represent various feature spaces by mapping input data $x$ into another feature space. In particular, MMDGAN can also be connected with WGAN when $f_\phi$ is composited to a linear kernel with an output dimension of 1 instead of a Gaussian kernel. The moment matching technique using the Gaussian kernel also has an advantage over WGAN in that it can match even an infinite order of moments since the exponential form can be represented as an infinite order via Taylor expansion while the WGAN can be treated as a first-order moment matching problem, as discussed earlier. However, a great disadvantage of measuring MMD distance is that computational cost grows quadratically as the number of samples grows [5].

Meanwhile, CramerGAN [8] argues that the Wasserstein distance incurs biased gradients, suggesting the energy distance between two distributions. In fact, it measures energy distance indirectly in the data manifold but with transformation function $h$. However, CramerGAN can be thought of as the distance in the kernel embedded space of MMDGAN, which forces $h$ to be injective by the additional autoencoder reconstruction loss, as discussed earlier.

Fisher GAN. In addition to standard IPM in Equation (9), Fisher GAN [84] incorporates a data-dependent constraint by the following equations:

Equation (23) is motivated by Fisher Linear Discriminant Analysis (FLDA) [130], which not only maximizes the mean difference but also reduces the total within class variance of two distributions. Equation (24) follows from the constraining numerator of Equation (23) to be 1. It is also, as are other IPM metrics, interpreted as a mean feature matching problem under the somewhat different constraints. Under the definition of Equation (10), Fisher GAN can be converted into another mean feature matching problem with second-order moment constraint. A mean feature matching problem derived from the FLDA concept is as follows:

\begin{equation} = \max _{w\in \Omega }\max _v\frac{\lt v, \mu _w(p_{\mathrm{data}})-\mu _w(p_{\mathrm{g}})\gt }{\sqrt []{v^T(\frac{1}{2}\sum _w(p_{\mathrm{data}})+\frac{1}{2}\sum _w(p_{\mathrm{g}})+\gamma I_m)v}} \\ \end{equation}
\begin{equation} = \max _{w\in \Omega }\max _{v,v^T(\frac{1}{2}\sum _w(p_{\mathrm{data}})+\frac{1}{2}\sum _w(p_{\mathrm{g}})+\gamma I_m)v=1}\lt v, \mu _w(p_{\mathrm{data}})-\mu _w(p_{\mathrm{g}})\gt , \end{equation}
where denotes an embedding mean and denotes an embedding covariance for the probability $\mathcal {P}$.

Equation (26) can be induced by using the inner product of $f,$ defined as in Equation (10). $\gamma I_m$ ofEquation (27) is an $m$-by-$m$ identity matrix that guarantees that a numerator of the preceding equations will not be zero. In Equation (28), Fisher GAN aims to find the embedding direction $v$ which maximizes the mean discrepancy while constraining it to lie in a hyperellipsoid, as $v^T(\frac{1}{2}\sum _w(p_{\mathrm{data}})+\frac{1}{2}\sum _w(p_{\mathrm{g}})+\gamma I_m)v=1$ represents. It naturally derives the Mahalanobis distance [20], which is defined as a distance between two distributions given a positive definite matrix such as a covariance matrix of each class. More importantly, Fisher GAN has advantages over WGAN. It does not impose a data-independent constraint such as weight clipping, which makes training too sensitive on the clipping value, and it has computational benefit over the gradient penalty method in WGAN-GP [42] as the latter must compute gradients of the critic while Fisher GAN computes covariances.

Comparison to F-divergence. The f-divergence family, which can be defined as in Equation (3) with a convex function $f$, has restrictions in that, as the dimension $d$ of the data space $x\in \mathcal {X}=R^d$ increases, the f-divergence is highly difficult to estimate, and the supports of two distributions tend to be unaligned, which leads a divergence value to infinity [117]. Even though Equation (6) derives a variational lower bound of Equation (3) that looks very similar to Equation (9), the tightness of the lower bound to the true divergence is not guaranteed in practice and can incur an incorrect, biased estimation.

Sriperumbudur et al. [117] showed that the only nontrivial intersection between the f-divergence family and the IPM family is total variation distance; therefore, the IPM family does not inherit the disadvantages of f-divergence. They also proved that IPM estimators using finite i.i.d. samples are more consistent in convergence, whereas the convergence of f-divergence is highly dependent on data distributions.

Consequently, employing an IPM family to measure distance between two distributions is advantageous over using an f-divergence family because IPM families are not affected by data dimension and consistently converge to the true distance between two distributions. Moreover, they do not diverge even though the supports of two distributions are disjointed. In addition, Fisher GAN [84] is also equivalent to the chi-squared distance [131], which can be covered by an f-divergence framework. However, with a data-dependent constraint, chi-squared distance can use IPM family characteristics, so it is more robust to unstable training of the f-divergence estimation.

2.1.3 Auxiliary Object Functions. In Section 2.1.1 and Section 2.1.2, we demonstrated various objective functions for adversarial learning. Concretely, through a minimax iterative algorithm, the discriminator estimates a specific kind of distance between $p_{\mathrm{g}}(x)$ and $p_{\mathrm{data}}(x)$. The generator reduces the estimated distance to make two distributions closer. Based on this adversarial objective function, a GAN can incorporate with other types of objective functions to help the generator and the discriminator stabilize during training or to perform some kinds of tasks such as classification. In this section, we introduce auxiliary object functions attached to the adversarial object function, mainly a reconstruction objective function and a classification objective function.

Reconstruction Object Function. Reconstruction works to make an output image of a neural network to be the same as an original input image of a neural network. The purpose of the reconstruction is to encourage the generator to preserve the contents of the original input image [13, 57, 102, 123, 145] or to adopt autoencoder architecture for the discriminator [10, 128, 143]. For a reconstruction objective function, mostly the L1 norm of the difference of the original input image and the output image is used.

When the reconstruction objective term is used for the generator, the generator is trained to maintain the contents of the original input image. In particular, this operation is crucial for tasks where semantics and several modes of the image should be maintained, such as image translation [57, 145] (detailed in Section 4.1.1) and autoencoder reconstruction [13, 102, 123] (detailed in Section 3.2). The intuition of using reconstruction loss for the generator is that it guides the generator to restore the original input in a supervised learning manner. Without a reconstruction loss, the generator is to simply fool the discriminator, so there is no reason for the generator to maintain crucial parts of the input. As a supervised learning approach, the generator treats the original input image as label information, so we can reduce the space of possible mappings of the generator however we desire.

There are some GAN variants using a reconstruction objective function for the discriminator, which is naturally derived from autoencoder architecture for the discriminator [10, 128, 143]. These GANs are based on the aspect that views the discriminator as an energy function and will be detailed in Section 2.2.3.

Classification Object Functions. A cross-entropy loss for a classification is widely added for many GAN applications where labeled data exist, especially semi-supervised learning and domain adaptation. Cross-entropy loss can be directly applied to the discriminator, which gives the discriminator an additional role of classification [90, 115]. Other approaches [2, 12, 67, 140] adopt a classifier explicitly, training the classifier jointly with the generator and the discriminator through a cross-entropy loss (detailed in Sections 4.3 and 4.4).

2.2 Architecture

An architecture of the generator and the discriminator is important as it highly influences the training stability and performance of the GAN. Various papers adopt several techniques, such as batch normalization, stacked architecture, and multiple generators and discriminators, to promote adversarial learning. We start with Deep Convolutional GAN (DCGAN) [100], which provides a remarkable benchmark architecture for other GAN variants.

2.2.1 DCGAN. DCGAN provides significant contributions to GAN in that its suggested Convolution Neural Network (CNN) [62] architecture greatly stabilizes GAN training. DCGAN suggests an architecture guideline in which the generator is modeled with a transposed CNN [26], and the discriminator is modeled with a CNN with an output dimension 1. It also proposes other techniques, such as batch normalization, and types of activation functions for the generator and the discriminator to help stabilize the GAN training. As it solves the instability of training GAN only through architecture, it becomes a baseline for modeling various GANs proposed later. For example, Im et al. [51] uses a Recurrent Neural Network (RNN) [38] to generate images motivated by DCGAN. By accumulating images of each time step output of DCGAN and combining several time step images, it produces higher visual quality images.

2.2.2 Hierarchical Architecture. In this section, we describe GAN variants that stack multiple generator-discriminator pairs. Commonly, these GANs generate samples in multiple stages to generate large-scale and high-quality samples. The generator of each stage is utilized or conditioned to help the next stage generator to better produce samples, as shown in Figure 3.

Fig. 3.
Fig. 3. Illustrations of (a) StackedGAN [49] and (b) Progressive GAN [56].

Hierarchy Using Multiple GAN Pairs. StackedGAN [49] attempts to learn a hierarchical representation by stacking several generator-discriminator pairs. For each layer of a generator stack, there exists the generator which produces level-specific representation, the corresponding discriminator training the generator adversarially at each level, and an encoder that generates the semantic features of real samples. Figure 3(a) shows a flowchart of StackedGAN. Each generator tries to produce a plausible feature representation that can deceive the corresponding discriminator, given previously generated features and the corresponding hierarchically encoded features.

In addition, Gang of GAN (GoGAN) [54] proposes to improve WGAN [5] by adopting multiple WGAN pairs. For each stage, it changes the Wasserstein distance to a margin-based Wasserstein disitance as $[D(G(z))-D(x)+m]^{+}$ so the discriminator focuses on generated samples whose gap $D(x)-D(G(z))$ is less than $m$. In addition, GoGAN adopts ranking loss for adjacent stages, which induces the generator in later stages to produce better results than the former generator by using a smaller margin at the next stage of the generation process. By progressively moving stages, GoGAN aims to gradually reduce the gap between $p_{\mathrm{data}}(x)$ and $p_{\mathrm{g}}(x)$.

Hierarchy Using a Single GAN. Generating high-resolution images is highly challenging since a large-scale generated image is easily distinguished by the discriminator, so the generator often fails to be trained. Moreover, there is a memory issue in that we are forced to set a low mini-batch size due to the large size of neural networks. Therefore, some studies adopt hierarchical stacks of multiple generators and discriminators [27, 33, 50]. This strategy divides a large complex generator's mapping space step by step for each GAN pair, making it easier to learn to generate high-resolution images. However, Progressive GAN [56] succeeds in generating high-resolution images in a single GAN, making training faster and more stable.

Progressive GAN generates high-resolution images by stacking each layer of the generator and the discriminator incrementally, as shown in Figure 3(b). It starts training to generate a very low spatial resolution (e.g., $4\times 4$) image, and progressively doubles the resolution of generated images by adding layers to the generator and the discriminator incrementally. In addition, it proposes various training techniques, such as pixel normalization, equalized learning rate, and mini-batch standard deviation, all of which help GAN training to become more stable.

2.2.3 Autoencoder Architecture. An autoencoder is a neural network for unsupervised learning. It assigns its input as a target value, so it is trained in a self-supervised manner. The reason for self-reconstruction is to encode a compressed representation or features of the input, which is widely utilized with a decoder. Its usage will be detailed in Section 3.2.

In this section, we describe GAN variants that adopt an autoencoder as the discriminator. These GANs view the discriminator as an energy function, not a probabilistic model that distinguishes its input as real or fake. An energy model assigns a low energy for a sample lying near the data manifold (a high data density region), while assigning a high energy for a contrastive sample lying far away from the data manifold (a low data density region). These variants are mainly Energy-Based GAN (EBGAN), Boundary Equilibrium GAN (BEGAN), and Margin Adaptation GAN (MAGAN), all of which frame GAN as an energy model.

Since an autoencoder is utilized for the discriminator, a pixelwise reconstruction loss between an input and an output of the discriminator is naturally adopted for the discriminator's energy function and is defined as follows:

\begin{align} D(v)&=\Vert v-AE(v)\Vert , \end{align}
where $AE:R^{N_x}\Rightarrow R^{N_x}$ denotes an autoencoder and $R^{N_x}$ represents the dimension of an input and an output of an autoencoder. It is noted that $D(v)$ in Equation (29) is the pixelwise L1 loss for an autoencoder which maps an input $v\in R^{N_x}$ into a positive real number $R^{+}$.

A discriminator with an energy $D(v)$ is trained to give a low energy for a real $v$ and a high energy for a generated $v$. From this point of view, the generator produces a contrastive sample for the discriminator so that the discriminator is forced to be regularized near the data manifold. Simultaneously, the generator is trained to generate samples near the data manifold since the discriminator is encouraged to reconstruct only real samples. Table 4 presents the summarized details of BEGAN, EBGAN, and MAGAN. $L_G$ and $L_D$ indicate the generator loss and the discriminator loss, respectively, and $[t]^+=\max (0,t)$ represents a maximum value between $t$ and 0, which acts as a hinge.

Table 4. An Autoencoder-Based GAN Variants (BEGAN, EBGAN, and MAGAN)
Objective function Details
BEGAN [10] $L_D=D(x)-k_tD(G(z))$ $L_G=D(G(z))$ $k_{t+1}=k_t+\alpha (\gamma D(x)-D(G(z)))$ Wasserstein distance between loss distributions
EBGAN [143] $L_D=D(x)+[m-D(G(z))]^{+}$ $L_G=D(G(z))$ Total Variance($p_{\mathrm{data}}, p_{\mathrm{\theta }}$)
MAGAN [128] $L_D=D(x)+[m-D(G(z))]^{+}$ $L_G=D(G(z))$ Margin $m$ is adjusted in EBGAN's training

Boundary Equilibrium GAN (BEGAN) [10] uses the fact that pixelwise loss distribution follows a normal distribution by the Central Limit Theorem (CLT). It focuses on matching loss distributions through Wasserstein distance and not on directly matching data distributions. In BEGAN, the discriminator has two roles: One is to reconstruct real samples sufficiently, and the other is to balance the generator and the discriminator via an equilibrium hyperparameter . $\gamma$ is fed into an objective function to prevent the discriminator from easily winning over the generator; therefore, this balances the power of the two components. Figure 4 shows face images at varying $\gamma$ of BEGAN.

Fig. 4.
Fig. 4. Random images sampled from the generator at varying $\gamma \in \lbrace 0.3, 0.5, 0.7\rbrace$ of BEGAN [10]. Samples at lower $\gamma$ shows similar images. At high $\gamma$ values, image diversity seems to increase, but contains some artifacts. Images from BEGAN [10].

Energy-Based GAN (EBGAN) [143] interprets the discriminator as an energy agent, which assigns low energy to real samples and high energy to generated samples. Through the $[m-L(G(z))]^{+}$ term in an objective function, the discriminator ignores generated samples with higher energy than $m$ so the generator attempts to synthesize samples that have lower energy than $m$ to fool the discriminator, which allows that mechanism to stabilize training. Margin Adaptation GAN (MAGAN) [128] takes a similar approach to EBGAN; the only difference is that MAGAN does not fix the margin $m$. MAGAN shows empirically that the energy of the generated sample fluctuates near the margin $m$ and that phenomena with a fixed margin make it difficult to adapt to the changing dynamics of the discriminator and generator. MAGAN suggests that margin $m$ should be adapted to the expected energy of real data; thus, $m$ is monotonically reduced so the discriminator reconstructs real samples more efficiently.

In addition, because the total variance distance belongs to an IPM family with the function class $\mathcal {F}=\lbrace f:\Vert f\Vert _\infty =\sup _x|f(x)|\le 1\rbrace$, it can be shown that EBGAN is equivalent to optimizing the total variance distance by using the fact that the discriminator's output for generated samples is only available for $0\le D\le m$ [5]. Because the total variance is the only intersection between IPM and f-divergence [117], it inherits some disadvantages for estimating f-divergence, as discussed by Arjovsky et al. [5] and Sriperumbudur et al. [117].

2.3 Obstacles in Training GAN

In this section, we discuss theoretical and practical issues related to the training dynamics of GAN. We first evaluate a theoretical problem of the standard GAN, which is incurred from the fact that the discriminator of GAN aims to approximate the JSD [36] between $p_{\mathrm{data}}(x)$ and $p_{\mathrm{\theta }}(x),$ and the generator of GAN tries to minimize the approximated JSD, as discussed in Section 2.3.1. We then discuss practical issues, especially a mode collapse problem where the generator fails to capture the diversity of real samples and generates only specific types of real samples, as discussed in Section 2.3.2.

2.3.1 Theoretical Issues. As mentioned in Section 1, the traditional generative model is to maximize a likelihood of $p_{\mathrm{g}}(x)$. It can be shown that maximizing the log likelihood is equivalent to minimizing the Kullback-Leibler Divergence (KLD) between $p_{\mathrm{data}}(x)$ and $p_{\mathrm{\theta }}(x)$ as the number of samples $m$ increases:

We note that we need to find an optimal parameter $\theta ^{\star }$ for maximizing likelihood; therefore, $\text{argmax}$ is used instead of $\max$. In addition, we replace our model's probability distribution $p_{\mathrm{\theta }}(x)$ with $p_{\mathrm{g}}(x)$ for consistency of notation.

Equation (31) is established by the CLT [103] in that, as $m$ increases, the variance of the expectation of the distribution decreases. Equation (32) can be induced because $p_{\mathrm{data}}(x)$ is not dependent on $\theta$, and Equation (34) follows from the definition of KLD. Intuitively, minimizing KLD between these two distributions can be interpreted as approximating $p_{\mathrm{data}}(x)$ with a large number of real training data because the minimum KLD is achieved when $p_{\mathrm{data}}(x) = p_{\mathrm{g}}(x)$.

Thus, the result of maximizing likelihood is equivalent to minimizing KLD$(p_{\mathrm{data}}||p_{\mathrm{g}})$ given infinite training samples. Because KLD is not symmetrical, minimizing KLD$(p_{\mathrm{g}}||p_{\mathrm{data}})$ gives a different result. Figure 5 from Goodfellow [35] shows the details of different behaviors of asymmetric KLD, where Figure 5(a) shows minimizing KLD$(p_{\mathrm{data}}||p_{\mathrm{g}})$ and Figure 5(b) shows minimizing KLD$(p_{\mathrm{g}}||p_{\mathrm{data}})$ given a mixture of two Gaussian distributions $p_{\mathrm{data}}(x)$ and the single Gaussian distribution $p_{\mathrm{g}}(x)$. $\theta ^{\star }$ in each figure denotes the argument minimum of each asymmetric KLD. For Figure 5(a), the points where $p_{\textit{data}}\ne 0$ contribute to the value of KLD and the other points at which $p_{\textit{data}}$ is small rarely affect the KLD. Thus, $p_{\mathrm{g}}$ becomes nonzero on the points where $p_{\textit{data}}$ is nonzero. Therefore, $p_{\theta ^{\star }}(x)$ in Figure 5(a) is averaged for all modes of $p_{\mathrm{data}}(x)$ as $KLD(p_{\mathrm{data}}||p_{\mathrm{g}})$ is more focused on covering all parts of $p_{\mathrm{data}}(x)$. In contrast, for KLD$(p_{\mathrm{g}}||p_{\mathrm{data}})$, the points of which $p_{\textit{data}}=0$ but $p_{\mathrm{g}}\ne 0$ contribute to a high cost. This is why $p_{\theta ^{\star }}(x)$ in Figure 5(b) seeks to find an $x$ which is highly likely from $p_{\mathrm{data}}(x)$.

Fig. 5.
Fig. 5. Different behavior of asymmetric KLD. Images reproduced from Goodfellow [35].

JSD has an advantage over the two asymmetric KLDs in that it accounts for both mode dropping and sharpness. It never explodes to infinity, unlike KLD, even though there exists a point $x$ that lies outside of $p_{\mathrm{g}}(x)$’s support which makes $p_{\mathrm{g}}(x)$ equal 0. Goodfellow et al. [36] showed that the discriminator $D$ aims to approximate $V(G, D^{\star })=2JSD(p_{\mathrm{data}}||p_{\mathrm{g}}) - 2\log 2$ for the fixed generator $G$ between $p_{\mathrm{g}}(x)$ and $p_{\mathrm{data}}(x)$, where $D^{\star }$ is an optimal discriminator and $V(G,D)$ is defined in Equation (1). Concretely, if $D$ is trained well so that it approximates $JSD(p_{\mathrm{data}}||p_{\mathrm{g}}) - 2\log 2$ sufficiently, training $G$ minimizes the approximated distance. However, Arjovsky and Bottou [4] reveal mathematically why approximating $V(G, D^{\star })$ does not work well in practice.

Arjovsky and Bottou [4] proved why training GAN is fundamentally unstable. When the supports of two distributions are disjointed or lie in low-dimensional manifolds, there exists the perfect discriminator which classifies real or fake samples perfectly; thus, the gradient of the discriminator is 0 at the supports of the two distributions. It has been proved empirically and mathematically that $p_{\mathrm{data}}(x)$ and $p_{\mathrm{g}}(x)$ derived from $z$ have a low-dimensional manifold in practice [86], and this fact allows $D$’s gradient transferred to $G$ to vanish as $D$ comes to perfectly classify real and fake samples. Because, in practice, $G$ is optimized with a gradient-based optimization method, $D$’s vanishing (or exploding) gradient hinders $G$ from learning enough through $D$’s gradient feedback. Moreover, even with the alternate $-\log D(G(z))$ objective proposed in Goodfellow et al. [36], minimizing an objective function is equivalent to simultaneously trying to minimize KLD$(p_{\mathrm{g}} || p_{\mathrm{data}})$ and maximize JSD$(p_{\mathrm{g}}||p_{\mathrm{data}})$. As these two objectives are opposites, this leads the magnitude and variance of $D$’s gradients to increase as training progresses, causing unstable training and making it difficult to converge to equilibrium. To summarize, training the GAN is theoretically guaranteed to converge if we use an optimal discriminator $D^{\star }$ which approximates JSD, but this theoretical result is not implemented in practice when using gradient-based optimization. In addition to the $D$’s improper gradient problem discussed in this paragraph, there are two practical issues as to why GAN training suffers from nonconvergence.

2.3.2 Practical Issues. First, we represent $G$ and $D$ as deep neural networks to learn parameters rather than directly learning $p_{\mathrm{g}}(x)$ itself. Modeling with deep neural networks, such as the Multilayer Perceptron (MLP) or CNN, is advantageous in that the parameters of distributions can be easily learned through gradient descent using back-propagation. This does not require further distribution assumptions to produce an inference; rather, it can generate samples following $p_{\mathrm{g}}$(x) through simple feedforward. However, this practical implementation causes a gap with theory. Goodfellow et al. [36] provide theoretical convergence proof based on the convexity of probability density function in $V(G,D)$. However, as we model $G$ and $D$ with deep neural networks, the convexity does not hold because we now optimize in the parameter space rather than in the function space (where assumed theoretical analysis lies). Therefore, theoretical guarantees no longer hold in practice. For a further issue related to parameterized neural network space, Arora et al. [6] discussed the existence of the equilibrium of GAN and showed that a large capacity of $D$ does not guarantee $G$ to generate all real samples perfectly, meaning that an equilibrium may not exist under a certain finite capacity of $D$.

A second problem is related to an iterative update algorithm suggested in Goodfellow et al. [36]. We wish to train $D$ until optimal for fixed $G$, but optimizing $D$ in such a manner is computationally expensive. Naturally, we must train $D$ in certain $k$ steps, and that scheme causes confusion as to whether it is solving a minimax problem or a maximin problem because $D$ and $G$ are updated alternatively by gradient descent in the iterative procedure. Unfortunately, solutions of the minimax and maximin problem are not generally equal, as follows:

\begin{equation} \min _G \max _D V(G,D) \ne \max _D \min _G V(G,D). \end{equation}

With a maximin problem, minimizing $G$ lies in the inner loop in the right side of Equation (35). $G$ is now forced to place its probability mass on the most likely point where the fixed nonoptimal $D$ believes it likely to be real rather than fake. After $D$ is updated to reject the generated fake one, $G$ attempts to move the probability mass to the other most likely point for fixed $D$. In practice, real data distribution is normally multimodal, but, in such a maximin training procedure, $G$ does not cover all modes of the real data distribution because $G$ considers that picking only one mode is enough to fool $D$. Empirically, $G$ tends to cover only a single mode or a few modes of real data distribution. This undesirable nonconvergent situation is called a mode collapse. A mode collapse occurs when many modes in the real data distribution are not represented in the generated samples, resulting in a lack of diversity in the generated samples. It can be simply considered as $G$ being trained to be a non one-to-one function, which produces a single output value for several input values.

Furthermore, the problem of the existence of the perfect discriminator we discussed in the preceding paragraph can be connected to a mode collapse. First, assume $D$ comes to output almost 1 for all real samples and 0 for all fake samples. Then, because $D$ produces values near 1 for all possible modes, there is no need for $G$ to represent all modes of real data probability. The theoretical and practical issues discussed in this section can be summarized as follows.

  • Because the supports of distributions lie on low-dimensional manifolds, there exists the perfect discriminator whose gradients vanish on every data point. Optimizing the generator may be difficult because it is not provided with any information from the discriminator.
  • GAN training optimizes the discriminator for the fixed generator and the generator for fixed discriminator simultaneously in one loop, but it sometimes behaves as if solving a maximin problem, not a minimax problem. It critically causes a mode collapse. In addition, the generator and the discriminator optimize the same objective function $V(G,D)$ in opposite directions, which is not usual in classical machine learning, and it often suffers from oscillations causing excessive training time.
  • The theoretical convergence proof does not apply in practice because the generator and the discriminator are modeled with deep neural networks, so optimization has to occur in the parameter space rather than in learning the probability density function itself.

2.3.3 Training Techniques to Improve GAN Training. As demonstrated in Sections 2.3.1 and 2.3.2, GAN training is highly unstable and difficult because the GAN is required to find a Nash equilibrium of a nonconvex minimax game with high dimensional parameters, but the GAN is typically trained with gradient descent [104]. In this section, we introduce some techniques to improve the training of GAN, to make it more stable and produce better results.

  • Feature matching [104]:
    This technique substitutes the discriminator's output in the objective function (Equation (1)) with an activation function's output of an intermediate layer of the discriminator to prevent overfitting from the current discriminator. Feature matching does not aim for the discriminator's output; rather, it guides the generator to see the statistics or features of real training data in an effort to stabilize training.
  • Label smoothing [104]:
    As mentioned previously, $V(G,D)$ is a binary cross-entropy loss whose real data label is 1 and its generated data label is 0. However, since a deep neural network classifier tends to output a class probability with extremely high confidence [35], label smoothing encourages a deep neural network classifier to produce a softer estimation by assigning label values lower than 1. Importantly, for GAN, label smoothing has to be made for labels of real data, not for labels of fake data, since, if not, the discriminator can act incorrectly [35].
  • Spectral normalization [81]:
    As we see in Sections 2.1.2 and 2.1.2, WGAN and Improved WGAN impose Lipschitz continuity on the discriminator, which constrains the magnitude of function differentiation. Spectral normalization aims to impose a Lipschitz condition for the discriminator in a different manner. Instead of adding a regularizing term or weight clipping, spectral normalization constrains the spectral norm of each layer of the discriminator where the spectral norm is the largest singular value of a given matrix. Since a neural network is a composition of multiple layers, spectral normalization normalizes the weight matrices of each layer to make the whole network Lipschitz continuous. In addition, compared to the gradient penalty method proposed in Improved WGAN, spectral normalization is computationally beneficial since gradient penalty regularization directly controls the gradient of the discriminator.
  • PatchGAN [52]:
    PatchGAN is not a technique for stabilizing GAN training. However, PatchGAN greatly helps to generate sharper results in various applications such as image translation [52, 145]. Rather than producing a single output from the discriminator, which is a probability for its input's authenticity, PatchGAN [52] makes the discriminator produce a grid output. For one element of the discriminator's output, its receptive field in the input image should be one small local patch in the input image, so the discriminator aims to distinguish each patch in the input image. To achieve this, one can remove the fully connected layer in the last part of the discriminator in the standard GAN. As a matter of fact, PatchGAN is equivalent to adopting multiple discriminators for every patch of the image, making the discriminator help the generator to represent sharper images locally.

2.4 Methods to Address Mode Collapse in GAN

Mode collapse, which indicates the failure of GAN to represent various types of real samples, is the main catastrophic problem of a GAN. From a perspective of the generative model, mode collapse is a critical obstacle for a GAN to be utilized in many applications since the diversity of generated data needs to be guaranteed to represent the data manifold concretely. Unless multiple modes of real data distribution are represented by the generative model, such a model would be meaningless to use.

Figure 6 shows a mode collapse problem for a toy example. A target distribution $p_{\mathrm{data}}$ has multiple modes, which is a Gaussian mixture in two-dimensional space [79]. Figures in the lower row represent learned distribution $p_{\mathrm{g}}$ as the training progresses. As we see in Figure 6, the generator does not cover all possible modes of the target distribution. Rather, the generator covers only a single mode, switching between different modes as the training goes on. The generator learns to produce a single mode, believing that it can fool the discriminator. The discriminator counteracts the generator by rejecting the chosen mode. Then, the generator switches to another mode which is believed to be real. This training behavior proceeds and thus the convergence to a distribution covering all the modes is highly difficult.

Fig. 6.
Fig. 6. An illustration of the mode collapse problem. Images from Unrolled GAN [79].

Fig. 7.
Fig. 7. Illustrations of (a) MAD-GAN [33] and (b) MRGAN [13].

In this section, we present several studies that suggest methods to overcome the mode collapse problem. In Section 2.4.1, we demonstrate studies that exploit new objective functions to tackle a mode collapse, and, in Section 2.4.2, we introduce studies which propose architecture modifications. Last, in Section 2.4.3, we describe mini-batch discrimination, which is a notable and practically effective technique for the mode collapse problem.

2.4.1 Object Function Methods. Unrolled GAN [79] manages mode collapse with a surrogate objective function for the generator, which helps the generator predict the discriminator's response by unrolling the discriminator update $k$ steps for the current generator update. As we see in the standard GAN [36], it updates the discriminator first for the fixed generator and then updates the generator for the updated discriminator. Unrolled GAN differs from standard GAN in that it updates the generator based on a $k$ steps updated discriminator, given the current generator update, which aims to capture how the discriminator responds to the current generator update. We see that when the generator is updated, it unrolls the discriminator's update step to consider the discriminator's $k$ steps future response with respect to the generator's current update while updating the discriminator in the same manner as the standard GAN. Since the generator is given more information about the discriminator's response, the generator spreads its probability mass to make it more difficult for the discriminator to react to the generator's behavior. It can be seen as empowering the generator because only the generator's update is unrolled, but it seems to be fair in that the discriminator cannot be trained to be optimal in practice due to an infeasible computational cost, while the generator is theoretically assumed to obtain enough information from the optimal discriminator.

Deep Regret Analytic GAN (DRAGAN) [61] suggests that a mode collapse occurs due to the existence of a spurious local Nash equilibrium in the nonconvex problem. DRAGAN addresses this issue by proposing constraining gradients of the discriminator around the real data manifold. It adds a gradient penalizing term which biases the discriminator to have a gradient norm of 1 around the real data manifold. This method attempts to create linear functions by making gradients have a norm of 1. Linear functions near the real data manifolds form a convex function space, which imposes a global unique optimum. Note that this gradient penalty method is also applied to WGAN-GP [42]. They differ in that DRAGAN imposes gradient penalty constraints only to local regions around the real data manifold while Improved WGAN imposes gradient penalty constraints almost everywhere around the generated data manifold and real data manifold, which leads to higher constraints than those in DRAGAN.

In addition, EBGAN proposes a repelling regularizer loss term to the generator, which encourages feature vectors in a mini-batch to be orthogonalized. This term is utilized with cosine similarities at a representation level of an encoder and forces the generator not to produce samples falling in a few modes.

2.4.2 Architecture Methods. Multiagent Diverse GAN (MAD-GAN) [33] adopts multiple generators for one discriminator to capture the diversity of generated samples, as shown in Figure 7(a). To induce each generator to move toward different modes, it adopts a cosine similarity value as an additional objective term to make each generator produce dissimilar samples. This technique is inspired from the fact that, as images from two different generators become similar, a higher similarity value is produced; thus, by optimizing this objective term, it may make each generator move toward different modes, respectively. In addition, because each generator produces different fake samples, the discriminator's objective function adopts a soft-max cross-entropy loss term to distinguish real samples from fake samples generated by multiple generators.

Mode Regularized GAN (MRGAN) [13] assumes that mode collapse occurs because the generator is not penalized for missing modes. To address mode collapse, MRGAN adds an encoder which maps the data space $\mathcal {X}$ into the latent space $\mathcal {Z}$. Motivated from the manifold disjoint mentioned in Section 2.3.1, MRGAN first tries to match the generated manifold and real data manifold using an encoder. For manifold matching, the discriminator $D_M$ distinguishes real samples $x$ and its reconstruction $G\circ E(x)$, and the generator is trained with $D_M(G\circ E(x))$ with a geometric regularizer $d(x, G\circ E(x)),$ where $d$ can be any metric in the data space. A geometric regularizer is used to reduce the geometric distance in the data space to help the generated manifold move to the real data manifold and allow the generator and an encoder to learn how to reconstruct real samples. For penalizing missing modes, MRGAN adopts another discriminator $D_D$ which distinguishes $G(z)$ as fake and $G\circ E(x)$ as real. Since MRGAN matches manifolds in advance with a geometric regularizer, this modes diffusion step can distribute a probability mass even to minor modes of the data space with the help of $G\circ E(x)$. An outline of MRGAN is illustrated in Figure 7(b), where $R$ denotes a geometric regularizing term (reconstruction).

2.4.3 Mini-Batch Discrimination. Mini-batch discrimination [104] allows the discriminator to look at multiple examples in a mini-batch to avoid a mode collapse of the generator. The basic idea is that it encourages the discriminator to allow diversity directly in a mini-batch, not considering independent samples in isolation. To make the discriminator deal with not only each example, but also with the correlation between other examples in a mini-batch simultaneously, it models a mini-batch layer in an intermediate layer of the discriminator, which calculates L1-distance-based statistics of samples in a mini-batch. By adding such statistics to the discriminator, each example in a mini-batch can be estimated by how far or close to other examples in a mini-batch it is, and this information can be internally utilized by the discriminator, which helps the discriminator reflect samples’ diversity to the output. As for the generator, it tries to create statistics similar to those of real samples in the discriminator by the adversarial learning procedure.

In addition, Progressive GAN [56] proposed a simplified version of mini-batch discrimination which uses the mean of the standard deviation for each feature (channel) in each spatial location over the mini-batch. This does not add trainable parameters, which projects statistics of the mini-batch while maintaining its effectiveness for a mode collapse. To summarize, mini-batch discrimination reflects samples’ diversity to the discriminator, helping the discriminator determine whether its input batch is real or fake.


Latent space, also called an embedding space, is the space in which a compressed representation of data lies. If we wish to change or reflect some attributes of an image (e.g., a pose, an age, an expression, or even an object of an image), modifying images directly in the image space would be highly difficult because the manifolds where the image distributions lie are high-dimensional and complex. Rather, manipulating in the latent space is more tractable because the latent representation expresses specific features of the input image in a compressed manner. In this section, we investigate how GAN handles latent space to represent target attributes and how a variational approach can be combined with the GAN framework.

3.1 Latent Space Decomposition

The input latent vector $z$ of the generator is so highly entangled and unstructured that we do not know which vector point contains the specific representations we want. From this point of view, several papers suggest decomposing the input latent space to an input vector $c$, which contains the meaningful information, and standard input latent vector $z$, which can be categorized into a supervised method and an unsupervised method.

3.1.1 Supervised Methods. Supervised methods require a pair of data and corresponding attributes such as the data's class label. The attributes are generally used as an additional input vector, as explained here.

Conditional GAN (CGAN) [80] imposes a condition of additional information, such as a class label, to control the data generation process in a supervised manner by adding an information vector $c$ to the generator and discriminator. The generator takes not only a latent vector $z$ but also an additional information vector $c,$ and the discriminator takes samples and the information vector $c$ so that it distinguishes fake samples given $c$. By doing so, CGAN can control the number of digits to be generated, which is impossible for standard GAN.

Auxiliary Classifier GAN (AC-GAN) [90] takes a somewhat different approach than CGAN. It is trained by minimizing the log-likelihood of class labels with the adversarial loss. The discriminator produces not only the probability that the input samples are from the real dataset but also the probability over the class labels. Figure 8 outlines CGAN and CGAN with a projection discriminator and ACGAN, where $CE$ denotes the cross-entropy loss for the classification.

Fig. 8.
Fig. 8. Illustrations of (a) CGAN [80], (b) CGAN with a projection discriminator [82], and (c) AC-GAN [90].

In addition, Plug-and-Play Generative Networks (PPGN) [88] are another type of generative model that produces data under a given condition. Unlike the other methods described earlier, PPGN does not use the labeled attributes while training the generator. Instead, PPGN learns the auxiliary classifier and the generator producing real-like data via adversarial learning independently. Then, PPGN produces the data for the given condition using the classifier and the generator through an MCMC-based sampler. An important characteristic of PPGNs is that they can work as plug and play. When a classifier pretrained with the same data but different labels is given to the generator, the generator can synthesize samples under that condition without further training.

3.1.2 Unsupervised Methods. Different from the supervised methods just discussed, unsupervised methods do not exploit any labeled information. Thus, they require an additional algorithm to disentangle the meaningful features from the latent space.

InfoGAN [15] decomposes an input noise vector into a standard incompressible latent vector $z$ and another latent variable $c$ to capture salient semantic features of real samples. Then, InfoGAN maximizes the amount of mutual information between $c$ and a generated sample $G(z,c)$ to allow $c$ to capture some noticeable features of real data. In other words, the generator takes the concatenated input $(z,c)$ and maximizes the mutual information, $I(c;G(z,c))$ between a given latent code $c$ and the generated samples $G(z,c)$ to learn meaningful feature representations. However, evaluating mutual information $I(c;G(z,c))$ needs to directly estimate the posterior probability $p(c|x)$, which is intractable. InfoGAN, thus, takes a variational approach which replaces a target value $I(c;G(z,c))$ by maximizing a lower bound.

Both CGAN and InfoGAN learn conditional probability $p(x|c)$ given a certain condition vector $c$; however, they are dissimilar regarding how they handle condition vector $c$. In CGAN, additional information $c$ is assumed to be semantically known (such as class labels), so we have to provide $c$ to the generator and the discriminator during the training phase. On the other hand, $c$ is assumed to be unknown in InfoGAN, so we take $c$ by sampling from prior distribution $p(c)$ and control the generating process based on $I(c;G(z,c))$. As a result, the automatically inferred $c$ in InfoGAN has much more freedom to capture certain features of real data than $c$ in CGAN, which is restricted to known information.

Semi-supervised InfoGAN (ss-InfoGAN) [116] takes advantage of both supervised and unsupervised methods. It introduces some label information in a semi-supervised manner by decomposing latent code $c$ into two parts, $c=c_{ss}\dot{\bigcup } c_{us}$. Similar to InfoGAN, ss-InfoGAN attempts to learn the semantic representations from the unlabeled data by maximizing the mutual information between the generated data and the unsupervised latent code $c_{us}$. In addition, the semi-supervised latent code $c_{ss}$ is trained to contain features that we want by using labeled data. For InfoGAN, we cannot predict which feature will be learned from the training, while ss-InfoGAN uses labeled data to control the learned feature by maximizing two mutual sets of information; one between $c_{ss}$ and the labeled real data to guide $c_{ss}$ to encode label information $y$, and the other between the generated data and $c_{ss}$. By combining the supervised and unsupervised methods, ss-InfoGAN learns the latent code representation more easily with a small subset of labeled data than the fully unsupervised methods of InfoGAN.

3.1.3 Examples. Decomposing the latent space into meaningful attributes within the GAN framework has been exploited for various tasks. StackGAN [142] was proposed for text-to-image generation that synthesizes corresponding images given text descriptions, as shown in Figure 9. StackGAN synthesizes images conditioned on text descriptions in a two-stage process: a low-level feature generation (stage 1) and painting details from a given generated image at stage 1 (stage 2). The generators of each stage are trained adversarially given text embedding information $\varphi _t$ from the text $t$. Notably, rather than directly concatenating $\varphi _t$ to $z,$ as CGAN does, StackGAN proposes a conditional augmentation technique which samples conditioned text latent vectors $c$ from the Gaussian distribution $\mathcal {N}(\mu (\varphi _t), \sum (\varphi _t))$. By sampling the text embedding vector $c$ from $\mathcal {N}(\mu (\varphi _t), \sum (\varphi _t))$, this technique attempts to augment more training pairs given limited amounts of image-text paired data.

Fig. 9.
Fig. 9. A text-to-image synthesis of StackGAN [142]. StackGAN shows higher output diversity than other text-to-image models. Images from Goodfellow [35].

Fig. 10.
Fig. 10. Illustrations of (a) ALI [25], BiGAN [24], and (b) AGE [123].

Fig. 11.
Fig. 11. Illustrations of (a) VAEGAN [64] and (b) $\alpha$-GAN [102].

Semantically Decomposing GAN (SD-GAN) [23] tries to generate a face having different poses by directly decomposing $z$ into the identity and pose parts of a face image and then sampling each latent variable separately from the independent latent distributions. Notably, we do not need to restrict the attributes of an image to facial characteristics. Attributes can be not only characteristics of a face but also scenery features such as the weather. Karacan et al. [55] synthesized outdoor images having specific scenery attributes using the CGAN framework, and they also concatenated an attribute latent vector to $z$ for the generator.

3.2 With an Autoencoder

In this section, we explore efforts combining an autoencoder structure into the GAN framework. An autoencoder structure consists of two parts: an encoder that compresses data $x$ into latent variable $z$: and a decoder that reconstructs encoded data into the original data $x$. This structure is suitable for stabilizing GAN because it learns the posterior distribution $p(z|x)$ to reconstruct data $x$, which reduces mode collapse caused by the lack of GAN's inference ability to map data $x$ to $z$. An autoencoder can also help manipulations at the abstract level become possible by learning a latent representation of a complex, high-dimensional data space with an encoder $\mathcal {X}\rightarrow \mathcal {Z}$ where $\mathcal {X}$ and $\mathcal {Z}$ denote the data space and the latent space, respectively. Learning a latent representation may make it easier to perform complex modifications in the data space through interpolation or conditional concatenation in the latent space. We demonstrate how GAN variants learn in the latent space in Section 3.2.1 and extend our discussion to proposed ideas which combine a Variational Autoencoder (VAE) framework, another generative model with an autoencoder, with GAN in Section 3.2.2.

3.2.1 Learning the Latent Space. Adversarially Learned Inference (ALI) [25] and Bidirectional GAN (BiGAN) [24] learn latent representations within the GAN framework combined with an encoder. As seen in Figure 10(a), they learn the joint probability distribution of data $x$ and latent $z,$ while GAN learns only the data distribution directly. The discriminator receives samples from the joint space of the data $x$ and the latent variable $z$ and discriminates joint pairs $(G(z), z)$ and $(x, E(x))$ where $G$ and $E$ represent a decoder and an encoder, respectively. By training an encoder and a decoder together, they can learn an inference $\mathcal {X}\rightarrow \mathcal {Z}$ while still being able to generate sharp, high-quality samples.

Ulyanov et al. [123] proposed a slightly peculiar method in which adversarial learning is run between the generator $G$ and the encoder $E$ instead of the discriminator. They showed that adversarial learning in the latent space using the generator and the encoder theoretically results in the perfect generator. Based on their theorems, the generator minimizes the divergence between $E(G(z))$ and a prior $z$ in the latent space, while the encoder maximizes the divergence. By doing so, they implemented adversarial learning together with tractable inference and disentangled the latent space without additional computational cost.

In addition, they added reconstruction loss terms in each space to guarantee each component to be reciprocal. These loss terms can be interpreted as requiring each function to be a one-to-one mapping function so as not to fall into mode collapse, as in MRGAN. Recall that a geometric regularizing term of MRGAN [13] also aims to incorporate a supervised training signal which guides the reconstruction process to the correct location. Figure 10(b) shows an outline of Ulyanov et al. [123], where $R$ in the red rectangle denotes the reconstruction loss term between the original and reconstructed samples.

3.2.2 Variational Autoencoder. VAE [22] is a popular generative model using an autoencoder framework. Assuming some unobserved latent variable $z$ affects a real sample $x$ in an unknown manner, VAE essentially finds the maximum of the marginal likelihood $p_\theta (x)$ for the model parameter $\theta$. VAE addresses the intractability of $p_\theta (x)$ by introducing a variational lower bound, learning the mapping of $\mathcal {X} \rightarrow \mathcal {Z}$ with an encoder and $\mathcal {Z}\rightarrow \mathcal {X}$ with a decoder. Specifically, VAE assumes a prior knowledge $p(z)$ and approximated posterior probability modeled by $Q_\phi (z|x)$ to be a standard normal distribution and a normal distribution with diagonal covariance, respectively, for the tractability. More explicitly, VAE learns to maximize $p_\theta (x)$ where a variational lower bound of the marginal log-likelihood $\log p_\theta (x)$ can be derived as follows:

\begin{equation} \log p_\theta (x) = \int _z Q_\phi (z|x)\log p_\theta (x)dz = \int _z Q_\phi (z|x) \log \left(\frac{p_\theta (x,z)}{p_\theta (z|x)}\frac{Q_\phi (z|x)}{Q_\phi (z|x)}\right)dz \\ \end{equation}
\begin{equation} = \int _z Q_\phi (z|x)\left(\log \left(\frac{Q_\phi (z|x)}{p_\theta (z|x)}\right)+\log \left(\frac{p_\theta (x,z)}{Q_\phi (z|x)}\right)\right)dz \\ \end{equation}
As $KL(Q_\phi (z|x)||p_\theta (z|X))$ is always nonnegative, a variational lower bound $L(\theta , \phi ; x)$ of Equation (42) can be derived as follows:
\begin{equation} = L(\theta , \phi ;x), \end{equation}
where $p_\theta (x|z)$ is a decoder that generates sample $x$ given the latent $z,$ and $Q_\phi (z|x)$ is an encoder that generates the latent code $z$ given sample $x$.

Maximizing $L(\theta , \phi ;x)$ increases the marginal likelihood $p_\theta (x)$. The first term can be interpreted as leading an encoder $Q_\phi (z|x)$ to be close to a prior probability $p_\theta (z)$. It can be calculated analytically because $Q_\phi (z|x)$ and the prior probability are assumed to follow a Gaussian distribution. The second term can be estimated from the sample using a reparameterization method. To sum up, VAE learns by tuning the parameter of the encoder and the decoder to maximize the lower bound $L(\theta , \phi ;x)$.

Hybrid with GAN. Recently, several approaches to incorporate each advantage of VAE and GAN have been proposed. Although VAE generates blurry images, it suffers less from the mode collapse problem because an autoencoder encourages all real samples to be reconstructed. GAN generates sharper images than VAE and does not need further constraints on the model, but GAN suffers from mode collapse, as mentioned in Section 2.4. In this section, we address two studies which attempt to combine VAE and GAN into one framework.

VAEGAN [64] combined VAE with GAN by assigning GAN's generator to a decoder. Its objective function combined VAE's objective function with an adversarial loss term to produce sharp images while maintaining encoding ability for the latent space. Notably, it replaced the reconstruction of $x$ in Equation (42) with the intermediate features of the discriminator to capture more perceptual similarity of real samples. Figure 11(a) shows an outline of VAEGAN where the discriminator $D$ takes one real sample and two fake samples, where one is sampled from an encoded latent space ($z_{VAE}$) and the other from a prior distribution ($z$).

Variational approaches for autoencoding GAN ($\alpha$-GAN) [102] proposes adopting discriminators for the variational inference and transforms Equation (41) into a more GAN-like formulation. The most negative aspect of VAE is that we have to constrain a distribution form of $Q_\phi (z|x)$ to analytically calculate $L(\theta , \phi ;x)$. $\alpha$-GAN treats the variational posteriori distribution implicitly by using the density ratio technique, which can be derived as follows:

where $C_\phi (z) = \frac{Q_\phi (z|x)}{Q_\phi (z|x)+p_\theta (z),}$ which is an optimal solution of the discriminator for a fixed generator in standard GAN [36]. $\alpha$-GAN estimates the KLD term using a learned discriminator from the encoder $Q_\phi (z|x)$ and the prior distribution $p_\theta (z)$ in the latent space. A KLD regularization term of a variational distribution thus no longer needs to be calculated analytically.

$\alpha$-GAN also modifies a reconstruction term in Equation (42) by adopting two techniques. One is using another discriminator which distinguishes real and synthetic samples, and the other adds a normal $l_1$ pixelwise reconstruction loss term in the data space. Specifically, $\alpha$-GAN changes a variational lower bound $L(\theta , \phi ;x)$ into a more GAN-like formulation by introducing two discriminators using a density ratio estimation and reconstruction loss to prevent a mode collapse problem. Figure 11(b) shows an outline of $\alpha$-GAN where $D_L$ is the discriminator of the latent space, $D_D$ represents the discriminator which acts on the data space, and $R$ is the reconstruction loss term.

3.2.3 Examples. Antipov et al. [3] proposed an encoder-decoder combined method for the face aging of a person. It produces facial aging of the target image under the given age vector $y$ while maintaining the identity of the target image. The encoder takes a role to output a latent vector $z,$ which represents a personal identity to be preserved. However, practically, this CGAN-based approach has a problem in that the generator tends to ignore the latent variable $z$, considering only conditional information $y$.

There are some approaches to training $z$ in an unsupervised manner with an autoencoder. Mainly, Semi-Latent GAN (SL-GAN) [138] proposed a method for changing a facial image from high-level semantic facial attributes (i.e., male/female, skin/hair color) by decomposing the latent space from an encoder into annotated attributes (ground-truth attribute of an image) and data-driven attributes (as $c$ of InfoGAN). Similarly to InfoGAN, SL-GAN tries to maximize the mutual information between the data-driven attribute and the generated image while using unsupervised training for the data-driven attributes.

In addition, Disentangled Representation GAN (DR-GAN) [121] addresses pose-invariant face recognition, which is a difficult problem because of the drastic changes in an image for each different pose, by adopting an encoder-decoder structure for the generator. As the purpose of DR-GAN is to generate a face of the same identity given a target pose, it has to learn the identity feature that should be invariant regardless of facial pose. To achieve this, DR-GAN designs an encoder of the generator to represent an identity feature, while a decoder of the generator produces an image under the pose representing vectors and the encoded identity.


As discussed in earlier sections, GAN is a very powerful generative model in that it can generate real-like samples with an arbitrary latent vector $z$. We do not need to know an explicit real data distribution nor assume further mathematical conditions. These advantages allow GAN to be applied in various academic and engineering fields. In this section, we discuss applications of GANs in several domains.

4.1 Image

4.1.1 Image Translation. Image translation involves translating images in one domain $X$ to images in another domain $Y$. Mainly, translated images have the dominant characteristic of domain $Y$ maintaining their attributes in the original images. Image translation can be categorized into supervised and unsupervised techniques, such as classical machine learning.

Paired Two-Domain Data. Image translation with paired images can be regarded as supervised image translation in that an input image ${x}$ $\in$ $X$ to be translated always has the target image ${y}$ $\in$ $Y,$ where $X$ and $Y$ are two distinctive domains. Pix2pix [52] suggests an image translation method with paired images using a CGAN framework in which a generator produces a corresponding target image conditioned on an input image, as seen in Figure 12. In contrast, Perceptual Adversarial Networks (PAN) [127] add the perceptual loss between a paired data $(x,y)$ to the generative adversarial loss to transform input image $x$ into ground-truth image $y$. Instead of using the pixelwise loss to push the generated image toward the target image, it uses hidden layer discrepancies of the discriminator between an input image $x$ and ground-truth image $y$. It tries to transform $x$ to $y$ to be perceptually similar by minimizing perceptual information discrepancies from the discriminator.

Fig. 12.
Fig. 12. Paired image translation results proposed by pix2pix [52]: converting Cityscapes labels into a real photo compared to ground truth. Images from pix2pix [52].

Unpaired Two-Domain Data. Image translation in an unsupervised manner learns a mapping between two domains given unpaired data from two domains. CycleGAN [145] and Discover Cross-Domain Relations with GAN (DiscoGAN) [57] aim to conduct unpaired image-to-image translations using a cyclic consistent loss term in addition to an adversarial loss term. With a sole translator ${G}$: ${X} \rightarrow {Y}$, GAN may learn meaningless translation or mode collapse, resulting in an undesired translation. To reduce the space of mapping of the generator, they adopt another inverse translator ${T}$: ${Y} \rightarrow {X}$ and introduce the cyclic consistency loss which encourages ${T}({G}({x})) \approx {x}$ and ${G}({T}({y})) \approx {y}$ so that each translation finds a plausible mapping between the two domains, as mentioned in Section 2.1.3. Their methods can be interpreted in a similar manner as that described in Section 3.2.1 in that they add a supervised signal for reconstruction.

Attribute-guided image translation was also considered to transfer the visual characteristic of an image. Conditional CycleGAN [75] utilizes CGAN with a cyclic consistency framework. Kim et al. [58] attempted to transfer visual attributes. In addition to the cyclic consistency of an image, they also added an attribute consistency loss which forces the transferred image to have a target attribute of the reference image.

4.1.2 Super-Resolution. Acquiring super-resolution images from low-resolution images has the fundamental problem that the recovered high-resolution image misses high-level texture details during the upscaling of the image. Ledig et al. [65] adopted a perceptual similarity loss in addition to an adversarial loss, instead of pixelwise mean-squared error loss. It focuses on feature differences from the intermediate layer of the discriminator, not pixelwise, because optimizing pixel-wise mean squared error induces the pixelwise average of a plausible solution, leading to perceptually poor smoothed details that is not robust to drastic pixel value changes.

4.1.3 Object Detection. Detecting small objects in an image typically suffers from low resolution and thus it is necessary to train models with images of various scales similar to You Look Only Once (YOLO) [101] and Single Shot Detection (SSD) [96] methods. Notably, Li et al. [69] tries to transform a small object with low resolution into a super-resolved large object to make the object more discriminative. They utilized a GAN framework, except they decomposed the discriminator into two branches, namely, an adversarial branch and a perceptual branch. The generator produces a real-like large-scale object by the typical adversarial branch, while the perceptual branch guarantees that the generated large-scale object is useful for the detection.

Ehsani et al. [28] proposed another framework to detect objects occluded by other objects in an image. It uses a segmentor, a generator, and a discriminator to extract the entire occluded-object mask and to paint it as a real-like image. The segmentor takes an image and a visible region mask of an occluded object and produces a mask of the entire occluded object. The generator and the discriminator are trained adversarially to produce an object image in which the invisible regions of the object are reconstructed.

4.1.4 Object Transfiguration. Object transfiguration is a conditional image generation technique that replaces an object in an image with a particular condition while the background does not change. Zhou et al. [144] adopted an encoder-decoder structure to transplant an object, where the encoder decomposes an image into the background feature and the object feature, and the decoder reconstructs the image from the background feature and the object feature we want to transfigure. Importantly, to disentangle the encoded feature space, two separate training sets are required, where one is the set of images having the object and the other is the set of images not having the object.

In addition, the GAN can be applied to an image blending task which implants an object into another image's background and makes the composited copy-paste images look more realistic. Gaussian-Poisson GAN (GP-GAN) [132] suggests a high-resolution image blending framework using GAN and a classic image blending gradient-based approach [31]. It decomposes images into low-resolution but well-blended images using a GAN and detailed textures and edges using a gradient constraint. Then, GP-GAN attempts to combine the information by optimizing a Gaussian-Poisson equation [95] to generate high-resolution well-blended images while maintaining captured high-resolution details.

4.1.5 Joint Image Generation. The GAN can be utilized to generate multiple domain images at once. Coupled GAN [74] suggests a method of generating multidomain images jointly by weight-sharing techniques among GAN pairs. It first adopts GAN pairs to match the number of domains we want to produce. Then, it shares the weights of some layers of each GAN pair that represents high-level semantics. Therefore, it attempts to learn joint distributions of a multidomain from samples drawn from a marginal domain distribution. It should be noted that because it aims to generate multidomain images which share high-level abstract representations, images from each domain have to be very similar in a broad view.

4.1.6 Video Generation. In this paragraph, we discuss GANs generating video. Generally, the video is composed of relatively stationary background scenery and dynamic object motions. Video GAN (VGAN) [125] considers a two-stream generator. A moving foreground generator using 3D CNN predicts plausible future frames while a static background generator using 2D CNN makes the background stationary. Pose-GAN [126] takes a VAE and GAN combining approach. It uses a VAE approach to estimate future object movements conditioned on a current object pose and hidden representations of past poses. With a rendered future pose video and clip image, it uses a GAN framework to generate future frames using a 3D CNN. Recently, Motion and Content GAN (MoCoGAN) [122] proposed to decompose the content part and motion part of the latent space, especially modeling the motion part with RNN to capture the time dependency.

4.2 Sequential Data Generation

GAN variants that generate discrete values mostly borrow a policy gradient algorithm of Reinforcement Learning (RL) to circumvent direct back-propagation of discrete values. To output discrete values, the generator, as a function, needs to map the latent variable into the domain where elements are not continuous. However, if we do the back-propagation as another continuous value-generating process, the generator is steadily guided to generate real-like data by the discriminator, rather than suddenly jumping to the target discrete values. Thus, such a slight change in the generator cannot easily look for a limited real discrete data domain [141].

In addition, when generating a sequence such as music or language, we need to evaluate a partially generated sequence step by step, measuring the performance of the generator. However, the conventional GAN framework can only evaluate whole generated sequences unless there is a discriminator for each time-step. This, too, can be solved by the policy gradient algorithm, in that RL naturally addresses the sequential decision process of the agent.

4.2.1 Music. When we want to generate music, we need to generate the notes and tone of the music step by step, and these elements are not continuous values. A simple and direct approach is Continuous RNN-GAN (C-RNN-GAN) [83], which models both the generator and discriminator as an RNN with Long Short Term Memory (LSTM) [46], directly extracting whole sequences of music. However, as just mentioned, we can only evaluate whole sequences and not a partially generated sequence. Furthermore, its results are not highly satisfactory since it does not consider the discrete properties of the music elements.

In contrast, Sequence GAN (SeqGAN) [141], Object-Reinforced GAN (ORGAN) [41], and Lee et al. [66] employed a policy gradient algorithm that does not generate whole sequences at once. The result of SeqGAN is shown in Figure 13. They treat a generator's output as a policy of an agent and take the discriminator's output as a reward. Selecting a reward with the discriminator is a natural choice as the generator acts to obtain a large output (reward) from the discriminator, similar to the agent learning to acquire a large reward in RL. In addition, ORGAN is slightly different from SeqGAN, adding a hard-coded objective to the reward function to achieve the specified goal.

Fig. 13.
Fig. 13. Polyphonic music sequences generated from SeqGAN [66].

4.2.2 Language and Speech. RankGAN [73] suggests language (sentence) generation methods and a ranker instead of a conventional discriminator. In natural language processing, the expression power of a natural language needs to be considered in addition to its authenticity. Thus, RankGAN adopts a relative ranking concept between generated sentences and reference sentences that are human-written. The generator tries to make its generated language sample ranked highly, while the ranker evaluates the rank score of the human-written sentences higher than the machine-written sentences. As the generator outputs discrete symbols, it similarly adopts a policy gradient algorithm similar to SeqGAN and ORGAN. In RankGAN, the generator can be interpreted as a policy-predicting next step symbol, and the rank score can be thought of as a value function given a past generated sequence.

Variational Autoencoding Wasserstein GAN (VAW-GAN) [48] is a voice conversion system combining GAN and VAE frameworks. The encoder infers a phonetic content $z$ of the source voice, and the decoder synthesizes the converted target voice given a target speaker's information $y$, similar to conditional VAE [135]. As discussed in Section 3.2.2, VAE suffers from generating sharp results due to the oversimplified assumption of the Gaussian distribution. To address this issue, VAW-GAN incorporates WGAN [5], similarly to VAEGAN [64]. By assigning the decoder to the generator, it aims to reconstruct the target voice given the speaker representation.

4.3 Semi-Supervised Learning

Semi-supervised learning is a learning method that improves the classification performance by using unlabeled data in a situation where there are both labeled and unlabeled data. In the big data era, there exists a common situation where the size of the data is too large to label all the data or the cost of labeling is expensive. Thus, it is often necessary to train the model with a dataset in which only a small portion of the total data has labels.

4.3.1 Semi-Supervised Learning with a Discriminator. The GAN-based semi-supervised learning method [104] demonstrates how unlabeled and generated data are available on the GAN framework. The generated data are allocated to a $K+1$ class beyond $1 \dots K$ class for the labeled data. For labeled real data, the discriminator classifies their correct label (1 to $K$). For unlabeled real data and generated data, they are trained with a GAN minimax game. Their training objective can be expressed as follows:

\begin{equation} L = L_{s} + L_{us} \\ \end{equation}
where $L_{s}$ and $L_{us}$ stand for the loss functions of the labeled data and the unlabeled data, respectively. Note that because only generated data are classified as the $K+1$ class, we could think of $L_{us}$ as a GAN standard minimax game. The unlabeled data and the generated data serve to inform the model of the space where the real data reside. In other words, the unsupervised cost serves to guide the location of the optimum supervised cost of the real labeled data.

Categorical GAN (CatGAN) [115] proposes an algorithm for robust classification in which the generator regularizes the classifier. The discriminator has no classification head for distinguishing real and fake and is trained with three requirements: a small conditional entropy of $H(y|x)$ to make the correct class assignment for the real data, a large conditional entropy of $H(y|G(z))$ to make the class assignment for the generated data diverse, and a large entropy of $H(y)$ to make a uniform marginal distribution with an assumption of a uniform prior $p(y)$ over classes where $x$, $y,$ and $G(z)$ are real data, labels, and generated data, respectively. The generator, meanwhile, is trained with two requirements: a small conditional entropy of $H(y|G(z))$ to make the class assignment for the generated data certain and a large entropy of $H(y)$ to generate equally distributed samples over classes. The unlabeled data and the generated data help the classification by balancing classes through the adversarial act of the generator, so it helps semi-supervised learning.

4.3.2 Semi-supervised Learning with an Auxiliary Classifier. The preceding GAN variants in semi-supervised learning have two problems: the first is that the discriminator has two incompatible convergence points, one for discriminating real and fake data and the other for predicting the class label; the second problem is that the generator cannot generate data in a specific class. Triple-GAN [67] addresses the two problems by a three-player formulation: a generator $G$, a discriminator $D,$ and a classifier $C$. The model is illustrated in Figure 14, where $(X_g, Y_g)\sim p_g(X, Y)$, $(X_l, Y_l) \sim p(X, Y)$, and $(X_c, Y_c) \sim p_c(X, Y)$ refer to the generated data, the labeled data, and the unlabeled data with a predicted label, respectively, and $CE$ is the cross-entropy loss. To summarize, Triple-GAN adopts an auxiliary classifier which classifies real labeled data and label-conditioned generated data, relieving the discriminator of classifying the labeled data. In addition, Triple-GAN generates data conditioned on $Y_g$, which means that it can generate label-specific data.

Fig. 14.
Fig. 14. An illustration of Triple-GAN [67].

4.4 Domain Adaptation

Domain adaptation is a type of transfer learning which tries to adapt data from one domain (i.e., the source domain) into another domain (i.e., the target domain), while the classification task performance is preserved in the target domain [93]. Formally, the unsupervised domain adaptation solves the following problem: With input data $x$ and its label $y$, let a source domain distribution and a target domain distribution be defined over $X\times Y$ where $X$ and $Y$ are sets of a data space and a label space, respectively. Given labeled source domain data $(x_s, y_s)$ and unlabeled target domain data (the marginal distribution of ), the unsupervised domain adaptation aims to learn a function ${h} : {X} \rightarrow {Y}$ which classifies well in the target domain without the label information of target domain data $x_t$.

4.4.1 Domain Adaptation by Feature Space Alignment via GAN. The main difficulty in domain adaptation is the difference between the source distribution and the target distribution, called domain shift. This domain shift allows the classifier trained with only the source data to fail in the target domain. One of the methods to address domain shift is to project each domain data into the common feature space where the distributions of the projected data are similar. There have been a few studies to achieve the common feature space via GAN for the domain adaptation task.

The Domain Adversarial Neural Network (DANN) [2] first used GAN to obtain domain-invariant features by making them indistinguishable as to whether they came from the source domain or the target domain while still discriminative for the classifying task. There are two components sharing a feature extractor. One is a classifier which classifies the labels of the data. The other is a domain discriminator which discerns where the data comes from. The feature generator acts like the generator in GAN, producing source-like features from the target domain. To sum up, DANN takes CNN classification networks in addition to the GAN framework to classify data from the target domain without label information by learning the domain-invariant features. Figure 15 shows the overall outline of DANN, ARDA, and unsupervised pixel-level domain adaptation where $I_S, I_T$, and $I_f$ are the source domain image, target domain image, and fake image, respectively. $F_S$ and $F_T$ are the extracted source features and the target features, respectively. It should be noted that $Y_s$, which is the label information of $I_s$, is fed into the classifier, training the classifier with cross-entropy loss.

Fig. 15.
Fig. 15. Illustrations of (a) DANN [2] and (b) unsupervised pixel-level domain adaptation [12].

Although DANN may achieve domain-invariant marginal feature distributions, if the labels of the source feature and the target feature do not contain a mismatch, the learned classifier might not work well in the target domain. Cycle-Consistent Adversarial Domain Adaptation (CyCADA) [47] thus adds cycle consistency [57, 145] to preserve the content of the data, which is the crucial characteristic in determining the label, while inheriting most of the architecture of DANN.

4.4.2 Examples. Bousmalis et al. [12] used domain adaptation via GAN for the grasping task. They used simulation data as the source domain data and ran the learned model in a real environment. Their method is slightly different from DANN in that it only adapts source images to be seen as if they were drawn from the target domain, while DANN tries to obtain features similarly from both domains. The feature space of Bousmalis et al. [12] is the target domain space, and the feature generator in the target domain is the identity function. In addition, they added a content similarity loss, defined as the pixelwise mean-squared error between the source image and the adapted image to preserve the content of the source image. By doing so, they achieved better performance than the supervised method in the grasping task. It should also be noted that their method [12] can check whether the domain adaptation process is working well during the training phase because the transformed data are visible, while other works [2, 108] cannot visually check the domain adaptation process because the common representation space cannot be easily visualized.

Yoo et al. [140] achieved autonomous navigation without any real labeled data using domain adaptation. As in Bousmalis et al. [12], they also exploited the simulation data as the source data and showed autonomous navigation in a real outdoor environment. They used cycle consistency to preserve the content and a style loss term motivated by the style transfer task [32] to reduce the domain shift dramatically for the outdoor environment, as seen in Figure 16. By doing so, they showed the applicability of simulation via domain adaptation into autonomous navigation tasks where collecting labels for various environments is difficult and expensive.

Fig. 16.
Fig. 16. The source images and transferred images generated by five models proposed by Yoo et al. [140]. The leftmost images are source images made by a simulator, and figures in columns 2-6 are transferred images by five different models. Images from Yoo et al. [140].

4.5 Other Tasks

Several variants of GAN have also been developed in other academic or practical fields other than the machine learning.

4.5.1 Medical Image Segmentation. Xue et al. [134] proposed a segmentor-critic structure to segment a medical image. A segmentor generates a predicted segmented image, and a critic maximizes the hierarchical feature differences between the ground-truth and the generated segmentation. This structure leads a segmentor to learn the features of the ground-truth segmentation adversarially, similar to the GAN approach. There are also other medical image segmentation algorithms, such as the Deep Image-to-Image Network (DI2IN) [136] and the Structure Correcting Adversarial Network (SCAN) [16]. DI2IN conducts liver segmentation of 3D computed tomography (CT) scan images through adversarial learning. SCAN tries to segment the lung and the heart from chest X-ray images through an adversarial approach with a ground-truth segmentation mask.

4.5.2 Steganography. It is also feasible to use GAN for steganography. Steganography is a technique that conceals secret messages in nonsecret containers, such as an image. A steganalyzer determines if a container contains a secret message or not. Some studies, such as Volkhonskiy et al. [124] and Shi et al. [109], propose a steganography model with three components: a generator producing real-looking images that are used as containers and two discriminators, one of which classifies whether an image is real or fake and the other of which determines whether an image contains a secret message.

4.5.3 Continual Learning. Deep generative replay [110] extends a GAN framework to continual learning. Continual learning solves multiple tasks and accumulates new knowledge continually. Continual learning in deep neural networks suffers from catastrophic forgetting, which refers to forgetting a previously learned task while learning a new task. Inspired by brain mechanisms, catastrophic forgetting is addressed with a GAN framework called deep generative replay. Deep generative replay trains a scholar model in each task, where a scholar model is composed of a generator. The generator produces samples of an old task, and the solver gives a target answer to an old task's sample. By sequentially training scholar models with old task values generated by old scholars, it attempts to overcome catastrophic forgetting while learning new tasks.


We have discussed how GAN and its variants work and how they are applied to various applications. Table 5 compares some famous variants of GAN with respect to model architectures and additional constraints. As we have viewed GAN from a microscopic perspective until now, we are going to discuss the macroscopic view of GAN in this section.

Table 5. Comparison of GAN Variants from Some Aspects
Generator Discriminator Additional Loss & Constraint
WGAN-GP ReLU MLP ReLU MLP Gradient penalty
BEGAN Discriminator decoder Autoencoder (ELU CNN) Equilibrium measure
ACGAN Transposed ReLU CNN LeakyReLU CNN Classification loss
SeqGAN LSTM ReLU CNN Policy gradient
DANN ReLU CNN ReLU MLP Classification loss, gradient reversal layer

5.1 Evaluation

Measuring the performance of GAN is related to capturing the diversity and quality of generated data. As explained in Section 1, the generative models mostly model the likelihood and learn by maximizing it. Thus, it is natural to evaluate the generator using log likelihood. However, GAN generates real-like data without estimating likelihood directly, so evaluation with likelihood is not quite proper. The most widely accepted evaluation metric for GAN is an Inception score. The Inception score was proposed by Salimans et al. [104], and the Inception score is defined as follows:

As seen in Equation (47), it computes the average KLD between the conditional label distribution $p(y|x)$ and the marginal distribution of the generated data's label $p(y)$ with a pretrained classifier such as VGG [112] and ImageNet data [63]. Since the KLD term between $p(y|x)$ and $p(y)$ in the exponent function is equivalent to their mutual information $I(y;x) = H(y) - H(y|x)$ [7], the high entropy of $p(y|x)$ and low entropy of $p(y)$ leads to a high Inception score. The entropy of $p(y|x)$ measures how sharp and clear the generated data are in order to be well-classified. The other term, $H(y),$ represents whether the generated data are diverse with respect to the generated class. In this way, the Inception score is believed to measure the diversity and visual quality of the generated data.

However, the Inception score has a limitation in that it is unable to resist the mode collapse problem. Even though the trained generator produces only one plausible data for each class, $p(y|x)$ can have low entropy, leading to a high Inception score. To address this issue, AC-GAN [90] adopts Multiscale Structural Similarity (MS-SSIM) [129], which evaluates perceptual image similarity and thus identifies a mode collapse more reliably. In addition, MRGAN [13] proposed a MODE score based on the Inception score, and Danihelka et al. [17] suggested an independent Wasserstein critic which is trained independently for the validation dataset to measure overfitting and mode collapse. Moreover, various object functions, such as MMD, total variance, and Wasserstein distance, can be used as the approximated distance between $p_{\mathrm{g}}(x)$ and $p_{\textit{data}}(x)$. Theoretically, all metrics should produce the same result under the assumption of the full capacity of the model and an infinite number of training samples. However, Theis et al. [119] showed that minimizing MMD, JSD, or KLD results in different optimal points, even for the mixture of Gaussian distributions as a real data distribution, and the convergence in one type of distance does not guarantee the convergence for another type of distance because their object functions do not share a unique global optima.

The method for measuring the performance of a GAN is still a disputed subject. Since GAN is naturally unsupervised learning, we cannot measure the accuracy or error rate as in the supervised learning approaches. Evaluation metrics or different distances discussed in the preceding paragraph still do not measures the performance of GAN exactly, and there are many cases in which images that do not look natural obtain a high score [7]. Thus, there is room to improve the evaluation of GAN.

5.2 Discrete Structured Data

Unlike other generative models such as VAE, GAN has an issue handling discrete data such as text sequences or discretized images. Since discrete data are nondifferentiable, gradient descent update via back-propagation cannot directly be applied for a discrete output. The content of this section may overlap with Section 4.2, but we will briefly cover this issues in the discussion because it is one of the more troublesome issue in GAN.

To address this issue, some methods adopt a policy gradient algorithm [118] in reinforcement RL, in which the objective is to maximize total rewards. By rolling out whole sequences, this method circumvents direct back-propagation for a discrete output. Intuitively, the generator which generates fake data maximizing the discriminator's output can be thought of as a policy agent in RL, which is a probability distribution to take action maximizing a reward in a given state.

Maximum-Likelihood Augmented Discrete GAN (MLADGAN) [14] and Boundary-Seeking GAN (BSGAN) [44] both take a policy gradient algorithm to generate discrete structured data. They both treat the generator as a policy which outputs a probability for a discrete output. Importantly, they estimate the true distribution as $\tilde{p}_{data}(x)=\frac{1}{Z}p_{\mathrm{g}}(x)\frac{D(x)}{1-D(x),}$ where $Z$ is a normalization factor. $\tilde{p}_{data}(x)$ is motivated from $p_{\mathrm{data}}(x)=\frac{D^*(x)}{1-D^*(x)}p_{\mathrm{g}}(x),$ where $D^*$ is the optimal discriminator, and MLADGAN and BSGAN interpret $\tilde{p}_{data}(x)$ as a reward for a discrete output $x$. Because larger $D(x)$ means greater $\tilde{p}_{data}(x)$, taking the $\tilde{p}_{data}(x)$ value as a reward brings about the same consequence as when $D(x)$ is taken as a reward. They showed some successful performance, but the learning process is highly unstable due to MCMC sampling in the training phase.

Adversarially Regularized Autoencoders (ARAE) [59] address this issue by combining a discrete autoencoder which encodes a discrete input into the continuous code $c$. ARAE also adopts WGAN [5] acting on the latent space where encoded code lies. To avoid direct access to the discrete structure, the generator produces fake code $\tilde{c}$ from the sampled latent $z$, and the critic evaluates such fake code $\tilde{c}$ and real code $c$. By jointly training an autoencoder with WGAN on the code space, it avoids back-propagation on the discrete space while learning latent representations of discrete structured data.

5.3 Relationship to Reinforcement Learning

RL is a type of learning theory that focuses on teaching an agent to choose the best action given a current state. A policy $\pi (a|s)$, which is a probability for choosing an action $a$ at state $s$, is learned via an on-policy or off-policy algorithm. RL has a very similar concept to GAN in regards to a policy gradient, where it is very important to estimate the value function correctly given state $s$ and action $a$.

The policy gradient algorithm [118] presented an unbiased estimation of a policy gradient under the correct action-state value $Q(s,a)$. Because $Q(s,a)$ is the discounted total reward from state $s$ and action $a$, $Q(s,a)$ must be estimated via an estimator called a critic. A wide variety of policy gradients, such as the Deep Deterministic Policy Gradient (DDPG) [71], Trust Region Policy Optimization (TRPO) [107], and Q-prop [40], can be thought of as estimating $Q(s,a),$ which has low bias and low variance, to correctly estimate the policy gradient. In that regard, they are similar to a GAN framework in that GAN's discriminator estimates the distance between two distributions, and the approximated distance needs to be highly unbiased to make $p_{\mathrm{g}}(x)$ closely approximate $p_{data}(x)$. Pfau and Vinyals [97] detailed the connection between GAN and the actor-critic methods.

Inverse RL (IRL) [1] is similar to RL in that its objective is to find the optimal policy. However, in the IRL framework, experts’ demonstrations are provided instead of a reward. It finds the appropriate reward function that makes the given demonstration as optimal as possible and then produces the optimal policy for the identified reward function. There are many variants in IRL. Maximal entropy IRL [146] is one which finds the policy distribution that satisfies the constraints so that the feature expectations of the policy distribution and the given demonstration are the same. To solve such an ill-posed problem, maximal entropy IRL finds the policy distribution with the largest entropy according to the maximal entropy principle [53]. Intuitively, the maximal entropy IRL finds the policy distribution which maximizes the likelihood of a demonstration and its entropy. Its constraint and convexity induce the dual minimax problem. The dual variable can be seen as a reward. The minimax formulation and the fact that it finds the policy which has the largest likelihood of demonstrations give it a deep connection with the GAN framework. The primal variable is a policy distribution in IRL, whereas it can be considered a data distribution from the generator in GAN. The dual variable is a reward/cost in IRL, while it can be seen as the discriminator in GAN.

Finn et al. [30], Yoo et al. [139], and Ho and Ermon [45] showed a mathematical connection between IRL and GAN. Ho and Ermon [45] converted IRL to the original GAN by constraining the space of dual variables, and Yoo et al. [139] showed the relationship between EBGAN [143] and IRL using approximate inference.

5.4 Pros and Cons of GAN

5.4.1 Pros. As briefly mentioned in Section 1, the major advantage of GAN is that it does not need to define the shape of the probability distribution of the generator model. Thus, GAN naturally avoids concerning tractable density forms which need to represent complex and high-dimensional distributions. Compared to other models using explicitly defined probability density [22, 91, 92, 105], GAN has following advantages:

  • GAN can parallelize the sampling of the generated data. In the case of PixelCNN [105], PixelRNN [92], and WaveNet [91], their speed of generation is very slow due to their autoregressive fashion, wherein $p_{\mathrm{g}}(x)$ is decomposed into a product of conditional distributions given previously generated values:
    \begin{align} p_{\mathrm{g}}(x) = \prod _{i=1}^d p_{\mathrm{g}}(x_i | x_{1}, \dots x_{i-1}), \end{align}
    where $x$ is the d-dimensional vector. For example, in image generation, autoregressive models generate an image pixel by pixel, where the probability distribution of a future pixel cannot be inherently computed until the value of the previous pixel is computed. Thus, the generation process is naturally slow, which becomes more severe for high-dimensional data generation such as speech synthesis [91].
    On the other hand, the generator of GAN is a simple feedforward network mapping from $\mathcal {Z}$ to $\mathcal {X}$. The generator produces data all at once, not pixel by pixel, as in autoregressive models. Therefore, GAN can generate samples in parallel, which results in a considerable speed up for sampling, and this property gives more opportunity for GAN to be used in various real applications.
  • GAN does not need to approximate a likelihood by introducing a lower bound, as in VAE. As we mentioned in Section 3.2.2, VAE tries to maximize a likelihood by introducing a variational lower bound. The strategy of VAE is to maximize a tractable variational lower bound, guaranteeing it to be at least as high as the lower bound even when the likelihood is intractable. However, VAE still needs assumptions on prior and posterior distributions which do not guarantee the tight bound of Equation (42). This strong assumption on distributions makes the approximation to the maximum likelihood biased.
    In contrast, GAN does not approximate the likelihood and does not need any probability distribution assumptions. Instead, GAN is designed to solve an adversarial game between the generator and the discriminator, and a Nash equilibrium of the GAN game corresponds to finding the real data distribution [35].
  • GAN is empirically known to produce better and sharper results than other generative models, especially VAE. In VAE, a distribution of pixel values in the reconstructed image is modeled as a conditional Gaussian distribution. This causes the optimization of $\log p_{\mathrm{g}}(x|z)$ to be equivalent to minimizing the Euclidean term of $-\Vert x-Decoder(z)\Vert ^2$, which can be interpreted as a regression problem fitting the mean.
    GAN is highly capable of capturing the high-frequency parts of an image. Since the generator tries to fool the discriminator to recover the real data distribution, the generator evolves to lead even the high-frequency parts to deceive the discriminator. In addition, some techniques, such as PatchGAN in Section 2.3.3, help GAN produce and capture sharper results more effectively.

5.4.2 Cons. GAN was developed to solve the minimax game between the generator and the discriminator. Though several studies discuss convergence and the existence of the Nash equilibrium in the GAN game, training GAN is highly unstable and difficult to converge, as mentioned in Sections 2.3.1 and 2.3.2. GAN solves the minimax game through the gradient descent method, iteratively for the generator and the discriminator. In the cost function $V(G,D)$ in Equation (1), a solution for the GAN game is Nash equilibrium, which is a point of parameters where the discriminator's cost and the generator's cost is minimum with respect to their parameters. However, the decrease in the discriminator's cost function can cause an increase in the generator's cost function and vice versa. Thus, a convergence of the GAN game may often fail and is prone to be unstable.

Another important issue for GAN is the mode collapse problem. This problem is very detrimental for GANs applied in real-world applications since a mode collapse restricts GAN's ability for diversity. In Equation (1), the generator is only forced to deceive the discriminator, not to represent the multimodality of a real data distribution. A mode collapse thus can happen even in a simple experiment [13], and this discourages GAN applications due to its low diversity. As mentioned in Section 2.4, various studies tried to address mode collapse by using a new object function [13, 61] or adding new components [13, 33]. However, for a highly complex and multimodal real data distribution, mode collapse still remains a problem GAN has to solve.

5.4.3 Future Research Areas. Because GAN has been become popular throughout the deep learning arena, its limitations have recently been improved [11, 19, 81]. With further developments, new tasks are steadily conquered using GAN. For instance, CausalGAN [60] combines a causal implicit generative model with the conditional GAN to replace conditioning by intervention, which enables the model to generate data with the desired characteristic combinations that do not exist in the dataset. In addition, new types of GANs have been proposed for new applications, such as cipher cracking [34] and object tracking [114]. When you design a GAN for a new task, you must first identify the nature of the task and then use Tables 1, 2, and 5 to determine which model to use as a baseline. A new loss function can be designed using the characteristics of the task. In the future, we anticipate that the limitations of existing GANs will be solved in novel ways, and GAN will remain an important generative model by conquering areas that existing deep learning-based models cannot effectively solve.


We discussed how various object functions and architectures affect the behavior of GAN and the applications of GAN such as image translation, image attribute editing, domain adaptation, and other fields. The GAN originated from the theoretical minimax game perspective. In addition to the standard GAN [36], practical trials as well as mathematical approaches have been adopted, resulting in many variants of GAN. Furthermore, the relationship between GAN and other concepts such as imitation learning and other generative models has been discussed and combined in various studies, resulting in rich theory and numerous application techniques. GAN has the potential to be applied in many domains, including those we have discussed. Despite GAN's significant success, there remain unsolved problems in its theoretical aspects (e.g., whether GAN actually converges and whether it can perfectly overcome mode collapse), as Arora et al. [6], Grnarova et al. [39], and Mescheder et al. [78] discussed. However, with the power of deep neural networks and with the utility of learning a highly nonlinear mapping from latent space into data space, there remain enormous opportunities to develop GAN further and to apply GAN to various applications and fields.



This research was supported in part by Projects for Research and Development of Police science and Technology under Center for Research and Development of Police science and Technology and Korean National Police Agency funded by the Ministry of Science, ICT and Future Planning (PA-C000001), in part by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (grant number: 2016M3A7B4911115), in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) [2018R1A2B3001628], and in part by the Brain Korea 21 Plus Project (Electrical and Computer Engineering, Seoul National University) in 2018.

Authors’ addresses: Y. Hong, U. Hwang, J. Yoo, and S. Yoon (corresponding author), Electrical and Computer Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Republic of Korea; emails:,,,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from

©2019 Association for Computing Machinery.
0360-0300/2019/01-ART10 $15.00

Publication History: Received January 2018; revised October 2018; accepted October 2018