0% found this document useful (0 votes)
16 views22 pages

Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications

Uploaded by

lucas.c.jofilsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views22 pages

Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications

Uploaded by

lucas.c.jofilsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

1

Generative Adversarial Networks for Image and


Video Synthesis: Algorithms and Applications
Ming-Yu Liu*, Xun Huang*, Jiahui Yu*, Ting-Chun Wang*, Arun Mallya*

Abstract—The generative adversarial network (GAN) framework has emerged as a powerful tool for various image and video
synthesis tasks, allowing the synthesis of visual content in an unconditional or input-conditional manner. It has enabled the generation
of high-resolution photorealistic images and videos, a task that was challenging or impossible with prior methods. It has also led to the
creation of many new applications in content creation. In this paper, we provide an overview of GANs with a special focus on algorithms
and applications for visual synthesis. We cover several important techniques to stabilize GAN training, which has a reputation for being
notoriously difficult. We also discuss its applications to image translation, image processing, video synthesis, and neural rendering.
arXiv:2008.02793v2 [cs.CV] 30 Nov 2020

Index Terms—Generative Adversarial Networks, Computer Vision, Image Processing, Image and Video Synthesis, Neural Rendering

1 I NTRODUCTION

T HE generative adversarial network (GAN) framework


is a deep learning architecture [59], [100] introduced by
Goodfellow et al. [60]. It consists of two interacting neural
Generator Discriminator False

Discriminator True
networks—a generator network G and a discriminator net-
work D, which are trained jointly by playing a zero-sum (a) Unconditional GAN
game where the objective of the generator is to synthesize
Generator Discriminator False
fake data that resembles real data, and the objective of the
discriminator is to distinguish between real and fake data.
When the training is successful, the generator is an approx- Discriminator True
imator of the underlying data generation mechanism in the (b) Conditional GAN
sense that the distribution of the fake data converges to the
real one. Due to the distribution matching capability, GANs Fig. 1. Unconditional vs. Conditional GANs. (a) In unconditional
have become a popular tool for various data synthesis and GANs, the generator converts a noise input z to a fake image G(z)
manipulation problems, especially in the visual domain. where z ∼ Z and Z is usually a Gaussian random variable. The discrim-
GAN’s rise also marks another major success of deep inator tells apart real images x from the training dataset D and fake im-
ages from G. (b) In conditional GANs, the generator takes an additional
learning in replacing hand-designed components with input y as the control signal, which could be another image (image-to-
machine-learned components in modern computer vision image translation), text (text-to-image synthesis), or a categorical label
pipelines. As deep learning has directed the community to (label-to-image synthesis). The discriminator tells apart real from fake
abandon hand-designed features, such as the histogram of by leveraging the information in y . In both settings, the combination of
the discriminator and real training data defines an objective function
oriented gradients (HOG) [36], for deep features computed for image synthesis. This data-driven objective function definition is a
by deep neural networks, the objective function used to train powerful tool for many computer vision problems.
the networks remains largely hand-designed. While this is
not a major issue for a classification task since effective and
descriptive objective functions such as the cross-entropy discriminator is to produce images similar to the real images
loss exist, this is a serious hurdle for a generation task. used for training. Since all the training images contain cats,
After all, how can one hand-design a function to guide a the generator output must contain cats to win the game.
generator to produce a better cat image? How can we even Moreover, when we replace the cat images with dog images,
mathematically describe “felineness” in an image? we can use the same method to train a dog image generator.
GANs address the issue through deriving a functional The objective function for the generator is defined by the
form of the objective using training data. As the discrim- training dataset and the discriminator architecture. It is thus
inator is trained to tell whether an input image is a cat a very flexible framework to define the objective function
image from the training dataset or one synthesized by the for a generation task as illustrated in Figure 1.
generator, it defines an objective function that can guide the However, despite its excellent modeling power, GANs
generator in improving its generation based on its current are notoriously difficult to train because it involves chasing
network weights. The generator can keep improving as a moving target. Not only do we need to make sure the
long as the discriminator can differentiate real and fake generator can reach the target, but also that the target can
cat images. The only way that a generator can beat the reach a desirable level of goodness. Recall that the goal of
the discriminator is to differentiate real and fake data. As
• * Equal contribution. Ming-Yu Liu, Xun Huang, Ting-Chun Wang, and the generator changes, the fake data distribution changes
Arun Mallya are with NVIDIA. Jiahui Yu is with Google.
as well. This poses a new classification problem to the dis-
2

review some algorithms in this space in Section 4.


We can now find GAN’s footprint in many visual pro-
cessing systems. For example, for image restoration, super-
resolution, and inpainting, where the goal is to transform
an input image distribution to a target image distribution,
GANs have been shown to generate results with much
better visual quality than those produced with traditional
methods. We will provide an overview of GAN methods in
these image processing tasks in Section 5.
(a) GAN (b) DCGAN (c) CoGAN (d) PgGAN (e) StyleGAN Video synthesis is another exciting area that GANs have
shown promising results. Many research works have uti-
Fig. 2. GAN progress on face synthesis. The figure shows the progress lized GANs to synthesize realistic human videos or transfer
of GANs on face synthesis over the years. From left to right, we have
face synthesis results by (a) the original GAN [60], DCGAN [163], Co- motions from one person to another for various enter-
GAN [119], PgGAN [84], and StyleGAN [85]. This image was originally tainment applications, which we will review in Section 6.
created and shared by Ian Goodfellow on Twitter. Finally, thanks to its great capability in generating photo-
realistic images, GANs have played an important role in the
development of neural rendering—using neural networks
criminator, distinguishing the same real but a new kind of to boost the performance of the graphics rendering pipeline.
fake data distribution, one that is presumably more similar We will cover GAN works in this space in Section 7.
to the real data distribution. As the discriminator is updated
according to the new classification problem, it induces a new
objective for the generator. Without careful control of the dy- 2 R ELATED W ORKS
namics, a learning algorithm tends to experience failures in Several GAN review articles exist, including the introduc-
GAN training. Often, the discriminator becomes too strong tory article by Goodfellow [58]. The articles by Creswell et
and provides strong gradients that push the generator to a al. [35] and Pan et al. [154] summarize GAN methods prior
numerically unstable region. This is a well-recognized issue. to 2018. Wang et al. [222] provides a taxonomy of GANS.
Fortunately, over the years, various approaches, including Our work differs from the prior works in that we provide a
better training algorithms, network architectures, and regu- more contemporary summary of GAN works with a focus
larization techniques, have been proposed to stabilize GAN on image and video synthesis.
training. We will review several representative approaches There are many different deep generative models or
in Section 3. In Figure 2, we illustrate the progress of GANs deep neural networks that model the generation process of
over the past few years. some data. Besides GANs, other popular deep generative
In the original GAN formulation [60], the generator is models include deep Boltzmann Machines, variational au-
formulated as a mapping function that converts a simple, toencoders, deep autoregressive models, and normalizing
unconditional distribution, such as a uniform distribution flow models. We compare these models in Figure 3 and
or a Gaussian distribution, to a complex data distribution, briefly review them below.
such as a natural image distribution. We now generally refer
Deep Boltzmann Machines (DBMs). DBMs [45], [48], [68],
to this formulation as the unconditional GAN framework.
[175] are energy-based models [101], which can be repre-
While the unconditional framework has several important
sented by undirected graphs. Let x denote the array of
applications on its own, the lack of controllability in the
image pixels, often called visible nodes. Let h denote the
generation outputs makes it unfit to many applications. This
hidden nodes. DBMs model the probability density function
has motivated the development of the conditional GAN
of data based on the Boltzmann (or Gibbs) distribution as
framework. In the conditional framework, the generator
additionally takes a control signal as input. The signal can 1 X
p(x; θ) = exp(−E(x, h; θ)), (1)
take in many different forms, including category labels, N (θ) h
texts, images, layouts, sounds, and even graphs. The goal
of the generator is to produce outputs corresponding to the where E is an energy function modeling interactions of
signal. In Figure 1, we compare these two frameworks. This nodes in the graph, N is the partition function, and θ de-
conditional GAN framework has led to many exciting appli- notes the network parameters to be learned. Once a DBM is
cations. We will cover several representative ones through trained, a new image can be generated by applying Markov
Sections 4 to 7. Chain Monte Carlo (MCMC) sampling, ascending from a
GANs have led to the creation of many exciting new random configuration to one with high probability. While
applications. For example, it has been the core building extensively expressive, the reliance on MCMC sampling
block to semantic image synthesis algorithms that concern on both training and generation makes DBMs scale poorly
converting human-editable semantic representations, such compared to other deep generative models, since efficient
as segmentation masks or sketches, to photorealistic images. MCMC sampling is itself a challenging problem, especially
GANs have also led to the development of many image- for large networks.
to-image translation methods, which aim to translate an Variational AutoEncoders (VAEs). VAEs [93], [94], [168]
image in one domain to a corresponding image in a different are directed probabilistic graphic models, inspired by the
domain. These methods find a wide range of applicability, Helmholtz machine [37]. They are also descendant of la-
ranging from image editing to domain adaptation. We will tent variable models, such as principal component analysis
3

True/False

stochastic
sampling

(a) Boltzmann Machine (b) Variational Autoencoder (c) Autoregressive Model (d) Normalizing Flow Model (e) Generative Adversarial Network

Fig. 3. Structure comparison of different deep generative models. Except the deep Boltzmann machine which is based on undirected graphs, the
other models are all based on directed graphs, which enjoy a faster inference speed.

and autoencoders [18], which concern representing high- ordering of the variables can be determined based on the
dimensional data x using lower-dimensional latent vari- time dimension, such an ordering does not exist for images.
ables z . In terms of structure, a VAE employs an infer- One hence has to enforce an order prior that is an unnatural
ence model q(z|x; φ) and a generation model p(x|z; θ)p(z) fit to the image grid.
where p(z) is usually a Gaussian distribution, which we can Normalizing Flow Models (NFMs). NFMs [40], [41], [92],
easily sample from, and q(z|x; φ) approximates the poste- [167] are based on the normalizing flow—a transformation
rior p(z|x; θ). Both of the inference and generation mod- of a simple probability distribution into a more complex dis-
els are implemented using feed-forward neural networks. tribution by a sequence of invertible and differentible map-
VAE training is through maximizing the evidence lower pings. Each mapping corresponds to a layer in a deep neural
bound (ELBO) of log p(x; θ) and the non-differentiblity network. With a layer design that guarantees invertibility
of the stochastic sampling is elegantly handled by the and differentibility for all possible weights, one can stack
reparametrization trick [94]. One can also show that max- many such layers to construct a powerful mapping because
imization the ELBO is equivalent to minimizing the Kull- composition of invertible and differentible functions are in-
back–Leibler (KL) divergence vertible and differentible. Let F = f (1) ◦f (2) ...◦f (K) be such
KL (q(x)q(z|x; φ)||p(z)p(x|z; θ)) , (2) a K -layer mapping that maps the simple probability distri-
bution Z to the data distribution X . The probability density
where q(x) is the empirical distribution of the data [94]. of a sample x ∼ X can be computed by transforming it
Once a VAE is trained, an image can be efficiently generated back to the corresponding z . Hence, we can apply maximum
by first sampling z from the Gaussian prior p(z) and then likelihood learning to train NFMs because the log-likelihood
passing it through the feed-forward deep neural network of the complex data distribution can be converted to the log-
p(x|z; θ). VAEs are effective in learning useful latent rep- likelihood of the simple prior distribution subtracted by the
resentations [188]. However, they tend to generate blurry Jacobians terms. This gives
output images.
K
Deep AutoRegressive Models (DARs). DARs [30], [153],
X df (i)
log p(x; θ) = log p(z; θ) − log det , (5)
[177], [207] are deep learning implementations of classical i=1
dzi−1
autoregressive models, which assume an ordering to the
random variables to be modeled and generate the variables where zi = f (i) (zi−1 ). One key strength of NFMs is in
sequentially based on the ordering. This induces a factoriza- supporting direct evaluation of probability density calcula-
tion form to the data distribution given by solving tion. However, NFMs require an invertible mapping, which
Y greatly limits the choices of applicable architectures.
p(x; θ) = p (xi |x<i ; θ) , (3)
i
3 L EARNING
where xi s are variables in x, and x<i are the union of
the variables that are prior to xi based on the assumed Let θ and φ be the learnable parameters in G and D, respec-
ordering. DARs are conditional generative models where tively. GAN training is formulated as a minimax problem
they generate a new portion of the signal based on what has min max V (θ, φ), (6)
been generated or observed so far. The learning is based on φ θ
maximum likelihood learning where V is the utility function.
max Ex∼D [log p(xi |x<i ; θ)] . (4) GAN training is challenging. Famous failure cases in-
θ clude mode collapse and mode dropping. In mode collapse,
DAR training is more stable compared to the other gen- the generator is trapped to a certain local minimum where
erative models. But, due to the recurrent nature, they are it only captures a small portion of the distribution. In mode
slow in inference. Also, while for audio or text, a natural dropping, the generator does not faithfully model the target
4

TABLE 1 training iteration contains a discriminator update step and


Comparison of different GAN losses, including saturated [60], a generator update step given by
non-saturated [60], Wasserstein [6], least square [134], and
hinge [112], [241], in terms of the discriminator output layer type in (7) ∂VD (θ (t) , φ(t) )
and (8). We maximize fD and fG for training the discriminator. As φ(t+1) = φ(t) + αD (9)
shown in (7) and (8), we minimize gG for training the generator. Note ∂φ
that σ(x) = 1+e1−x is the sigmoid function. ∂VG (θ (t) , φ(t) )
θ (t+1) = θ (t) − αG , (10)
∂θ
Loss fD (x) fG (x) gG (x) where αD and αG are the learning rates for the generator
and discriminator, respectively. In the alternating update,
Saturated log σ(x) log(1 − σ(x)) log(1 − σ(x))
each training iteration consists of one discriminator update
Non-Saturated log σ(x) log(1 − σ(x)) − log σ(x)
step followed by a generator update step, given by
Wasserstein x −x −x
Least-Square −(x − 1)2 −x2 (x − 1)2 ∂VD (θ (t) , φ(t) )
Hinge min(0, x − 1) min(0, −x − 1) −x φ(t+1) = φ(t) + αD (11)
∂φ
∂VG (θ (t) , φ(t+1) )
θ (t+1) = θ (t) − αG . (12)
distribution and misses some portion of it. Other common ∂θ
failure cases include checkerboard and waterdrop artifacts. Note that in the alternating update scheme, the generator
In this paper, we cover the basics of GAN training and some update (12) utilizes the newly updated discriminator pa-
techniques invented to improve training stability. rameters θ (t+1) , while, in the simultaneous update (10), it
does not. These two schemes have their pros and cons. The
simultaneous update scheme can be computed more effi-
3.1 Learning Objective
ciently, as a major part of the computation in the two steps
The core idea in GAN training is to minimize the discrep- can be shared. In the other hand, the alternating update
ancy between the true data distribution p(x) and the fake scheme tends to be more stable as the generator update is
data distribution p(G(z; θ)). As there are a variety of ways computed based on the latest discriminator. Recent GAN
to measure the distance between two distributions, such works [24], [64], [70], [118], [156] mostly use the alternating
as the Jensen-Shannon divergence, the Kullback−Leibler update scheme. Sometimes, the discriminator update (11) is
divergence, and the integral probability metric, there are performed several times before computing (12) [24], [64].
also a variety of GAN losses, including the saturated GAN Among various SGD algorithms, ADAM [91], which is
loss [60], the non-saturated GAN loss [60], the Wassterstein based on adaptive estimates of the first and second order
GAN loss [6], [64], the least-square GAN loss [134], the moments, is very popular for training GANs. ADAM has
hinge GAN loss [112], [241], the f-divergence GAN loss [81], several user-defined parameters. Typically, the first momen-
[150], and the relativistic GAN loss [80]. Empirically, the tum is set to 0, while the second momentum is set to 0.999.
performance of a GAN loss depends on the application as The learning rate for the discriminator update is often set to
well as the network architecture. As of the time of writing 2 to 4 times larger than the learning rate for the generator
this survey paper, there is no clear consensus on which one update (usually set to 0.0001), which is called the two-time
is absolutely better. update scales (TTUR) [67]. We also note that RMSProp [201]
Here, we give a generic GAN learning objective formu- is popular for GAN training [64], [84], [85], [118].
lation that subsumes several popular ones. For the discrim-
inator update step, the learning objective is
h i h i 3.3 Regularization
max Ex∼D fD (D(x; φ) + Ez∼Z fG (D(G(z; θ); φ) , (7) We review several popular regularization techniques avail-
φ
able for countering instability in GAN training.
where fD and fG are the output layers that transform the
Gradient Penalty (GP) is an auxiliary loss term that penal-
results computed by the discriminator D to the classification
izes deviation of gradient norm from the desired value [64],
scores for the real and fake images, respectively. For the
[138], [169]. To use GP, one adds it to the objective function
generator update step, the learning objective is
for the discriminator update, i.e., (7). There are several
h i
min Ez∼Z gG (D(G(z; θ); φ) , (8) variants of GP. Generally, they can be expressed as
θ h i
where gG is the output layer that transforms the result GP-δ = Ex̂ k∇D(x̂)k2 − δ . (13)
computed by the discriminator to a classification score for The most common two forms are GP-1 [64] and GP-0 [138].
the fake image. In Table 1, we compare fD , fG , and gG for GP-1 was first introduced by Gulrajani et al. [64]. It uses
several popular GAN losses. an imaginary data distribution

3.2 Training x̂ = ux + (1 − u)G(z), u ∼ U(0, 1) (14)


Two variants of stochastic gradient descent/ascent (SGD) where u is a uniform random variable between 0 and 1.
schemes are commonly used for GAN training: the simulta- Basically, x̂ is neither real or fake. It is a convex combination
neous update scheme and the alternating update scheme. of a real sample and a fake sample. The design of the GP-1
Let VD (θ, φ) and VG (θ, φ) be the objective functions in is motivated by the property of an optimal D that solves
(7) and (8), respectively. In the simultaneous update, each the Wasserstein GAN loss. However, GP-1 is also useful
5

FC Conv Conv Conv Cond. Conv

FC Conv Conv Conv Cond. Conv FC


Act. Norm. Cond. Act. Norm FC

FC Conv Conv Conv Cond. Conv FC


Act. Norm. Cond. Act. Norm FC

FC FC FC FC FC

(a) Multi-layer Perceptron (b) Deep ConvNet (c) Residual Net. (d) Residual Net. with Cond. Act. Norm. (e) Cond. ConvNet

Fig. 4. Generator evolution. Since the debut of GANs [60], the generator architecture has continuously evolved. From (a-c), one can observe the
change from simple MLPs to deep convolutional and residual networks. Recently, conditional architectures, including conditional activation norms
(d) and conditional convolutions (e), have gained popularity as they allow users to have more control on the generation outputs.

when using other GAN losses. In practice, it has the effect image for a fake image. For the i-th layer, the instance-based
of countering vanishing and exploding gradients occurred FM loss is given by
during GAN training.
[Di ◦ ... ◦ D1 (xi )] − [Di ◦ ... ◦ D1 (G(z, yi ))] , (17)
On the other hand, the design of GP-0 is based on the
idea of penalizing the discriminator deviating away from where yi is the control signal for xi .
the Nash-equilibrium. GP-0 takes a simpler form where they Perceptual Loss [79]. Often, when instance-based FM loss
do not use imaginary sample distribution but use the real is applicable, one can additionally match features extracted
data distribution, i.e., setting x̂ ≡ x. We find the use of from real and fake images using a pretrained network. Such
GP-0 in several state-of-the-art GAN algorithms [85], [86]. a variant of FM losses is called the perceptual loss [79].
Spectral Normalization (SN) [140] is an effective regulariza- Model Average (MA) can improve the quality of images
tion technique used in many recent GAN algorithms [24], generated by a GAN. To use MA, we keep two copies of the
[156], [170], [241]. SN is based on regularizing the spec- generator network during training, where one is the original
tral norm of the projection operation at each layer of the generator with weight θ and the other is the model average
discriminator, by simply dividing the weight matrix by its generator with weight θM A . At iteration t, we update θM A
largest eigenvalue. Let W be the weight matrix of a layer of based on
the discriminator network. With SN, the true weight that is (t) (t−1)
θM A = βθ (t) + (1 − β)θM A , (18)
applied is
q where β is a scalar controlling the contribution from the
W/ λmax (W T W ), (15) current model weight.

where λmax (A) extracts the largest eigenvalue from the


square matrix A. In other word, each project layer has a 3.4 Network Architecture
projection matrix with spectral norm equal to one. Network architectures provide a convenient way to inject
Feature Matching (FM) provides a way to encourage the inductive biases. Certain network designs often work better
generator to generate images similar to real ones in some than others for a given task. Since the introduction of GANs,
sense. Similar to GP, FM is an auxiliary loss. There are two we have observed an evolution of the network architecture
popular implementations: one is batch-based [176] and the for both the generator and discriminator.
other is instance-based [99], [218]. Let Di be the i-th layer of Generator Evolution. In Figure 4, we visualize the evolution
a discriminator D, i.e., D = Dd ◦ ... ◦ D2 ◦ D1 . For the batch- of the GAN generator architecture. In the original GAN
based FM loss, it matches the moments of the activations paper [60], both the generator and the discriminator are
extracted by the real and fake images, respectively. For the based on the multilayer perceptron (MLP) (Figure 4(a)). As
i-th layer, the loss is an MLP fails to model the translational invariance property
of natural images, its output images are of limited quality.
Ex∼D [Di ◦ ... ◦ D1 (x)] − Ez∼Z [Di ◦ ... ◦ D1 (G(z))] . In the DCGAN work [163], deep convolutional architecture
(16) (Figure 4(b)) is used for the GAN generator. As the con-
One can apply the FM loss to a subset of layers in the volutional architecture is a better fit for modeling image
generator and use the weighted sum as the final FM loss. signals, the outputs produced by the DCGAN are often
The instance-based FM loss is only applicable to conditional with better quality. Researchers also borrow architecture
generation models where we have the corresponding real designs from discriminative modeling tasks. As the residual
6

architecture [66] is proven to be effective for training deep True/ True/ True/
networks, several GAN works start to use the residual False False False
architecture in their generator design (Figure 4(c)) [6], [140]. Embed
A residual block used in modern GAN generators typ-
dot
ically consists of a skip connection paired with a series
of batch normalization (BN) [74], nonlinearity, and convo- 𝐷 𝐷 FC
lution operations. The BN is one type of activation norm
Concatenation 𝐷
(AN), a technique that normalizes the activation values
to facilitate training. Other AN variants have also been
exploited for the GAN generator, including the instance nor- (a) Auxiliary Classifier (b) Input Concatenation (c) Projection Discriminator
malization [206], the layer normalization [8], and the group
Fig. 5. Conditional discriminator architectures. There are several ways
normalization [227]. Generally, an activation normalization
to leverage the user input signal y in the GAN discriminator. (a) Auxiliary
scheme consists of a whitening step followed by an affine classifier [151]. In this design, the discriminator is asked to predict the
transformation step. Let hc be the output of the whitening ground truth label for the real image. (b) Concatenation [75]. In this
step for h. The final output of the normalization layer is design, the discriminator learns to reason whether the input is real by
learning a joint feature embedding of image and label. (c) Projection
discriminator [141]. In this design, the discriminator computes an image
γc hc + βc , (19) embedding and correlates it with the label embedding (through the dot
product) to determine whether the input is real or fake.
where γc and βc are scalars used to shift the post-
normalization activation values. They are constants learned
during training. 4 I MAGE T RANSLATION
For many applications, it is required to have some way This section discusses the application of GANs to image-
to control the output produced by a generator. This desire to-image translation, which aims to map an image from one
has motivated various conditional generator architectures domain to a corresponding image in a different domain, e.g.,
(Figure 4(d)) for the GAN generator [24], [70], [156]. The sketch to shoes, label maps to photos, summer to winter.
most common approach is to use the conditional AN. In The problem can be studied in a supervised setting, where
a conditional AN, both γc and βc are data dependent. example pairs of corresponding images are available, or an
Often, one employs a separate network to map input con- unsupervised setting, where such training data is unavail-
trol signals to the target γc and βc values. Another way able and we only have two independent sets of images. In
to achieve such controllability is to use hyper-networks; the following subsections, we will discuss recent progress in
Basically, using an auxiliary network to produce weights both settings.
for the main network. For example, we can have a convo-
lutional layer where the filter weights are generated by a
separate network. We often call such a scheme conditional 4.1 Supervised Image Translation
convolutions (Figure 4(e)), and it has been used for several Isola et al. [75] proposed the pix2pix framework as a
state-of-the-art GAN generators [86], [216]. general-purpose solution to image-to-image translation in
Discriminator Evolution. GAN discriminators have also the supervised setting. The training objective of pix2pix
undergone an evolution. However, the change has mostly combines conditional GANs with the pixel-wise `1 loss
been on moving from the MLP to deep convolutional and between the generated image and the ground truth. One
residual architectures. As the discriminator is solving a notable design choice of pix2pix is the use of patch-wise
classification task, new breakthroughs in architecture design discriminators (PatchGAN), which attempts to discriminate
for image classification tasks could influence future GAN each local image patch rather than the whole image. This
discriminator designs. design incorporates the prior knowledge that the underly-
ing image translation function we want to learn is local,
Conditional Discriminator Architecture. There are several assuming independence between pixels that are far away.
effective architectures for utilizing control signals (condi- In other words, the translation mostly involves style or
tional inputs y ) in the GAN discriminator to achieve better texture change. It significantly alleviates the burden of the
image generation quality, as visualized in Figure 5. This discriminator because it requires much less model capacity
includes the auxiliary classifier (AC) [151], input concate- to discriminate local patches than whole images.
nation (IC) [75], and the projection discriminator (PD) [141]. One important limitation of pix2pix is that its translation
The AC and PD are mostly used for category-conditional function is restricted to be one-to-one. However, many of
image generation tasks, while the PD is common for image- the mappings we aim to learn are one-to-many in nature.
to-image translation tasks. In other words, the distribution of possible outputs is
Neural Architecture Search. As neural architecture search multimodal. For example, one can imagine many shoes
has become a popular topic for various recognition tasks, in different colors and styles that correspond to the same
efforts have been made in trying to automatically find a sketch of a shoe. Naively injecting a Gaussian noise latent
performant architecture for GANs [56]. code to the generator does not lead to many variations,
While the current and previous sections have focused on since the generator is free to ignore that latent code. Bicycle-
introducing the GAN mechanism and various algorithms GAN [254] explores approaches to encourage the generator
used to train them, the following sections focus on various to make use of the latent code to represent output variations,
applications of GANs in generating images and videos. including applying a KL divergence loss to the encoded
7

cloud sky

tree mountain

sea grass

Fig. 6. Image translation examples of SPADE [156], which converts semantic label maps into photorealistic natural scenes. The style of the
output image can also be controlled by by a reference image (the leftmost column). Images are from Park et al. [156].

latent code, and reconstructing the sampled latent code methods above need to train a different model for each pair
from the generated image. Other strategies to encourage of image domains, StarGAN [32] is able to translate images
diversity include using different generators to capture dif- across multiple domains using only a single model.
ferent output modes [54], replacing the reconstruction loss In many unsupervised image translation tasks (e.g.,
with maximum likelihood objective [106], [107], and directly horses to zebras, dogs to cats), the two image domains
encouraging the distance between output images generated mainly differ in the foreground objects, and the background
from different latent codes to be large [120], [133], [233]. distribution is very similar. Ideally, the model should only
Besides, the quality of image-to-image translation has modify the foreground objects and leave the background
been significantly improved by some recent works [104], region untouched. Some work [31], [137], [231] employs
[122], [156], [194], [218], [249]. In particular, pix2pixHD [218] spatial attention to detect and change the foreground re-
is able to generate high-resolution images with a coarse-to- gion without influencing the background. InstaGAN [142]
fine generator and a multi-scale discriminator. SPADE [156] further allows the shape of the foreground objects to be
further improves the image quality with a spatially-adaptive changed.
normalization layer. SPADE, in addition, allows a style
image input for better control the desired look of the output The early work mentioned above focuses on unimodal
image. Some examples of SPADE are shown in Figure 6. translation. On the other hand, recent advances [5], [57],
[70], [105], [128], [133] have made it possible to perform
multimodal translation, generating diverse output images
4.2 Unsupervised Image Translation given the same input. For example, MUNIT [70] assumes
For many tasks, paired training images are very difficult to that images can be encoded into two disentangled latent
obtain [16], [32], [70], [90], [105], [117], [234], [253]. Unsuper- spaces: a domain-invariant content space that captures the
vised learning of mappings between corresponding images information that should be preserved during translation,
in two domains is a much harder problem but has wider and a domain-specific style space that represents the vari-
applications than the supervised setting. CycleGAN [253] ations that are not specified by the input image. To generate
simultaneously learns mappings in both directions and em- diverse translation results, we can recombine the content
ploys a cycle consistency loss to enforce that if an image code of the input image with different style codes sampled
is translated to the other domain and translated back to from the style space of the target domain. Figure 7 com-
the original domain, the output should be close to the pares MUNIT with existing unimodal translation methods
original image. UNIT [117] makes a shared latent space including CycleGAN and UNIT. The disentangled latent
assumption [119] that a pair of corresponding images can be space not only enables multimodal translation, but also
mapped to the same latent code in a shared latent space. It allows example-guided translation in which the generator
is shown that shared-latent space implies cycle consistency recombines the domain-invariant content of an image from
and imposes a stronger regularization. DistanceGAN [16] the source domain and the domain-specific style of an image
encourages the mapping to preserve the distance between from the target domain. The idea of using a guiding style
any pair of images before and after translation. While the image has also been applied to the supervised setting [156],
8

[214], [243]. X1 <latexit sha1_base64="PFDCjXShCzNrMhKBsZUMGzwcEZo=">AAAB93icbVBNS8NAFHypX7V+NOrRy2IRPJVEBBU8FLx4rGC00Iaw2W7apZtN2N0INeSXePGg4tW/4s1/46bNQVsHFoaZ93izE6acKe0431ZtZXVtfaO+2dja3tlt2nv79yrJJKEeSXgieyFWlDNBPc00p71UUhyHnD6Ek+vSf3ikUrFE3OlpSv0YjwSLGMHaSIHdHMRYjwnmea8IcrcI7JbTdmZAy8StSAsqdAP7azBMSBZToQnHSvVdJ9V+jqVmhNOiMcgUTTGZ4BHtGypwTJWfz4IX6NgoQxQl0jyh0Uz9vZHjWKlpHJrJMqZa9ErxP6+f6ejCz5lIM00FmR+KMo50gsoW0JBJSjSfGoKJZCYrImMsMdGmq4YpwV388jLxTtuXbef2rNW5qtqowyEcwQm4cA4duIEueEAgg2d4hTfryXqx3q2P+WjNqnYO4A+szx9sDpMc</latexit>


X2
<latexit sha1_base64="Jzrs8t3r/FcOR9bHyVoqgDb/u98=">AAAB9HicbVC7SgNBFL0bXzG+opY2g0GwCrtB0MIiYGMZwTwgWcLdySQZMju7zswGwpLvsLFQxNaPsfNvnE220MQDA4dz7uWeOUEsuDau++0UNja3tneKu6W9/YPDo/LxSUtHiaKsSSMRqU6AmgkuWdNwI1gnVgzDQLB2MLnL/PaUKc0j+WhmMfNDHEk+5BSNlfxeiGZMUaSdeb/WL1fcqrsAWSdeTiqQo9Evf/UGEU1CJg0VqHXXc2Pjp6gMp4LNS71EsxjpBEesa6nEkGk/XYSekwurDMgwUvZJQxbq740UQ61nYWAns5B61cvE/7xuYoY3fsplnBgm6fLQMBHERCRrgAy4YtSImSVIFbdZCR2jQmpsTyVbgrf65XXSqlU9yx+uKvXbvI4inME5XIIH11CHe2hAEyg8wTO8wpszdV6cd+djOVpw8p1T+APn8wfApZIN</latexit>

Although paired example images are not needed in the


unsupervised setting, most existing methods still require
access to a large number of unpaired example images in
both source and target domains. Some works seek to reduce (a) CycleGAN
the number of training examples without much loss of
performance. Benaim and Wolf [17] focus on the situation X1 <latexit sha1_base64="PFDCjXShCzNrMhKBsZUMGzwcEZo=">AAAB93icbVBNS8NAFHypX7V+NOrRy2IRPJVEBBU8FLx4rGC00Iaw2W7apZtN2N0INeSXePGg4tW/4s1/46bNQVsHFoaZ93izE6acKe0431ZtZXVtfaO+2dja3tlt2nv79yrJJKEeSXgieyFWlDNBPc00p71UUhyHnD6Ek+vSf3ikUrFE3OlpSv0YjwSLGMHaSIHdHMRYjwnmea8IcrcI7JbTdmZAy8StSAsqdAP7azBMSBZToQnHSvVdJ9V+jqVmhNOiMcgUTTGZ4BHtGypwTJWfz4IX6NgoQxQl0jyh0Uz9vZHjWKlpHJrJMqZa9ErxP6+f6ejCz5lIM00FmR+KMo50gsoW0JBJSjSfGoKJZCYrImMsMdGmq4YpwV388jLxTtuXbef2rNW5qtqowyEcwQm4cA4duIEueEAgg2d4hTfryXqx3q2P+WjNqnYO4A+szx9sDpMc</latexit>
Z
<latexit sha1_base64="v0PRquneriYkyyOHlYwQSIZ5wOo=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjNV0IWLghuXFewDp0PJpJk2NJMMyR2hDP0MNy4UcevXuPNvzLSz0NYDgcM595JzT5gIbsB1v53S2vrG5lZ5u7Kzu7d/UD086hiVasraVAmleyExTHDJ2sBBsF6iGYlDwbrh5Db3u09MG67kA0wTFsRkJHnEKQEr+f2YwJgSkT3OBtWaW3fnwKvEK0gNFWgNql/9oaJpzCRQQYzxPTeBICMaOBVsVumnhiWETsiI+ZZKEjMTZPPIM3xmlSGOlLZPAp6rvzcyEhszjUM7mUc0y14u/uf5KUTXQcZlkgKTdPFRlAoMCuf34yHXjIKYWkKo5jYrpmOiCQXbUsWW4C2fvEo6jbp3UW/cX9aaN0UdZXSCTtE58tAVaqI71EJtRJFCz+gVvTngvDjvzsditOQUO8foD5zPH5XLkW4=</latexit>
X2 <latexit sha1_base64="Jzrs8t3r/FcOR9bHyVoqgDb/u98=">AAAB9HicbVC7SgNBFL0bXzG+opY2g0GwCrtB0MIiYGMZwTwgWcLdySQZMju7zswGwpLvsLFQxNaPsfNvnE220MQDA4dz7uWeOUEsuDau++0UNja3tneKu6W9/YPDo/LxSUtHiaKsSSMRqU6AmgkuWdNwI1gnVgzDQLB2MLnL/PaUKc0j+WhmMfNDHEk+5BSNlfxeiGZMUaSdeb/WL1fcqrsAWSdeTiqQo9Evf/UGEU1CJg0VqHXXc2Pjp6gMp4LNS71EsxjpBEesa6nEkGk/XYSekwurDMgwUvZJQxbq740UQ61nYWAns5B61cvE/7xuYoY3fsplnBgm6fLQMBHERCRrgAy4YtSImSVIFbdZCR2jQmpsTyVbgrf65XXSqlU9yx+uKvXbvI4inME5XIIH11CHe2hAEyg8wTO8wpszdV6cd+djOVpw8p1T+APn8wfApZIN</latexit>

where there are many images in the target domain but


only a single image in the source domain. The work of
Cohen and Wolf [34] enables translation in the opposite
direction where the source domain has many images but
(b) UNIT
the target domain has only one. The above setting assumes
the source and target domain images, whether there are
many or few, are available during training. Liu et al. [118] X1
<latexit sha1_base64="PFDCjXShCzNrMhKBsZUMGzwcEZo=">AAAB93icbVBNS8NAFHypX7V+NOrRy2IRPJVEBBU8FLx4rGC00Iaw2W7apZtN2N0INeSXePGg4tW/4s1/46bNQVsHFoaZ93izE6acKe0431ZtZXVtfaO+2dja3tlt2nv79yrJJKEeSXgieyFWlDNBPc00p71UUhyHnD6Ek+vSf3ikUrFE3OlpSv0YjwSLGMHaSIHdHMRYjwnmea8IcrcI7JbTdmZAy8StSAsqdAP7azBMSBZToQnHSvVdJ9V+jqVmhNOiMcgUTTGZ4BHtGypwTJWfz4IX6NgoQxQl0jyh0Uz9vZHjWKlpHJrJMqZa9ErxP6+f6ejCz5lIM00FmR+KMo50gsoW0JBJSjSfGoKJZCYrImMsMdGmq4YpwV388jLxTtuXbef2rNW5qtqowyEcwQm4cA4duIEueEAgg2d4hTfryXqx3q2P+WjNqnYO4A+szx9sDpMc</latexit>
C
<latexit sha1_base64="KNrHAeBcmpaWlFwLgPoyTdQSnPo=">AAAB8XicbVBNSwMxFMzWr1q/qh69BIvgqeyKoIKHQi8eK7i2sF1KNs22odlkSd4KZenP8OJBxav/xpv/xmy7B20dCAwz75F5E6WCG3Ddb6eytr6xuVXdru3s7u0f1A+PHo3KNGU+VULpXkQME1wyHzgI1ks1I0kkWDeatAu/+8S04Uo+wDRlYUJGksecErBS0E8IjCkReXs2qDfcpjsHXiVeSRqoRGdQ/+oPFc0SJoEKYkzguSmEOdHAqWCzWj8zLCV0QkYssFSShJkwn0ee4TOrDHGstH0S8Fz9vZGTxJhpEtnJIqJZ9grxPy/IIL4Ocy7TDJiki4/iTGBQuLgfD7lmFMTUEkI1t1kxHRNNKNiWarYEb/nkVeJfNG+a7v1lo3VbtlFFJ+gUnSMPXaEWukMd5COKFHpGr+jNAefFeXc+FqMVp9w5Rn/gfP4A24yRJg==</latexit>
<latexit
X2
<latexit sha1_base64="Jzrs8t3r/FcOR9bHyVoqgDb/u98=">AAAB9HicbVC7SgNBFL0bXzG+opY2g0GwCrtB0MIiYGMZwTwgWcLdySQZMju7zswGwpLvsLFQxNaPsfNvnE220MQDA4dz7uWeOUEsuDau++0UNja3tneKu6W9/YPDo/LxSUtHiaKsSSMRqU6AmgkuWdNwI1gnVgzDQLB2MLnL/PaUKc0j+WhmMfNDHEk+5BSNlfxeiGZMUaSdeb/WL1fcqrsAWSdeTiqQo9Evf/UGEU1CJg0VqHXXc2Pjp6gMp4LNS71EsxjpBEesa6nEkGk/XYSekwurDMgwUvZJQxbq740UQ61nYWAns5B61cvE/7xuYoY3fsplnBgm6fLQMBHERCRrgAy4YtSImSVIFbdZCR2jQmpsTyVbgrf65XXSqlU9yx+uKvXbvI4inME5XIIH11CHe2hAEyg8wTO8wpszdV6cd+djOVpw8p1T+APn8wfApZIN</latexit>

proposed FUNIT to address a different situation where there


are many source domain images that are available during
training, but few target domain images that are available
only at test time. The target domain images are used to
guide translation similar to the example-guided translation
procedure in MUNIT. Saito et al. [170] proposed a content- S1
<latexit sha1_base64="kckIGnVHlP7ZBs/R3Wtr7IuQ6Hs=">AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4rGhsoQ1hs920SzebsLsRasgv8eJBxat/xZv/xk2bg7YOLAwz7/FmJ0g4U9q2v63Kyura+kZ1s7a1vbNbb+ztP6g4lYS6JOax7AVYUc4EdTXTnPYSSXEUcNoNJteF332kUrFY3OtpQr0IjwQLGcHaSH6jPoiwHhPMs7vcz5zcbzTtlj0DWiZOSZpQouM3vgbDmKQRFZpwrFTfsRPtZVhqRjjNa4NU0QSTCR7RvqECR1R52Sx4jo6NMkRhLM0TGs3U3xsZjpSaRoGZLGKqRa8Q//P6qQ4vvIyJJNVUkPmhMOVIx6hoAQ2ZpETzqSGYSGayIjLGEhNtuqqZEpzFLy8T97R12bJvz5rtq7KNKhzCEZyAA+fQhhvogAsEUniGV3iznqwX6936mI9WrHLnAP7A+vwBZGaTFw==</latexit>
S2
<latexit sha1_base64="zsZna1JzOzOpxLLAHR/SYoHVReE=">AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS8eKxpbaEPYbDft0s0m7G6EGvJLvHhQ8epf8ea/cdPmoK0DC8PMe7zZCRLOlLbtb2tldW19Y7OyVd3e2d2r1fcPHlScSkJdEvNY9gKsKGeCupppTnuJpDgKOO0Gk+vC7z5SqVgs7vU0oV6ER4KFjGBtJL9eG0RYjwnm2V3uZ63crzfspj0DWiZOSRpQouPXvwbDmKQRFZpwrFTfsRPtZVhqRjjNq4NU0QSTCR7RvqECR1R52Sx4jk6MMkRhLM0TGs3U3xsZjpSaRoGZLGKqRa8Q//P6qQ4vvIyJJNVUkPmhMOVIx6hoAQ2ZpETzqSGYSGayIjLGEhNtuqqaEpzFLy8Tt9W8bNq3Z432VdlGBY7gGE7BgXNoww10wAUCKTzDK7xZT9aL9W59zEdXrHLnEP7A+vwBZeqTGA==</latexit>

conditioned style encoder to better preserve the domain-


(c) MUNIT(randomly sampled)
invariant content of the input image. However, the above
scenario [118], [170] still assumes access to the domain labels
of the training images. Some recent work aims to reduce X1
<latexit sha1_base64="PFDCjXShCzNrMhKBsZUMGzwcEZo=">AAAB93icbVBNS8NAFHypX7V+NOrRy2IRPJVEBBU8FLx4rGC00Iaw2W7apZtN2N0INeSXePGg4tW/4s1/46bNQVsHFoaZ93izE6acKe0431ZtZXVtfaO+2dja3tlt2nv79yrJJKEeSXgieyFWlDNBPc00p71UUhyHnD6Ek+vSf3ikUrFE3OlpSv0YjwSLGMHaSIHdHMRYjwnmea8IcrcI7JbTdmZAy8StSAsqdAP7azBMSBZToQnHSvVdJ9V+jqVmhNOiMcgUTTGZ4BHtGypwTJWfz4IX6NgoQxQl0jyh0Uz9vZHjWKlpHJrJMqZa9ErxP6+f6ejCz5lIM00FmR+KMo50gsoW0JBJSjSfGoKJZCYrImMsMdGmq4YpwV388jLxTtuXbef2rNW5qtqowyEcwQm4cA4duIEueEAgg2d4hTfryXqx3q2P+WjNqnYO4A+szx9sDpMc</latexit>
C
<latexit sha1_base64="KNrHAeBcmpaWlFwLgPoyTdQSnPo=">AAAB8XicbVBNSwMxFMzWr1q/qh69BIvgqeyKoIKHQi8eK7i2sF1KNs22odlkSd4KZenP8OJBxav/xpv/xmy7B20dCAwz75F5E6WCG3Ddb6eytr6xuVXdru3s7u0f1A+PHo3KNGU+VULpXkQME1wyHzgI1ks1I0kkWDeatAu/+8S04Uo+wDRlYUJGksecErBS0E8IjCkReXs2qDfcpjsHXiVeSRqoRGdQ/+oPFc0SJoEKYkzguSmEOdHAqWCzWj8zLCV0QkYssFSShJkwn0ee4TOrDHGstH0S8Fz9vZGTxJhpEtnJIqJZ9grxPy/IIL4Ocy7TDJiki4/iTGBQuLgfD7lmFMTUEkI1t1kxHRNNKNiWarYEb/nkVeJfNG+a7v1lo3VbtlFFJ+gUnSMPXaEWukMd5COKFHpGr+jNAefFeXc+FqMVp9w5Rn/gfP4A24yRJg==</latexit>
<latexit
X2
<latexit sha1_base64="Jzrs8t3r/FcOR9bHyVoqgDb/u98=">AAAB9HicbVC7SgNBFL0bXzG+opY2g0GwCrtB0MIiYGMZwTwgWcLdySQZMju7zswGwpLvsLFQxNaPsfNvnE220MQDA4dz7uWeOUEsuDau++0UNja3tneKu6W9/YPDo/LxSUtHiaKsSSMRqU6AmgkuWdNwI1gnVgzDQLB2MLnL/PaUKc0j+WhmMfNDHEk+5BSNlfxeiGZMUaSdeb/WL1fcqrsAWSdeTiqQo9Evf/UGEU1CJg0VqHXXc2Pjp6gMp4LNS71EsxjpBEesa6nEkGk/XYSekwurDMgwUvZJQxbq740UQ61nYWAns5B61cvE/7xuYoY3fsplnBgm6fLQMBHERCRrgAy4YtSImSVIFbdZCR2jQmpsTyVbgrf65XXSqlU9yx+uKvXbvI4inME5XIIH11CHe2hAEyg8wTO8wpszdV6cd+djOVpw8p1T+APn8wfApZIN</latexit>

the need for such supervision by using few [220] or even


no [9] domain labels. Very recently, some works [15], [113],
[155] are able to achieve image translation even when each
domain only has a single image, inspired by recent advances
that can train GANs on a single image [179].
Despite the empirical successes, the problem of unsu- S1
<latexit sha1_base64="kckIGnVHlP7ZBs/R3Wtr7IuQ6Hs=">AAAB93icbVBNS8NAFHypX7V+tOrRy2IRPJVEBBU8FLx4rGhsoQ1hs920SzebsLsRasgv8eJBxat/xZv/xk2bg7YOLAwz7/FmJ0g4U9q2v63Kyura+kZ1s7a1vbNbb+ztP6g4lYS6JOax7AVYUc4EdTXTnPYSSXEUcNoNJteF332kUrFY3OtpQr0IjwQLGcHaSH6jPoiwHhPMs7vcz5zcbzTtlj0DWiZOSZpQouM3vgbDmKQRFZpwrFTfsRPtZVhqRjjNa4NU0QSTCR7RvqECR1R52Sx4jo6NMkRhLM0TGs3U3xsZjpSaRoGZLGKqRa8Q//P6qQ4vvIyJJNVUkPmhMOVIx6hoAQ2ZpETzqSGYSGayIjLGEhNtuqqZEpzFLy8T97R12bJvz5rtq7KNKhzCEZyAA+fQhhvogAsEUniGV3iznqwX6936mI9WrHLnAP7A+vwBZGaTFw==</latexit>
S2
<latexit sha1_base64="zsZna1JzOzOpxLLAHR/SYoHVReE=">AAAB93icbVBNS8NAFHzxs9aPVj16WSyCp5IUQQUPBS8eKxpbaEPYbDft0s0m7G6EGvJLvHhQ8epf8ea/cdPmoK0DC8PMe7zZCRLOlLbtb2tldW19Y7OyVd3e2d2r1fcPHlScSkJdEvNY9gKsKGeCupppTnuJpDgKOO0Gk+vC7z5SqVgs7vU0oV6ER4KFjGBtJL9eG0RYjwnm2V3uZ63crzfspj0DWiZOSRpQouPXvwbDmKQRFZpwrFTfsRPtZVhqRjjNq4NU0QSTCR7RvqECR1R52Sx4jk6MMkRhLM0TGs3U3xsZjpSaRoGZLGKqRa8Q//P6qQ4vvIyJJNVUkPmhMOVIx6hoAQ2ZpETzqSGYSGayIjLGEhNtuqqaEpzFLy8Tt9W8bNq3Z432VdlGBY7gGE7BgXNoww10wAUCKTzDK7xZT9aL9W59zEdXrHLnEP7A+vwBZeqTGA==</latexit>

pervised image-to-image translation is inherently ill-posed,


(d) MUNIT(example-guided)
even with constraints such as cycle consistency or shared
latent space. Specifically, there exist infinitely many map-
pings that satisfy those constraints [38], [51], [230], yet most Fig. 7. Comparisons among unsupervised image translation meth-
of them are not semantically meaningful. How do current ods (CycleGAN [253], UNIT [117], and MUNIT [70]). X1 and X2 are
two different image domains (dogs and cats in this example). (a) Cycle-
methods successfully find the meaningful mapping in prac- GAN enforces the learned mappings to be inverses of each other. (b)
tice? Galanti et al. [51] assume that the meaningful mapping UNIT auto-encodes images in both domains to a common latent space
is of minimal complexity and the popular generator archi- Z . Both CycleGAN and UNIT can only perform unimodal translation.
tectures are not expressive enough to represent mappings (c) MUNIT decomposes the latent space into a shared content space C
and unshared style spaces S1 , S2 . Diverse outputs can be obtained by
that are highly complex. Bezenac et al. [38] further argue that sampling different style codes from the target style space. (d) The style
the popular architectures are implicitly biased towards map- of the translation output can also be controlled by a guiding image in the
pings that produce minimal changes to the input, which are target domain.
usually semantically meaningful. In summary, the training
objectives of unsupervised image translation alone cannot
guarantee that the model can find semantically meaningful 5.1 Image Restoration and Enhancement
mappings and the inductive bias of generator architectures
plays an important role. The traditional way of evaluating algorithms for image
restoration and enhancement tasks is to measure the dis-
tortion, the difference between the ground truth images
and restored images using metrics like the mean square
error (MSE), the peak signal-to-noise ratio (PSNR), and
5 I MAGE P ROCESSING the structural similarity index (SSIM). Recently, metrics for
measuring perceptual quality, such as the no-reference (NR)
GAN’s strength in generating realistic images makes it metric [127], have been proposed, as the visual quality
ideal for solving various image processing problems, es- is arguably the most important factor for the usability
pecially for those where the perceptual quality of image of an algorithm. Blau et al. [22] proposed the perception-
outputs is the primary evaluation criteria. This section will distortion tradeoff [22], which states that an image restora-
discuss some prominent GAN-based methods for several tion algorithm can potentially improve only in terms of its
key image processing problems, including image restoration distortion or in terms of its perceptual quality, as shown
and enhancement (super-resolution, denoising, deblurring, in the Figure 8. Blau et al. [22] further demonstrate that
compression artifacts removal) and image inpainting. GANs provide a principled way to approach the perception-
9

Perception
R1 R2 R3
EDSR
Alg. 1 RCAN
Alg. 2

Perceptual Index
Results on PIRM s
Better quality

Possible Method P
ESRGAN 2.0
interp_1 interp_2 2.5
EnhanceNet 2.6
Alg. 3
EnhanceNet interp_1 3.2
Impossible interp_2 RCAN 4.8
EDSR 5.2
ESRGAN
Distortion
Less distortion
RMSE
Fig. 8. Perception-distortion tradeoff [22]. Distortion metrics, includ-
ing the MSE, PSNR, and SSIM, measure the similarity between the Fig. 9. The perception-distortion curve of the ESRGAN [219] on
ground truth image and the restored images. Perceptual quality met- PIRM self-validation dataset [21]. The curve also compares the ESR-
rics, including NR [127], measure the distribution distance between the GAN with the EnhanceNet [174], the RCAN [245], and the EDSR [111].
recovered image distribution and the target image distribution. Blau et The curve is from Wang et al. [219].
al. [22] show that an image restoration algorithm can be characterized
by the distortion and perceptual quality tradeoff curve. The plot is from
Blau et al. [22].

distortion bound.
Image super-resolution (SR) aims at estimating a high-
resolution (HR) image from its low-resolution (LR) coun-
terpart. Deep learning has enabled faster and more ac-
curate super-resolution methods, including SRCNN [42],
FSRCNN [43], ESPCN [182], VDSR [88], SRResNet [102],
EDSR [111], SRDenseNet [203], MemNet [193], RDN [246],
SRGAN ESRGAN Ground Truth
WDSR [235], and many others. However, the above super-
resolution approaches focus on improving the distortion
metrics and pay little to no attention to the perceptual qual- Fig. 10. Visual comparison between the ESRGAN [219] and the
ity metrics. As a result, they tend to predict over-smoothed SRGAN [102]. Images are from Wang et al. [219].
outputs and fail to synthesize finer high-frequency details.
Recent image super-resolution algorithms improve the
not directly applicable to upsample low-resolution images
perceptual quality of outputs by leveraging GANs. The
captured in the wild. Several methods have addressed the
SRGAN [102] is the first of its kind and can generate photo-
issue by studying image super-resolution in the unsuper-
realistic images with 4× or higher upscaling factors. The
vised setting where they only assume a dataset of low-
quality of the SRGAN [102] outputs is mainly measured by
resolution images captured by a sensor and a dataset of
the mean opinion score (MOS) over 26 raters. To enhance
high-resolution images. Recently, Maeda [131] proposes a
the visual quality further, Wang et al. [219] revisit the design
GAN-based image super-resolution algorithm operates in
of the three key components in the SRGAN: the network
the unsupervised setting for bridging the gap.
architecture, the GAN loss, and the perceptual loss. They
propose the Enhanced SRGAN (ESRGAN), which achieves Image denoising aims at removing noise from noisy images.
consistently better visual quality with more realistic and The task is challenging since the noise distribution is usually
natural textures than the competing methods, as shown unknown. This setting is also referred to as blind image
in Figure 9 and Figure 10. The ESRGAN is the winner of denoising. DnCNN [242] is one of the first approaches
the 2018 Perceptual Image Restoration and Manipulation using feed-forward convolutional neural networks for im-
challenge (PIRM) [21] (region 3 in Figure 9). Other GAN- age denoising. However, DnCNN [242] requires knowing
based image super-resolution methods and practices can be the noise distribution in the noisy image and hence has
found in the 2018 PIRM challenge report [21]. limited applicability. To tackle blind image denoising, Chen
The above image super-resolution algorithms all operate et al. [27] proposed the GAN-CNN-based Blind Denoiser
in the supervised setting where they assume corresponding (GCBD), which consists of 1) a GAN trained to estimate the
low-resolution and high-resolution pairs in the training noise distribution over the input noisy images to generate
dataset. Typically, they create such a training dataset by noise samples, and 2) a deep CNN that learns to denoise
downsampling the ground truth high-resolution images. on generated noisy images. The GAN training criterion
However, the downsampled high-resolution images are of GCBD [27] is based on Wasserstein GAN [6], and the
very different from the low-resolution images captured by generator network is based on DCGAN [163].
a real sensor, which often contain noise and other dis- Image deblurring sharpens blurry images, which result
tortion. As a result, these super-resolution algorithms are from motion blur, out-of-focus, and possibly other causes.
10

Original GAN [4]: 1567 B, 1x JP2K: 3138 B, 2x

Ground truth Blurry inputs Deblurred results


Fig. 11. Face deblurring results with GANs [181]. Images are from BPG: 3573 B, 1.2x JPEG: 13959 B, 7.9x WebP: 9437 B, 5x
Shen et al. [181].
Fig. 12. Image compression with GANs [4]. Comparing a GAN-based
approach [4] for image compression to those obtained by the off-the-
DeblurGAN [95] trains an image motion deblurring net- shelf codecs. Even with fewer than half the number of bytes, GAN-based
work using Wasserstein GAN [6] with the GP-1 loss and compression [4] produces more realistic visual results. Images are from
from Agustsson et al. [4].
the perceptual loss (See Section 3). Shen et al. [181] use a
similar approach to deblur face image by using GAN and
perceptual loss and incrementally training the deblurring
network. Visual examples are shown in Figure 11.
Lossy image compression algorithms (e.g., JPEG, JPEG2000,
BPG, and WebP) can efficiently reduce image sizes but
introduce visual artifacts in compressed images when the
compression ratio is high. Deep neural networks have been
widely explored for removing the introduced artifacts [4],
[52], [204]. Galteri et al. [52] show that a residual net-
work trained with a GAN loss is able to produce im-
ages with more photorealistic details than MSE or SSIM-
based objectives for the removal of image compression ar-
tifacts. Tschannen et al. [204] further proposed distribution-
preserving lossy compression by using a new combination
Fig. 13. Image inpainting results uisng the DeepFill [236]. Missing
of Wasserstein GAN and Wasserstein autoencoder [202].
regions are shown in white. In each pair, the left is the input image, and
More recently, Agustsson et al. [4] built an extreme image the right is the direct output of trained GAN without any post-processing.
compression system by using unconditional and conditional Images are from Yu et al. [236].
GANs, outperforming all other codecs in the low bit-rate
setting. Some compression visual examples [4] are shown in
Figure 12. in an end-to-end fashion. Comparing to PatchMatch, these
methods are more scalable and can leverage large-scale data.
The context encoder approach (CE) [157] is one of the
5.2 Image Inpainting
first in using a GAN generator to predict the missing regions
Image inpainting aims at filling missing pixels in an image and is trained with the `2 pixel-wise reconstruction loss and
such that the result is visually realistic and semantically cor- a GAN loss. Iizuka et al. [73] further improve the GAN-
rect. Image inpainting algorithms can be used to remove dis- based inpainting framework by using both global and local
tracting objects or retouch undesired regions in photos and GAN discriminators, with the global one operating on the
can be further extended to other tasks, including image un- entire image and the local one operating on only the patch in
cropping, rotation, stitching, re-targeting, re-composition, the hole. We note that the post-processing techniques such
compression, super-resolution, harmonization, and more. as image blending are still required in these GAN-based
Traditionally patch-based approaches, such as the Patch- approaches [73], [157] to reduce visual artifacts near hole
Match [12], copy background patches according to the low- boundaries.
level feature matching (e.g., euclidean distance on pixel RGB Yu et al. [236] proposed DeepFill, a GAN framework for
values) and paste them into the missing regions. These end-to-end image inpainting without any post-processing
approaches can synthesize plausible stationary textures but step, which leverages a stacked network, consisting of a
fail at non-stationary image regions such as faces, objects, coarse network and a refinement network, to ensure the
and complicated scenes. Recently, deep learning and GAN- color and texture consistency between the in-filled regions
based approaches [73], [78], [114], [145], [221], [228], [236], and their surrounding. Moreover, as convolutions are lo-
[237], [240], [248] open a new direction for image inpainting cal operators and less effective in capturing long-range
using deep neural networks learned on large-scale data spatial dependencies, the contextual attention layer [236]
11

as there are many plausible solutions for filling a hole in an


image. User-guided inpainting methods [145], [228], [237]
have been proposed to provide an option to take additional
user inputs, for example, sketches, as guidance for image
inpainting networks. An example of user-guided image
inpainting is shown in Figure 15.
Finally, we note that the image out-painting or extrapola-
tion tasks are closely related to image inpainting [89], [195].
They can be also benefited from a GAN formulation.

Fig. 14. Free-form image inpainting results using the Deep-


FillV2 [237]. From left to right, we have the ground truth image, the 6 V IDEO S YNTHESIS
free-form mask, and the DeepFillV2 inpainting result. Original images
are from Yu et al. [237]. Video synthesis focuses on generating video content instead
of static images. Compared with image synthesis, video
synthesis needs to ensure the temporal consistency of the
output videos. This is usually achieved by using a tempo-
ral discriminator [205], flow-warping loss on neighboring
frames [217], smoothing the inputs before processing [26], or
a post-processing step [98]. Each of them might be suitable
for a particular task.
Similar to image synthesis, video synthesis can be clas-
sified into unconditional and conditional video synthesis.
Unconditional video synthesis generates sequences using
random noise inputs [33], [171], [205], [210]. Because such a
Fig. 15. User-guided image inpainting results using the Deep- method needs to model all the spatial and temporal content
FillV2 [237]. From left to right, we have the ground truth image, the in a video, the generated results are often short or with very
mask with user-provided edge guidance, and the DeepFillV2 inpainting constrained motion patterns. For example, MoCoGAN [205]
result. Images are from Yu et al. [237].
decomposes the motion and content parts of the sequence
and uses a fixed latent code for the content and a series of
latent codes to generate the motion. The synthesized videos
is introduced and integrated into the DeepFill to borrow are usually up to a few seconds on simple video content,
information from distant spatial locations explicitly. Visual such as facial motion.
examples of the DeepFill [236] are shown in Figure 13. On the other hand, conditional video synthesis generates
One common issue with the earlier GAN-based inpaint- videos conditioning on input content. A common category
ing approaches [73], [157], [236] is that the training is per- is future frame prediction [39], [47], [83], [103], [110], [125],
formed with randomly sampled rectangular masks. While [136], [191], [208], [212], [213], [229], which attempts to
allowing easy processing during training, these approaches predict the next frame of a sequence based on the past
do not generalize well to free-form masks, irregular masks frames. Another common category of conditional video
with arbitrary shapes. To address the issue, Liu et al. [114] synthesis is conditioning on an input video that shares
proposed the partial convolution layer where the convo- the same high-level representation. Such a setting is often
lution is masked and re-normalized to utilize valid pixels referred to as the video-to-video synthesis [217]. This line of
only. Yu et al. [237] further proposed the gated convolution works has shown promising results on various tasks, such
layer, generalizing the partial convolution by providing a as transforming high-level representations to photorealistic
learnable dynamic feature selection mechanism for each videos [217], animating characters with new expressions, or
channel at each spatial location across all layers. In addi- motions [26], [199], or innovating a new rendering pipeline
tion, as free-form masks may appear anywhere in images for graphics engines [50]. Due to its broader impact, we will
with any shape, global and local GANs [73] designed for mainly focus on conditional video synthesis. Particularly,
a single rectangular mask are not applicable. To address we will focus on its two major domains: face reenactment and
this issue, Yu et al. [237] introduced a patch-based GAN pose transfer.
loss, SNPatchGAN [237], by applying spectral-normalized
discriminator on the dense image patches. Visual examples
of the DeepFillV2 [237] with free-form masks are shown in 6.1 Face Reenactment
Figure 14. Conditional face video synthesis exists in many forms. The
Although capable of handling free-form masks, these most common forms include face swapping and face reenact-
inpainting methods perform poorly in reconstructing fore- ment. Face swapping focuses on pasting the face region from
ground details. This motivated the design of edge-guided one subject to another, while face reenactment concerns
image inpainting methods [145], [228]. These methods de- transferring the subject’s expressions and head poses. Fig-
compose inpainting into two stages. The first stage predicts ure 16 illustrates the difference. Here, we only focus on face
edges or contours of foregrounds, and the second stage reenactment. It has many applications in fields like gaming
takes predicted edges to predict the final output. Moreover, or film industry, where the characters can be animated by
for image inpainting, enabling user interactivity is essential human actors. Based on whether the trained model can only
12

Fig. 16. Face swapping vs. reenactment [149]. Face swapping focuses on pasting the face region from one subject to another, while face
reenactment concerns transferring the expressions and head poses from the target subject to the source image. Images are from Nirkin et al. [149].

work for a specific person or is universal to all persons, face sions and head movements using subject-agnostic frame-
reenactment can be classified as subject-specific or subject- works [7], [62], [149], [184], [216], [225], [238]. These frame-
agnostic as described below. works only need a single 2D image of the target person
and can synthesize talking videos of this person given ar-
Subject-specific. Traditional methods usually build a
bitrary motions. These motions are represented using either
subject-specific model, which can only synthesize one pre-
facial landmarks [7] or keypoints learned without supervi-
determined subject by focusing on transferring the expres-
sion [184]. Since the input is only a 2D image, many methods
sions without transferring the head movement [192], [197],
rely on warping the input or its extracted features and then
[198], [199], [209]. This line of works usually starts by col-
fill in the unoccluded areas to refine the results. For example,
lecting footage of the target person to be synthesized, either
Averbuch et al. [7] first warp the image and directly copy
using an RGBD sensor [198] or an RGB sensor [199]. Then a
the teeth region from the driving image to fill in the holes in
3D model of the target person is built for the face region [20].
case of an open mouth. Siarohin et al. [184] warp extracted
At test time, given the new expressions, they can be used
features from the input image, using motion fields estimated
to drive the 3D model to generate the desired motions,
from sparse keypoints. On the other hand, Zakharov et
as shown in Figure 17. Instead of extracting the driving
al. [238] demonstrate that it is possible to achieve promising
expressions from someone else, they can also be directly
results using direct synthesis methods without any warp-
synthesized from speech inputs [192]. Since 3D models are
ing. To synthesize the target identity, they extract features
involved, this line of works typically does not use GANs.
from the source images and inject the information into the
Some follow-up works take transferring head motions generator through the AdaIN [69] parameters. Similarly, the
into account and can model both expressions and different few-shot vid2vid [216] injects the information into their
head poses at the same time [11], [87], [226]. For example, generator by dynamically determining the SPADE [156]
RecycleGAN [11] extends CycleGAN [253] to incorporate parameters. Since these methods require only an image as
temporal constraints so it can transform videos of a par- input, they become particularly powerful and can be used in
ticular person to another fixed person. On the other hand, even more cases. For instance, several works [7], [216], [238]
ReenactGAN [226] can transfer movements and expressions demonstrate successes in animating paintings or graffiti
from an arbitrary person to a fixed person. Still, the subject- instead of real humans, as shown in Figure 18, which is not
dependent nature of these works greatly limits their usabil- possible with the previous subject-dependent approaches.
ity. One model can only work for one person, and gener- However, while these methods have achieved great results
alizing to another person requires training a new model. in synthesizing people talking under natural motions, they
Moreover, collecting training data for the target person may usually struggle to generate satisfying outputs under ex-
not be feasible at all times, which motivates the emergence treme poses or uncommon expressions, especially when the
of subject-agnostic models. target pose is very different from the original one. Moreover,
Subject-agnostic. Several recent works propose subject- synthesizing complex regions such as hair or background is
agnostic frameworks, which focus on transferring the facial still hard. This is indeed a very challenging task that is still
expressions without head movements [28], [29], [49], [53], open to further research. A summary of different categories
[77], [144], [152], [159], [160], [190], [211], [250]. In particular, of face reenactment methods can be found in Table 2.
many works only focus on the mouth region, since it is the
most expressive part during talking. For example, given an 6.2 Pose Transfer
audio speech and one lip image of the target identity, Chen Pose transfer techniques aim at transferring the body pose
et al. [28] synthesize a video of the desired lip movements. of one person to another person. It can be seen as the
Fried et al. [49] edit the lower face region of an existing whole body counterpart of face reenactment. In contrast to
video, so they can edit the video script and synthesize the talking head generation, which usually shares similar
a new video corresponding to the change. While these motions, body poses have more varieties and are thus much
works have better generalization capability than the previ- harder to synthesize. Early works focus on simple pose
ous subject-specific methods, they usually cannot synthesize transfers that generate low resolution and lower quality
spontaneous head motions. The head movements cannot be images. They only work on single images instead of videos.
transferred from the driving sequence to the target person. Recent works have shown their capability to generate high
Some works can very recently handle both expres- quality and high-resolution videos for challenging poses
13

Fig. 17. Face reenactment using 3D face models [87]. These methods first construct a 3D model for the person to be synthesized, so they can
easily animate the model with new expressions. Images are from Kim et al. [87].

Fig. 18. Few-shot face reenactment methods which require only a 2D image as input [238]. The driving expressions are usually represented by
facial landmarks or keypoints. Images are from Zakharov et al. [238].

TABLE 2 TABLE 3
Categorization of face reenactment methods. Subject-specific models Categories of pose transfer methods. Again, they can be classified
can only work on one subject per model, while subject-agnostic models depending on whether one model can work for only one person or any
can work on general targets. Among each of them, some frameworks persons. Some of the frameworks only focus on generating single
only focus on the inner face region, so they can only transfer images, while others also demonstrate their effectiveness on videos.
expressions, while others can also transfer head movements. Works Works with * do not use GANs in their framework.
with * do not use GANs in their framework.

Target Output
Methods
Target Transferred subject type
Methods
subject region
[200]*, [217], [26], [3], [183]*, [251],
Specific Videos
Face only [209]*, [198]*, [199]*, [192]*, [197]* [116], [115]
Specific
Entire head [87], [11], [226]
[129], [130], [186], [46]*, [10], [161],
[152], [28], [53], [144], [159], [250], Images [239]*, [82], [247], [146], [164], [44],
Face only General [61], [108], [124], [189], [256]
General [29], [190], [77]*, [211], [49], [160]
[7]*, [225]*, [149], [238], [162]*, [232], [185], [223]*, [121], [184]*, [216],
Entire head Videos
[216], [184]*, [65] [166]

but can only work on a particular person per model. Very codes to provide more flexibility and controllability. Later,
recently, several works attempt to perform subject-agnostic Siarohin et al. [186] introduce deformable skip connections
video synthesis. A summary of the categories is shown in to move local features to the target pose position in a U-
Table 3. Below we introduce each category in more detail. Net generator. Similarly, Balakrishnan et al. [10] decompose
Subject-agnostic image generation. Although we focus on different parts of the body into different layer masks and
video synthesis in this section, since most of the existing mo- apply spatial transforms to each of them. The transformed
tion transfer approaches only focus on synthesizing images, segments are then fused together to form the final output.
we still briefly introduce them here ( [10], [44], [46], [61], [82], The above methods work in a supervised setting where
[108], [124], [129], [130], [146], [161], [164], [186], [189], [239], images of different poses of the same person are avail-
[247], [256]). Ma et al. [129] adopt a two-stage coarse-to-fine able during training. To work in the unsupervised setting,
approach using GANs to synthesize a person in a different Pumarola et al. [161] render the synthesized image back to
pose, represented by a set of keypoints. In their follow- the original pose, and apply cycle-consistency constraint on
up work [130], the foreground, background, and poses the back-rendered image. Lorenz et al. [124] decouple the
in the image are further disentangled into different latent shape and appearance from images without supervision by
14

Fig. 20. Subject-specific pose transfer examples for video genera-


tion [217]. For each image triplet, left: the driving sequence, middle:
the intermediate pose representation, right: the synthesized output. By
using a model specifically trained on the target person, it can synthesize
realistic output videos faithfully reflecting the driving motions. Images
are from Wang et al. [217].

Fig. 19. Subject-agnostic pose transfer examples [256]. Using only


a 2D image and the target pose to be synthesized, these methods engines. For example, instead of predicting RGB values
can realistically generate the desired outputs. Images are from Zhu et directly, Shysheya et al. [183] predict DensePose-like part
al. [256].
maps and texture maps from input 3D keypoints, and adopt
a neural renderer to render the outputs. Liu et al. [116] first
construct a 3D character model of the target by capturing
adopting a two-stream auto-encoding architecture, so they multi-view static images and then train a character-to-image
can re-synthesize images in a different shape with the same translation network using a monocular video of the target.
appearance. The authors later combine the constructed 3D model with
Recently, instead of relying on 2D keypoints solely, some the monocular video to estimate dynamic textures, so they
frameworks choose to utilize 3D or 2.5D information. For can use different texture maps when synthesizing different
example, Zanfir et al. [239] incorporate estimating 3D para- motions to increase the realism [115].
metric models into their framework to aid the synthesis
Subject-agnostic video generation. Finally, the most gen-
process. Similarly, Li et al. [108] predict 3D dense flows to
eral framework would be to have one model that can work
warp the source image by estimating 3D models from the
universally regardless of the target identity. Early works
input images. Neverova et al. [146] adopt the DensePose [63]
in this category synthesize videos unconditionally and do
to help warp the input textures according to their UV-
not have full control over the synthesized sequence (e.g.,
coordinates and inpaint the holes to generate the final result.
MoCoGAN [205]). Some other works such as as [232] have
Grigorev et al. [61] also map the input to a texture space and
control over the appearance and the starting pose of the
inpaint the textures before warping them back to the target
person, but the motion generation is still unconditional.
pose. Huang et al. [71] combine the SMPL models [123]
Due to these factors, the synthesized videos are usually
with the implicit field estimation framework [172] to rig
shorter and of lower quality. Very recently, a few works
the reconstructed meshes with desired motions. While these
have shown the ability to render higher quality videos for
methods work reasonably well in transferring poses, as
pose transfer results [121], [166], [184], [185], [216], [223].
shown in Figure 19, directly applying them to videos will
Weng et al. [223] reconstruct the SMPL model [123] from
usually result in unsatisfactory artifacts such as flickering
the input image and animate it with some simple motions
or inconsistent results. Below we introduce methods specif-
like running. Liu et al. [121] propose a unified framework
ically targeting video generation, which work on a one-
for pose transfer, novel view synthesis, and appearance
person-per-model basis.
transfer all at once. Siarohin et al. [184], [185] estimate
Subject-specific video generation. For high-quality video unsupervised keypoints from the input images and predict
synthesis, most methods employ a subject-specific model, a dense motion field to warp the source features to the target
which can only synthesize a particular person. These ap- pose. Want et al. [216] extend vid2vid [217] to the few-shot
proaches start with collecting training data of the target setting by predicting kernels in the SPADE [156] modules.
person to be synthesized (e.g. a few minutes of a subject per- Similarly, Ren et al. [166] also predict kernels in their local
forming various motions) and then train a neural network or attention modules using the input images to adaptively
infer a 3D model from it to synthesize the output. For exam- select features and warp them to the target pose. While
ple, Thies et al. [200] extend their previous face reenactment these approaches have achieved better results than previous
work [199] to include shoulders and part of the upper body works (Figure 21), their qualities are still not comparable to
to increase realism and fidelity. To extend to whole-body state-of-the-art subject-specific models. Moreover, most of
motion transfer, Wang et al. [217] extend their image synthe- them still synthesize lower resolution outputs (256 or 512).
sis framework [218] to videos and successfully demonstrate How to further increase the quality and resolution to the
the transfer results on several dancing sequences, opening photorealistic level is still an open question.
the era for a new application (Figure 20). Chan et al. [26] also
adopt a similar approach to generate many dancing exam-
ples, but using a simple temporal smoothing on the inputs 7 N EURAL R ENDERING
instead of explicitly modeling temporal consistency by the Neural rendering is a recent and upcoming topic in the
network. Following these works, many subsequent works area of neural networks, which combines classical rendering
improve upon them [3], [115], [116], [183], [251], usually by and generative models. Classical rendering can produce
combining the neural network with 3D models or graphics photorealistic images given the complete specification of the
15

into the framework of image-to-image translation, possibly


unimodal, multimodal, or conditional, depending on the
exact use-case. Using given camera parameters, the source
3D world is first projected to a 2D feature map containing
per-pixel information such as color, depth, surface normals,
segmentation, etc. This feature map is then fed as input to a
generator, which tries to produce desired outputs, usually
a realistic-looking RGB image. The deep neural network
application happens in the 2D space after the 3D world is
projected to the camera view, and no features or gradients
are backpropagated to the 3D source world or through the
camera projection. A key advantage of this approach is that
the traditional graphics rendering pipeline can be easily
augmented to immediately take advantage of proven and
mature techniques from 2D image-to-image translation (as
discussed in Section 4), without the need for designing and
implementing differentiable projection layers or transforma-
tions that are part of the deep network during training. This
type of framework is illustrated in Figure 22 (a).
Martin-Brualla et al. [135] introduce the notion of re-
Fig. 21. Subject-agnostic pose transfer videos [216]. Given an example rendering, where a deep neural network takes as input
image and a driving pose sequence, the methods can output a se- a rendered 2D image and enhances it (improving colors,
quence of the person performing the motions. Images are from Wang
et al. [216]. boundaries, resolution, etc.) to produce a re-rendered image.
The full pipeline consists of two steps—a traditional 3D
to 2D rendering step and a trainable deep network that
world. This includes all the objects in it, their geometry, enhances the rendered 2D image. The 3D to 2D rendering
material properties, the lighting, the cameras, etc. Creat- technique can be differentiable or non-differentiable, but
ing such a world from scratch is a laborious process that no gradients are backpropagated through this step. This
often requires expert manual input. Moreover, faithfully allows one to use more complex rendering techniques. By
reproducing such data directly from images of the world using this two-step process, the output of a performance
can often be hard or impossible. On the other hand, as capture system, which might suffer from noise, poor color
described in the previous sections, GANs have had great reproduction, and other issues, can be improved. In this
success in producing photorealistic images given minimal particular work, they did not see an improvement from
semantic inputs. The ability to synthesize and learn material using a GAN loss, perhaps because they trained their system
properties, textures, and other intangibles from training data on the limited domain of people and faces, using carefully
can help overcome the drawbacks of classical rendering. captured footage.
Neural rendering aims to combine the strengths of the Meshry et al. [139] and Li et al. [109] extend this approach
two areas to create a more powerful and flexible framework. to the more challenging domain of unstructured photo col-
Neural networks can either be applied as a postprocessing lections. They produce multiple plausible views of famous
step after classical rendering or as part of the rendering landmarks from noisy point clouds generated from internet
pipeline with the design of 3D-aware and differentiable photo collections by utilizing Structure from Motion (SfM).
layers. The following sections discuss such approaches and Meshry et al. [139] generate a 2D feature map containing per-
how they use GAN losses to improve the quality of out- pixel albedo and depth by splatting points of the 3D point
puts. In this paper, we focus on works that use GANs to cloud onto a given viewpoint. The segmentation map of the
train neural networks and augment the classical rendering expected output image is also concatenated to this feature
pipeline to generate images. For a general survey on the use representation. The problem is then framed as a multimodal
of neural networks in rendering, please refer to the survey image translation problem. A noisy and incomplete input
paper on neural rendering by Tewari et al. [196]. has to be translated to a realistic image conditioned on a
We divide the works on GAN-based neural rendering style code to produce desired environmental effects such as
into two parts: 1) works that treat 3D to 2D projection lighting. Li et al. [109] use a similar approach, but with multi-
as a preprocessing step and apply neural networks purely plane images and achieve better photo-realism. Pittaluga et
in the 2D domain, and 2) works that incorporate layers al. [158] tackle the task of producing 2D color images of the
that perform differentiable operations to transform features underlying scene given as input a sparse SfM point cloud
from 3D to 2D or vice versa (3D ↔ 2D) and learn some with associated point attributes such as color, depth, and
implicit form of geometry to provide 3D understanding to SIFT descriptors. The input to their network is a 2D feature
the network. map obtained by projecting the 3D points to the image plane
given the camera parameters. The attributes of the 3D point
are copied to the 2D pixel location to which it is projected.
7.1 3D to 2D projection as a preprocessing step Mallya et al. [132] precompute the mapping of the 3D world
A number of works [109], [132], [135], [139], [158] improve point cloud to the pixel locations in the images produced by
upon traditional techniques by casting the task of rendering cameras with known parameters and use this to obtain an
16

3D to 2D
projection

2D color, depth, Generator 2D image


3D input
segmentation, etc.
Trainable
deep network

(a) 3D to 2D projection as a preprocessing step

differentiable differentiable
2D to 3D 3D to 2D
feature lifting projection

3D features Differentiable 2D features Generator 2D image


Image(s) or
3D transform
noise vector
Trainable
deep network

(b) 3D ↔ 2D transform as a part of network training

Fig. 22. The two common frameworks for neural rendering. (a) In the first set of works [109], [132], [135], [139], [158], a neural network that purely
operates in the 2D domain is trained to enhance an input image, possibly supplemented with other information such as depth, or segmentation
maps. (b) The second set of works [147], [148], [178], [187], [224] introduces native 3D operations that produce and transform 3D features. This
allows the network to reason in 3D and produce view-consistent outputs.

estimate of the next frame, referred to as a ‘guidance image’.


They learn to output video frames consistent over time and
viewpoints by conditioning the generator on these noisy
estimates.
In these works, the use of a generator coupled with
an adversarial loss helps produce better-looking outputs
conditioned on the input feature maps. Similar to appli-
cations of pix2pixHD [218], such as manipulating output
images by editing input segmentation maps, Meshry et
al. [139] are able to remove people and transient objects from
images of landmarks and generate plausible inpainting. A
key motivation of the work of Pittaluga et al. [158] was to
explore if a user’s privacy can be protected by techniques
such as discarding the color of the 3D points. A very in-
teresting observation was that discarding color information
helps prevent accurate reproduction. However, the use of
a GAN loss recovers plausible colors and greatly improves
the output results, as shown in Figure. 23. GAN losses might
also be helpful in cases where it is hard to manually define Fig. 23. Inverting images from 3D point clouds and their associated
a good loss function, either due to the inherent ambiguity depth and SIFT attributes [158]. The top row of images are produced
in determining the desired behavior or the difficulty in fully by a generator trained without an adversarial loss, while the bottom row
uses adversarial loss. Using an adversarial loss helps generates better
labeling the data. details and more plausible colors. Images are from Pittaluga et al. [158].

7.2 3D ↔ 2D transform as a part of network training


In the previous set of works, the geometry of the world network have several advantages: the ability to reason in
or object is explicitly provided, and neural rendering is 3D, control the pose, and produce a series of consistent
purely used to enhance the appearance or add details to the views of a scene. Contrast this to the neural network shown
traditionally rendered image or feature maps. The works in in Figure 22 (a), which purely operates in the 2D domain.
this section [147], [148], [178], [187], [224] introduce native DeepVoxels [187] learns a persistent 3D voxel feature
3D operations in the neural network used to learn from and representation of a scene given a set of multi-view images
produce images. These operations enable them to model the and their associated camera intrinsic and extrinsic parame-
geometry and appearance of the scene in the feature space. ters. Features are first extracted from the 2D views and then
The general pipeline of this line of works is illustrated in lifted to a 3D volume. This 3D volume is then integrated into
Figure 22 (b). Learning a 3D representation and modeling the persistent DeepVoxels representation. These 3D features
the process of image projection and formation into the are then projected to 2D using a projection layer, and a new
17

view of the object is synthesized using a U-Net generator. TABLE 4


This generator network is trained with an `1 loss and a GAN Key differences amongst 3D-aware methods. Adversarial losses are
used by a range of methods that differ in the type of 3D feature
loss. The authors found that using a GAN loss accelerates representation and training supervision.
the generation of high-frequency details, especially at earlier
stages of training. Similar to DeepVoxels [187], Visual Object
Networks (VONs) [255] generate a voxel grid from a sample 3D feature
Supervision Methods
representation
noise vector and use a differentiable projection layer to map
the voxel grid to a 2.5D sketch. Inspired by classical graphics Radiance field GRAF [178]
rendering pipelines, this work decomposes image formation None HoloGAN [147]
into three conditionally independent factors of shape, view- BlockGAN [148]
point, and texture. Trained with a GAN loss, their model Voxel 3D supervision VONs [255]
synthesizes more photorealistic images, and the use of the Input-Output DeepVoxels [187]
disentangled representation allows for 3D manipulations, Point cloud
pose
SynSin [224]
which are not feasible with purely 2D methods. transformation
HoloGAN [147] proposes a system to learn 3D voxel fea-
ture representations of the world and to render it to realistic-
looking images. Unlike VONs [255], HoloGAN does not As summarized in Table 4, the works discussed in this
require explicit 3D data or supervision and can do so using section use a variety of 3D feature representations, and
unlabeled images (no pose, explicit 3D shape, or multiple train their networks using paired input-output with known
views). By incorporating a 3D rigid-body transformation transformations or unlabeled and unpaired data. The use
module and a 3D-to-2D projection module in the network, of a GAN loss is common to all these approaches. This is
HoloGAN provides the ability to control the pose of the perhaps because traditional hand-designed losses such as
generated objects. HoloGAN employs a multi-scale feature the `1 loss or even perceptual loss are unable fully to capture
GAN discriminator, and the authors empirically observed what makes a synthesized image look unrealistic. Further,
that this helps prevent mode collapse. BlockGAN [148] ex- in the case where explicit task supervision is unavailable,
tends the unsupervised approach of the HoloGAN [147] to BlockGAN [148] shows that a GAN loss can help in learning
also consider object disentanglement. BlockGAN learns 3D disentangled features by ensuring that the outputs after
features per object and the background. These are combined projection and rendering look realistic. The learnability and
into 3D scene features after applying appropriate trans- flexibility of the GAN loss to the task at hand helps provide
formations before projecting them into the 2D space. One feedback, guiding how to change the generated image, and
issue with learning scene compositionality without explicit thus the upstream features, so that it looks as if it were
supervision is the conflation of features of the foreground sampled from the distribution of real images. This makes
object and the background, which results in visual artifacts the GAN framework a powerful asset in the toolbox of any
when objects or the camera moves. By adding more power- neural rendering practitioner.
ful ‘style’ discriminators (feature discriminators introduced
in [147]) to their training scheme, the authors observed that 8 L IMITATIONS AND O PEN P ROBLEMS
the disentangling of features improved, resulting in cleaner
Despite the successful applications introduced above, there
outputs.
are still limitations of GANs needed to be addressed by
SynSin [224] learns an end-to-end model for view syn-
future work.
thesis from a single image, without any ground-truth 3D
supervision. Unlike the above works which internally use a Evaluation metrics. Evaluate and comparing different GAN
feature voxel representation, SynSin predicts a point cloud models is difficult. The most popular evaluation metrics are
of features from the input image and then projects it to perhaps Inception Score (IS) [176] and Fréchet Inception
new views using a differentiable point cloud renderer. 2D Distance (FID) [67], which both have many shortcomings.
image features and a depth map are first predicted from the The Inception Score, for example, is not able to detect
input image. Based on the depth map, the 2D features are intra-class mode collapse [23]. In other words, a model
projected to 3D to obtain the 3D feature point cloud. The that generates only a single image per class can obtain
network is trained adversarially with a discriminator based a high IS. FID can better measure such diversity, but it
on the one proposed by Wang et al. [218]. does not have an unbiased estimator [19]. Kernel Inception
One of the drawbacks of voxel-based feature repre- Distance (KID) [19] can capture higher-order statistics and
sentations is the cubic growth in the memory required has an unbiased estimator but has been empirically found
to store them. To keep requirements manageable, voxel- to suffer from high variance [165]. In addition to the above
based approaches are typically restricted to low resolutions. measures that summarize the performance with a single
GRAF [178] proposes to use conditional radiance fields, number, there are metrics that separately evaluate fidelity
which are a continuous mapping from a 3D location and and diversity of the generator distribution [97], [143], [173].
a 2D viewing direction to an RGB color value, as the Instability. Although the regularization techniques intro-
intermediate feature representation. They also use a single duced in section 3.3 have greatly improved the stability of
discriminator similar to PatchGAN [75], with weights that GAN training, GANs are still much more unstable to train
are shared across patches with different receptive fields. This than supervised discriminative models or likelihood-based
allows them to capture the global context as well as refine generative models. For example, even the state-of-the-art
local details. BigGAN model would eventually collapse in the late stage
18

of training on ImageNet [24]. Also, the final performance is [13] D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T.
generally very sensitive to hyper-parameters [96], [126]. Freeman, and A. Torralba. GAN dissection: Visualizing and
understanding generative adversarial networks. In ICLR, 2019.
Interpretability. Despite the impressive quality of the gen- [14] D. Bau, J.-Y. Zhu, J. Wulff, W. Peebles, H. Strobelt, B. Zhou, and
erate images, there has been a lack of understanding of A. Torralba. Seeing what a GAN cannot generate. In ICCV, 2019.
[15] S. Benaim, R. Mokady, A. Bermano, D. Cohen-Or, and L. Wolf.
how GANs represent the image structure internally in the Structural-analogy from a single image pair. arXiv preprint
generator. Bau et al. visualize the causal effect of different arXiv:2004.02222, 2020.
neurons on the output image [13]. After finding the semantic [16] S. Benaim and L. Wolf. One-sided unsupervised domain map-
ping. In NeurIPS, 2017.
meaning of individual neurons or directions in the latent [17] S. Benaim and L. Wolf. One-shot unsupervised cross domain
space [55], [76], [180], one can edit a real image by inverting translation. In NeurIPS, 2018.
[18] Y. Bengio, A. Courville, and P. Vincent. Representation learning:
it to the latent space, edit the latent code according to
A review and new perspectives. TPAMI, 2013.
the desired semantic change, and regenerate it with the [19] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. De-
generator. Finding the best way to encode an image to mystifying mmd gans. In ICLR, 2018.
[20] V. Blanz and T. Vetter. A morphable model for the synthesis of
the latent space is, therefore, another interesting research 3D faces. In Proceedings of the 26th annual conference on Computer
direction [1], [2], [14], [72], [86], [252]. graphics and interactive techniques, 1999.
[21] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor.
Forensics. The success of GANs has enabled many new The 2018 PIRM challenge on perceptual image super-resolution.
applications but also raised ethical and social concerns In ECCV Workshop, 2018.
such as fraud and fake news. The ability to detect GAN- [22] Y. Blau and T. Michaeli. The perception-distortion tradeoff. In
CVPR, 2018.
generated images is essential to prevent malicious usage [23] A. Borji. Pros and cons of gan evaluation measures. Computer
of GANs. Recent studies have found it possible to train Vision and Image Understanding, 2019.
a classifier to detect generated images and generalize to [24] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN
training for high fidelity natural image synthesis. In ICLR, 2019.
unseen generator architectures [25], [215], [244]. This cat- [25] L. Chai, D. Bau, S.-N. Lim, and P. Isola. What makes fake images
and-mouse game may continue, as generated images may detectable? understanding properties that generalize. In ECCV,
become increasingly harder to detect in the future. 2020.
[26] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance
now. In ICCV, 2019.
[27] J. Chen, J. Chen, H. Chao, and M. Yang. Image blind denoising
9 C ONCLUSION with generative adversarial network based noise modeling. In
CVPR, 2018.
In this paper, we present a comprehensive overview of [28] L. Chen, Z. Li, R. K Maddox, Z. Duan, and C. Xu. Lip movements
GANs with an emphasis on algorithms and applications generation at a glance. In ECCV, 2018.
[29] L. Chen, R. K. Maddox, Z. Duan, and C. Xu. Hierarchical cross-
to visual synthesis. We summarize the evolution of the modal talking face generation with dynamic pixel-wise loss. In
network architectures in GANs and the strategies to sta- CVPR, 2019.
[30] X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel. PixelSNAIL:
bilize GAN training. We then introduce several fascinating An improved autoregressive generative model. In ICML, 2018.
applications of GANs, including image translation, image [31] X. Chen, C. Xu, X. Yang, and D. Tao. Attention-GAN for object
processing, video synthesis, and neural rendering. In the transfiguration in wild images. In ECCV, 2018.
[32] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Star-
end, we point out some open problems for GANs, and we GAN: Unified generative adversarial networks for multi-domain
hope this paper would inspire future research to solve them. image-to-image translation. In CVPR, 2018.
[33] A. Clark, J. Donahue, and K. Simonyan. Efficient video genera-
tion on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
[34] T. Cohen and L. Wolf. Bidirectional one-shot unsupervised
R EFERENCES domain mapping. In ICCV, 2019.
[35] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sen-
[1] R. Abdal, Y. Qin, and P. Wonka. Image2StyleGAN: How to embed gupta, and A. A. Bharath. Generative adversarial networks: An
images into the StyleGAN latent space? In ICCV, 2019. overview. IEEE Signal Processing Magazine, 2018.
[2] R. Abdal, Y. Qin, and P. Wonka. Image2StyleGAN++: How to [36] N. Dalal and B. Triggs. Histograms of oriented gradients for
edit the embedded images? In CVPR, 2020. human detection. In CVPR, 2005.
[3] K. Aberman, M. Shi, J. Liao, D. Lischinski, B. Chen, and D. Cohen- [37] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The
Or. Deep video-based performance cloning. Computer Graphics helmholtz machine. Neural computation, 1995.
Forum, 2019. [38] E. de Bézenac, I. Ayed, and P. Gallinari. Optimal unsupervised
[4] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. domain translation. arXiv preprint arXiv:1906.01292, 2019.
Gool. Generative adversarial networks for extreme learned image [39] E. L. Denton and V. Birodkar. Unsupervised learning of disen-
compression. In ICCV, 2019. tangled representations from video. In NeurIPS, 2017.
[5] A. Almahairi, S. Rajeshwar, A. Sordoni, P. Bachman, and [40] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear indepen-
A. Courville. Augmented CycleGAN: Learning many-to-many dent components estimation. In ICLR, 2015.
mappings from unpaired data. In ICML, 2018. [41] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation
[6] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. In using real NVP. In ICLR, 2017.
ICML, 2017. [42] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution
[7] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen. Bring- using deep convolutional networks. TPAMI, 2015.
ing portraits to life. TOG, 2017. [43] C. Dong, C. C. Loy, and X. Tang. Accelerating the super-
[8] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv resolution convolutional neural network. In ECCV, 2016.
preprint arXiv:1607.06450, 2016. [44] H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and J. Yin. Soft-
[9] K. Baek, Y. Choi, Y. Uh, J. Yoo, and H. Shim. Rethinking the gated warping-GAN for pose-guided person image synthesis. In
truly unsupervised image-to-image translation. arXiv preprint NeurIPS, 2018.
arXiv:2006.06500, 2020. [45] Y. Du and I. Mordatch. Implicit generation and generalization in
[10] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag. energy-based models. In NeurIPS, 2019.
Synthesizing images of humans in unseen poses. In CVPR, 2018. [46] P. Esser, E. Sutter, and B. Ommer. A variational u-net for
[11] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh. Recycle-GAN: conditional appearance and shape generation. In CVPR, 2018.
Unsupervised video retargeting. In ECCV, 2018. [47] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for
[12] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. physical interaction through video prediction. In NeurIPS, 2016.
PatchMatch: A randomized correspondence algorithm for struc- [48] A. Fischer and C. Igel. An introduction to restricted boltzmann
tural image editing. TOG, 2009.
19

machines. In Iberoamerican congress on pattern recognition, 2012. [81] A. Jolicoeur-Martineau. On relativistic f-divergences. In ICML,
[49] O. Fried, A. Tewari, M. Zollhöfer, A. Finkelstein, E. Shechtman, 2020.
D. B. Goldman, K. Genova, Z. Jin, C. Theobalt, and M. Agrawala. [82] D. Joo, D. Kim, and J. Kim. Generating a fusion image: One’s
Text-based editing of talking-head video. TOG, 2019. identity and another’s shape. In CVPR, 2018.
[50] O. Gafni, L. Wolf, and Y. Taigman. Vid2Game: Controllable [83] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka,
characters extracted from real-world videos. In ICLR, 2020. O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel net-
[51] T. Galanti, L. Wolf, and S. Benaim. The role of minimal complex- works. ICML, 2017.
ity functions in unsupervised learning of semantic mappings. In [84] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing
ICLR, 2018. of GANs for improved quality, stability, and variation. In ICLR,
[52] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo. Deep 2018.
generative adversarial compression artifact removal. In ICCV, [85] T. Karras, S. Laine, and T. Aila. A style-based generator architec-
2017. ture for generative adversarial networks. In CVPR, 2019.
[53] J. Geng, T. Shao, Y. Zheng, Y. Weng, and K. Zhou. Warp-guided [86] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila.
GANs for single-photo facial animation. TOG, 2018. Analyzing and improving the image quality of StyleGAN. In
[54] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and P. K. CVPR, 2020.
Dokania. Multi-agent diverse generative adversarial networks. [87] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Niessner,
In CVPR, 2018. P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. Deep video
[55] L. Goetschalckx, A. Andonian, A. Oliva, and P. Isola. Ganalyze: portraits. TOG, 2018.
Toward visual definitions of cognitive image properties. In ICCV, [88] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-
2019. resolution using very deep convolutional networks. In CVPR,
[56] X. Gong, S. Chang, Y. Jiang, and Z. Wang. Autogan: Neural 2016.
architecture search for generative adversarial networks. In ICCV, [89] K. Kim, Y. Yun, K.-W. Kang, K. Gong, S. Lee, and S.-J. Kang.
2019. Painting outside as inside: Edge guided image outpainting via
[57] A. Gonzalez-Garcia, J. Van De Weijer, and Y. Bengio. Image-to- bidirectional rearrangement with step-by-step learning. arXiv
image translation for cross-domain disentanglement. In NeurIPS, preprint arXiv:2010.01810, 2020.
2018. [90] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning
[58] I. Goodfellow. NIPS 2016 tutorial: Generative adversarial net- to discover cross-domain relations with generative adversarial
works. arXiv preprint arXiv:1701.00160, 2016. networks. In ICML, 2017.
[59] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT [91] D. Kingma and J. Ba. Adam: A method for stochastic optimiza-
press, 2016. tion. In ICLR, 2015.
[60] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [92] D. P. Kingma and P. Dhariwal. Glow: Generative flow with
Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adver- invertible 1x1 convolutions. In NeurIPS, 2018.
sarial networks. In NeurIPS, 2014. [93] D. P. Kingma and M. Welling. Auto-encoding variational Bayes.
[61] A. Grigorev, A. Sevastopolsky, A. Vakhitov, and V. Lempitsky. In ICLR, 2013.
Coordinate-based texture inpainting for pose-guided human im- [94] D. P. Kingma and M. Welling. An introduction to variational
age generation. In CVPR, 2019. autoencoders. arXiv preprint arXiv:1906.02691, 2019.
[62] K. Gu, Y. Zhou, and T. S. Huang. FLNet: Landmark driven fetch- [95] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas.
ing and learning network for faithful talking facial animation DeblurGAN: Blind motion deblurring using conditional adver-
synthesis. In AAAI, 2020. sarial networks. In CVPR, 2018.
[63] R. A. Güler, N. Neverova, and I. Kokkinos. DensePose: Dense [96] K. Kurach, M. Lučić, X. Zhai, M. Michalski, and S. Gelly. A
human pose estimation in the wild. In CVPR, 2018. large-scale study on regularization and normalization in gans.
[64] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. In International Conference on Machine Learning, 2019.
Courville. Improved training of wasserstein GANs. In NeurIPS, [97] T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila.
2017. Improved precision and recall metric for assessing generative
[65] S. Ha, M. Kersner, B. Kim, S. Seo, and D. Kim. MarioNETte: Few- models. In NeurIPS, 2019.
shot face reenactment preserving identity of unseen targets. In [98] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and M.-
AAAI, 2020. H. Yang. Learning blind video temporal consistency. In ECCV,
[66] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for 2018.
image recognition. In CVPR, 2016. [99] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther.
[67] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and Autoencoding beyond pixels using a learned similarity metric.
S. Hochreiter. GANs trained by a two time-scale update rule In ICML, 2016.
converge to a local Nash equilibrium. In NeurIPS, 2017. [100] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
[68] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimension- [101] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A
ality of data with neural networks. Science, 2006. tutorial on energy-based learning. Predicting structured data, 2006.
[69] X. Huang and S. Belongie. Arbitrary style transfer in real-time [102] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
with adaptive instance normalization. In ICCV, 2017. A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-
[70] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal realistic single image super-resolution using a generative adver-
unsupervised image-to-image translation. ECCV, 2018. sarial network. In CVPR, 2017.
[71] Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung. ARCH: [103] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and
Animatable reconstruction of clothed humans. In CVPR, 2020. S. Levine. Stochastic adversarial video prediction. arXiv preprint
[72] M. Huh, R. Zhang, J.-Y. Zhu, S. Paris, and A. Hertzmann. Trans- arXiv:1804.01523, 2018.
forming and projecting images into class-conditional generative [104] C.-H. Lee, Z. Liu, L. Wu, and P. Luo. MaskGAN: Towards diverse
networks. In ECCV, 2020. and interactive facial image manipulation. In CVPR, 2020.
[73] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally [105] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang.
consistent image completion. TOG, 2017. Diverse image-to-image translation via disentangled representa-
[74] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep tions. In ECCV, 2018.
network training by reducing internal covariate shift. In ICML, [106] S. Lee, J. Ha, and G. Kim. Harmonizing maximum likelihood
2015. with GANs for multimodal conditional generation. In ICLR, 2019.
[75] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image [107] K. Li, T. Zhang, and J. Malik. Diverse image synthesis from
translation with conditional adversarial networks. In CVPR, semantic layouts via conditional imle. In ICCV, 2019.
2017. [108] Y. Li, C. Huang, and C. C. Loy. Dense intrinsic appearance flow
[76] A. Jahanian, L. Chai, and P. Isola. On the ”steerability” of for human pose transfer. In CVPR, 2019.
generative adversarial networks. In ICLR, 2020. [109] Z. Li, W. Xian, A. Davis, and N. Snavely. Crowdsampling the
[77] A. Jamaludin, J. S. Chung, and A. Zisserman. You said that?: plenoptic function. In ECCV, 2020.
Synthesising talking faces from audio. IJCV, 2019. [110] X. Liang, L. Lee, W. Dai, and E. P. Xing. Dual motion GAN for
[78] Y. Jo and J. Park. SC-FEGAN: Face editing generative adversarial future-flow embedded video prediction. In NeurIPS, 2017.
network with user’s sketch and color. In ICCV, 2019. [111] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee. Enhanced deep
[79] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time residual networks for single image super-resolution. In CVPR
style transfer and super-resolution. In ECCV, 2016. Workshop, 2017.
[80] A. Jolicoeur-Martineau. The relativistic discriminator: a key [112] J. H. Lim and J. C. Ye. Geometric GAN. arXiv preprint
element missing from standard GAN. In ICLR, 2019. arXiv:1705.02894, 2017.
20

[113] J. Lin, Y. Pang, Y. Xia, Z. Chen, and J. Luo. TuiGAN: Learning [144] K. Nagano, J. Seo, J. Xing, L. Wei, Z. Li, S. Saito, A. Agarwal,
versatile image-to-image translation with two unpaired images. J. Fursund, and H. Li. paGAN: real-time avatars using dynamic
In ECCV, 2020. textures. TOG, 2018.
[114] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catan- [145] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi.
zaro. Image inpainting for irregular holes using partial convolu- EdgeConnect: Generative image inpainting with adversarial edge
tions. In ECCV, 2018. learning. arXiv preprint arXiv:1901.00212, 2019.
[115] L. Liu, W. Xu, M. Habermann, M. Zollhoefer, F. Bernard, H. Kim, [146] N. Neverova, R. Alp Guler, and I. Kokkinos. Dense pose transfer.
W. Wang, and C. Theobalt. Neural human video rendering by In ECCV, 2018.
learning dynamic textures and rendering-to-video translation. [147] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y.-L. Yang.
TVCG, 2020. HoloGAN: Unsupervised learning of 3d representations from
[116] L. Liu, W. Xu, M. Zollhoefer, H. Kim, F. Bernard, M. Habermann, natural images. In CVPR, 2019.
W. Wang, and C. Theobalt. Neural rendering and reenactment of [148] T. Nguyen-Phuoc, C. Richardt, L. Mai, Y.-L. Yang, and N. Mitra.
human actor videos. TOG, 2019. BlockGAN: Learning 3d object-aware scene representations from
[117] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image unlabelled images. arXiv preprint arXiv:2002.08988, 2020.
translation networks. In NeurIPS, 2017. [149] Y. Nirkin, Y. Keller, and T. Hassner. FSGAN: Subject agnostic face
[118] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and swapping and reenactment. In ICCV, 2019.
J. Kautz. Few-shot unsupervised image-to-image translation. In [150] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training genera-
ICCV, 2019. tive neural samplers using variational divergence minimization.
[119] M.-Y. Liu and O. Tuzel. Coupled generative adversarial net- In NeurIPS, 2016.
works. In NeurIPS, 2016. [151] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis
[120] S. Liu, X. Zhang, J. Wangni, and J. Shi. Normalized diversifica- with auxiliary classifier GANs. In ICML, 2017.
tion. In CVPR, 2019. [152] K. Olszewski, Z. Li, C. Yang, Y. Zhou, R. Yu, Z. Huang, S. Xiang,
[121] W. Liu, Z. Piao, J. Min, W. Luo, L. Ma, and S. Gao. Liquid Warping S. Saito, P. Kohli, and H. Li. Realistic dynamic facial textures from
GAN: A unified framework for human motion imitation, appear- a single image using GANs. In ICCV, 2017.
ance transfer and novel view synthesis. In ICCV, 2019. [153] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
[122] X. Liu, G. Yin, J. Shao, X. Wang, et al. Learning to predict layout- A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
to-image conditional convolutions for semantic image synthesis. WaveNet: A generative model for raw audio. arXiv preprint
In NeurIPS, 2019. arXiv:1609.03499, 2016.
[123] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. [154] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, and Y. Zheng. Recent
SMPL: A skinned multi-person linear model. TOG, 2015. progress on generative adversarial networks (GANs): A survey.
[124] D. Lorenz, L. Bereska, T. Milbich, and B. Ommer. Unsupervised IEEE Access, 2019.
part-based disentangling of object shape and appearance. In [155] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu. Contrastive learning
CVPR, 2019. for unpaired image-to-image translation. In ECCV, 2020.
[125] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding [156] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic image
networks for video prediction and unsupervised learning. In synthesis with spatially-adaptive normalization. In CVPR, 2019.
ICLR, 2017. [157] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros.
[126] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Context encoders: Feature learning by inpainting. In CVPR, 2016.
Are gans created equal? a large-scale study. In Advances in neural [158] F. Pittaluga, S. J. Koppal, S. B. Kang, and S. N. Sinha. Revealing
information processing systems, pages 700–709, 2018. scenes by inverting structure from motion reconstructions. In
[127] C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang. Learning a no- CVPR, 2019.
reference quality metric for single-image super-resolution. CVIU, [159] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and
2017. F. Moreno-Noguer. GANimation: Anatomically-aware facial ani-
[128] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and L. Van Gool. mation from a single image. In ECCV, 2018.
Exemplar guided unsupervised image-to-image translation with [160] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and
semantic consistency. In ICLR, 2019. F. Moreno-Noguer. GANimation: One-shot anatomically consis-
[129] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. tent facial animation. IJCV, 2020.
Pose guided person image generation. In NeurIPS, 2017. [161] A. Pumarola, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer.
[130] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz. Unsupervised person image synthesis in arbitrary poses. In
Disentangled person image generation. In CVPR, 2018. Proceedings of the IEEE Conference on Computer Vision and Pattern
[131] S. Maeda. Unpaired image super-resolution using pseudo- Recognition, 2018.
supervision. In CVPR, 2020. [162] S. Qian, K.-Y. Lin, W. Wu, Y. Liu, Q. Wang, F. Shen, C. Qian,
[132] A. Mallya, T.-C. Wang, K. Sapra, and M.-Y. Liu. World-consistent and R. He. Make a face: Towards arbitrary high fidelity face
video-to-video synthesis. In ECCV, 2020. manipulation. In ICCV, 2019.
[133] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and M.-H. Yang. Mode [163] A. Radford, L. Metz, and S. Chintala. Unsupervised represen-
seeking generative adversarial networks for diverse image syn- tation learning with deep convolutional generative adversarial
thesis. In CVPR, 2019. networks. In ICLR, 2015.
[134] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least [164] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu.
squares generative adversarial networks. In ICCV, 2017. SwapNet: Image based garment transfer. In ECCV, 2018.
[135] R. Martin-Brualla, R. Pandey, S. Yang, P. Pidlypenskyi, J. Taylor, [165] S. Ravuri and O. Vinyals. Seeing is not necessarily believing:
J. Valentin, S. Khamis, P. Davidson, A. Tkach, P. Lincoln, A. Kow- Limitations of biggans for data augmentation. 2019.
dle, C. Rhemann, D. B. Goldman, C. Keskin, S. Seitz, S. Izadi, and [166] Y. Ren, X. Yu, J. Chen, T. H. Li, and G. Li. Deep image spatial
S. Fanello. LookinGood: Enhancing performance capture with transformation for person image generation. In CVPR, 2020.
real-time neural re-rendering. TOG, 2018. [167] D. J. Rezende and S. Mohamed. Variational inference with
[136] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video normalizing flows. In ICML, 2015.
prediction beyond mean square error. In ICLR, 2016. [168] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic back-
[137] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim. propagation and approximate inference in deep generative mod-
Unsupervised attention-guided image-to-image translation. In els. In ICML, 2014.
NeurIPS, 2018. [169] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing
[138] L. Mescheder, A. Geiger, and S. Nowozin. Which training training of generative adversarial networks through regulariza-
methods for GANs do actually converge? In ICML, 2018. tion. In NeurIPS, 2017.
[139] M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, [170] K. Saito, K. Saenko, and M.-Y. Liu. COCO-FUNIT: Few-shot
N. Snavely, and R. Martin-Brualla. Neural rerendering in the unsupervised image translation with a content conditioned style
wild. In CVPR, 2019. encoder. In ECCV, 2020.
[140] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral [171] M. Saito, E. Matsumoto, and S. Saito. Temporal generative
normalization for generative adversarial networks. In ICLR, 2018. adversarial nets with singular value clipping. In ICCV, 2017.
[141] T. Miyato and M. Koyama. cGANs with projection discriminator. [172] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and
In ICLR, 2018. H. Li. PIFu: Pixel-aligned implicit function for high-resolution
[142] S. Mo, M. Cho, and J. Shin. InstaGAN: Instance-aware image-to- clothed human digitization. In ICCV, 2019.
image translation. In ICLR, 2019. [173] M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly.
[143] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo. Reliable fidelity Assessing generative models via precision and recall. In NeurIPS,
and diversity metrics for generative models. In ICML, 2020. 2018.
21

[174] M. S. Sajjadi, B. Scholkopf, and M. Hirsch. EnhanceNet: Single 2018.


image super-resolution through automated texture synthesis. In [205] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. MoCoGAN:
ICCV, 2017. Decomposing motion and content for video generation. In CVPR,
[175] R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In 2018.
Artificial intelligence and statistics, 2009. [206] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normaliza-
[176] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, tion: The missing ingredient for fast stylization. arXiv preprint
and X. Chen. Improved techniques for training GANs. In arXiv:1607.08022, 2016.
NeurIPS, 2016. [207] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals,
[177] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pix- A. Graves, et al. Conditional image generation with PixelCNN
elCNN++: Improving the PixelCNN with discretized logistic decoders. In NeurIPS, 2016.
mixture likelihood and other modifications. In ICLR, 2017. [208] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing
[178] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger. GRAF: Gener- motion and content for natural video sequence prediction. In
ative radiance fields for 3d-aware image synthesis. arXiv preprint ICLR, 2017.
arXiv:2007.02442, 2020. [209] D. Vlasic, M. Brand, H. Pfister, and J. Popovic. Face transfer with
[179] T. R. Shaham, T. Dekel, and T. Michaeli. SinGAN: Learning a multilinear models. TOG, 2005.
generative model from a single natural image. In ICCV, 2019. [210] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos
[180] Y. Shen, J. Gu, X. Tang, and B. Zhou. Interpreting the latent space with scene dynamics. In NeurIPS, 2016.
of GANs for semantic face editing. In CVPR, 2020. [211] K. Vougioukas, S. Petridis, and M. Pantic. Realistic speech-driven
[181] Z. Shen, W.-S. Lai, T. Xu, J. Kautz, and M.-H. Yang. Deep semantic facial animation with GANs. IJCV, 2019.
face deblurring. In CVPR, 2018. [212] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain
[182] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, future: Forecasting from static images using variational autoen-
D. Rueckert, and Z. Wang. Real-time single image and video coders. In ECCV, 2016.
super-resolution using an efficient sub-pixel convolutional neural [213] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows:
network. In CVPR, 2016. Video forecasting by generating pose futures. In ICCV, 2017.
[183] A. Shysheya, E. Zakharov, K.-A. Aliev, R. Bashirov, E. Burkov, [214] M. Wang, G.-Y. Yang, R. Li, R.-Z. Liang, S.-H. Zhang, P. M. Hall,
K. Iskakov, A. Ivakhnenko, Y. Malkov, I. Pasechnik, D. Ulyanov, and S.-M. Hu. Example-guided style-consistent image synthesis
et al. Textured neural avatars. In CVPR, 2019. from semantic labeling. In CVPR, 2019.
[184] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe. First [215] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros. CNN-
order motion model for image animation. In NeurIPS, 2019. generated images are surprisingly easy to spot... for now. In
[185] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe. CVPR, 2020.
Animating arbitrary objects via deep motion transfer. In CVPR, [216] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro.
2019. Few-shot video-to-video synthesis. In NeurIPS, 2019.
[186] A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe. Deformable [217] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and
GANs for pose-based human image generation. In CVPR, 2018. B. Catanzaro. Video-to-video synthesis. In NeurIPS, 2018.
[187] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and [218] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catan-
M. Zollhofer. DeepVoxels: Learning persistent 3d feature embed- zaro. High-resolution image synthesis and semantic manipula-
dings. In CVPR, 2019. tion with conditional GANs. In CVPR, 2018.
[188] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and [219] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and
O. Winther. Ladder variational autoencoders. In NeurIPS, 2016. C. Change Loy. ESRGAN: Enhanced super-resolution generative
[189] S. Song, W. Zhang, J. Liu, and T. Mei. Unsupervised person image adversarial networks. In ECCV, 2018.
generation with semantic parsing transformation. In CVPR, 2019. [220] Y. Wang, S. Khan, A. Gonzalez-Garcia, J. v. d. Weijer, and F. S.
[190] Y. Song, J. Zhu, D. Li, A. Wang, and H. Qi. Talking face generation Khan. Semi-supervised learning for few-shot image-to-image
by conditional recurrent adversarial network. In IJCAI, 2019. translation. In CVPR, 2020.
[191] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised [221] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia. Image inpainting
learning of video representations using LSTMs. In ICML, 2015. via generative multi-column convolutional neural networks. In
[192] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. NeurIPS, 2018.
Synthesizing Obama: learning lip sync from audio. TOG, 2017. [222] Z. Wang, Q. She, and T. E. Ward. Generative adversarial networks
[193] Y. Tai, J. Yang, X. Liu, and C. Xu. MemNet: A persistent memory in computer vision: A survey and taxonomy. arXiv preprint
network for image restoration. In ICCV, 2017. arXiv:1906.01529, 2019.
[194] H. Tang, X. Qi, D. Xu, P. H. Torr, and N. Sebe. Edge guided [223] C.-Y. Weng, B. Curless, and I. Kemelmacher-Shlizerman. Photo
GANs with semantic preserving for semantic image synthesis. wake-up: 3d character animation from a single photo. In CVPR,
arXiv preprint arXiv:2003.13898, 2020. 2019.
[195] P. Teterwak, A. Sarna, D. Krishnan, A. Maschinot, D. Belanger, [224] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson. SynSin: End-to-
C. Liu, and W. T. Freeman. Boundless: Generative adversarial end view synthesis from a single image. In CVPR, 2020.
networks for image extension. In ICCV, 2019. [225] O. Wiles, A. Sophia Koepke, and A. Zisserman. X2Face: A
[196] A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, network for controlling face generation using images, audio, and
K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Nießner, pose codes. In ECCV, 2018.
R. Pandey, S. Fanello, G. Wetzstein, J.-Y. Zhu, C. Theobalt, [226] W. Wu, Y. Zhang, C. Li, C. Qian, and C. Change Loy. Reenact-
M. Agrawala, E. Shechtman, D. B. Goldman, and M. Zollhöfer. GAN: Learning to reenact faces via boundary transfer. In ECCV,
State of the Art on Neural Rendering. Computer Graphics Forum 2018.
(EG STAR), 2020. [227] Y. Wu and K. He. Group normalization. In ECCV, 2018.
[197] J. Thies, M. Zollhöfer, and M. Nießner. Deferred neural rendering: [228] W. Xiong, J. Yu, Z. Lin, J. Yang, X. Lu, C. Barnes, and J. Luo.
Image synthesis using neural textures. TOG, 2019. Foreground-aware image inpainting. In CVPR, 2019.
[198] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, [229] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics:
and C. Theobalt. Real-time expression transfer for facial reenact- Probabilistic future frame synthesis via cross convolutional net-
ment. TOG, 2015. works. In NeurIPS, 2016.
[199] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and [230] C. Yang, T. Kim, R. Wang, H. Peng, and C.-C. J. Kuo. ESTHER:
M. Nießner. Face2Face: Real-time face capture and reenactment Extremely simple image translation through self-regularization.
of rgb videos. In CVPR, 2016. In BMVC, 2018.
[200] J. Thies, M. Zollhöfer, C. Theobalt, M. Stamminger, and M. Niess- [231] C. Yang, T. Kim, R. Wang, H. Peng, and C.-C. J. Kuo. Show,
ner. HeadOn: Real-time reenactment of human portrait videos. attend, and translate: Unsupervised image translation with self-
TOG, 2018. regularization and attention. TIP, 2019.
[201] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the [232] C. Yang, Z. Wang, X. Zhu, C. Huang, J. Shi, and D. Lin. Pose
gradient by a running average of its recent magnitude. COURS- guided human video generation. In ECCV, 2018.
ERA: Neural Networks for Machine Learning, 2012. [233] D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee. Diversity-sensitive
[202] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasser- conditional generative adversarial networks. In ICLR, 2019.
stein auto-encoders. In ICLR, 2018. [234] Z. Yi, H. Zhang, P. T. Gong, et al. DualGAN: Unsupervised dual
[203] T. Tong, G. Li, X. Liu, and Q. Gao. Image super-resolution using learning for image-to-image translation. In ICCV, 2017.
dense skip connections. In ICCV, 2017. [235] J. Yu, Y. Fan, J. Yang, N. Xu, Z. Wang, X. Wang, and T. Huang.
[204] M. Tschannen, E. Agustsson, and M. Lucic. Deep generative mod- Wide activation for efficient and accurate image super-resolution.
els for distribution-preserving lossy compression. In NeurIPS, In BMVC, 2019.
22

[236] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative Xun Huang is a Research Scientist at NVIDIA
image inpainting with contextual attention. In CVPR, 2018. Research. He obtained his Ph.D. from Cornell
[237] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Free-form University under the supervision of Professor
image inpainting with gated convolution. In ICCV, 2019. Serge Belongie. His research interest includes
[238] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky. Few- developing new architectures and training algo-
shot adversarial learning of realistic neural talking head models. rithms of generative adversarial networks, as
In ICCV, 2019. well as applications such as image editing and
[239] M. Zanfir, A.-I. Popa, A. Zanfir, and C. Sminchisescu. Human synthesis. He is a recipient of NVIDIA Gradu-
appearance transfer. In CVPR, 2018. ate Fellowship, Adobe Research Fellowship, and
[240] Y. Zeng, J. Fu, H. Chao, and B. Guo. Learning pyramid-context Snap Research Fellowship.
encoder network for high-quality image inpainting. In CVPR,
2019.
[241] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-
attention generative adversarial networks. In ICML, 2019.
[242] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond
a gaussian denoiser: Residual learning of deep CNN for image
denoising. TIP, 2017.
[243] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen. Cross-domain
correspondence learning for exemplar-based image translation.
In CVPR, 2020. Jiahui Yu is a Research Scientist at Google
[244] X. Zhang, S. Karaman, and S.-F. Chang. Detecting and simulating Brain. He received his Ph.D. at University of
artifacts in GAN fake images. In WIFD, 2019. Illinois at Urbana-Champaign in 2020, and Bach-
[245] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image super-
elor with distinction at School of the Gifted Young
resolution using very deep residual channel attention networks.
in Computer Science, University of Science and
In ECCV, 2018.
[246] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense Technology of China in 2016. His research inter-
network for image super-resolution. In CVPR, 2018. est is in sequence modeling (language, speech,
[247] B. Zhao, X. Wu, Z.-Q. Cheng, H. Liu, Z. Jie, and J. Feng. Multi- video, financial data), machine perception (vi-
view image generation from a single-view. In MM, 2018. sion), generative models (GANs), and high per-
[248] C. Zheng, T.-J. Cham, and J. Cai. Pluralistic image completion. In formance computing. He is a member of IEEE,
CVPR, 2019. ACM and AAAI. He is a recipient of Baidu Schol-
[249] H. Zheng, H. Liao, L. Chen, W. Xiong, T. Chen, and J. Luo. arship, Thomas and Margaret Huang Research Award, and Microsoft-
Example-guided image synthesis across arbitrary scenes using IEEE Young Fellowship.
masked spatial-channel attention and self-supervision. arXiv
preprint arXiv:2004.10024, 2020.
[250] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. Talking face gener-
ation by adversarially disentangled audio-visual representation.
In AAAI, 2019.
[251] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg. Dance dance
generation: Motion transfer for internet videos. arXiv preprint
arXiv:1904.00129, 2019.
[252] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Gen- Ting-Chun Wang is a senior research scientist
erative visual manipulation on the natural image manifold. In at NVIDIA Research. He obtained his Ph.D. in
ECCV, 2016. EECS from UC Berkeley, advised by Profes-
[253] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to- sor Ravi Ramamoorthi and Alexei A. Efros. He
image translation using cycle-consistent adversarial networks. In won the 1st place in the Domain Adaptation for
ICCV, 2017. Semantic Segmentation Competition in CVPR,
[254] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, 2018. His semantic image synthesis paper was
and E. Shechtman. Toward multimodal image-to-image transla- in the best paper finalist in CVPR, 2019, and
tion. In NeurIPS, 2017. the corresponding GauGAN app won the Best
[255] J.-Y. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum, in Show Award and Audience Choice Award in
and B. Freeman. Visual Object Networks: Image generation with SIGGRAPH RealTimeLive, 2019. He served as
disentangled 3d representations. In NeurIPS, 2018. an area chair in WACV, 2020. His research interests include computer
[256] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai. Progressive vision, machine learning and computer graphics, particularly the inter-
pose attention transfer for person image generation. In CVPR, sections of all three. His recent research focus is on using generative
2019. adversarial models to synthesize realistic images and videos, with ap-
plications to rendering, visual manipulations and beyond.

Ming-Yu Liu is a Distinguished Research Sci-


entist and Manager at NVIDIA Research. Before
joining NVIDIA in 2016, he was a Principal Re-
search Scientist at Mitsubishi Electric Research Arun Mallya is a Senior Research Scientist at
Labs (MERL). He received his Ph.D. from the NVIDIA Research. He obtained his Ph.D. from
Department of Electrical and Computer Engi- the University of Illinois at Urbana-Champaign
neering at the University of Maryland College in 2018, with a focus on performing multiple
Park in 2012. Ming-Yu Liu has won several pres- tasks efficiently with a single deep network. He
tigious awards in his field. He is a recipient of holds a B.Tech. in Computer Science and Engi-
the R&D 100 Award by R&D Magazine in 2014 neering from the Indian Institute of Technology
for his robotic bin picking system. In SIGGRAPH - Kharagpur (2012), and an MS in Computer
2019, he won the Best in Show Award and Audience Choice Award in Science from the University of Illinois at Urbana-
the Real-Time Live track for his GauGAN work. His GauGAN work also Champaign (2014). He was selected as a Siebel
won the Best of What’s New Award by the Popular Science Magazine in Scholar in 2014. He is interested in generative
2019. His research interest is on generative image modeling. His goal is modeling and enabling new applications of deep neural networks.
to enable machines human-like imagination capability.

You might also like