Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
Abstract—The generative adversarial network (GAN) framework has emerged as a powerful tool for various image and video
synthesis tasks, allowing the synthesis of visual content in an unconditional or input-conditional manner. It has enabled the generation
of high-resolution photorealistic images and videos, a task that was challenging or impossible with prior methods. It has also led to the
creation of many new applications in content creation. In this paper, we provide an overview of GANs with a special focus on algorithms
and applications for visual synthesis. We cover several important techniques to stabilize GAN training, which has a reputation for being
notoriously difficult. We also discuss its applications to image translation, image processing, video synthesis, and neural rendering.
arXiv:2008.02793v2 [cs.CV] 30 Nov 2020
Index Terms—Generative Adversarial Networks, Computer Vision, Image Processing, Image and Video Synthesis, Neural Rendering
1 I NTRODUCTION
Discriminator True
networks—a generator network G and a discriminator net-
work D, which are trained jointly by playing a zero-sum (a) Unconditional GAN
game where the objective of the generator is to synthesize
Generator Discriminator False
fake data that resembles real data, and the objective of the
discriminator is to distinguish between real and fake data.
When the training is successful, the generator is an approx- Discriminator True
imator of the underlying data generation mechanism in the (b) Conditional GAN
sense that the distribution of the fake data converges to the
real one. Due to the distribution matching capability, GANs Fig. 1. Unconditional vs. Conditional GANs. (a) In unconditional
have become a popular tool for various data synthesis and GANs, the generator converts a noise input z to a fake image G(z)
manipulation problems, especially in the visual domain. where z ∼ Z and Z is usually a Gaussian random variable. The discrim-
GAN’s rise also marks another major success of deep inator tells apart real images x from the training dataset D and fake im-
ages from G. (b) In conditional GANs, the generator takes an additional
learning in replacing hand-designed components with input y as the control signal, which could be another image (image-to-
machine-learned components in modern computer vision image translation), text (text-to-image synthesis), or a categorical label
pipelines. As deep learning has directed the community to (label-to-image synthesis). The discriminator tells apart real from fake
abandon hand-designed features, such as the histogram of by leveraging the information in y . In both settings, the combination of
the discriminator and real training data defines an objective function
oriented gradients (HOG) [36], for deep features computed for image synthesis. This data-driven objective function definition is a
by deep neural networks, the objective function used to train powerful tool for many computer vision problems.
the networks remains largely hand-designed. While this is
not a major issue for a classification task since effective and
descriptive objective functions such as the cross-entropy discriminator is to produce images similar to the real images
loss exist, this is a serious hurdle for a generation task. used for training. Since all the training images contain cats,
After all, how can one hand-design a function to guide a the generator output must contain cats to win the game.
generator to produce a better cat image? How can we even Moreover, when we replace the cat images with dog images,
mathematically describe “felineness” in an image? we can use the same method to train a dog image generator.
GANs address the issue through deriving a functional The objective function for the generator is defined by the
form of the objective using training data. As the discrim- training dataset and the discriminator architecture. It is thus
inator is trained to tell whether an input image is a cat a very flexible framework to define the objective function
image from the training dataset or one synthesized by the for a generation task as illustrated in Figure 1.
generator, it defines an objective function that can guide the However, despite its excellent modeling power, GANs
generator in improving its generation based on its current are notoriously difficult to train because it involves chasing
network weights. The generator can keep improving as a moving target. Not only do we need to make sure the
long as the discriminator can differentiate real and fake generator can reach the target, but also that the target can
cat images. The only way that a generator can beat the reach a desirable level of goodness. Recall that the goal of
the discriminator is to differentiate real and fake data. As
• * Equal contribution. Ming-Yu Liu, Xun Huang, Ting-Chun Wang, and the generator changes, the fake data distribution changes
Arun Mallya are with NVIDIA. Jiahui Yu is with Google.
as well. This poses a new classification problem to the dis-
2
True/False
stochastic
sampling
(a) Boltzmann Machine (b) Variational Autoencoder (c) Autoregressive Model (d) Normalizing Flow Model (e) Generative Adversarial Network
Fig. 3. Structure comparison of different deep generative models. Except the deep Boltzmann machine which is based on undirected graphs, the
other models are all based on directed graphs, which enjoy a faster inference speed.
and autoencoders [18], which concern representing high- ordering of the variables can be determined based on the
dimensional data x using lower-dimensional latent vari- time dimension, such an ordering does not exist for images.
ables z . In terms of structure, a VAE employs an infer- One hence has to enforce an order prior that is an unnatural
ence model q(z|x; φ) and a generation model p(x|z; θ)p(z) fit to the image grid.
where p(z) is usually a Gaussian distribution, which we can Normalizing Flow Models (NFMs). NFMs [40], [41], [92],
easily sample from, and q(z|x; φ) approximates the poste- [167] are based on the normalizing flow—a transformation
rior p(z|x; θ). Both of the inference and generation mod- of a simple probability distribution into a more complex dis-
els are implemented using feed-forward neural networks. tribution by a sequence of invertible and differentible map-
VAE training is through maximizing the evidence lower pings. Each mapping corresponds to a layer in a deep neural
bound (ELBO) of log p(x; θ) and the non-differentiblity network. With a layer design that guarantees invertibility
of the stochastic sampling is elegantly handled by the and differentibility for all possible weights, one can stack
reparametrization trick [94]. One can also show that max- many such layers to construct a powerful mapping because
imization the ELBO is equivalent to minimizing the Kull- composition of invertible and differentible functions are in-
back–Leibler (KL) divergence vertible and differentible. Let F = f (1) ◦f (2) ...◦f (K) be such
KL (q(x)q(z|x; φ)||p(z)p(x|z; θ)) , (2) a K -layer mapping that maps the simple probability distri-
bution Z to the data distribution X . The probability density
where q(x) is the empirical distribution of the data [94]. of a sample x ∼ X can be computed by transforming it
Once a VAE is trained, an image can be efficiently generated back to the corresponding z . Hence, we can apply maximum
by first sampling z from the Gaussian prior p(z) and then likelihood learning to train NFMs because the log-likelihood
passing it through the feed-forward deep neural network of the complex data distribution can be converted to the log-
p(x|z; θ). VAEs are effective in learning useful latent rep- likelihood of the simple prior distribution subtracted by the
resentations [188]. However, they tend to generate blurry Jacobians terms. This gives
output images.
K
Deep AutoRegressive Models (DARs). DARs [30], [153],
X df (i)
log p(x; θ) = log p(z; θ) − log det , (5)
[177], [207] are deep learning implementations of classical i=1
dzi−1
autoregressive models, which assume an ordering to the
random variables to be modeled and generate the variables where zi = f (i) (zi−1 ). One key strength of NFMs is in
sequentially based on the ordering. This induces a factoriza- supporting direct evaluation of probability density calcula-
tion form to the data distribution given by solving tion. However, NFMs require an invertible mapping, which
Y greatly limits the choices of applicable architectures.
p(x; θ) = p (xi |x<i ; θ) , (3)
i
3 L EARNING
where xi s are variables in x, and x<i are the union of
the variables that are prior to xi based on the assumed Let θ and φ be the learnable parameters in G and D, respec-
ordering. DARs are conditional generative models where tively. GAN training is formulated as a minimax problem
they generate a new portion of the signal based on what has min max V (θ, φ), (6)
been generated or observed so far. The learning is based on φ θ
maximum likelihood learning where V is the utility function.
max Ex∼D [log p(xi |x<i ; θ)] . (4) GAN training is challenging. Famous failure cases in-
θ clude mode collapse and mode dropping. In mode collapse,
DAR training is more stable compared to the other gen- the generator is trapped to a certain local minimum where
erative models. But, due to the recurrent nature, they are it only captures a small portion of the distribution. In mode
slow in inference. Also, while for audio or text, a natural dropping, the generator does not faithfully model the target
4
FC FC FC FC FC
(a) Multi-layer Perceptron (b) Deep ConvNet (c) Residual Net. (d) Residual Net. with Cond. Act. Norm. (e) Cond. ConvNet
Fig. 4. Generator evolution. Since the debut of GANs [60], the generator architecture has continuously evolved. From (a-c), one can observe the
change from simple MLPs to deep convolutional and residual networks. Recently, conditional architectures, including conditional activation norms
(d) and conditional convolutions (e), have gained popularity as they allow users to have more control on the generation outputs.
when using other GAN losses. In practice, it has the effect image for a fake image. For the i-th layer, the instance-based
of countering vanishing and exploding gradients occurred FM loss is given by
during GAN training.
[Di ◦ ... ◦ D1 (xi )] − [Di ◦ ... ◦ D1 (G(z, yi ))] , (17)
On the other hand, the design of GP-0 is based on the
idea of penalizing the discriminator deviating away from where yi is the control signal for xi .
the Nash-equilibrium. GP-0 takes a simpler form where they Perceptual Loss [79]. Often, when instance-based FM loss
do not use imaginary sample distribution but use the real is applicable, one can additionally match features extracted
data distribution, i.e., setting x̂ ≡ x. We find the use of from real and fake images using a pretrained network. Such
GP-0 in several state-of-the-art GAN algorithms [85], [86]. a variant of FM losses is called the perceptual loss [79].
Spectral Normalization (SN) [140] is an effective regulariza- Model Average (MA) can improve the quality of images
tion technique used in many recent GAN algorithms [24], generated by a GAN. To use MA, we keep two copies of the
[156], [170], [241]. SN is based on regularizing the spec- generator network during training, where one is the original
tral norm of the projection operation at each layer of the generator with weight θ and the other is the model average
discriminator, by simply dividing the weight matrix by its generator with weight θM A . At iteration t, we update θM A
largest eigenvalue. Let W be the weight matrix of a layer of based on
the discriminator network. With SN, the true weight that is (t) (t−1)
θM A = βθ (t) + (1 − β)θM A , (18)
applied is
q where β is a scalar controlling the contribution from the
W/ λmax (W T W ), (15) current model weight.
architecture [66] is proven to be effective for training deep True/ True/ True/
networks, several GAN works start to use the residual False False False
architecture in their generator design (Figure 4(c)) [6], [140]. Embed
A residual block used in modern GAN generators typ-
dot
ically consists of a skip connection paired with a series
of batch normalization (BN) [74], nonlinearity, and convo- 𝐷 𝐷 FC
lution operations. The BN is one type of activation norm
Concatenation 𝐷
(AN), a technique that normalizes the activation values
to facilitate training. Other AN variants have also been
exploited for the GAN generator, including the instance nor- (a) Auxiliary Classifier (b) Input Concatenation (c) Projection Discriminator
malization [206], the layer normalization [8], and the group
Fig. 5. Conditional discriminator architectures. There are several ways
normalization [227]. Generally, an activation normalization
to leverage the user input signal y in the GAN discriminator. (a) Auxiliary
scheme consists of a whitening step followed by an affine classifier [151]. In this design, the discriminator is asked to predict the
transformation step. Let hc be the output of the whitening ground truth label for the real image. (b) Concatenation [75]. In this
step for h. The final output of the normalization layer is design, the discriminator learns to reason whether the input is real by
learning a joint feature embedding of image and label. (c) Projection
discriminator [141]. In this design, the discriminator computes an image
γc hc + βc , (19) embedding and correlates it with the label embedding (through the dot
product) to determine whether the input is real or fake.
where γc and βc are scalars used to shift the post-
normalization activation values. They are constants learned
during training. 4 I MAGE T RANSLATION
For many applications, it is required to have some way This section discusses the application of GANs to image-
to control the output produced by a generator. This desire to-image translation, which aims to map an image from one
has motivated various conditional generator architectures domain to a corresponding image in a different domain, e.g.,
(Figure 4(d)) for the GAN generator [24], [70], [156]. The sketch to shoes, label maps to photos, summer to winter.
most common approach is to use the conditional AN. In The problem can be studied in a supervised setting, where
a conditional AN, both γc and βc are data dependent. example pairs of corresponding images are available, or an
Often, one employs a separate network to map input con- unsupervised setting, where such training data is unavail-
trol signals to the target γc and βc values. Another way able and we only have two independent sets of images. In
to achieve such controllability is to use hyper-networks; the following subsections, we will discuss recent progress in
Basically, using an auxiliary network to produce weights both settings.
for the main network. For example, we can have a convo-
lutional layer where the filter weights are generated by a
separate network. We often call such a scheme conditional 4.1 Supervised Image Translation
convolutions (Figure 4(e)), and it has been used for several Isola et al. [75] proposed the pix2pix framework as a
state-of-the-art GAN generators [86], [216]. general-purpose solution to image-to-image translation in
Discriminator Evolution. GAN discriminators have also the supervised setting. The training objective of pix2pix
undergone an evolution. However, the change has mostly combines conditional GANs with the pixel-wise `1 loss
been on moving from the MLP to deep convolutional and between the generated image and the ground truth. One
residual architectures. As the discriminator is solving a notable design choice of pix2pix is the use of patch-wise
classification task, new breakthroughs in architecture design discriminators (PatchGAN), which attempts to discriminate
for image classification tasks could influence future GAN each local image patch rather than the whole image. This
discriminator designs. design incorporates the prior knowledge that the underly-
ing image translation function we want to learn is local,
Conditional Discriminator Architecture. There are several assuming independence between pixels that are far away.
effective architectures for utilizing control signals (condi- In other words, the translation mostly involves style or
tional inputs y ) in the GAN discriminator to achieve better texture change. It significantly alleviates the burden of the
image generation quality, as visualized in Figure 5. This discriminator because it requires much less model capacity
includes the auxiliary classifier (AC) [151], input concate- to discriminate local patches than whole images.
nation (IC) [75], and the projection discriminator (PD) [141]. One important limitation of pix2pix is that its translation
The AC and PD are mostly used for category-conditional function is restricted to be one-to-one. However, many of
image generation tasks, while the PD is common for image- the mappings we aim to learn are one-to-many in nature.
to-image translation tasks. In other words, the distribution of possible outputs is
Neural Architecture Search. As neural architecture search multimodal. For example, one can imagine many shoes
has become a popular topic for various recognition tasks, in different colors and styles that correspond to the same
efforts have been made in trying to automatically find a sketch of a shoe. Naively injecting a Gaussian noise latent
performant architecture for GANs [56]. code to the generator does not lead to many variations,
While the current and previous sections have focused on since the generator is free to ignore that latent code. Bicycle-
introducing the GAN mechanism and various algorithms GAN [254] explores approaches to encourage the generator
used to train them, the following sections focus on various to make use of the latent code to represent output variations,
applications of GANs in generating images and videos. including applying a KL divergence loss to the encoded
7
cloud sky
tree mountain
sea grass
Fig. 6. Image translation examples of SPADE [156], which converts semantic label maps into photorealistic natural scenes. The style of the
output image can also be controlled by by a reference image (the leftmost column). Images are from Park et al. [156].
latent code, and reconstructing the sampled latent code methods above need to train a different model for each pair
from the generated image. Other strategies to encourage of image domains, StarGAN [32] is able to translate images
diversity include using different generators to capture dif- across multiple domains using only a single model.
ferent output modes [54], replacing the reconstruction loss In many unsupervised image translation tasks (e.g.,
with maximum likelihood objective [106], [107], and directly horses to zebras, dogs to cats), the two image domains
encouraging the distance between output images generated mainly differ in the foreground objects, and the background
from different latent codes to be large [120], [133], [233]. distribution is very similar. Ideally, the model should only
Besides, the quality of image-to-image translation has modify the foreground objects and leave the background
been significantly improved by some recent works [104], region untouched. Some work [31], [137], [231] employs
[122], [156], [194], [218], [249]. In particular, pix2pixHD [218] spatial attention to detect and change the foreground re-
is able to generate high-resolution images with a coarse-to- gion without influencing the background. InstaGAN [142]
fine generator and a multi-scale discriminator. SPADE [156] further allows the shape of the foreground objects to be
further improves the image quality with a spatially-adaptive changed.
normalization layer. SPADE, in addition, allows a style
image input for better control the desired look of the output The early work mentioned above focuses on unimodal
image. Some examples of SPADE are shown in Figure 6. translation. On the other hand, recent advances [5], [57],
[70], [105], [128], [133] have made it possible to perform
multimodal translation, generating diverse output images
4.2 Unsupervised Image Translation given the same input. For example, MUNIT [70] assumes
For many tasks, paired training images are very difficult to that images can be encoded into two disentangled latent
obtain [16], [32], [70], [90], [105], [117], [234], [253]. Unsuper- spaces: a domain-invariant content space that captures the
vised learning of mappings between corresponding images information that should be preserved during translation,
in two domains is a much harder problem but has wider and a domain-specific style space that represents the vari-
applications than the supervised setting. CycleGAN [253] ations that are not specified by the input image. To generate
simultaneously learns mappings in both directions and em- diverse translation results, we can recombine the content
ploys a cycle consistency loss to enforce that if an image code of the input image with different style codes sampled
is translated to the other domain and translated back to from the style space of the target domain. Figure 7 com-
the original domain, the output should be close to the pares MUNIT with existing unimodal translation methods
original image. UNIT [117] makes a shared latent space including CycleGAN and UNIT. The disentangled latent
assumption [119] that a pair of corresponding images can be space not only enables multimodal translation, but also
mapped to the same latent code in a shared latent space. It allows example-guided translation in which the generator
is shown that shared-latent space implies cycle consistency recombines the domain-invariant content of an image from
and imposes a stronger regularization. DistanceGAN [16] the source domain and the domain-specific style of an image
encourages the mapping to preserve the distance between from the target domain. The idea of using a guiding style
any pair of images before and after translation. While the image has also been applied to the supervised setting [156],
8
Perception
R1 R2 R3
EDSR
Alg. 1 RCAN
Alg. 2
Perceptual Index
Results on PIRM s
Better quality
Possible Method P
ESRGAN 2.0
interp_1 interp_2 2.5
EnhanceNet 2.6
Alg. 3
EnhanceNet interp_1 3.2
Impossible interp_2 RCAN 4.8
EDSR 5.2
ESRGAN
Distortion
Less distortion
RMSE
Fig. 8. Perception-distortion tradeoff [22]. Distortion metrics, includ-
ing the MSE, PSNR, and SSIM, measure the similarity between the Fig. 9. The perception-distortion curve of the ESRGAN [219] on
ground truth image and the restored images. Perceptual quality met- PIRM self-validation dataset [21]. The curve also compares the ESR-
rics, including NR [127], measure the distribution distance between the GAN with the EnhanceNet [174], the RCAN [245], and the EDSR [111].
recovered image distribution and the target image distribution. Blau et The curve is from Wang et al. [219].
al. [22] show that an image restoration algorithm can be characterized
by the distortion and perceptual quality tradeoff curve. The plot is from
Blau et al. [22].
distortion bound.
Image super-resolution (SR) aims at estimating a high-
resolution (HR) image from its low-resolution (LR) coun-
terpart. Deep learning has enabled faster and more ac-
curate super-resolution methods, including SRCNN [42],
FSRCNN [43], ESPCN [182], VDSR [88], SRResNet [102],
EDSR [111], SRDenseNet [203], MemNet [193], RDN [246],
SRGAN ESRGAN Ground Truth
WDSR [235], and many others. However, the above super-
resolution approaches focus on improving the distortion
metrics and pay little to no attention to the perceptual qual- Fig. 10. Visual comparison between the ESRGAN [219] and the
ity metrics. As a result, they tend to predict over-smoothed SRGAN [102]. Images are from Wang et al. [219].
outputs and fail to synthesize finer high-frequency details.
Recent image super-resolution algorithms improve the
not directly applicable to upsample low-resolution images
perceptual quality of outputs by leveraging GANs. The
captured in the wild. Several methods have addressed the
SRGAN [102] is the first of its kind and can generate photo-
issue by studying image super-resolution in the unsuper-
realistic images with 4× or higher upscaling factors. The
vised setting where they only assume a dataset of low-
quality of the SRGAN [102] outputs is mainly measured by
resolution images captured by a sensor and a dataset of
the mean opinion score (MOS) over 26 raters. To enhance
high-resolution images. Recently, Maeda [131] proposes a
the visual quality further, Wang et al. [219] revisit the design
GAN-based image super-resolution algorithm operates in
of the three key components in the SRGAN: the network
the unsupervised setting for bridging the gap.
architecture, the GAN loss, and the perceptual loss. They
propose the Enhanced SRGAN (ESRGAN), which achieves Image denoising aims at removing noise from noisy images.
consistently better visual quality with more realistic and The task is challenging since the noise distribution is usually
natural textures than the competing methods, as shown unknown. This setting is also referred to as blind image
in Figure 9 and Figure 10. The ESRGAN is the winner of denoising. DnCNN [242] is one of the first approaches
the 2018 Perceptual Image Restoration and Manipulation using feed-forward convolutional neural networks for im-
challenge (PIRM) [21] (region 3 in Figure 9). Other GAN- age denoising. However, DnCNN [242] requires knowing
based image super-resolution methods and practices can be the noise distribution in the noisy image and hence has
found in the 2018 PIRM challenge report [21]. limited applicability. To tackle blind image denoising, Chen
The above image super-resolution algorithms all operate et al. [27] proposed the GAN-CNN-based Blind Denoiser
in the supervised setting where they assume corresponding (GCBD), which consists of 1) a GAN trained to estimate the
low-resolution and high-resolution pairs in the training noise distribution over the input noisy images to generate
dataset. Typically, they create such a training dataset by noise samples, and 2) a deep CNN that learns to denoise
downsampling the ground truth high-resolution images. on generated noisy images. The GAN training criterion
However, the downsampled high-resolution images are of GCBD [27] is based on Wasserstein GAN [6], and the
very different from the low-resolution images captured by generator network is based on DCGAN [163].
a real sensor, which often contain noise and other dis- Image deblurring sharpens blurry images, which result
tortion. As a result, these super-resolution algorithms are from motion blur, out-of-focus, and possibly other causes.
10
Fig. 16. Face swapping vs. reenactment [149]. Face swapping focuses on pasting the face region from one subject to another, while face
reenactment concerns transferring the expressions and head poses from the target subject to the source image. Images are from Nirkin et al. [149].
work for a specific person or is universal to all persons, face sions and head movements using subject-agnostic frame-
reenactment can be classified as subject-specific or subject- works [7], [62], [149], [184], [216], [225], [238]. These frame-
agnostic as described below. works only need a single 2D image of the target person
and can synthesize talking videos of this person given ar-
Subject-specific. Traditional methods usually build a
bitrary motions. These motions are represented using either
subject-specific model, which can only synthesize one pre-
facial landmarks [7] or keypoints learned without supervi-
determined subject by focusing on transferring the expres-
sion [184]. Since the input is only a 2D image, many methods
sions without transferring the head movement [192], [197],
rely on warping the input or its extracted features and then
[198], [199], [209]. This line of works usually starts by col-
fill in the unoccluded areas to refine the results. For example,
lecting footage of the target person to be synthesized, either
Averbuch et al. [7] first warp the image and directly copy
using an RGBD sensor [198] or an RGB sensor [199]. Then a
the teeth region from the driving image to fill in the holes in
3D model of the target person is built for the face region [20].
case of an open mouth. Siarohin et al. [184] warp extracted
At test time, given the new expressions, they can be used
features from the input image, using motion fields estimated
to drive the 3D model to generate the desired motions,
from sparse keypoints. On the other hand, Zakharov et
as shown in Figure 17. Instead of extracting the driving
al. [238] demonstrate that it is possible to achieve promising
expressions from someone else, they can also be directly
results using direct synthesis methods without any warp-
synthesized from speech inputs [192]. Since 3D models are
ing. To synthesize the target identity, they extract features
involved, this line of works typically does not use GANs.
from the source images and inject the information into the
Some follow-up works take transferring head motions generator through the AdaIN [69] parameters. Similarly, the
into account and can model both expressions and different few-shot vid2vid [216] injects the information into their
head poses at the same time [11], [87], [226]. For example, generator by dynamically determining the SPADE [156]
RecycleGAN [11] extends CycleGAN [253] to incorporate parameters. Since these methods require only an image as
temporal constraints so it can transform videos of a par- input, they become particularly powerful and can be used in
ticular person to another fixed person. On the other hand, even more cases. For instance, several works [7], [216], [238]
ReenactGAN [226] can transfer movements and expressions demonstrate successes in animating paintings or graffiti
from an arbitrary person to a fixed person. Still, the subject- instead of real humans, as shown in Figure 18, which is not
dependent nature of these works greatly limits their usabil- possible with the previous subject-dependent approaches.
ity. One model can only work for one person, and gener- However, while these methods have achieved great results
alizing to another person requires training a new model. in synthesizing people talking under natural motions, they
Moreover, collecting training data for the target person may usually struggle to generate satisfying outputs under ex-
not be feasible at all times, which motivates the emergence treme poses or uncommon expressions, especially when the
of subject-agnostic models. target pose is very different from the original one. Moreover,
Subject-agnostic. Several recent works propose subject- synthesizing complex regions such as hair or background is
agnostic frameworks, which focus on transferring the facial still hard. This is indeed a very challenging task that is still
expressions without head movements [28], [29], [49], [53], open to further research. A summary of different categories
[77], [144], [152], [159], [160], [190], [211], [250]. In particular, of face reenactment methods can be found in Table 2.
many works only focus on the mouth region, since it is the
most expressive part during talking. For example, given an 6.2 Pose Transfer
audio speech and one lip image of the target identity, Chen Pose transfer techniques aim at transferring the body pose
et al. [28] synthesize a video of the desired lip movements. of one person to another person. It can be seen as the
Fried et al. [49] edit the lower face region of an existing whole body counterpart of face reenactment. In contrast to
video, so they can edit the video script and synthesize the talking head generation, which usually shares similar
a new video corresponding to the change. While these motions, body poses have more varieties and are thus much
works have better generalization capability than the previ- harder to synthesize. Early works focus on simple pose
ous subject-specific methods, they usually cannot synthesize transfers that generate low resolution and lower quality
spontaneous head motions. The head movements cannot be images. They only work on single images instead of videos.
transferred from the driving sequence to the target person. Recent works have shown their capability to generate high
Some works can very recently handle both expres- quality and high-resolution videos for challenging poses
13
Fig. 17. Face reenactment using 3D face models [87]. These methods first construct a 3D model for the person to be synthesized, so they can
easily animate the model with new expressions. Images are from Kim et al. [87].
Fig. 18. Few-shot face reenactment methods which require only a 2D image as input [238]. The driving expressions are usually represented by
facial landmarks or keypoints. Images are from Zakharov et al. [238].
TABLE 2 TABLE 3
Categorization of face reenactment methods. Subject-specific models Categories of pose transfer methods. Again, they can be classified
can only work on one subject per model, while subject-agnostic models depending on whether one model can work for only one person or any
can work on general targets. Among each of them, some frameworks persons. Some of the frameworks only focus on generating single
only focus on the inner face region, so they can only transfer images, while others also demonstrate their effectiveness on videos.
expressions, while others can also transfer head movements. Works Works with * do not use GANs in their framework.
with * do not use GANs in their framework.
Target Output
Methods
Target Transferred subject type
Methods
subject region
[200]*, [217], [26], [3], [183]*, [251],
Specific Videos
Face only [209]*, [198]*, [199]*, [192]*, [197]* [116], [115]
Specific
Entire head [87], [11], [226]
[129], [130], [186], [46]*, [10], [161],
[152], [28], [53], [144], [159], [250], Images [239]*, [82], [247], [146], [164], [44],
Face only General [61], [108], [124], [189], [256]
General [29], [190], [77]*, [211], [49], [160]
[7]*, [225]*, [149], [238], [162]*, [232], [185], [223]*, [121], [184]*, [216],
Entire head Videos
[216], [184]*, [65] [166]
but can only work on a particular person per model. Very codes to provide more flexibility and controllability. Later,
recently, several works attempt to perform subject-agnostic Siarohin et al. [186] introduce deformable skip connections
video synthesis. A summary of the categories is shown in to move local features to the target pose position in a U-
Table 3. Below we introduce each category in more detail. Net generator. Similarly, Balakrishnan et al. [10] decompose
Subject-agnostic image generation. Although we focus on different parts of the body into different layer masks and
video synthesis in this section, since most of the existing mo- apply spatial transforms to each of them. The transformed
tion transfer approaches only focus on synthesizing images, segments are then fused together to form the final output.
we still briefly introduce them here ( [10], [44], [46], [61], [82], The above methods work in a supervised setting where
[108], [124], [129], [130], [146], [161], [164], [186], [189], [239], images of different poses of the same person are avail-
[247], [256]). Ma et al. [129] adopt a two-stage coarse-to-fine able during training. To work in the unsupervised setting,
approach using GANs to synthesize a person in a different Pumarola et al. [161] render the synthesized image back to
pose, represented by a set of keypoints. In their follow- the original pose, and apply cycle-consistency constraint on
up work [130], the foreground, background, and poses the back-rendered image. Lorenz et al. [124] decouple the
in the image are further disentangled into different latent shape and appearance from images without supervision by
14
3D to 2D
projection
differentiable differentiable
2D to 3D 3D to 2D
feature lifting projection
Fig. 22. The two common frameworks for neural rendering. (a) In the first set of works [109], [132], [135], [139], [158], a neural network that purely
operates in the 2D domain is trained to enhance an input image, possibly supplemented with other information such as depth, or segmentation
maps. (b) The second set of works [147], [148], [178], [187], [224] introduces native 3D operations that produce and transform 3D features. This
allows the network to reason in 3D and produce view-consistent outputs.
of training on ImageNet [24]. Also, the final performance is [13] D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T.
generally very sensitive to hyper-parameters [96], [126]. Freeman, and A. Torralba. GAN dissection: Visualizing and
understanding generative adversarial networks. In ICLR, 2019.
Interpretability. Despite the impressive quality of the gen- [14] D. Bau, J.-Y. Zhu, J. Wulff, W. Peebles, H. Strobelt, B. Zhou, and
erate images, there has been a lack of understanding of A. Torralba. Seeing what a GAN cannot generate. In ICCV, 2019.
[15] S. Benaim, R. Mokady, A. Bermano, D. Cohen-Or, and L. Wolf.
how GANs represent the image structure internally in the Structural-analogy from a single image pair. arXiv preprint
generator. Bau et al. visualize the causal effect of different arXiv:2004.02222, 2020.
neurons on the output image [13]. After finding the semantic [16] S. Benaim and L. Wolf. One-sided unsupervised domain map-
ping. In NeurIPS, 2017.
meaning of individual neurons or directions in the latent [17] S. Benaim and L. Wolf. One-shot unsupervised cross domain
space [55], [76], [180], one can edit a real image by inverting translation. In NeurIPS, 2018.
[18] Y. Bengio, A. Courville, and P. Vincent. Representation learning:
it to the latent space, edit the latent code according to
A review and new perspectives. TPAMI, 2013.
the desired semantic change, and regenerate it with the [19] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. De-
generator. Finding the best way to encode an image to mystifying mmd gans. In ICLR, 2018.
[20] V. Blanz and T. Vetter. A morphable model for the synthesis of
the latent space is, therefore, another interesting research 3D faces. In Proceedings of the 26th annual conference on Computer
direction [1], [2], [14], [72], [86], [252]. graphics and interactive techniques, 1999.
[21] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor.
Forensics. The success of GANs has enabled many new The 2018 PIRM challenge on perceptual image super-resolution.
applications but also raised ethical and social concerns In ECCV Workshop, 2018.
such as fraud and fake news. The ability to detect GAN- [22] Y. Blau and T. Michaeli. The perception-distortion tradeoff. In
CVPR, 2018.
generated images is essential to prevent malicious usage [23] A. Borji. Pros and cons of gan evaluation measures. Computer
of GANs. Recent studies have found it possible to train Vision and Image Understanding, 2019.
a classifier to detect generated images and generalize to [24] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN
training for high fidelity natural image synthesis. In ICLR, 2019.
unseen generator architectures [25], [215], [244]. This cat- [25] L. Chai, D. Bau, S.-N. Lim, and P. Isola. What makes fake images
and-mouse game may continue, as generated images may detectable? understanding properties that generalize. In ECCV,
become increasingly harder to detect in the future. 2020.
[26] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance
now. In ICCV, 2019.
[27] J. Chen, J. Chen, H. Chao, and M. Yang. Image blind denoising
9 C ONCLUSION with generative adversarial network based noise modeling. In
CVPR, 2018.
In this paper, we present a comprehensive overview of [28] L. Chen, Z. Li, R. K Maddox, Z. Duan, and C. Xu. Lip movements
GANs with an emphasis on algorithms and applications generation at a glance. In ECCV, 2018.
[29] L. Chen, R. K. Maddox, Z. Duan, and C. Xu. Hierarchical cross-
to visual synthesis. We summarize the evolution of the modal talking face generation with dynamic pixel-wise loss. In
network architectures in GANs and the strategies to sta- CVPR, 2019.
[30] X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel. PixelSNAIL:
bilize GAN training. We then introduce several fascinating An improved autoregressive generative model. In ICML, 2018.
applications of GANs, including image translation, image [31] X. Chen, C. Xu, X. Yang, and D. Tao. Attention-GAN for object
processing, video synthesis, and neural rendering. In the transfiguration in wild images. In ECCV, 2018.
[32] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Star-
end, we point out some open problems for GANs, and we GAN: Unified generative adversarial networks for multi-domain
hope this paper would inspire future research to solve them. image-to-image translation. In CVPR, 2018.
[33] A. Clark, J. Donahue, and K. Simonyan. Efficient video genera-
tion on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
[34] T. Cohen and L. Wolf. Bidirectional one-shot unsupervised
R EFERENCES domain mapping. In ICCV, 2019.
[35] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sen-
[1] R. Abdal, Y. Qin, and P. Wonka. Image2StyleGAN: How to embed gupta, and A. A. Bharath. Generative adversarial networks: An
images into the StyleGAN latent space? In ICCV, 2019. overview. IEEE Signal Processing Magazine, 2018.
[2] R. Abdal, Y. Qin, and P. Wonka. Image2StyleGAN++: How to [36] N. Dalal and B. Triggs. Histograms of oriented gradients for
edit the embedded images? In CVPR, 2020. human detection. In CVPR, 2005.
[3] K. Aberman, M. Shi, J. Liao, D. Lischinski, B. Chen, and D. Cohen- [37] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The
Or. Deep video-based performance cloning. Computer Graphics helmholtz machine. Neural computation, 1995.
Forum, 2019. [38] E. de Bézenac, I. Ayed, and P. Gallinari. Optimal unsupervised
[4] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. domain translation. arXiv preprint arXiv:1906.01292, 2019.
Gool. Generative adversarial networks for extreme learned image [39] E. L. Denton and V. Birodkar. Unsupervised learning of disen-
compression. In ICCV, 2019. tangled representations from video. In NeurIPS, 2017.
[5] A. Almahairi, S. Rajeshwar, A. Sordoni, P. Bachman, and [40] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear indepen-
A. Courville. Augmented CycleGAN: Learning many-to-many dent components estimation. In ICLR, 2015.
mappings from unpaired data. In ICML, 2018. [41] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation
[6] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. In using real NVP. In ICLR, 2017.
ICML, 2017. [42] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution
[7] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen. Bring- using deep convolutional networks. TPAMI, 2015.
ing portraits to life. TOG, 2017. [43] C. Dong, C. C. Loy, and X. Tang. Accelerating the super-
[8] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv resolution convolutional neural network. In ECCV, 2016.
preprint arXiv:1607.06450, 2016. [44] H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and J. Yin. Soft-
[9] K. Baek, Y. Choi, Y. Uh, J. Yoo, and H. Shim. Rethinking the gated warping-GAN for pose-guided person image synthesis. In
truly unsupervised image-to-image translation. arXiv preprint NeurIPS, 2018.
arXiv:2006.06500, 2020. [45] Y. Du and I. Mordatch. Implicit generation and generalization in
[10] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag. energy-based models. In NeurIPS, 2019.
Synthesizing images of humans in unseen poses. In CVPR, 2018. [46] P. Esser, E. Sutter, and B. Ommer. A variational u-net for
[11] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh. Recycle-GAN: conditional appearance and shape generation. In CVPR, 2018.
Unsupervised video retargeting. In ECCV, 2018. [47] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for
[12] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. physical interaction through video prediction. In NeurIPS, 2016.
PatchMatch: A randomized correspondence algorithm for struc- [48] A. Fischer and C. Igel. An introduction to restricted boltzmann
tural image editing. TOG, 2009.
19
machines. In Iberoamerican congress on pattern recognition, 2012. [81] A. Jolicoeur-Martineau. On relativistic f-divergences. In ICML,
[49] O. Fried, A. Tewari, M. Zollhöfer, A. Finkelstein, E. Shechtman, 2020.
D. B. Goldman, K. Genova, Z. Jin, C. Theobalt, and M. Agrawala. [82] D. Joo, D. Kim, and J. Kim. Generating a fusion image: One’s
Text-based editing of talking-head video. TOG, 2019. identity and another’s shape. In CVPR, 2018.
[50] O. Gafni, L. Wolf, and Y. Taigman. Vid2Game: Controllable [83] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka,
characters extracted from real-world videos. In ICLR, 2020. O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel net-
[51] T. Galanti, L. Wolf, and S. Benaim. The role of minimal complex- works. ICML, 2017.
ity functions in unsupervised learning of semantic mappings. In [84] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing
ICLR, 2018. of GANs for improved quality, stability, and variation. In ICLR,
[52] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo. Deep 2018.
generative adversarial compression artifact removal. In ICCV, [85] T. Karras, S. Laine, and T. Aila. A style-based generator architec-
2017. ture for generative adversarial networks. In CVPR, 2019.
[53] J. Geng, T. Shao, Y. Zheng, Y. Weng, and K. Zhou. Warp-guided [86] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila.
GANs for single-photo facial animation. TOG, 2018. Analyzing and improving the image quality of StyleGAN. In
[54] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and P. K. CVPR, 2020.
Dokania. Multi-agent diverse generative adversarial networks. [87] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Niessner,
In CVPR, 2018. P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. Deep video
[55] L. Goetschalckx, A. Andonian, A. Oliva, and P. Isola. Ganalyze: portraits. TOG, 2018.
Toward visual definitions of cognitive image properties. In ICCV, [88] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-
2019. resolution using very deep convolutional networks. In CVPR,
[56] X. Gong, S. Chang, Y. Jiang, and Z. Wang. Autogan: Neural 2016.
architecture search for generative adversarial networks. In ICCV, [89] K. Kim, Y. Yun, K.-W. Kang, K. Gong, S. Lee, and S.-J. Kang.
2019. Painting outside as inside: Edge guided image outpainting via
[57] A. Gonzalez-Garcia, J. Van De Weijer, and Y. Bengio. Image-to- bidirectional rearrangement with step-by-step learning. arXiv
image translation for cross-domain disentanglement. In NeurIPS, preprint arXiv:2010.01810, 2020.
2018. [90] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning
[58] I. Goodfellow. NIPS 2016 tutorial: Generative adversarial net- to discover cross-domain relations with generative adversarial
works. arXiv preprint arXiv:1701.00160, 2016. networks. In ICML, 2017.
[59] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT [91] D. Kingma and J. Ba. Adam: A method for stochastic optimiza-
press, 2016. tion. In ICLR, 2015.
[60] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [92] D. P. Kingma and P. Dhariwal. Glow: Generative flow with
Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adver- invertible 1x1 convolutions. In NeurIPS, 2018.
sarial networks. In NeurIPS, 2014. [93] D. P. Kingma and M. Welling. Auto-encoding variational Bayes.
[61] A. Grigorev, A. Sevastopolsky, A. Vakhitov, and V. Lempitsky. In ICLR, 2013.
Coordinate-based texture inpainting for pose-guided human im- [94] D. P. Kingma and M. Welling. An introduction to variational
age generation. In CVPR, 2019. autoencoders. arXiv preprint arXiv:1906.02691, 2019.
[62] K. Gu, Y. Zhou, and T. S. Huang. FLNet: Landmark driven fetch- [95] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas.
ing and learning network for faithful talking facial animation DeblurGAN: Blind motion deblurring using conditional adver-
synthesis. In AAAI, 2020. sarial networks. In CVPR, 2018.
[63] R. A. Güler, N. Neverova, and I. Kokkinos. DensePose: Dense [96] K. Kurach, M. Lučić, X. Zhai, M. Michalski, and S. Gelly. A
human pose estimation in the wild. In CVPR, 2018. large-scale study on regularization and normalization in gans.
[64] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. In International Conference on Machine Learning, 2019.
Courville. Improved training of wasserstein GANs. In NeurIPS, [97] T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila.
2017. Improved precision and recall metric for assessing generative
[65] S. Ha, M. Kersner, B. Kim, S. Seo, and D. Kim. MarioNETte: Few- models. In NeurIPS, 2019.
shot face reenactment preserving identity of unseen targets. In [98] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and M.-
AAAI, 2020. H. Yang. Learning blind video temporal consistency. In ECCV,
[66] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for 2018.
image recognition. In CVPR, 2016. [99] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther.
[67] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and Autoencoding beyond pixels using a learned similarity metric.
S. Hochreiter. GANs trained by a two time-scale update rule In ICML, 2016.
converge to a local Nash equilibrium. In NeurIPS, 2017. [100] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
[68] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimension- [101] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A
ality of data with neural networks. Science, 2006. tutorial on energy-based learning. Predicting structured data, 2006.
[69] X. Huang and S. Belongie. Arbitrary style transfer in real-time [102] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
with adaptive instance normalization. In ICCV, 2017. A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-
[70] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal realistic single image super-resolution using a generative adver-
unsupervised image-to-image translation. ECCV, 2018. sarial network. In CVPR, 2017.
[71] Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung. ARCH: [103] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and
Animatable reconstruction of clothed humans. In CVPR, 2020. S. Levine. Stochastic adversarial video prediction. arXiv preprint
[72] M. Huh, R. Zhang, J.-Y. Zhu, S. Paris, and A. Hertzmann. Trans- arXiv:1804.01523, 2018.
forming and projecting images into class-conditional generative [104] C.-H. Lee, Z. Liu, L. Wu, and P. Luo. MaskGAN: Towards diverse
networks. In ECCV, 2020. and interactive facial image manipulation. In CVPR, 2020.
[73] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally [105] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang.
consistent image completion. TOG, 2017. Diverse image-to-image translation via disentangled representa-
[74] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep tions. In ECCV, 2018.
network training by reducing internal covariate shift. In ICML, [106] S. Lee, J. Ha, and G. Kim. Harmonizing maximum likelihood
2015. with GANs for multimodal conditional generation. In ICLR, 2019.
[75] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image [107] K. Li, T. Zhang, and J. Malik. Diverse image synthesis from
translation with conditional adversarial networks. In CVPR, semantic layouts via conditional imle. In ICCV, 2019.
2017. [108] Y. Li, C. Huang, and C. C. Loy. Dense intrinsic appearance flow
[76] A. Jahanian, L. Chai, and P. Isola. On the ”steerability” of for human pose transfer. In CVPR, 2019.
generative adversarial networks. In ICLR, 2020. [109] Z. Li, W. Xian, A. Davis, and N. Snavely. Crowdsampling the
[77] A. Jamaludin, J. S. Chung, and A. Zisserman. You said that?: plenoptic function. In ECCV, 2020.
Synthesising talking faces from audio. IJCV, 2019. [110] X. Liang, L. Lee, W. Dai, and E. P. Xing. Dual motion GAN for
[78] Y. Jo and J. Park. SC-FEGAN: Face editing generative adversarial future-flow embedded video prediction. In NeurIPS, 2017.
network with user’s sketch and color. In ICCV, 2019. [111] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee. Enhanced deep
[79] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time residual networks for single image super-resolution. In CVPR
style transfer and super-resolution. In ECCV, 2016. Workshop, 2017.
[80] A. Jolicoeur-Martineau. The relativistic discriminator: a key [112] J. H. Lim and J. C. Ye. Geometric GAN. arXiv preprint
element missing from standard GAN. In ICLR, 2019. arXiv:1705.02894, 2017.
20
[113] J. Lin, Y. Pang, Y. Xia, Z. Chen, and J. Luo. TuiGAN: Learning [144] K. Nagano, J. Seo, J. Xing, L. Wei, Z. Li, S. Saito, A. Agarwal,
versatile image-to-image translation with two unpaired images. J. Fursund, and H. Li. paGAN: real-time avatars using dynamic
In ECCV, 2020. textures. TOG, 2018.
[114] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catan- [145] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi.
zaro. Image inpainting for irregular holes using partial convolu- EdgeConnect: Generative image inpainting with adversarial edge
tions. In ECCV, 2018. learning. arXiv preprint arXiv:1901.00212, 2019.
[115] L. Liu, W. Xu, M. Habermann, M. Zollhoefer, F. Bernard, H. Kim, [146] N. Neverova, R. Alp Guler, and I. Kokkinos. Dense pose transfer.
W. Wang, and C. Theobalt. Neural human video rendering by In ECCV, 2018.
learning dynamic textures and rendering-to-video translation. [147] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y.-L. Yang.
TVCG, 2020. HoloGAN: Unsupervised learning of 3d representations from
[116] L. Liu, W. Xu, M. Zollhoefer, H. Kim, F. Bernard, M. Habermann, natural images. In CVPR, 2019.
W. Wang, and C. Theobalt. Neural rendering and reenactment of [148] T. Nguyen-Phuoc, C. Richardt, L. Mai, Y.-L. Yang, and N. Mitra.
human actor videos. TOG, 2019. BlockGAN: Learning 3d object-aware scene representations from
[117] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image unlabelled images. arXiv preprint arXiv:2002.08988, 2020.
translation networks. In NeurIPS, 2017. [149] Y. Nirkin, Y. Keller, and T. Hassner. FSGAN: Subject agnostic face
[118] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and swapping and reenactment. In ICCV, 2019.
J. Kautz. Few-shot unsupervised image-to-image translation. In [150] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training genera-
ICCV, 2019. tive neural samplers using variational divergence minimization.
[119] M.-Y. Liu and O. Tuzel. Coupled generative adversarial net- In NeurIPS, 2016.
works. In NeurIPS, 2016. [151] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis
[120] S. Liu, X. Zhang, J. Wangni, and J. Shi. Normalized diversifica- with auxiliary classifier GANs. In ICML, 2017.
tion. In CVPR, 2019. [152] K. Olszewski, Z. Li, C. Yang, Y. Zhou, R. Yu, Z. Huang, S. Xiang,
[121] W. Liu, Z. Piao, J. Min, W. Luo, L. Ma, and S. Gao. Liquid Warping S. Saito, P. Kohli, and H. Li. Realistic dynamic facial textures from
GAN: A unified framework for human motion imitation, appear- a single image using GANs. In ICCV, 2017.
ance transfer and novel view synthesis. In ICCV, 2019. [153] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
[122] X. Liu, G. Yin, J. Shao, X. Wang, et al. Learning to predict layout- A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
to-image conditional convolutions for semantic image synthesis. WaveNet: A generative model for raw audio. arXiv preprint
In NeurIPS, 2019. arXiv:1609.03499, 2016.
[123] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. [154] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, and Y. Zheng. Recent
SMPL: A skinned multi-person linear model. TOG, 2015. progress on generative adversarial networks (GANs): A survey.
[124] D. Lorenz, L. Bereska, T. Milbich, and B. Ommer. Unsupervised IEEE Access, 2019.
part-based disentangling of object shape and appearance. In [155] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu. Contrastive learning
CVPR, 2019. for unpaired image-to-image translation. In ECCV, 2020.
[125] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding [156] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic image
networks for video prediction and unsupervised learning. In synthesis with spatially-adaptive normalization. In CVPR, 2019.
ICLR, 2017. [157] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros.
[126] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Context encoders: Feature learning by inpainting. In CVPR, 2016.
Are gans created equal? a large-scale study. In Advances in neural [158] F. Pittaluga, S. J. Koppal, S. B. Kang, and S. N. Sinha. Revealing
information processing systems, pages 700–709, 2018. scenes by inverting structure from motion reconstructions. In
[127] C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang. Learning a no- CVPR, 2019.
reference quality metric for single-image super-resolution. CVIU, [159] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and
2017. F. Moreno-Noguer. GANimation: Anatomically-aware facial ani-
[128] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and L. Van Gool. mation from a single image. In ECCV, 2018.
Exemplar guided unsupervised image-to-image translation with [160] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and
semantic consistency. In ICLR, 2019. F. Moreno-Noguer. GANimation: One-shot anatomically consis-
[129] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. tent facial animation. IJCV, 2020.
Pose guided person image generation. In NeurIPS, 2017. [161] A. Pumarola, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer.
[130] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz. Unsupervised person image synthesis in arbitrary poses. In
Disentangled person image generation. In CVPR, 2018. Proceedings of the IEEE Conference on Computer Vision and Pattern
[131] S. Maeda. Unpaired image super-resolution using pseudo- Recognition, 2018.
supervision. In CVPR, 2020. [162] S. Qian, K.-Y. Lin, W. Wu, Y. Liu, Q. Wang, F. Shen, C. Qian,
[132] A. Mallya, T.-C. Wang, K. Sapra, and M.-Y. Liu. World-consistent and R. He. Make a face: Towards arbitrary high fidelity face
video-to-video synthesis. In ECCV, 2020. manipulation. In ICCV, 2019.
[133] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and M.-H. Yang. Mode [163] A. Radford, L. Metz, and S. Chintala. Unsupervised represen-
seeking generative adversarial networks for diverse image syn- tation learning with deep convolutional generative adversarial
thesis. In CVPR, 2019. networks. In ICLR, 2015.
[134] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least [164] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu.
squares generative adversarial networks. In ICCV, 2017. SwapNet: Image based garment transfer. In ECCV, 2018.
[135] R. Martin-Brualla, R. Pandey, S. Yang, P. Pidlypenskyi, J. Taylor, [165] S. Ravuri and O. Vinyals. Seeing is not necessarily believing:
J. Valentin, S. Khamis, P. Davidson, A. Tkach, P. Lincoln, A. Kow- Limitations of biggans for data augmentation. 2019.
dle, C. Rhemann, D. B. Goldman, C. Keskin, S. Seitz, S. Izadi, and [166] Y. Ren, X. Yu, J. Chen, T. H. Li, and G. Li. Deep image spatial
S. Fanello. LookinGood: Enhancing performance capture with transformation for person image generation. In CVPR, 2020.
real-time neural re-rendering. TOG, 2018. [167] D. J. Rezende and S. Mohamed. Variational inference with
[136] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video normalizing flows. In ICML, 2015.
prediction beyond mean square error. In ICLR, 2016. [168] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic back-
[137] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim. propagation and approximate inference in deep generative mod-
Unsupervised attention-guided image-to-image translation. In els. In ICML, 2014.
NeurIPS, 2018. [169] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing
[138] L. Mescheder, A. Geiger, and S. Nowozin. Which training training of generative adversarial networks through regulariza-
methods for GANs do actually converge? In ICML, 2018. tion. In NeurIPS, 2017.
[139] M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, [170] K. Saito, K. Saenko, and M.-Y. Liu. COCO-FUNIT: Few-shot
N. Snavely, and R. Martin-Brualla. Neural rerendering in the unsupervised image translation with a content conditioned style
wild. In CVPR, 2019. encoder. In ECCV, 2020.
[140] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral [171] M. Saito, E. Matsumoto, and S. Saito. Temporal generative
normalization for generative adversarial networks. In ICLR, 2018. adversarial nets with singular value clipping. In ICCV, 2017.
[141] T. Miyato and M. Koyama. cGANs with projection discriminator. [172] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and
In ICLR, 2018. H. Li. PIFu: Pixel-aligned implicit function for high-resolution
[142] S. Mo, M. Cho, and J. Shin. InstaGAN: Instance-aware image-to- clothed human digitization. In ICCV, 2019.
image translation. In ICLR, 2019. [173] M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly.
[143] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo. Reliable fidelity Assessing generative models via precision and recall. In NeurIPS,
and diversity metrics for generative models. In ICML, 2020. 2018.
21
[236] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative Xun Huang is a Research Scientist at NVIDIA
image inpainting with contextual attention. In CVPR, 2018. Research. He obtained his Ph.D. from Cornell
[237] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Free-form University under the supervision of Professor
image inpainting with gated convolution. In ICCV, 2019. Serge Belongie. His research interest includes
[238] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky. Few- developing new architectures and training algo-
shot adversarial learning of realistic neural talking head models. rithms of generative adversarial networks, as
In ICCV, 2019. well as applications such as image editing and
[239] M. Zanfir, A.-I. Popa, A. Zanfir, and C. Sminchisescu. Human synthesis. He is a recipient of NVIDIA Gradu-
appearance transfer. In CVPR, 2018. ate Fellowship, Adobe Research Fellowship, and
[240] Y. Zeng, J. Fu, H. Chao, and B. Guo. Learning pyramid-context Snap Research Fellowship.
encoder network for high-quality image inpainting. In CVPR,
2019.
[241] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-
attention generative adversarial networks. In ICML, 2019.
[242] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond
a gaussian denoiser: Residual learning of deep CNN for image
denoising. TIP, 2017.
[243] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen. Cross-domain
correspondence learning for exemplar-based image translation.
In CVPR, 2020. Jiahui Yu is a Research Scientist at Google
[244] X. Zhang, S. Karaman, and S.-F. Chang. Detecting and simulating Brain. He received his Ph.D. at University of
artifacts in GAN fake images. In WIFD, 2019. Illinois at Urbana-Champaign in 2020, and Bach-
[245] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image super-
elor with distinction at School of the Gifted Young
resolution using very deep residual channel attention networks.
in Computer Science, University of Science and
In ECCV, 2018.
[246] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense Technology of China in 2016. His research inter-
network for image super-resolution. In CVPR, 2018. est is in sequence modeling (language, speech,
[247] B. Zhao, X. Wu, Z.-Q. Cheng, H. Liu, Z. Jie, and J. Feng. Multi- video, financial data), machine perception (vi-
view image generation from a single-view. In MM, 2018. sion), generative models (GANs), and high per-
[248] C. Zheng, T.-J. Cham, and J. Cai. Pluralistic image completion. In formance computing. He is a member of IEEE,
CVPR, 2019. ACM and AAAI. He is a recipient of Baidu Schol-
[249] H. Zheng, H. Liao, L. Chen, W. Xiong, T. Chen, and J. Luo. arship, Thomas and Margaret Huang Research Award, and Microsoft-
Example-guided image synthesis across arbitrary scenes using IEEE Young Fellowship.
masked spatial-channel attention and self-supervision. arXiv
preprint arXiv:2004.10024, 2020.
[250] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. Talking face gener-
ation by adversarially disentangled audio-visual representation.
In AAAI, 2019.
[251] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg. Dance dance
generation: Motion transfer for internet videos. arXiv preprint
arXiv:1904.00129, 2019.
[252] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Gen- Ting-Chun Wang is a senior research scientist
erative visual manipulation on the natural image manifold. In at NVIDIA Research. He obtained his Ph.D. in
ECCV, 2016. EECS from UC Berkeley, advised by Profes-
[253] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to- sor Ravi Ramamoorthi and Alexei A. Efros. He
image translation using cycle-consistent adversarial networks. In won the 1st place in the Domain Adaptation for
ICCV, 2017. Semantic Segmentation Competition in CVPR,
[254] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, 2018. His semantic image synthesis paper was
and E. Shechtman. Toward multimodal image-to-image transla- in the best paper finalist in CVPR, 2019, and
tion. In NeurIPS, 2017. the corresponding GauGAN app won the Best
[255] J.-Y. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum, in Show Award and Audience Choice Award in
and B. Freeman. Visual Object Networks: Image generation with SIGGRAPH RealTimeLive, 2019. He served as
disentangled 3d representations. In NeurIPS, 2018. an area chair in WACV, 2020. His research interests include computer
[256] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai. Progressive vision, machine learning and computer graphics, particularly the inter-
pose attention transfer for person image generation. In CVPR, sections of all three. His recent research focus is on using generative
2019. adversarial models to synthesize realistic images and videos, with ap-
plications to rendering, visual manipulations and beyond.