Ten Years of Generative Adversarial Nets (Gans) : A Survey of The State-Of-The-Art
Ten Years of Generative Adversarial Nets (Gans) : A Survey of The State-Of-The-Art
Abstract—Since their inception in 2014, Generative Adversar- the development of numerous specialized variants that excel
ial Networks (GANs) have rapidly emerged as powerful tools in creating data across diverse fields. Conditional GAN [2]
for generating realistic and diverse data across various domains, enables the generation of data based on specific conditions or
arXiv:2308.16316v1 [cs.LG] 30 Aug 2023
training approaches and hybridization with popular deep learn- This survey is structured in the following manner. Section II
ing architectures such as Transformers [19], Physics-Informed digs into related works and recent surveys giving background
Neural Network (PINN) [20], Large language models (LLMs) information and emphasizing the most significant develop-
[21], and Diffusion models [22] have been proposed in the ments in GAN done over the decade. Section III is a concise
literature. These modified methodologies have shown promise overview of GAN describing the fundamental components and
in enhancing the synthetic data generation capabilities of intricate details of its architecture. In Section IV, we examine
GANs. the wide range of fields that GANs have influenced, such as
Finally, GANs have emerged as an effective tool for pro- computer vision, natural language processing, time series, and
ducing high-quality and varied data in several disciplines. audio, among many others. Subsequently, Section V reviews
Notwithstanding the difficulties connected with their use, the innovations and applications of popular GAN-based frame-
GANs have shown outstanding results and have the potential works from various domains along with their implementation
to drive innovation in disciplines such as computer vision, software and discusses their limitations. This section also
machine learning, and virtual reality. This in-depth analysis provides a timeline for the GAN models to have a clear
covers the accomplishments and limitations of GAN, as well vision of the development of this field. Section VI summarizes
as the promise of these approaches for future research and ap- the recent theoretical developments of GAN and its variants.
plications. This comprehensive survey aims to explore both the Section VII reviews the metrics used for evaluating GAN-
accomplishments and challenges of GAN. The contributions based models. Section VIII analyzes the limitations of GANs
of the article can be summarized as follows: and presents its remedial measures. Section IX discusses the
potential and usability of GAN with the development of
• Exploration of Vanilla GAN and their applications: new deep learning technologies such as Transformers, PINNs,
We offer an elaborate description of the GAN model, LLMs, and Diffusion models. Section X proposes potential
encompassing its architectural particulars and the mathe- directions for further research in this field. Finally, Section
matical optimization functions it employs. We summarize XI concludes the survey by indicating prospective directions
the areas where GANs have emerged as a promising for future research projects while also offering a closing
tool in efficiently solving real-world problems with their assessment of the successes and limits of GANs.
generative capabilities.
• Evolution of state-of-the-art GAN models across the
II. R ELATED W ORKS AND R ECENT S URVEYS
decade: Our comprehensive analysis encompasses a wide
range of cutting-edge GAN adaptations crafted to address GANs are a promising deep learning framework for gen-
practical challenges across various domains. We delve erating artificial data that closely resembles real-world data
into their structural designs, practical uses, execution [1]. Early GAN-related research focused on creating realistic
methods, and constraints. To facilitate a lucid under- visuals. Radford et al. proposed a deep convolutional GAN
standing of the field’s progress, we present an intricate (DCGAN) in 2015 [23], which utilized convolutional layers,
chronological breakdown of GAN model advancements. batch normalization, and a specific loss function to generate
Furthermore, we evaluate recent field surveys, outlining high-quality images. DCGAN introduced important innova-
their pros and cons, while also tackling these aspects tions in image generation. In 2017, Karras et al. [5] introduced
within our own survey. progressive growing GAN (ProGAN), which generates higher
• Theoretical advancements of GANs: We give a tech- quality and resolution images compared to vanilla GAN.
nical overview of the theoretical developments of GANs ProGAN trains multiple generators and discriminators in a
by exploring the connections between adversarial train- stepwise manner, gradually increasing the resolution of the
ing and Jensen-Shannon divergence and discussing their generated images. The results demonstrated the ability of
optimality features. ProGAN to produce images closely resembling genuine photos
• Assessment of GAN Models: We provide a comprehen- for various datasets, including the CelebA dataset [24].
sive breakdown of the essential performance measures GANs have found applications beyond image generation,
utilized to assess both the caliber and range of samples including video production and text generation. Vondrick et
produced by GANs. These metrics notably fluctuate de- al. proposed a video generation GAN (VGAN) in 2018 [38],
pending on the specific domains of application. capable of producing realistic and diverse videos by learning
• Limitations of GANs: We critically examine the con- to track and anticipate object motion. The VGAN architec-
straints associated with GANs, primarily stemming from ture consisted of a motion estimation network and a video-
issues of learning instability, and discuss various enhance- generating network, jointly trained to generate high-quality
ment strategies aimed at alleviating these challenges. videos. The results showcased VGAN’s ability to produce
• Anticipating future trajectories: In addition to evaluat- realistic and varied films, enabling applications like video
ing the pros and cons of current GAN-centric approaches, prediction and synthesis. Text generation is another domain
we illuminate the hybridization of emerging deep learning where GAN has been utilized. In 2017, Yu et al. introduced
models such as Transformers, PINNs, LLMs, and Diffu- SeqGAN, a GAN-based text generation model [39]. SeqGAN
sion models with GANs. We outline potential avenues for achieved realistic and diverse text generation capabilities by
research within this domain by summarizing several open maximizing a reinforcement learning goal. The model included
scientific problems. a generator responsible for text creation and a discriminator
3
TABLE I
C OMPARISON OF OUR SURVEY AND OTHER RELATED GAN SURVEYS ( GREEN CIRCLE SIGNIFIES “F ULLY COVERED ”, BLUE CIRCLE SIGNIFIES
“PARTIALLY COVERED ”, AND RED CIRCLE SIGNIFIES “N OT COVERED ”).
Domain
Theoretical Evaluation
Year Survey Computer Natural Language Time Urban Imbalanced
Music Medical
Insights Metrics Vision Processing Series Planning Classification
2019 Kulkarni et al. [25]
2021 Jabbar et al. [26]
2021 Durgadevi et al. [27]
2021 Nandhini et al. [28]
2021 Wang et al. [29]
2021 Sampath et al [30]
2021 Gui et al [31]
2021 Li et al [32]
2022 Xia et al. [33]
2022 Xun et al. [34]
2023 Ji et al. [35]
2023 Iglesias et al. [36]
2023 Brophy et al. [37]
2023+ Our survey
assessing the quality of the generated text. Through reinforce- strategies, including minimax optimization, training stability,
ment learning, the generator was trained to maximize the and assessment measures. They examine the typical challenges
predicted reward based on the discriminator’s evaluation. The that arise during GANs training, such as mode collapse and
findings demonstrated that SeqGAN outperformed previous training instability, and they give numerous solutions that
text generation algorithms, producing text that was more varied have been suggested by researchers to address these problems.
and lifelike. These advancements in GAN applications for However, it does not specifically concentrate on GAN-based
video and text generation highlight the versatility and potential methods for imbalanced, time series, geoscience, and other
of GAN frameworks in diverse domains. data types and fails to reflect the most recent advancements in
Another popular area of research focuses on addressing the field. The survey by Xia et al. [33] focuses on two primary
medical questions using GANs, as highlighted in the recent categories of techniques for GAN inversion: Optimization-
paper by Tan et al. where a GAN-based scale invariant post- based methods and Reconstruction-based methods. To locate
processing approach is proposed for lung segmentation in CT the hidden code that optimally reconstructs the supplied out-
Scans [40]. A similar framework called RescueNet, developed put, optimization-based approaches formulate an optimization
by Nema et al., combines domain-specific segmentation meth- issue. Reconstruction-based approaches, on the other hand, use
ods and general-purpose adversarial learning for segmenting different methods, such as feature matching or autoencoders,
brain tumors [41]. Their study not only suggests a promising to directly estimate the latent code. An in-depth discussion of
technique for brain tumor segmentation but also advances these strategies’ advantages, disadvantages, and trade-offs is
the development of systems capable of answering complex provided in the article. The non-convexity of the optimization
medical inquiries. Despite the significant breakthroughs, there issue and the lack of ground truth data for assessment are
are still unresolved issues in GAN architectures and appli- only two of the difficulties faced in GAN inversion that are
cations. One prominent challenge is the instability of GAN highlighted in this article. The authors [33] additionally go
training, which can be influenced by various factors such through specific evaluation standards and measures designed
as architecture, loss function, and optimization technique. In for computer vision tasks. In addition, the study discusses
2017, Arjovsky et al. proposed a solution called Wasserstein current developments and variants in GAN inversion, such as
GAN (WGAN) [15], introducing a novel loss function and techniques for managing conditional GAN, detaching latent
optimization algorithm to address stability issues in GAN variables, and dealing with different modalities. Aspect mod-
training. Their approach showed improved stability and per- ification, domain adaptability, and unsupervised learning are
formance on datasets like CIFAR-10 [42] and ImageNet [43]. a few of the applications and potential future directions of
GAN inversion that are covered. A recent study by Durgadevi
Related survey. The existing body of research exploring et al. [27] presents a comprehensive overview of numerous
various analytic tasks with GAN is accompanied by numer- GAN variants that have been proposed until 2020. Since its
ous surveys, which predominantly concentrate on specific inception, GANs have undergone significant evolutions leading
perspectives within constrained domains, particularly com- researchers to propose various enhancements and modifica-
puter vision and natural language processing. For instance, tions aimed at addressing the prevalent challenges. These
the survey by Jabbar et al. [26] explores applications of alterations encompass diverse aspects such as architectural
GANs in various industries, including computer vision, natural design, training methods, or a combination of both. In this
language processing, music, and medicine. To demonstrate survey [27] the authors delve into the application and impact
the influence and promise of GANs in certain application of GANs in different domains including image processing,
domains, they also highlight noteworthy academic publications medicine, face detection, and text transferring. The survey by
and real-world instances. The study tackles the difficulties and Alom et al. [44] covers various aspects of the deep learning
problems related to GAN training in addition to discussing paradigm, such as fundamental ideas, algorithms, architec-
their variations. The authors [26] investigate several training
4
tures, and contemporary developments including convolutional directions. Our survey is intended for general machine learning
neural networks (CNNs), recurrent neural networks (RNNs), practitioners interested in exploring and keeping abreast of
deep belief networks (DBNs), generative models, transfer the latest advancements in GAN for multi-purpose use. It
learning, and reinforcement learning. The survey of Nandhini is also suitable for domain experts applying GANs to new
et al. [28] offers a thorough investigation of the application of applications or exploring novel possibilities building on recent
deep CNNs and deep GANs in computational image analysis advancements.
driven by visual perception. The designs and methodology
used, the outcomes of the experiments, and possible uses for III. OVERVIEW OF G ENERATIVE A DVERSARIAL N ETWORK
these approaches are covered in the paper. Overall this study Generative Adversarial Networks (GANs) signify a pivotal
provides a retrospective review of the development of GANs advancement in artificial intelligence, offering a robust frame-
for the deep learning-based image analysis community. The work to craft synthetic data that closely resembles real-world
survey by Kulkarni et al. [25] presents an overview of various information [45]. Consisting of two interconnected neural
strategies, techniques, and developments used in GAN-based networks, the Generator and Discriminator, GANs engage in
music generation. The survey of Sampath et al. [30] summa- a dynamic adversarial process that is redefining the landscape
rizes the current advances in the GAN landscape for computer of deep generative modeling [1], [46]. By orchestrating this
vision tasks including classification, object detection, and seg- interplay, GANs transcend data generation frontiers across
mentation in the presence of an imbalanced dataset. Another various domains, from crafting images to generating language,
survey by Brophy et al. [37] attempts to review various discrete demonstrating a profound influence on reshaping the way
and continuous GAN models designed for time series-related machines comprehend and replicate intricate data distributions.
applications. The study by Xun et al. [34] reviews more This dynamic is facilitated through the Generator (G) network,
than 120 GAN-based models designed for region-specific entrusted with producing new data samples based on the
medical image segmentation that were published until 2021. input data distribution, while the Discriminator (D) network
Another recent survey by Ji et al. [35] summarizes the task- is devoted to discerning genuine data from their synthetic
oriented GAN architectures developed for symbolic music counterparts.
generation but other application domains are overlooked. The From a mathematical viewpoint, the G network considers
survey by Wang et al. [29] reviews various architecture-variant a latent space z from the noise distribution pz as input and
and loss-variant GAN frameworks designed for addressing generates synthetic samples G(z). Its goal is to generate data
practical challenges relevant to computer vision tasks. Another that is indistinguishable from real data samples x originating
survey by Gui et al. [31] provides a comprehensive review of from the probability distribution pdata . On the other hand, D
task-oriented GAN applications and showcases the theoretical takes both real data samples x from the actual dataset and
properties of GAN and its variants. The study by Iglesias et al. fake data samples G(z) generated by G as input and classifies
[36] summarizes the architecture of the latest GAN variants, whether the input data is real or fake. It essentially acts as
optimization of the loss functions, and validation metrics in a “critic” that evaluates the quality of the generated data.
some promising application domains including computer vi- The training process consists of both networks working in a
sion, language generation, and data augmentation. The survey two-player zero-sum game [36]. While G aims to produce
by Li et al. [32] reviews the theoretical advancements in more realistic outcomes, D enhances its ability to distinguish
GAN and also provides an overview of the mathematical and between real and fake samples. This dynamic prompts both
statistical properties of GAN variants. A detailed comparison players to evolve in tandem: if G generates superior outputs,
between our survey and others is presented in Table I. it becomes tougher for D to discern them. Conversely, if D
Although there are several papers reviewing GAN architec- becomes more accurate, G faces greater difficulty in deceiving
ture and its domain-specific applications, none of them concur- D. This process resembles a minimax game, where D strives
rently emphasize on applications of GAN in geoscience, urban to maximize accuracy while G seeks to minimize it [47].
planning, data privacy, imbalanced learning, and time series The goal is to find a balance where G produces increasingly
problems in a comprehensive manner. Methods developed to convincing data while D becomes better at classifying real
deal with these practical problems are underrepresented in past data from fake ones. The mathematical expression of this
surveys. Moreover, the stability of GANs training, assessment minimax loss function can be represented as:
of the produced data, and ethical issues with GAN are some
of the issues that still need to be resolved. To fully exploit
the future potential of GANs, more study in these areas is min max L = Ex∼pdata [log D(x)]+Ez∼pz [log(1 − D(G(z)))] ,
G D
required. To fill the gap, this survey offers a comprehensive (1)
and up-to-date review of GANs, encompassing mainstream where the probability values D(x) and D(G(z)) represent the
tasks ranging from audio, video, and image analysis, to discriminator’s outputs for real and fake samples, respectively.
natural language processing, privacy, geophysics, and many The first term in Eq. (1) encourages D to correctly classify
more. Specifically, we first provide several applied areas of real data by maximizing log D(x), whereas the second term
GAN and discuss existing works from task and methodology- encourages G to produce realistic data that D classifies as
oriented perspectives. Then, we delve into multiple popular real by minimizing log(1 − D(G(z))). In essence, G aims to
application sectors within the existing research of GAN with minimize the loss while D aims to maximize it, leading to
their limitations and propose several potential future research a continual back-and-forth training process. Throughout the
5
Fig. 1. Architecture of GANs and its primary functions. In this example, different analytical tasks of GANs are categorized into synthetic data generation,
style transfer, data augmentation, and anomaly detection.
in this context offers a swift, cost-effective, and efficient alter- tirely new image [65]. This method can be applied to develop
native to traditional manual design and modeling approaches, novel artistic features or enhance the visual attractiveness of
enabling the production of high-quality graphics. pictures. By facilitating the development of fresh artistic trends
b) Video Synthesis: In addition to generating high-quality and boosting the aesthetic appeal of pictures, GAN-based style
images, GANs offer the potential to create synthetic videos, transfer approaches have transformed the area of computer
a more complex task due to coherence requirements [56]. vision [3], [66]. These methods have been used in a variety
GANs, combining generators and discriminators, excel in this of fields, such as digital art, photography, and graphic design,
challenge [57]. The discriminator learns to differentiate real and they continue to be an inspiration for new developments
from synthetic frames, while the generator produces visually and studies in the area.
authentic video frames. GANs find widespread use in replicat- e) Natural Language Processing: Over the past few
ing real-world actions, enhancing surveillance and animations years, GANs have been adapted to process text data, re-
[58]. One of the most popular and controversial applications sulting in groundbreaking advancements within the realm of
of GAN is the evolution of Deepfake [59]. Deepfakes are AI- Natural Language Processing (NLP). One notable application
generated media, that blend a person’s likeness with another’s involves text generation, where GANs can create coherent and
context using GANs. While they offer creative potential, contextually relevant textual content. For instance, the Text
deepfakes raise ethical concerns, requiring a holistic approach GAN framework utilizes Long Short-Term Memory (LSTM)
to detect them [60], [61]. networks [67] as the generator and CNN as the discriminator
c) Augmenting data: GANs possess the capability to to synthesize novel text using adversarial training [68]. Fur-
generate synthetic data, which can be harnessed to bolster thermore, GANs play a role in text style transfer, allowing
actual data and enhance the performance of deep learning alterations in writing styles while preserving content, and
models. This approach is instrumental in mitigating concerns enhancing the adaptability of generated material [69]. In the
related to data scarcity and refining model accuracy [62]. domain of sentiment analysis, GANs contribute by generat-
GANs provide an effective avenue for fortifying machine ing text with specific emotional tones, thereby aiding model
learning and deep learning frameworks with authentic data. training and dataset augmentation for sentiment classifica-
Addressing the challenge of limited data availability, GANs tion tasks. Additionally, GANs are instrumental in text-to-
enable the creation of larger, more diverse datasets by generat- image synthesis, translating textual descriptions into visual
ing artificial samples that closely emulate real data [63]. GAN- representations, proving valuable in fields like accessibility
based data augmentation strategies have showcased promising and multimedia content creation [4]. GANs have also been
outcomes across various domains, offering the potential to harnessed to enhance machine translation software, refining
enhance model precision and transcend the constraints posed translation precision and fluidity [39], [70].
by insufficient data [64]. f) Music Generation: GANs are revolutionizing music
d) Style Transfer: GANs are capable of transferring the creation by tapping into existing compositions’ patterns and
style of one image to another, resulting in the creation of an en- structures [71]. This technology not only fosters original music
7
DCGANs exhibit elevated computational demands, sensitivity framework enhances its ability to learn representations that
to hyperparameters, and susceptibility to challenges such as facilitate data exploration, interpretation, and manipulation
restricted diversity of generated images and mode collapse tasks. Unlike supervised methods, InfoGAN does not rely
[116]. Despite these limitations, DCGANs find successful on explicit supervision or labeling, making it a flexible and
applications across domains encompassing image synthesis, scalable option for unsupervised learning tasks like image
style transfer, and image super-resolution. Their far-reaching generation and data augmentation. However, the InfoGAN
impact on the field of generative modeling continues to framework may struggle to learn meaningful and interpretable
inspire advancements and innovation. representations for high-dimensional complex datasets, and its
benefits may not always justify the additional complexity and
AAEs. Adversarial Autoencoder (AAE) framework, computational cost. Overall, InfoGAN shows promising results
proposed by Makhzani et al. in 2015, is a hybridization of in learning disentangled representations, but its effectiveness
autoencoders with adversarial training [117]. This model depends on specific goals, data characteristics, and available
has garnered significant attention due to its potential for resources [119]. Ongoing research and advancements hold
variational inference by aligning the aggregated posterior of the potential to address limitations and further improve this
the hidden code vector with a chosen prior distribution. This approach in the future.
approach ensures that meaningful outcomes emerge from
various regions of the prior space. Consequently, the AAE’s SAD-GAN. The Synthetic Autonomous Driving using
decoder acquires the capability to learn a sophisticated GANs (SAD-GAN) model, introduced by Ghosh et al.
generative model, effectively mapping the imposed in 2016, is designed to generate synthetic driving scenes
prior to the data distribution. AAEs excel in producing using the GAN approach [120]. This model’s core concept
disentangled representations, showcasing noise resistance, involves training a controller trainer network using images
and generating high-quality images. The components and keypress data to replicate human learning. To create
within the AAE framework offer notable advantages over synthetic driving scenes, the SAD-GAN is trained on labeled
alternative generative models. Through adversarial training, data from a racing game, consisting of images portraying
AAEs excel in capturing complex data distributions and a driver’s bike and its surroundings. A key press logger
generating detailed, high-quality images. Their ability to learn software is employed to capture key press data during bike
disentangled representations in separate latent dimensions rides. The framework’s architecture is inspired by DCGAN
empowers precise image control, encompassing alterations to [23]. The generator takes a current-time input image and
object properties. AAEs exhibit resilience to input variations, produces the subsequent-time synthetic image. Meanwhile, the
making them valuable for noisy data scenarios. Their encoder- discriminator receives the real latest-time image, generates
decoder design supports denoising and surpasses other models its feature map via convolution, and compares real and
in semi-supervised classification [117]. However, like other synthetic scenes to train the generator through a minimax
generative models, AAEs can encounter mode collapse, game. The SAD-GAN framework offers an autonomous
demand substantial computational resources, and necessitate driving prediction algorithm suitable for manual driving as
cautious hyperparameter tuning. Striking the right balance a recommendation system. Nevertheless, like DCGAN, it
between adversarial training and autoencoder loss poses requires substantial computation and is susceptible to mode
a challenge. AAEs lack explicit control over generated collapse, limiting its real-time applications.
samples, hindering targeted data traits in fine-grained control
contexts [118]. Yet, the application scope of AAEs is notably LSGAN. Traditional GAN models typically utilize a dis-
expanded by the enhanced encoder, decoder, and discriminator criminator modeled as a classifier with the sigmoid cross
networks, even surpassing traditional autoencoders. entropy loss function. However, this choice of loss function
can result in the issue of vanishing gradients during training,
InfoGAN. Information Maximizing Generative Adversarial resulting in impaired learning of the deep representations. To
Network (InfoGAN), a modification of GAN, is designed to address this concern, Mao et al. introduced a novel approach
learn disentangled representations of data by maximizing the called Least Squares GAN (LSGAN) in 2017, which employs
mutual information between a subset of the generator’s input the least squares loss function for the discriminator instead
and the generated output. It was introduced by Chen et al. in [121]. Mathematically, the Generator loss function (LG ) and
2016 [14]. The loss function formulation for the Generator in the Discriminator loss function (LD ) of LSGAN model is
InfoGAN is as follows: expressed as follows:
1
L = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))] Ez∼pz (D(G(z)) − c)2 ,
LG =
2
− λI(c; G(z)), 1 1
LD = Ex∼pdata (D(x) − b)2 + Ez∼pz (D(G(z)) − a)2 ,
where I(c; G(z)) is the mutual information between the 2 2
generator’s output G(z) and the learned latent code c, and where a-b encoding scheme represents the labels for fake data
λ is a hyperparameter that regulates the trade-off between and real data for D, and c denotes the values that G wants D
the adversarial loss and the mutual information term. The to believe for fake data. The LSGAN framework represents
information-theoretic approach employed in the InfoGAN a notable advancement over traditional GANs, offering
10
improved stability and convergence during training while thesis, style transfer, and data generation. The formulation of
generating higher-quality synthetic data. It has outperformed the WGAN framework utilizes the Wasserstein-1 distance or
regular GANs in generating realistic images, as measured by the Earth Mover distance to measure the distance between
Inception score, across various datasets such as CIFAR-10 real and generated data distributions. Mathematically, the
[121]. However, LSGANs often produce fuzzy images due Wasserstein distance for transforming the distribution P to
to the use of squared loss in the objective function. The distribution Q can be expressed as:
generated images often lack sharpness and fine details, as h i
the loss function penalizes large discrepancies between fake W (P, Q) = inf E(X̃,Ỹ )∼θ ∥X̃ − Ỹ ∥ .
and real images but neglects smaller variations. Researchers θ∈π(P,Q)
have addressed this issue by modifying the loss function
in subsequent studies, aiming to enhance the sharpness of In the WGAN model, the discriminator function D is designed
synthetic images [122], [123]. While LSGANs show promise as a critic network that estimates the Wasserstein distance
in generating high-quality images, ongoing research and between the real data distribution and the generated data
development are focused on overcoming their limitations in distribution instead of probability values as in conventional
producing crisp and detailed results. GAN. These scores reflect the degree of similarity or
dissimilarity between the input sample and the real data
SRGAN. Super Resolution GAN (SRGAN), introduced by distribution. The training of the critic in WGAN involves
Ledig et al. in 2017, is a GAN-based framework for image optimizing its parameters to maximize the difference in
super-resolution [124]. It generates high-resolution images critic values between real and generated samples. By
from low-resolution inputs with an upscaling factor of 4 clipping the discriminator weights, the discriminator loss
using a generator network and a discriminator network. To function in WGAN is adjusted to enforce the Lipschitz
achieve super-resolution, SRGAN incorporates a perceptual continuity requirement, but the fundamental structure of
loss function, combining content and adversarial losses. Math- the loss functions is maintained. In general, WGANs
ematically, the perceptual loss is expressed as: have demonstrated improved training stability compared to
traditional GANs. They are less sensitive to hyperparameters
lSR = lxSR + 10−3 lGen
SR
, and more resistant to mode collapse [122]. The use of the
where lxSR represents the content loss and lGen SR
is the Wasserstein distance facilitates smoother optimization and
adversarial loss. The content loss used in the SRGAN better gradient flow, resulting in faster training and higher-
framework relies on a pre-trained VGG-19 model and it quality samples. However, calculating the Wasserstein distance
provides the network information regarding the quality and can be computationally expensive [126]. Although WGANs
content of the generated image. On the other hand, the offer enhanced stability, careful tuning of hyperparameters
adversarial loss is responsible for ensuring the generation of and network designs is still necessary for satisfactory
realistic images from the generator network. SRGANs offer results. Furthermore, WGANs are primarily suited for
the ability to generate high-quality images with enhanced generating images and may have limited applicability to other
details and textures, resulting in improved overall image types of data. In summary, WGANs represent a promising
quality. They excel in producing visually appealing and advancement in the field of GANs, addressing their limitations
realistic images, as confirmed by studies on perceptual and providing insights into distribution distances, but the
quality [65]. SRGANs exhibit noise resistance, enabling applicability of WGANs to real-world problems requires
them to handle low-quality or noisy input images while still careful consideration of its challenges.
delivering high-quality outputs [125]. Moreover, this model
demonstrates flexibility and applicability across various CycleGAN. Cycle-Consistent GAN (CycleGAN), intro-
domains, including video processing, medical imaging, duced by Zhu et al. in 2017, is an unsupervised image-
and satellite imaging [124]. However, training SRGANs to-image translation framework that eliminates the need for
can be computationally expensive, especially for complex paired training data unlike traditional GANs [3]. It relies on
models or large datasets. Additionally, like other GANs, the cycle consistency, allowing images to be translated between
interpretability of SRGANs can be challenging, making it two domains using two generators and two discriminators
difficult to understand the underlying learning process of while preserving coherence. One generator GXY translates
the generator. Furthermore, while SRGANs excel in image images from the source domain X to the target domain Y ,
synthesis, they may not perform as effectively with text or and the other GY X performs the reverse. In other words
audio inputs, limiting their range of applications. the function GY X is such that GY X (GXY (x)) = x. The
discriminators, on the other hand, distinguish between real and
WGAN. The Wasserstein GAN (WGAN), introduced by translated images generated by the generators. To train this
Arjovsky et al. in 2017, is a loss function optimization variant architecture the cycle consistency loss of Cycle GAN plays a
of GAN that improves training stability and mitigates mode crucial role by enforcing consistency between the original and
collapse [15]. It employs the Wasserstein distance to enhance round-trip translated images, the so-called forward and back-
realistic sample generation and ensure meaningful gradients. ward consistency. This ensures generators produce meaningful
By introducing a critic network and weight clipping, WGAN translations, preserving important content and characteristics
achieves training stability. It finds applications in image syn- across domains. Mathematically, the cycle consistency loss
11
function can be expressed as: scratch or by combining the melodies of previous bars.
The architectural configuration of the MidiNet framework is
Lcycle (GXY , GY X ) = Ex∼pdata [∥GY X (GXY (x)) − x∥1 ] motivated by the DCGAN model [23]. The MidiNet model
+ Ey∼pdata [∥GXY (GY X (y)) − y∥1 ]. combines a CNN generator with a conditioner CNN in the
first phase of training. While the former CNN is employed to
The main advantage of Cycle GAN lies in its ability to produce generate synthetic melodies based on the random noise vector,
high-quality images with remarkable visual fidelity. It excels the latter provides the available prior knowledge about other
in various image-to-image translation tasks, including style melodies in the form of an encoded vector as an optional
transfer, colorization, and object transformation. Moreover, input to the generator. Once the melody is generated it is
its computational efficiency allows training on large datasets. processed with a CNN-based discriminator which consists of
However, CycleGAN often suffers from mode collapse and a few convolutional layers and a fully connected network.
the increasing amount of parameters reduces its efficiency The discriminator is optimized using a cross-entropy loss
[127]. Despite its limitations, CycleGAN remains a valuable function to efficiently detect whether the input is a real or a
tool for image translation, and ongoing research for any generated one. For training the overall network in MidiNet,
data translation task aims to address its shortcomings [128]. the minimax loss function is combined with feature mapping
For example, it shows promising results in medical imaging and one-sided label smoothing to ensure learning stability and
domain adaptation [129]. versatility in the generated content. The MidiNet framework
proposes a unique CNN-GAN structure for the generation of
ProGAN. In 2017, Karras et al. introduced the Progressive symbolic melodies. Its ability to synthesize artificial music in
Growing of GAN (ProGAN), addressing the limitations the presence or absence of prior knowledge is very useful in
of traditional GANs such as training instability and low- the audio domain. However, due to the use of a CNN-based
resolution output [5]. ProGAN utilizes a progressive growth structure, its computational complexity significantly increases
technique, gradually increasing the size and complexity of in comparison to the standard GAN model. Further research
the generator and discriminator networks during training. in this domain is required to understand the capabilities of
This incremental approach enables the model to learn MidiNet in multi-track music generation while simultaneously
coarse characteristics first and subsequently refine them, reducing its running time.
ultimately producing high-resolution images. By starting
with low-resolution image generation and progressively SN-GAN. Spectral Normalization GAN (SN-GAN) is a
adding layers and details, ProGAN achieves training stability GAN variant that utilizes spectral normalization to stabilize
and generates visually realistic images of superior quality. the training of the generator and discriminator networks
This technique has found successful applications in various [133]. In conventional GANs, training can be unstable due to
domains, including image synthesis, super-resolution, and a powerful discriminator or poor-quality generator samples.
style transfer. During training, the resolution of the generated SN-GAN addresses this by constraining the Lipschitz
images is increased progressively from a low resolution constant of the discriminator, preventing it from dominating
(e.g., 4x4) to a high resolution (e.g., 1024x1024). At each the training process. Spectral normalization normalizes the
resolution level, the generator and discriminator networks are discriminator’s weight matrices, ensuring a stable maximum
updated using a combination of loss functions. Progressive value and preventing the amplification of minor input
updates at increasing resolutions ensure high-quality image perturbations. SN-GAN produces high-quality samples with
synthesis with fine features and textures throughout training, improved stability and convergence compared to traditional
unlike the conventional GAN framework. ProGAN offers GANs. The adversarial training process used in the SN-GAN
better scalability, enabling the generation of images at any framework, similar to the conventional GAN (as in Eq. 1),
resolution. It exhibits improved stability during training, encourages G to produce more realistic samples that can
overcoming issues like mode collapse. The flexibility of fool D, while D learns to accurately distinguish between
ProGAN makes it suitable for various image synthesis real and generated samples. Several benefits of the SN-GAN
applications, including satellite imaging, video processing, model over the standard GAN include increased stability
and medical imaging [5]. However, training ProGAN can in training the generator and discriminator by constraining
be computationally expensive, especially for large datasets the Lipschitz constant of the discriminator. This mitigates
or complex models. Interpretability may pose challenges, as issues like gradient explosion and mode collapse, resulting in
with other GANs, making it difficult to discern the learned high-quality examples with fine features and edges. SN-GAN
representations. Additionally, ProGAN’s generalization to is relatively simple to implement and can be integrated
new or unexplored data may be limited, requiring further into existing GAN systems. However, the computation of
fine-tuning or training on fresh datasets [130]. singular values during the normalization process adds to the
computational burden, potentially extending training time and
MidiNet. MidiNet, proposed by Yang et al. in 2017, requiring more memory. SN-GAN’s reliance on the spectral
attempts to generate melodies or a series of MIDI notes norm assumption of discriminator weights may limit its
in the symbolic domain [8]. Unlike other music generation applicability to specific GAN architectures. While SN-GANs
frameworks, such as WaveNet [131], and Song from PI may exhibit slower convergence and reduced sample diversity
[132], the MidiNet model can generate melodies either from compared to conventional GANs, they excel in stability and
12
orthogonal regularization and truncation tricks stabilize and networks an encoder GEnc and a decoder GDec in place of G of
control the generator’s output. Data augmentation methods, conventional GAN and it utilizes an attribute classifier C with
such as progressive resizing and interpolation, are employed the discriminator network. During the training phase, given an
to handle high-resolution images effectively. The modified input image xã with a set of n-dimensional binary attribute
ã
training approach in the BigGAN architecture enables the ã, GEnc encodes x into a latent vector representation i.e.,
ã
generation of high-quality images with detailed features and s = GEnc x . Simultaneously, GDec is employed for editing
textures, surpassing the capabilities of regular GANs. This the attributes of xã to another set of n-dimensional attributes
enhanced model offers scalability, addresses mode collapse b̃ i.e., the edited image xb̂ is constructed as xb̂ = GDec s, b̃ .
issues, and has broad applications in fields such as video To perform this unsupervised learning task C is used with the
processing, satellite imaging, and medical imaging. However, encoder-decoder pair to constrain xb̂ to possess the desired
it is computationally demanding, especially when dealing with qualities. Moreover, the adversarial loss used in the training
large datasets or complex models [143], [144]. Additionally, process ensures realistic image generation. On the other hand,
the generalization of the framework to new, unseen data is to allow for satisfactory preservation of attribute-excluding
limited, requiring further fine-tuning or training on fresh details in the network a reconstruction loss is utilized in the
datasets [145]. framework. This loss ensures that the interaction between the
latent vector s with attribute b̃ will always produce xb̂ and
MI-GAN. In the field of deep learning, constrained data the interaction between s with attribute ã will always produce
sizes within the medical domain pose a significant challenge xâ , approximating the input image xã . Thus the overall loss
for supervised learning tasks, elevating concerns about function for the encoder-decoder-based generator of AttGAN
overfitting. To address this, Iqbal et al. introduced Medical can be expressed as:
Imaging GAN (MI-GAN) in 2018, an innovative GAN
LEnc, Dec = λRec Exã ∥xã − xâ ∥1
framework tailored for Medical Imaging [146]. MI-GAN
is specialized in generating synthetic retinal vessel images
h i h i
+ λClsG Exã ,b̃ H b̃, C(xb̂ ) − Exã ,b̃ D xb̂
along with segmented masks based on limited input data.
The architecture of the MI-GAN framework’s generator and the loss for the classifier and the discriminator is formu-
network adopts an encoder-decoder structure. Given a random lated as:
noise vector, the encoder functions as a feature extractor,
LD, Cls = λClsD Exã H ã, C(xã ) −
capturing local and global data representations through
h i
its fully connected neural network design. These learned Exã D xã + Exã ,b̃ D xb̂ ,
representations are then channeled into the decoder using
skip connections, facilitating the generation of segmented where H is the cross entropy loss, and λRec , λClsG , λClsD are
images. The generator’s enhancements encompass the hyperparameters for balancing the losses. AttGAN offers
integration of global standard segmented images and style several benefits in the image generation domain including
transfer mechanisms, refining the segmented image generation precise control over the attributes of generated images,
process. Consequently, the modified MI-GAN generator is allowing users to modify age, gender, expression, and other
trained using a blend of adversarial, segmentation, and style qualities. It provides flexibility by adapting to multiple
transfer loss functions. In contrast, the discriminator network domains and tasks, enabling customization and flexibility
within the MI-GAN model consists of multiple convolutional in image synthesis applications. The model produces
layers, and it is trained using adversarial loss functions to realistic images that approximate the desired attributes while
effectively distinguish between real and generated images. maintaining the visual aspects of the original image. However,
MI-GAN refines the conditional GAN model for retinal ethical considerations regarding representation, identity, and
image synthesis and segmentation. Remarkably, despite privacy must be addressed when using AttGAN or similar
being trained with a mere ten real examples, this model models [17], [149]. The computational complexity of AttGAN
holds tremendous potential in medical image generation. requires significant resources and may pose challenges for
Nonetheless, this approach relies on spatial alignment to deployment in production settings or on resource-limited
achieve superior outcomes, which can often be scarce [147]. devices. Additionally, AttGAN relies on labeled data with
attribute annotations, which may not always be readily
AttGAN. AttGAN, also known as Attribute GAN, is a available, and the performance and generalizability of the
variation of the GAN framework that focuses on generating model can be influenced by the quantity and quality of the
images with customizable properties such as age, gender, and attribute annotations [150]. The distribution and diversity of
expression. It was introduced by He et al. in 2019 in their the training data can also impact the model’s performance and
work “AttGAN: Facial Attribute Editing by Only Changing ability to handle uncommon or out-of-distribution features
What You Want” [148]. AttGAN aims to allow users to [151]. In conclusion, AttGAN provides precise attribute
modify specific facial attributes while preserving the overall control, flexibility, and realistic image generation capabilities,
identity and appearance of the face. By manipulating attribute but careful ethical considerations, resource requirements,
vectors, users can control the desired changes in the facial and data dependencies should be taken into account when
attributes, resulting in realistic and visually appealing image utilizing the model in practical applications.
transformations. The AttGAN framework combines two sub-
14
DM-GAN. The Dynamic Memory GAN (DM-GAN) The generators, along with their corresponding discriminators
introduced by Zhu et al. in 2019 combines the power of Dn , are trained using adversarial training. The goal is to
GANs with a memory-augmented neural network design generate realistic samples that cannot be distinguished from
to overcome the limitations of conventional GANs [152], the downsampled image xn . The SinGAN architecture consists
[153]. By addressing issues like mode collapse and lack of of 5 convolutional blocks in both Gn and Dn networks. Each
fine-grained control, DM-GAN aims to improve the image block consists of a 3×3 convolutional layer with 32 kernels,
synthesis process. This deep learning model focuses on followed by batch normalization and LeakyReLU activation.
generating realistic images from text descriptions, tackling The patch size for the discriminator remains fixed at 11×11
two main challenges in existing methods. Firstly, it addresses across all pyramid levels. During training, the generator and
the impact of initial image quality on the refinement process, discriminator networks are iteratively updated to optimize a
ensuring satisfactory results. Secondly, DM-GAN considers combination of adversarial loss and reconstruction loss. As
the importance of each word in conveying image content the training progresses to higher pyramid levels, the generator
by incorporating a dynamic memory module. The two-stage incorporates the output from the previous level, enabling it
training of the DM-GAN framework initially transforms the to capture finer details and generate more realistic images.
textual description into an internal representation using a text To enhance the model’s ability to handle diverse variations,
encoder and a deep generator model is utilized to generate noise injection is introduced during training, where random
an initial image based on the encoded text and random noise. noise patterns are added to the input image at each scale.
In the subsequent dynamic memory-based image refinement This helps in generating diverse outputs. The training process
step the generated fuzzy image is processed using a memory continues until convergence, where the generator is capable of
writing gate to select relevant text information based on the synthesizing images that closely resemble the training image
initial image content and a response gate to fuse information at all scales of the pyramid.
from memories and image features. These advancements SinGAN offers numerous advantages in image manipulation
enable DM-GAN to generate high-quality images from text tasks, requiring minimal data. It enables controlled alteration,
descriptions accurately. The dynamic memory module of synthesis, and modification of images, allowing users to
DM-GAN enhances image generation by capturing long-range adjust lighting, colors, textures, and objects. The model
relationships and maintaining global context, resulting in produces aesthetically realistic and visually consistent results
persuasive and visually appealing images. It provides fine- that align with the input image. Its multi-stage training
grained control over attribute-guided synthesis and increases process captures global and local characteristics, resulting in
diversity by addressing mode collapse. However, DM-GAN’s high-quality outputs. However, SinGAN lacks explicit control
computational complexity and memory management pose over specific image traits and quality depends on input image
challenges, and it relies on labeled data [154], [155]. The quality and quantity [159]. Ethical considerations should
model’s interpretability is limited due to the complexity of the be addressed, and the model is computationally complex
memory module [156], [157]. In conclusion, DM-GAN offers with limited interpretability [160]. Nevertheless, SinGAN’s
enhanced image generation capabilities with control, diversity, multi-stage training has gained popularity due to its versatility
and robustness, while considerations such as computational and the powerful image generation capabilities it offers.
resources, data availability, and interpretability should be
considered. PATE-GAN. In our data-centric world, safeguarding
data privacy holds paramount importance, ensuring the
SinGAN. Single-Image GAN (SinGAN) is an unconditional protection of individual rights, ethical data handling, and the
generative model introduced by Shaham, et al. in 2019 for establishment of a reliable digital environment. It ensures
learning the internal statistics from a single image without the a harmonious blend of leveraging the benefits of data-
need for additional training data [158]. SinGAN allows for a driven technologies while respecting individual’s autonomy
wide range of image synthesis and manipulation tasks, includ- and rights. To uphold these concerns and to enable the
ing animation, editing, harmonization, and super-resolution, ethical usage of real-world data in various machine-learning
among many others. The key innovation of SinGAN is the frameworks, Jordan et al. in 2019 proposed the Private
use of a multi-scale pyramid of GANs, where each GAN is Aggregation of Teacher Ensembles Generative Adversarial
responsible for generating images at a different scale. This hi- Network (PATE-GAN) framework [161]. Combining the
erarchical structure enables SinGAN to capture both the global differential privacy principles of Private Aggregation of
and local characteristics of the input image, resulting in high- Teacher Ensembles (PATE) with the generative prowess
quality and coherent output images. By training on a single of GANs, PATE-GAN generates synthetic data for training
image, SinGAN eliminates the need for a large dataset, making algorithms while aiming for a positive societal impact. Similar
it a versatile and practical tool for image generation tasks. to the conventional GAN model, PATE-GAN comprises of
During the training phase of SinGAN, a hierarchical structure a generator network that receives a latent vector as input
called the multi-scale pyramid is utilized. This pyramid con- and provides generated data as an output. However, in the
sists of a series of generators denoted as {G0 , G1 , . . . , GN }. discriminator aspect, PATE-GAN innovatively integrates the
The generators take input patches of the image at different PATE mechanism involving multiple teacher discriminators
downsampled levels, represented as {x0 , x1 , . . . , xN }, where and a single student discriminator. The teacher discriminators
each level is downsampled by a factor of rn (r > 1). classify real and generated samples within their dataset
15
segments, while the student discriminator employs the the realm of GAN-based architectures, with the primary
labels aggregated from the teacher discriminators to classify objective of elevating the visual caliber of images taken
generated samples. The framework’s training employs an via mobile devices [168]. This endeavor involves several
asymmetric adversarial process, where teachers aim to modifications to the conventional GAN architecture. In the
enhance their loss relative to the generator, the generator MIEGAN model, a multi-module cascade generative network
targets the student’s loss, and the student seeks to optimize its is utilized which combines an Autoencoder and a feature
loss against the teachers. This arrangement with the student transformer. The encoder of this modified generator comprises
discriminator ensures differential privacy concerning the of two streams with the second stream being responsible
original dataset. for enhancing the regions with low luminance - a common
issue in mobile photography leading to reduced clarity.
POLY-GAN. Introduced by Pandey et al. in 2020, In the feature transformative module, the local and global
Poly-GAN is a novel conditional GAN architecture aimed information of the image is further captured using a dual
at fashion synthesis [95]. This architecture is designed to network structure. Furthermore, to enhance the generative
automatically dress human model images in diverse poses network’s ability to produce images of superior visual quality,
with different clothing items. Poly-GAN employs an encoder- an adaptive multi-scale discriminator is employed in lieu of
decoder structure with skip connections for tasks like image a standard single discriminator in the MIEGAN model. This
alignment, stitching, and inpainting. The training procedure of multi-scale discriminator serves to differentiate between real
the Poly-GAN framework consists of four steps. This model and fake images on both global and local scales. To harmonize
takes input images, including a reference garment and a the evaluations from the global and local discriminators,
model image for clothing placement. Initially, pre-processing an adaptable weight allocation strategy is utilized in the
involves using a pre-trained LCR-Net++ pose estimator discriminator. Additionally, this model is trained based on a
[162] to extract the model’s pose skeleton and a U-Net++ contrast loss mechanism and a mixed loss function, which
segmentation network [125], [163] to obtain the segmented further enhances the visual quality of the generated images.
mask of the old garment from the model image. The Despite the image quality enhancement capabilities of the
Poly-GAN pipeline begins by passing the reference garment MIEGAN framework, their high computation complexity
and generated RGB pose skeleton through the generator to poses a significant challenge for their real-time application in
create a garment image that aligns with the skeleton’s shape. mobile photography.
The architecture of G follows an encoder-decoder structure.
The encoder incorporates three components: a Conv module VQGAN. Vector Quantized GAN (VQGAN) introduces
for propagating pose skeleton information at each layer, a a novel methodology that merges the capabilities of GAN
ResNet module for generating a feature vector [164], and a with vector quantization techniques to generate high-quality
Conv-norm module with two convolutional layers to process images [169]. This approach effectively leverages the
the other two modules’ outputs. On the other hand, the synergies between the localized interactions of CNN and
decoder learns to produce the desired garment image based the extended interactions of Transformers [19] in tasks
on pose condition embedding sent by the encoder using skip involving the conditional synthesis of data. The distinctive
connections. The transformed garment image and segmented architecture of VQGAN not only yields images of exceptional
pose skeleton are sent as inputs to the second stage of the quality but also empowers a degree of creative influence,
network for image stitching, yielding an image of the pose enabling the manipulation of various attributes within the
skeleton with the reference attire. In the third stage, the generated content. The training process of the VQGAN
model performs inpainting to eliminate any irregularities in architecture unfolds in two pivotal phases. Initially, a
the generated model image. The discriminator, similar in variational autoencoder and decoder are trained, as opposed
structure to SR-GAN [124], is employed during these stages to the conventional GAN generator network. This training
to differentiate real from fake images. Finally, in the fourth aims to reconstruct the image by utilizing a discrete latent
stage, post-processing is applied, stitching the model’s head vector representation derived from the input image. This
to the image to produce the final output. The Poly-GAN intermediate representation is subsequently linked to a
framework utilizes adversarial, GAN, and identity losses for codebook, efficiently capturing the underlying semantic
training, ensuring high image quality and minimizing texture information. To augment the fidelity of the reconstructed
and color discrepancies from real images. Poly-GAN presents image, a discriminator is incorporated into the autoencoder
an advancement in fashion synthesis compared to other structure. The training of the autoencoder model, the
models [165], as it operates with multiple conditional inputs codebook, and the discriminator involves optimizing a fusion
and achieves satisfactory fitting results without requiring 3D of adversarial loss and perceptual loss functions. In the
model information [166]. However, the generated images can subsequent phase, the codebook indices, constituting the
exhibit texture deformation and body part loss, affecting the intermediate image representations, are fed into Transformers.
fitting outcomes [167]. Further research is needed to address These Transformers are trained through a transformer loss
these issues in this domain. mechanism, guiding them to predict the succeeding indices
within the encoded sequence, resulting in an improved
MIEGAN. Mobile Image Enhancement GAN (MIEGAN), codebook representation. Finally, the information from the
introduced by Pan et al. in 2021, is a novel approach within codebook is utilized by the decoder to generate images of
16
higher resolutions. The unique aspect of VQGAN lies in bias favoring the majority class. Previous studies have
its ability to allow users to manipulate generated images in suggested oversampling approaches, involving the artificial
creative ways. By modifying the quantized codes, users can generation of samples from the minority class, as an efficient
control specific features of the generated content, thereby mechanism to mitigate this issue. Classification Enhancement
unlocking a spectrum of artistic potentials. Nonetheless, GAN (CEGAN) model introduces a solution to address the
the caliber of the images generated by VQGAN depends class imbalance issue through the utilization of a GAN-based
largely on its input data, necessitating expansive datasets and framework, as outlined in the work by Suh et al. [99].
substantial computational resources to produce images of This model particularly focuses on enhancing the quality of
exceptional excellence [170]. Consequently, this restricts its data generated from the minority class, thereby mitigating
immediate applicability in real-time case studies. Moreover, the classifier’s bias towards the distribution of the majority
the codebook representation used in the vector quantization class. Differing from the conventional GAN model, the
process can significantly reduce the variation in the generated CEGAN framework combines three distinct networks –
images [171]. a generator, a discriminator, and a classifier. The training
process of the CEGAN model involves a two-step sequence.
DALL-E. DALL-E is an advanced text-to-image generative In the initial phase, the generator generates synthetic data
framework created by OpenAI that utilizes a two-stage using input noise and real class labels. Simultaneously, the
process to generate images from textual prompts [172], [173]. discriminator distinguishes between real and synthetic data,
It combines the concepts of GANs and Transformers to while the classifier assigns class labels to input samples. The
generate highly realistic and coherent images from textual subsequent stage involves the integration of the generated
descriptions. What sets DALL-E apart is its ability to generate samples with the original training data, creating an augmented
realistic art and images from textual descriptions that may dataset for training the classifier. The CEGAN framework
describe completely novel concepts or objects. The working serves as an efficient methodology that incorporates techniques
principle of the pre-trained DALL-E model comprises of two such as data augmentation, noise reduction, and ambiguity
phases. The first stage involves a prior model that generates a reduction to effectively tackle class imbalance problems.
Contrastive Language-Image Pretraining (CLIP) [174] image Notably, this approach overcomes the limitations associated
embedding, capturing the essential gist of the image based on with traditional resampling techniques, as it avoids the need
the provided caption. In the second stage, a decoder model to modify the original dataset.
known as GLIDE takes the image embedding and reconstructs
the image itself, gradually removing noise and generating SeismoGen. Seismogen is a seismic waveform synthesis
a realistic and visually coherent image. The CLIP model, technique that utilizes GAN for seismic data augmentation
consisting of a text encoder and an image encoder, is trained [87]. The motivation behind Seismogen arises from the need
using contrastive training to learn the relationship between for abundant labeled data for accurate earthquake detec-
images and their corresponding captions. This allows the tion models. To overcome the scarcity of seismic waveform
model to generate the CLIP text embedding from the input datasets, Wang et al. introduced the Seismogen framework,
caption. Further, the prior model of DALL-E processes this employing GAN to generate realistic multi-labeled waveform
text representation to generate the CLIP image embedding. In data based on limited real seismic datasets. Incorporating this
case of the decoder, DALL-E utilizes a Diffusion model [22] additional dataset enhances the training of machine learning-
which generates the image by using CLIP image embedding based seismic analysis models, leading to more robust predic-
and the CLIP text embedding as an additional input. tions for out-of-sample datasets. The mathematical formulation
DALL-E’s two-stage process offers advantages in prioritizing of the Seismogen framework follows the Wasserstein GAN
high-level semantics and enabling intuitive transformations. [109] framework and can be expressed as:
It excels in generating creative and imaginative images LG = − E D(G(z)),
based on textual descriptions, making it valuable for z∼N(0,1)
creative tasks. However, training DALL-E requires substantial LD = E D(G(z)) − E D(x)
z∼N(0,1) x∼pdata
computational resources and presents challenges in fine-tuning h i
and attribute control. Ethical concerns and biases surrounding 2
+λ E (∥D(G(z))∥2 − 1) ,
AI-generated content also arise [175], [176]. Moreover, the z∼N(0,1)
lack of interpretability and explainability of this framework where the noise z is a standard normal variable and λ is
restricts its applications in legal, medical, or safety-sensitive a hyperparameter. The primary objective is to minimize
domains [177]. Nevertheless, DALL-E represents a significant the difference between the true seismic waveforms and the
advancement in image synthesis and has garnered attention synthetic waveforms generated by the Seismogen. This is
for its creative potential. Ongoing research, such as DALL-E achieved by iteratively optimizing LG and LD to find an
2 [178], continues to push the boundaries of this field and equilibrium between the generator and discriminator networks.
attempts to mitigate the explainability concerns [179]. SeismoGen has demonstrated its ability to generate highly
realistic seismic waveforms, making it valuable for seismic
CEGAN. Class imbalance is a prevalent challenge across waveform analysis and data augmentation. Its conditional
many real-world datasets. In the context of classification generation feature allows users to produce waveforms labeled
tasks, this skewed distribution of classes leads to a significant with specific categories, enhancing its versatility for various
17
applications. SeismoGen is scalable and capable of generating costs due to extensive data requirements and dependence on
large databases of artificial waveforms, which is beneficial for data quality, which may hinder its performance with noisy or
tasks requiring extensive training data. However, SeismoGen’s missing data. Additionally, the model lacks interpretability,
effectiveness is influenced by the quality and distribution of making it difficult to understand the reasoning behind its
the training data. It does not model the expected waveform predictions, and it may struggle to represent all intricate
move-out, which is relevant in various seismic research. features of complex urban systems effectively.
Additionally, due to imbalanced real seismic waveform
datasets, SeismoGen struggles to generate data with rare M3GAN. Anomaly detection in multi-dimensional time
characteristics. Moreover, the computational cost of training series data has received tremendous attention in the fields
and using SeismoGen may be a limiting factor, especially of medicine, fault diagnosis, network intrusion, and climate
for real-time seismic hazard assessment applications. As a change. In this work, the authors have proposed the M2GAN
relatively new technology, there might be some potential (a GAN framework based on a masking strategy for multi-
for unexpected behavior when using SeismoGen, as its full dimensional anomaly detection) and M3GAN (M2GAN for
capabilities and limitations are yet to be fully explored. mutable filter) for improving the robustness and accuracy of
GAN-based anomaly detection methods. M2GAN generates
MetroGAN. Zhang et al. introduced Metropolitan GAN fake samples by directly reconstructing real samples, which are
(MetroGAN) as a geographically informed generative deep sufficiently realistic [102]. This is done by extracting various
learning model for urban morphology simulation [84]. Met- information from the original data by the mask method which
roGAN incorporates a progressive growing structure to learn improves the robustness of the model. M3GAN fuses the fast
urban features at various scales and leverages physical geogra- Fourier transform (FFT) [180] and wavelet decomposition
phy constraints through geographical loss to ensure that urban [181] to obtain a mutable filter to process the raw data so
areas are not generated on water bodies. The generation of that the model can learn various types of anomalies. The
cities with MetroGAN involves a global city dataset compris- architecture of the M2GAN framework utilizes the AAE
ing three layers: terrain (digital elevation model), water, and [117] in place of the generator of the conventional GAN
nighttime lights, effectively capturing the physical geography model for generating realistic fake data. A masking strategy
characteristics and socioeconomic development of cities. The of the AAE enhances the variability within the original time
model detects and represents over 10,000 cities worldwide as series and overcomes the mode collapse problem. For the
100km × 100km images. The mathematical formulation of the discriminator network, this framework employs an AnoGAN
MetroGAN framework is a modified version of the LSGAN [182] architecture that distinguishes between normal data and
model [121], which can be expressed as follows: anomalous data using DCGAN [23]. The M3GAN model
combines a dynamic switch-based adaptive filter selection
1 h
2
i
L∗ = arg min max Ex,y (D(x, y) − 1) mechanism with the multidimensional anomaly detection
G D 2
capabilities of the M2GAN model. This approach allows one
1 h
2
i
+ Ex,z (D(x, G(x, z))) + λL1 LL1 (G) to select the most suitable filter for the given data that better
2
exploits the complex characteristics of the series, leading
− λGeo Ex,z [xwater ⊙ G(x, z)] ,
to improved accuracy in anomaly detection. Both M2GAN
where images x with corresponding labels y and a random and M3GAN architectures excel in spotting anomalies in
vector z in the latent space are fed into G to produce multi-dimensional time series data, offering adaptability for
simulated images G(x, z). Both real input pairs (x, y) dynamic settings. Its capacity to generate synthetic data
and simulated pairs (x, G(x, z)) are then presented to D aids tasks like diverse model training. However, their high
to distinguish real images from fake ones and also to computational complexity leads to extended processing times.
assess if the input pairs match. The objective loss function Moreover, their limited interpretability also poses a significant
comprises different terms, including least square adversarial challenge in understanding the marked anomalies. Further
loss (from the first two expectation terms), L1 loss denoted research is needed in this domain to address these issues and
as LL1 , and a geographical loss with hyperparameters provide support for adaptive filter parameters in M3GAN.
λL1 and λGeo , respectively. The geographical loss (last
term) utilizes Hadamard product ⊙ to filter out pixels that CNTS. Cooperative Network for Time Series (CNTS),
generate urban areas on water area xwater . MetroGAN, a introduced by Yang et al. in 2023, is a reconstruction-based
robust urban morphology simulation model, has several unsupervised anomaly detection technique for time series data
notable advantages and limitations. On the positive side, it [103]. This model aims to overcome the limitations of the
incorporates geographical knowledge, resulting in enhanced previous generative methods that were sensitive to outliers
performance. Its progressive growing structure allows for and showed sub-optimal anomaly detection performance due
stable learning at different scales, while multi-layer input to their emphasis on time series reconstruction. The CNTS
ensures precise city layout generation. The model’s evaluation framework consists of two FEDformer [183] networks, namely
framework covers various aspects, ensuring the quality of a reconstructor (R) and a detector (D). The reconstructor
its output. Furthermore, MetroGAN finds wide applications aims to regenerate the series that closely matches the known
in urban science and data augmentation. However, these data distribution (without anomalies) i.e., data reconstruction.
strengths come with challenges, including high computational On the other hand, the detector focuses on identifying the
18
values that deviate from the fitted data distribution, effectively or partial real data, as the model can produce data that
detecting anomalies. Despite having different purposes, these closely resembles actual urban morphology and helps in data
two networks are trained using a cooperative mode, enabling augmentation. However, the framework fails to showcase its
them to leverage mutual information. During the training performance for the generated human settlements which is
phase, the reconstruction error of R serves as a labeling crucial in the urban planning procedure. Further studies in
mechanism for D, while D provides crucial information to R this domain are indeed required to understand the suitability
regarding the presence of anomalies, enhancing the robustness of the framework for large cities as well.
to outliers. Thus the multi-objective function of the CNTS
model can be expressed as: VI. R ECENT T HEORETICAL A DVANCEMENTS OF GAN
Pn
min i=1 LD (D(xi , θD ), LR (xi , R(xi , θR ))) Empirical studies have shown great success of GAN and
θD θR Pn , their variants in producing state-of-the-art results in diverse
min i=1 (1 − ŷi (xi , θD ))LR (xi , R(xi , θR )))
θD θR domains ranging from image, video, and text generation to au-
where xi is the value for the ith , i = 1, 2, . . . , n time stamp tomatic vehicles, time series, and drug discovery, among many
of the input series, θD and θR denotes the parameters of others. The mathematical reasoning of GANs is to approximate
D and R, while LD and LR represent their corresponding the unknown distribution of a given data by optimizing an
loss functions, respectively. The categorical label ŷi indicates objective function through an adversarial game between a
the presence of anomalies as identified by D and helps to family of generators and a family of discriminators. Biau et
remove data with high anomaly scores, thereby reducing al. [192] analyzed the mathematical and statistical properties
their impact on the training of R. The cooperative training of GANs by establishing connections between adversarial
approach employed by CNTS allows it to model complex principles and Jensen-Shannon (JS) divergence. Their work
temporal patterns present in real-world time series data, thus provides the large sample properties for the parameters of the
significantly enhancing its performance in various anomaly estimated distribution and a result towards the central limit
detection tasks. The flexibility and adaptability of the CNTS theorem. Another cousin approach of GAN called WGAN
model make it robust to the presence of outliers in the series. has more stable training dynamics than typical GANs. Biau
However, the presence of the dual-network architecture of et al. [193] studied the convergence of empirical WGANs
the CNTS model increases its computational complexity, when sample size approaches infinity. More recently, the rate
hindering its real-time applicability. Moreover, the lack of of convergence for density estimation with GANs has been
interpretability of the model poses a significant challenge to studied in [194]. In particular, they studied the non-asymptotic
its potential use cases. Furthermore, the success of the CNTS properties of the vanilla GAN and derived a theoretical guaran-
model is contingent on the availability of representative and tee of the density estimation with GANs under a proper choice
diverse time series datasets and the choice of sub-networks. of deep neural network classes representing generators and
Further research in this domain is required to comment on the discriminators. It suggests that the resulting estimates converge
performance of the model for diverse datasets and appropriate to the true density (p∗ ) in terms of the JS divergence at the
2β/(2β+d)
sub-network choices. rate of (log n/n) , where n is the sample size, β
determines the smoothness of p∗ , and d is the data dimension.
RidgeGAN. RidgeGAN, introduced by Thottolil et al. in In Theorem 2 of [194] if the choice of G and D to be
2023, is a hybridization of the nonlinear kernel ridge regres- classes of neural networks with rectified quadratic unit (ReQU)
sion (KRR) [184], [185] and the generative CityGAN model activation functions, the rates of convergence for the estimate
[10]. This framework aims to predict the transportation net- pĝ to the true density p∗ in terms of JS divergence holds the
work of the future small and medium-sized cities of India by following inequality with probability at least 1 − δ;
analyzing the spatial indicators of human settlement patterns. 2β
log n 2β+d log (1/δ)
This prediction is crucial for facilitating sustainable urban JS (pĝ , p∗ ) ≲ + .
n n
planning and traffic management systems. The RidgeGAN
framework operates in three steps. Firstly, it generates an The above mathematical result suggests that the convergence
urban universe for India based on spatial patterns by learning rate of vanilla GAN’s density estimate in the JS divergence
urban morphology using the CityGAN model [82]. Secondly, is faster than n−1/2 when β > d2 ; therefore, the obtained
it utilizes KRR to study the relationship between the human rate is minimax optimal for the considered class of densities.
settlement indices (HSI) and the transportation indices (TI) Meitz et al. [195] studied statistical inference for GAN by
of 503 real small and medium-sized cities in India. Finally, addressing two critical issues for the generator and discrimina-
the KKR model’s regression framework is applied to the tor’s parameters, namely consistent estimation and confidence
synthetic hyper-realistic samples of future cities and their TI sets. Mbacke et al. [196] studied PAC-Bayesian generalization
is predicted. RidgeGAN framework has its applications in di- bound for WGANs based on Wasserstein distance and Total
verse areas, such as analyzing urban land patterns, forecasting variational distance. The generalization properties of GANs
essential urban infrastructure, and assisting policymakers in try to answer the following question: How to certify that the
achieving a more inclusive and effective planning process. learned distribution pĝ is “close” to the true one p∗ ? This
Moreover, this model is especially valuable when designing question is pivotal since the true distribution p∗ is unknown
the transportation network of developing nations with limited in real problems and generative models can only access its
19
TABLE II
S OFTWARE L INKS FOR THE GAN S
empirical counterpart. Liu et al. [197] studied how well GAN applicability [47]. In this section, we will briefly overview the
can approximate the target distribution under various notions popular evaluation measures used in different applications.
of distributional convergence. Lin et al. [198] showed that
under certain conditions GAN-generated samples inherently
A. Inception Score
satisfy some (weak) privacy guarantees. Another study offers
a theoretical perspective on why GANs sometimes fail for The Inception Score (IS) is a widely used metric to assess
certain generation tasks, in particular, sequential tasks such the quality and diversity of GAN-generated samples [202].
as natural language generation [199]. Further research on It leverages a pre-trained neural network classifier called
the comparative theoretical aspects, both pros and cons, of Inception v3 [203], which was initially trained on the Imagenet
different generative approaches will enhance support for the [204] dataset containing a diverse range of real-world images
wide applications of GANs and address their limitations. categorized into 1,000 classes. The IS measures the quality of
generated samples based on their classification probabilities
predicted by Inception v3. Essentially, higher-quality samples
VII. E VALUATION M EASURES
are expected to be strongly classified into specific classes,
In contrast to conventional deep learning architectures that implying low entropy. In general, the IS value ranges between
employ convergence-based optimization of the objective func- 1 and the number of classes in the classifier, reflecting the
tion, generative models like GANs utilize a minimax loss diversity of the generated samples, with higher scores indicat-
function, trained iteratively to establish equilibrium between ing better performance. Nevertheless, the Inception Score does
the generator and discriminator networks [1]. The absence of come with a number of limitations. It encounters challenges
an objective loss function for GAN training restricts the ability when dealing with instances of mode collapse, wherein the
of loss measurements to assess training progress or model per- generated samples by GANs are extremely similar, causing
formance. To address this challenge, a mix of qualitative and artificially inflated IS values that don’t accurately represent
quantitative GAN evaluation approaches has been developed diversity. Additionally, it relies on the performance of the
[200]. These evaluation measures particularly vary based on Inception v3 model, which might not always align with human
the quality and diversity of the generated synthetic data, as perception of image quality. To mitigate these drawbacks of IS,
well as the potential applications of the generated data [201]. several modified versions have been proposed in the literature.
Owing to the lack of consensus amongst the researchers on For example, the modified Inception Score (m-IS) attempts
the use of a universal metric to gauge the performance of the to address the mode collapse problem in GAN by evaluating
deep generative models, different metrics have been developed the diversity of images with the same category [205]. Other
in the last decade with their unique strengths and particular modification of IS includes the Mode Score (MS) which
20
evaluates the quality and diversity of the generated data by E. Music Evaluation Metric
considering the prior data distribution of the labels [206]. Evaluating the quality of music generated by GANs presents
unique challenges due to the subjective nature of musical
B. Fréchet Inception Distance perception. Traditional quantitative metrics like those used
for image evaluation may not fully capture the richness and
The Fréchet Inception Distance (FID) is a widely used complexity of musical content. However, several methods have
evaluation metric that measures the quality and diversity of been developed to assess the quality and coherence of GAN-
GAN-generated images [49]. It calculates the similarities and generated music. Certain objective evaluation metrics encom-
differences between the distributions of real and generated pass factors such as musical characteristics, structure, style,
images using the Fréchet distance, which is a form of the uniqueness, and tonality, drawing from statistical representa-
Wasserstein-2 distance. The FID metric calculates the mean tions [35]. Amid these, subjective listening is the most reliable
and covariance of both the real and generated images and then metric for evaluating GAN-generated music. This approach
computes the distance between their distributions. Mathemat- encompasses dimensions like melody, harmony, rhythm, and
ically the FID is expressed as: emotional resonance, thereby furnishing insightful glimpses
2
1/2
into the musical caliber.
FID = |µ − µw | + tr Σ + Σw − 2 (ΣΣw ) ,
where (µ, Σ) and (µw , Σw ) represent the mean and covariance F. Maximum Mean Discrepancy
pair for the real images and the generated images respectively. Maximum Mean Discrepancy (MMD) is a statistical mea-
The strength of FID lies in its ability to account for various sure that quantifies the dissimilarity between two probability
forms of contamination, such as Gaussian noise, Gaussian blur, distributions. In the context of GAN evaluation, MMD is
black rectangles, and swirls, among others. FID’s incorpora- employed to assess the quality of generated samples by com-
tion of these factors contributes to a more robust evaluation paring them with real data distributions based on their mean
of GAN-generated images. As a widely accepted and utilized values in a high-dimensional space [211]. A lower MMD score
metric, FID offers a common ground for comparing results indicates that the difference between the two data distributions
across different GAN architectures, promoting a standardized is relatively smaller, hence the synthetic data is similar to the
approach for assessing image quality [5], [6], [207]. original data.
neural networks based on two types of uncertainty sources, This circumstance can occur when the discriminator becomes
namely data uncertainty and model uncertainty [214]. They very accurate, such as when D(G(z) = 0 and D(x) = 1 or
highlighted that GAN-based models can capture the structure when D is inadequately trained and fails to differentiate be-
of data uncertainty, however, they are hard to train. Another tween real and generated data. Consequently, the loss function
survey [215] highlighted various measures to quantify uncer- might approach zero, hindering constructive feedback to the
tainties in deep neural networks. However, it still remains generator and restricting the generation of high-quality data.
difficult to validate existing methods due to the lack of Several strategies have been proposed to address vanishing
uncertain ground truths. gradients in GANs. One approach is to use a modified loss
function, such as the Least-Square GAN [121] that mitigates
VIII. L IMITATIONS AND SCOPE FOR IMPROVEMENT the vanishing gradient problem to a considerable extent.
Furthermore, advanced optimization algorithms, alternative
Although GANs have brought a transformative shift in
activation functions, and batch normalization strategies are
generative modeling, it’s crucial to address the substantial
often adopted to reduce the effect of vanishing gradients during
challenges embedded within their training process that demand
GANs training.
careful consideration [202]. Various architectural modifica-
tions of GAN (as discussed in Section V) aim to address C. Learning Instability and Nash Equilibrium
specific GAN-related issues and optimize their overall perfor-
The architectural characteristics of GAN involve a com-
mance. In this section, we summarize the different obstacles
plex interplay between the two deep neural networks in an
in GAN and discuss their potential remedies.
adversarial manner. Their training happens in a cooperative
yet competitive way using a zero-sum game strategy where
A. Mode Collapse both G and D aim to optimize their respective objective
The foremost challenge during GANs training is mode functions to achieve the Nash equilibrium i.e., a state beyond
collapse (MC), a phenomenon where the generator’s output which they can not improve their performance unilaterally
becomes constrained, yielding repetitive samples that lack the [48]. While this cooperative architecture aims to optimize a
comprehensive range of the target data distribution [173]. MC global loss function, the optimization problems faced by the
arises when the generator doesn’t explore the full spectrum individual networks are fundamentally opposing. Due to this
of potential outputs and instead generates identical outputs complexity in the loss function, there can be situations where
for distinct inputs from the latent space. This issue can some minor adjustments in one network can trigger substantial
manifest due to an overpowering discriminator or insufficient modifications in the other. Moreover, when both the networks
feedback for the generator to diversify its outputs [216]. Partial aim to independently optimize their loss functions without
and complete mode collapse are its two variants, with the coordination, attaining the Nash equilibrium can be hard. Such
former leading to a limited diversity in generated data and the instances of desynchronization between the networks can lead
latter resulting in entirely uniform patterns across generated to instability in the overall learning process and substantially
samples. While partial mode collapse is common, complete increase the computation time [221]. To counter this challenge,
mode collapse is relatively rare [47]. recent advancements in GAN architectures have been focusing
Many efforts have been made to tackle the mode collapse on enhancing training stability. The feature matching technique
problem [217], [218]. Some of these approaches include improves the stability of the GAN framework by introducing
the application of Unrolled GAN [219] where the generator an alternative cost function for G combining the output of the
network is updated by unrolling the discriminator’s update discriminator [202]. Additionally, historical averaging of the
steps, unlike the conventional GAN, where D is first updated parameters [202], unrolled GAN [219], and gradient penalty
while G is kept fixed and G is updated based on the updated D. [122] strategies mitigate learning instability and promote con-
Moreover, mini-batch discrimination is often used to mitigate vergence of the model.
the MC problem [202]. In this approach, instead of modeling
D. Stopping Problem
each data example independently, D processes multiple data
examples in mini-batches. The use of modified loss functions, During GANs training, determining the appropriate time
for example, Least-Square GAN [121], Wasserstein GAN at which the networks are fully optimized is crucial for
[109], Cycle consistency GAN [3] also reduces the mode addressing the problems related to overfitting and underfitting.
collapse problem. However, in GANs due to the minimax objective function
determining the state of the networks based on their respective
loss functions is impossible. To address this issue related to
B. Vanishing Gradients the GANs stopping criterion, researchers often employ an
The vanishing gradients problem is another significant chal- early stopping approach where the training halts based on a
lenge encountered during the training phase of GANs. This predefined threshold or the lack of improvement in evaluation
issue emerges due to the complex architecture of GANs, metrics.
where both G and D need to maintain a balance and learn
collaboratively [220]. During the training process, as gradients E. Internal Distributional Shift
are backpropagated through the layers of the network, they The internal distributional shift often called internal covari-
can diminish drastically, leading to stagnancy in learning. ate shift refers to the changing distribution in the network
22
activations of the current layer w.r.t the previous layer. In gradient from multiple losses. Another architecture namely
the context of GAN, when the generator’s parameters are Physics-informed GAN (PI-GAN) [233] tackles the problem
updated, the distribution of its output may change, leading to of sequence generation with limited data. It integrates a transi-
internal distributional shifts in subsequent layers and causing tion module in the generator part that can iteratively construct
the discriminator’s learning to lag behind. This phenomenon the sequence with only one initial point as input. Solving
affects the convergence of the GAN training process and differential equations using GANs to learn the loss function
the computational complexity of the network significantly was presented in the Differential Equation GAN (DEQ-GAN)
increases to counter the shifts. To address this issue batch nor- model [234]. Combining GANs with PINNs achieved solution
malization technique is widely adopted in various applications accuracies that are competitive with popularly used numerical
of GAN [222]. methods.
Large language models (LLMs) [21] became a very popular
IX. DISCUSSION choice for their ability to understand and generate human
Over the past decade, GANs have emerged as the foremost language. LLMs are neural networks that are trained on
and pivotal generative architecture within the areas of com- massive text datasets to understand the relationship between
puter vision, natural language processing, and related fields. words and phrases. This enables LLMs to generate text that
To enhance the performance of GAN architecture, numerous is both coherent and grammatically correct. Recently, LLMs
studies have focused on the following: (i) the generation of and their cousin ChatGPT revolutionized the field of natural
high-quality samples, (ii) diversity in the simulated samples, language processing, question-answering, and creative writing.
and (iii) stabilizing the training algorithm. Constant efforts and Additionally, LLMs and their variants are used to create
improvements of the GAN model have resulted in plausible creative content such as poems, scripts, and codes. GANs
sample generation, text/image-to-image translations, data aug- and LLMs are two powerful co-existing models where the
mentation, style transfer, anomaly detection, and other applied former is used to generate realistic images. Mega-TTS [235]
domains. adopt a VQGAN [169] based acoustic model and a latent-code
Recent advancements in machine learning with the help language model called Prosody-LLM (P-LLM) [236] to solve
of Diffusion models [22], [223], [224] also known as score- zero-shot text-to-speech at scale with intrinsic inductive bias.
based generative models have made a strong impression on a Future works in the hybridization of GANs with several other
variety of tasks including image denoising, image inpainting, architectures will be a promising field of future research.
image super-resolution, and image generation. The primary
goal of Diffusion models is to learn the latent structure of X. FUTURE RESEARCH DIRECTION
the dataset by modeling the way in which data points diffuse Despite the substantial advancements achieved by GAN-
through the latent space. [225] has shown that Diffusion based frameworks over the past decade, there remain a number
models outperform GANs on image synthesis due to their of challenges spanning both theoretical and practical aspects
better stability and non-existence of mode collapse. However, that require further exploration in future research. In this
the cost of synthesizing new samples and computational time section, we identify these gaps that necessitate deeper investi-
for making realistic images lead to its shortcomings when gation to enhance our comprehension of GANs. The summary
applied to real-time application [226], [227]. Due to the is presented below:
fact that GANs need fine-tuning in their hyperparameters, a) Fundamental questions on the theory of GANs:
Transformers [19] have been used to enhance the results Recent advancements in the theory of GAN by [192], [193],
of GANs that can adopt self-attention layers. This helps in [197] explored the role of the discriminator family in terms
designing larger models and replacing the neural network of JS divergence and some large sample properties (conver-
models of G and D within the GAN structure. TransGAN gence and asymptotic normality) of the parameter describing
[228] introduces a GAN architecture without convolutions by the empirically selected generator. However, a fundamental
using Transformers in both G and D of the GAN resulting in question of how well GANs can approximate the target distri-
improved high-resolution image generation. [229] presented an bution p∗ remained largely unanswered. From the theoretical
intersection of GANs and Transformers to predict pedestrian perspective, there is still a mystery about the role and impact
paths. Although Transformers and their variants have several of the discriminator on the quality of the approximation. The
advantages, they suffer from high computational (time and universal consistency and the rate of convergence of GANs
resource) complexity [230]. More recently, physics-informed and their variants still remain an open problem.
neural networks (PINN) [20] was introduced as a universal b) Improvement of training stability and diversity:
function approximator that can incorporate knowledge of Achieving the Nash equilibrium in GAN frameworks, which
physical laws to govern the data in the learning process. PINNs is essential for the generator to learn the actual sample
overcome the low data availability issue [231] in which GANs distribution, requires stable training mechanisms [237], [238].
and Transformers lack robustness, rendering them ineffective However, attaining this optimal balance between the generator
scenarios. A GAN framework based on a physics-informed and discriminator remains challenging. Various approaches
(PI) discriminator for uncertainty quantification is used to have been explored, such as WGAN [109], SN-GAN [133],
inform the knowledge of physics during the learning of both One-sided Label Smoothing [203], and WGAN with gradient
G and D models. Physics-informed Discriminator GAN (PID- penalty (WGAN-GP) [122], to enhance training stability. Ad-
GAN) [232] doesn’t suffer from an imbalance of generator ditionally, addressing mode collapse, a common GAN issue
23
that leads to limited sample diversity, has prompted strategies modalities can unlock novel avenues for creating complex data
like WGAN [109], U-GAN [219], generator regulating GAN [247].
(GRGAN) [239], and Adaptive GAN [240]. Future research f) Human-centric GANs: GANs have the potential to
could focus on devising techniques to stabilize GAN training enable human-machine creative cooperation [248]. Future
and alleviate problems like mode collapse through regular- research could emphasize human-centric GANs, integrating
ization methods, alternative loss functions, and optimized hy- human feedback, preferences, and creativity into the generative
perparameters. Incorporating methods like multi-modal GANs, process. This direction might pave the way for interactive and
designed to generate diverse outputs from a single input, might co-creative GANs, enabling the production of outputs aligned
contribute to enhancing sample diversity [239]. with human preferences and needs, while also involving users
c) Data scarcity in GAN: Addressing the issue of data in active participation during the generation process.
scarcity in GANs stands as a crucial research trajectory. To g) Other innovative applications and industry usage:
expand GAN applications, forthcoming investigations could Initially designed for generating realistic images, GANs have
focus on devising training strategies for scenarios with limited exhibited impressive performance in computer vision. While
data. Approaches such as few-shot GANs, transfer learning, their application has extended to domains like time series
and domain adaptation offer the potential to enhance GAN per- generation [102], [103], audio synthesis [8], and autonomous
formance when data is scarce [241], [242]. This challenge be- vehicles [120], their use outside computer vision remains
comes especially pertinent when acquiring substantial datasets somewhat constrained. The divergent nature of image and
poses difficulties. Additionally, refining training algorithms non-image data introduces challenges, particularly in non-
for maximal data utility could be pursued. Bolstering GAN image contexts like NLP, where discrete values such as words
effectiveness in low-data situations holds pivotal significance and characters predominate [199]. Future research can aim
for broader adoption across various industries and domains. to overcome these challenges and enhance GANs’ capabili-
d) Ethics and privacy: Since its inception in 2014, GAN ties in discrete data scenarios. Furthermore, exploring unique
development has yielded substantial benefits in research and applications of GANs in fields like finance, education, and
real-world applications. However, the inappropriate utilization entertainment offers the potential to introduce new possibilities
of GANs can give rise to latent societal issues such as pro- and positively impact various industries [249]. Collaborative
ducing deceptive content, malicious images, fabricated news, efforts across disciplines could also harness diverse expertise,
deepfakes, prejudiced portrayals, and compromising individual fostering synergies to enhance GANs’ adaptability across a
safety [243]. To tackle these issues, the establishment of broad spectrum of applications [250].
ethical guidelines and regulations is imperative [244]. Future
research avenues might center on developing robust techniques
XI. CONCLUSION
to detect and alleviate ethical concerns associated with GANs,
while also advocating their ethical and responsible deployment In this article, we presented a GAN survey, GAN variants,
in diverse fields. Essential to this effort is the creation of and a detailed analysis of the wide range of GAN applications
forgery detection methods capable of effectively identifying in several applied domains. In addition, we reviewed the
AI-generated content, including images produced through recent theoretical developments in the GAN literature and
GANs. Furthermore, GANs can be susceptible to adversarial the most common evaluation metrics. Despite all these one
attacks, wherein minor modifications to input data result in of the core contributions of this survey is to discuss several
visually convincing yet incorrect outputs [116], [245]. Fu- obstacles of various GAN architectures and their potential
ture investigations could prioritize the development of robust solutions for future research. Overall, we discuss GANs’
GANs that can withstand such attacks, alongside methods for potential to facilitate practical applications not only in im-
identifying and countering them. Ensuring the integrity and age, audio, and text but also in relatively uncommon areas
reliability of GANs is of utmost importance, particularly in such as time series analysis, geospatial data analysis, and
contexts like authentication, content verification, and cyberse- imbalanced learning. In the discussion section, apart from
curity [216], [246]. GANs’ significant success, we detail the failures of GANs
e) Real-time implementation and scalability: While due to their time complexity and unstable training. Although
GANs have shown immense potential, their resource-intensive GANs have been phenomenal for the generation of hyper-
nature hinders real-time usage and scalability. Recent GAN realistic data, current progress in deep learning depicts an
variants like ProGAN [5] and Att-GAN [148] aim to address alternative narrative. Recently developed architectures such
this complexity. Future efforts might focus on crafting efficient as Diffusion models have demonstrated significant success
GAN architectures capable of generating high-quality samples and outperformed GANs on image synthesis. On the other
in real-time, vital for constrained platforms like mobile devices hand, Transformers, a deep learning architecture based on a
and edge computing. Integrating GANs with reinforcement multi-head attention mechanism, has been used within GAN
learning, transfer learning, and supervised learning, as seen architecture to enhance its performance. Furthermore, Large
in RidgeGAN [10], opens opportunities for hybrid models Language Models, a widely utilized deep learning structure
with expanded capabilities. Research should delve into hybrid designed for comprehending and producing natural language,
approaches, leveraging GANs alongside other techniques for have been incorporated into GAN architecture to bolster its
enhanced generative potential. Additionally, exploring mul- effectiveness. The hybridization of PINN and GAN namely,
timodal GANs that produce diverse outputs from multiple PI-GAN can solve inverse and mixed stochastic problems
24
based on a limited number of scattered measurements. On [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
the contrary, GANs’ ability which relies on large data for Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
Advances in neural information processing systems, vol. 30, 2017.
training, using physical laws inside GANs in the form of [20] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed
stochastic differential equations can mitigate the limited data neural networks: A deep learning framework for solving forward and
problem. Several hybrid approaches combining GAN with inverse problems involving nonlinear partial differential equations,”
Journal of Computational physics, vol. 378, pp. 686–707, 2019.
other powerful deep learners are showing great merit and [21] A. Radford, J. Wu, D. Amodei, D. Amodei, J. Clark, M. Brundage, and
success as discussed in the discussion section. Finally, several I. Sutskever, “Better language models and their implications,” OpenAI
applications of GANs over the last decade are summarized blog, vol. 1, no. 2, 2019.
[22] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli,
and criticized throughout the article. “Deep unsupervised learning using nonequilibrium thermodynamics,”
in International conference on machine learning. PMLR, 2015, pp.
2256–2265.
R EFERENCES [23] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
learning with deep convolutional generative adversarial networks,”
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
arXiv preprint arXiv:1511.06434, 2015.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets
(advances in neural information processing systems)(pp. 2672–2680),” [24] Y. Zhang, Z. Yin, Y. Li, G. Yin, J. Yan, J. Shao, and Z. Liu, “Celeba-
Red Hook, NY Curran, 2014. spoof: Large-scale face anti-spoofing dataset with rich annotations,” in
Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
[2] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”
UK, August 23–28, 2020, Proceedings, Part XII 16. Springer, 2020,
arXiv preprint arXiv:1411.1784, 2014.
pp. 70–85.
[3] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
translation using cycle-consistent adversarial networks,” in Proceedings [25] R. Kulkarni, R. Gaikwad, R. Sugandhi, P. Kulkarni, and S. Kone,
of the IEEE international conference on computer vision, 2017, pp. “Survey on deep learning in music using gan,” Int. J. Eng. Res. Technol,
2223–2232. vol. 8, no. 9, pp. 646–648, 2019.
[4] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. [26] A. Jabbar, X. Li, and B. Omar, “A survey on generative adversarial net-
Metaxas, “Stackgan: Text to photo-realistic image synthesis with works: Variants, applications, and training,” ACM Computing Surveys
stacked generative adversarial networks,” in Proceedings of the IEEE (CSUR), vol. 54, no. 8, pp. 1–49, 2021.
international conference on computer vision, 2017, pp. 5907–5915. [27] M. Durgadevi et al., “Generative adversarial network (gan): a general
[5] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing review on different variants of gan and applications,” in 2021 6th
of gans for improved quality, stability, and variation,” arXiv preprint International Conference on Communication and Electronics Systems
arXiv:1710.10196, 2017. (ICCES). IEEE, 2021, pp. 1–8.
[6] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture [28] R. Nandhini Abirami, P. Durai Raj Vincent, K. Srinivasan, U. Tariq,
for generative adversarial networks,” in Proceedings of the IEEE/CVF and C.-Y. Chang, “Deep cnn and deep gan in computational visual
conference on computer vision and pattern recognition, 2019, pp. perception-driven image analysis,” Complexity, vol. 2021, pp. 1–30,
4401–4410. 2021.
[7] X. Liu, M. Cheng, H. Zhang, and C.-J. Hsieh, “Towards robust neural [29] Z. Wang, Q. She, and T. E. Ward, “Generative adversarial networks in
networks via random self-ensemble,” in Proceedings of the European computer vision: A survey and taxonomy,” ACM Computing Surveys
Conference on Computer Vision (ECCV), 2018, pp. 369–385. (CSUR), vol. 54, no. 2, pp. 1–38, 2021.
[8] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutional [30] V. Sampath, I. Maurtua, J. J. Aguilar Martin, and A. Gutierrez, “A
generative adversarial network for symbolic-domain music generation,” survey on generative adversarial networks for imbalance problems in
arXiv preprint arXiv:1703.10847, 2017. computer vision tasks,” Journal of big Data, vol. 8, pp. 1–59, 2021.
[9] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, [31] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A review on generative
M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., “Google’s neural adversarial networks: Algorithms, theory, and applications,” IEEE
machine translation system: Bridging the gap between human and transactions on knowledge and data engineering, 2021.
machine translation,” arXiv preprint arXiv:1609.08144, 2016. [32] Y. Li, Q. Wang, J. Zhang, L. Hu, and W. Ouyang, “The theoretical
[10] R. Thottolil, U. Kumar, and T. Chakraborty, “Prediction of transporta- research of generative adversarial networks: an overview,” Neurocom-
tion index for urban patterns in small and medium-sized indian cities puting, vol. 435, pp. 26–41, 2021.
using hybrid ridgegan model,” arXiv preprint arXiv:2306.05951, 2023. [33] W. Xia, Y. Zhang, Y. Yang, J.-H. Xue, B. Zhou, and M.-H. Yang,
[11] K. E. Smith and A. O. Smith, “Conditional gan for timeseries genera- “Gan inversion: A survey,” IEEE Transactions on Pattern Analysis and
tion,” arXiv preprint arXiv:2006.16477, 2020. Machine Intelligence, 2022.
[12] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, [34] S. Xun, D. Li, H. Zhu, M. Chen, J. Wang, J. Li, M. Chen, B. Wu,
D. Mollura, and R. M. Summers, “Deep convolutional neural networks H. Zhang, X. Chai, et al., “Generative adversarial networks in medical
for computer-aided detection: Cnn architectures, dataset characteristics image segmentation: A review,” Computers in biology and medicine,
and transfer learning,” IEEE transactions on medical imaging, vol. 35, vol. 140, p. 105063, 2022.
no. 5, pp. 1285–1298, 2016. [35] S. Ji, X. Yang, and J. Luo, “A survey on deep learning for symbolic
[13] J. Togelius, N. Shaker, and M. J. Nelson, “Procedural content gen- music generation: Representations, algorithms, evaluations, and chal-
eration in games: A textbook and an overview of current research,” lenges,” ACM Computing Surveys, 2023.
Togelius N. Shaker M. Nelson Berlin: Springer, 2014. [36] G. Iglesias, E. Talavera, and A. Dı́az-Álvarez, “A survey on gans for
[14] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and computer vision: Recent research, analysis and taxonomy,” Computer
P. Abbeel, “Infogan: Interpretable representation learning by infor- Science Review, vol. 48, p. 100553, 2023.
mation maximizing generative adversarial nets,” Advances in neural [37] E. Brophy, Z. Wang, Q. She, and T. Ward, “Generative adversarial net-
information processing systems, vol. 29, 2016. works in time series: A systematic literature review,” ACM Computing
[15] M. Arjovsky and L. Bottou, “Towards principled methods for training Surveys, vol. 55, no. 10, pp. 1–31, 2023.
generative adversarial networks,” arXiv preprint arXiv:1701.04862, [38] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Mur-
2017. phy, “Tracking emerges by colorizing videos,” in Proceedings of the
[16] D. Wilby, T. Aarts, P. Tichit, A. Bodey, C. Rau, G. Taylor, and E. Baird, European conference on computer vision (ECCV), 2018, pp. 391–408.
“Using micro-ct techniques to explore the role of sex and hair in the [39] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative
functional morphology of bumblebee (bombus terrestris) ocelli,” Vision adversarial nets with policy gradient,” in Proceedings of the AAAI
Research, vol. 158, pp. 100–108, 2019. conference on artificial intelligence, vol. 31, no. 1, 2017.
[17] J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy [40] J. Tan, L. Jing, Y. Huo, L. Li, O. Akin, and Y. Tian, “Lgan: Lung
disparities in commercial gender classification,” in Conference on segmentation in ct scans using generative adversarial network,” Com-
fairness, accountability and transparency. PMLR, 2018, pp. 77–91. puterized Medical Imaging and Graphics, vol. 87, p. 101817, 2021.
[18] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang, “Gender [41] S. Nema, A. Dudhane, S. Murala, and S. Naidu, “Rescuenet: An
bias in coreference resolution: Evaluation and debiasing methods,” unpaired gan for brain tumor segmentation,” Biomedical Signal Pro-
arXiv preprint arXiv:1804.06876, 2018. cessing and Control, vol. 55, p. 101641, 2020.
25
[42] Y. Abouelnaga, O. S. Ali, H. Rady, and M. Moustafa, “Cifar-10: Knn- [65] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time
based ensemble of classifiers,” in 2016 International Conference on style transfer and super-resolution,” in Computer Vision–ECCV 2016:
Computational Science and Computational Intelligence (CSCI). IEEE, 14th European Conference, Amsterdam, The Netherlands, October 11-
2016, pp. 1192–1195. 14, 2016, Proceedings, Part II 14. Springer, 2016, pp. 694–711.
[43] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet [66] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic
classifiers generalize to imagenet?” in International conference on style,” arXiv preprint arXiv:1508.06576, 2015.
machine learning. PMLR, 2019, pp. 5389–5400. [67] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[44] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K. Asari, [68] Y. Zhang, Z. Gan, and L. Carin, “Generating text via adversarial train-
“A state-of-the-art survey on deep learning theory and architectures,” ing,” in NIPS workshop on Adversarial Training, vol. 21. academia.
electronics, vol. 8, no. 3, p. 292, 2019. edu, 2016, pp. 21–32.
[45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, [69] M. Toshevska and S. Gievska, “A review of text style transfer using
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial net- deep learning,” IEEE Transactions on Artificial Intelligence, 2021.
works,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, [70] J. Guo, S. Lu, H. Cai, W. Zhang, Y. Yu, and J. Wang, “Long
2020. text generation via adversarial training with leaked information,” in
[46] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Proceedings of the AAAI conference on artificial intelligence, vol. 32,
press, 2016. no. 1, 2018.
[47] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,” [71] Z. Mu, X. Yang, and Y. Dong, “Review of end-to-end speech synthesis
arXiv preprint arXiv:1701.00160, 2016. technology based on deep learning,” arXiv preprint arXiv:2104.09995,
[48] J. Nash, “Non-cooperative games,” Annals of mathematics, pp. 286– 2021.
295, 1951. [72] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “Musegan:
[49] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, Multi-track sequential generative adversarial networks for symbolic
“Gans trained by a two time-scale update rule converge to a local music generation and accompaniment,” in Proceedings of the AAAI
nash equilibrium,” Advances in neural information processing systems, Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
vol. 30, 2017. [73] M. Civit, J. Civit-Masot, F. Cuadrado, and M. J. Escalona, “A sys-
[50] F. Farnia and A. Ozdaglar, “Do gans always have nash equilibria?” in tematic review of artificial intelligence-based music generation: Scope,
International Conference on Machine Learning. PMLR, 2020, pp. applications, and future trends,” Expert Systems with Applications, p.
3029–3039. 118190, 2022.
[51] M.-Y. Liu, X. Huang, J. Yu, T.-C. Wang, and A. Mallya, “Generative [74] X. Mao, S. Wang, L. Zheng, and Q. Huang, “Semantic invariant
adversarial networks for image and video synthesis: Algorithms and cross-domain image generation with generative adversarial networks,”
applications,” Proceedings of the IEEE, vol. 109, no. 5, pp. 839–862, Neurocomputing, vol. 293, pp. 55–63, 2018.
2021. [75] J. T. Guibas, T. S. Virdi, and P. S. Li, “Synthetic medical im-
[52] S. W. Kim, Y. Zhou, J. Philion, A. Torralba, and S. Fidler, “Learning to ages from dual generative adversarial networks,” arXiv preprint
simulate dynamic environments with gamegan,” in Proceedings of the arXiv:1709.01872, 2017.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [76] N. K. Singh and K. Raza, “Medical image generation using generative
2020, pp. 1231–1240. adversarial networks: A review,” Health informatics: A computational
[53] Y.-J. Cao, L.-L. Jia, Y.-X. Chen, N. Lin, C. Yang, B. Zhang, Z. Liu, perspective in healthcare, pp. 77–96, 2021.
X.-X. Li, and H.-H. Dai, “Recent advances of generative adversarial [77] C. Wang, G. Yang, G. Papanastasiou, S. A. Tsaftaris, D. E. Newby,
networks in computer vision,” IEEE Access, vol. 7, pp. 14 985–15 006, C. Gray, G. Macnaught, and T. J. MacGillivray, “Dicyc: Gan-based
2018. deformation invariant cross-domain information fusion for medical
[54] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, “Pose image synthesis,” Information Fusion, vol. 67, pp. 147–160, 2021.
guided person image generation,” Advances in neural information [78] A. Kadurin, A. Aliper, A. Kazennov, P. Mamoshina, Q. Vanhaelen,
processing systems, vol. 30, 2017. K. Khrabrov, and A. Zhavoronkov, “The cornucopia of meaningful
[55] Y. Yu, Z. Gong, P. Zhong, and J. Shan, “Unsupervised representation leads: Applying deep adversarial autoencoders for new molecule de-
learning with deep convolutional neural network for remote sensing velopment in oncology,” Oncotarget, vol. 8, no. 7, p. 10883, 2017.
images,” in Image and Graphics: 9th International Conference, ICIG [79] A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, and A. Zhavoronkov,
2017, Shanghai, China, September 13-15, 2017, Revised Selected “drugan: an advanced generative adversarial autoencoder model for de
Papers, Part II 9. Springer, 2017, pp. 97–108. novo generation of new molecules with desired molecular properties in
[56] Y. Wang, P. Bilinski, F. Bremond, and A. Dantcheva, “Imaginator: silico,” Molecular pharmaceutics, vol. 14, no. 9, pp. 3098–3104, 2017.
Conditional spatio-temporal gan for video generation,” in Proceedings [80] Y. Zhao, Y. Wang, J. Zhang, X. Liu, Y. Li, S. Guo, X. Yang, and
of the IEEE/CVF Winter Conference on Applications of Computer S. Hong, “Surgical gan: Towards real-time path planning for passive
Vision, 2020, pp. 1160–1169. flexible tools in endovascular surgeries,” Neurocomputing, vol. 500, pp.
[57] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decompos- 567–580, 2022.
ing motion and content for video generation,” in Proceedings of the [81] S. Ma, Z. Hu, K. Ye, X. Zhang, Y. Wang, and H. Peng, “Feasibility
IEEE conference on computer vision and pattern recognition, 2018, study of patient-specific dose verification in proton therapy utilizing
pp. 1526–1535. positron emission tomography (pet) and generative adversarial network
[58] W. Wang, H. Yang, Z. Tuo, H. He, J. Zhu, J. Fu, and J. Liu, (gan),” Medical Physics, vol. 47, no. 10, pp. 5194–5208, 2020.
“Videofactory: Swap attention in spatiotemporal diffusions for text- [82] A. Albert, E. Strano, J. Kaur, and M. C. González, “Modeling urban-
to-video generation,” arXiv preprint arXiv:2305.10874, 2023. ization patterns with generative adversarial networks,” IGARSS 2018 -
[59] M. Westerlund, “The emergence of deepfake technology: A review,” 2018 IEEE International Geoscience and Remote Sensing Symposium,
Technology innovation management review, vol. 9, no. 11, 2019. pp. 2095–2098, 2018.
[60] P. Korshunov and S. Marcel, “Vulnerability assessment and detection [83] A. Albert, J. Kaur, E. Strano, and M. Gonzalez, “Spatial sen-
of deepfake videos,” in 2019 International Conference on Biometrics sitivity analysis for urban land use prediction with physics-
(ICB). IEEE, 2019, pp. 1–6. constrained conditional generative adversarial networks,” arXiv
[61] P. Yu, Z. Xia, J. Fei, and Y. Lu, “A survey on deepfake video detection,” preprint arXiv:1907.09543, 2019.
Iet Biometrics, vol. 10, no. 6, pp. 607–624, 2021. [84] W. Zhang, Y. Ma, D. Zhu, L. Dong, and Y. Liu, “Metrogan: Simulating
[62] Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le, “Unsupervised data urban morphology with generative adversarial network,” in Proceedings
augmentation for consistency training,” Advances in neural information of the 28th ACM SIGKDD Conference on Knowledge Discovery and
processing systems, vol. 33, pp. 6256–6268, 2020. Data Mining, 2022, pp. 2482–2492.
[63] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and [85] L. Mosser, O. Dubrule, and M. J. Blunt, “Reconstruction of three-
S. Bengio, “Generating sentences from a continuous space,” arXiv dimensional porous media using generative adversarial neural net-
preprint arXiv:1511.06349, 2015. works,” Physical Review E, vol. 96, no. 4, p. 043309, 2017.
[64] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan, [86] T.-F. Zhang, P. Tilke, E. Dupont, L.-C. Zhu, L. Liang, and W. Bailey,
“Synthetic data augmentation using gan for improved liver lesion clas- “Generating geologically realistic 3d reservoir facies models using
sification,” in 2018 IEEE 15th international symposium on biomedical deep learning of sedimentary architecture with generative adversarial
imaging (ISBI 2018). IEEE, 2018, pp. 289–293. networks,” Petroleum Science, vol. 16, pp. 541–549, 2019.
26
[87] T. Wang, D. Trugman, and Y. Lin, “Seismogen: Seismic waveform [111] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis
synthesis using gan with application to seismic data augmentation,” with auxiliary classifier gans,” in International conference on machine
Journal of Geophysical Research: Solid Earth, vol. 126, no. 4, p. learning. PMLR, 2017, pp. 2642–2651.
e2020JB020077, 2021. [112] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfel-
[88] B. Gecer, B. Bhattarai, J. Kittler, and T.-K. Kim, “Semi-supervised low, and R. Fergus, “Intriguing properties of neural networks,” arXiv
adversarial learning to generate photorealistic face images of new preprint arXiv:1312.6199, 2013.
identities from 3d morphable model,” in Proceedings of the European [113] J. Xiao, S. Zhang, Y. Yao, Z. Wang, Y. Zhang, and Y.-F. Wang, “Gen-
conference on computer vision (ECCV), 2018, pp. 217–234. erative adversarial network with hybrid attention and compromised
[89] X. Pan, Y. You, Z. Wang, and C. Lu, “Virtual to real reinforcement normalization for multi-scene image conversion,” Neural Computing
learning for autonomous driving,” arXiv preprint arXiv:1704.03952, and Applications, vol. 34, no. 9, pp. 7209–7225, 2022.
2017. [114] E. L. Denton, S. Chintala, R. Fergus, et al., “Deep generative image
[90] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, models using a laplacian pyramid of adversarial networks,” Advances
“Learning from simulated and unsupervised images through adversarial in neural information processing systems, vol. 28, 2015.
training,” in Proceedings of the IEEE conference on computer vision [115] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features
and pattern recognition, 2017, pp. 2107–2116. from tiny images. Toronto, ON, Canada, 2009.
[91] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid, “Deeproad: [116] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet,
Gan-based metamorphic testing and input validation framework for “Are gans created equal? a large-scale study,” Advances in neural
autonomous driving systems,” in Proceedings of the 33rd ACM/IEEE information processing systems, vol. 31, 2018.
International Conference on Automated Software Engineering, 2018, [117] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adver-
pp. 132–142. sarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
[92] S. Jiang and Y. Fu, “Fashion style generator.” in IJCAI, 2017, pp. [118] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan,
3721–3727. “Unsupervised pixel-level domain adaptation with generative adversar-
[93] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis, “Viton: An image- ial networks,” in Proceedings of the IEEE conference on computer
based virtual try-on network,” in Proceedings of the IEEE conference vision and pattern recognition, 2017, pp. 3722–3731.
on computer vision and pattern recognition, 2018, pp. 7543–7552. [119] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,
[94] L. Liu, H. Zhang, Y. Ji, and Q. J. Wu, “Toward ai fashion design: An S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual
attribute-gan model for clothing match,” Neurocomputing, vol. 341, pp. concepts with a constrained variational framework,” in International
156–167, 2019. conference on learning representations, 2017.
[95] N. Pandey and A. Savakis, “Poly-gan: Multi-conditioned gan for [120] A. Ghosh, B. Bhattacharya, and S. B. R. Chowdhury, “Sad-gan:
fashion synthesis,” Neurocomputing, vol. 414, pp. 356–364, 2020. Synthetic autonomous driving using generative adversarial networks,”
[96] T. Chakraborty and A. K. Chakraborty, “Hellinger net: A hybrid arXiv preprint arXiv:1611.08788, 2016.
imbalance learning model to improve software defect prediction,” IEEE
[121] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least
Transactions on Reliability, vol. 70, no. 2, pp. 481–494, 2020.
squares generative adversarial networks,” in Proceedings of the IEEE
[97] T. Dam, M. M. Ferdaus, M. Pratama, S. G. Anavatti, S. Jayavelu,
international conference on computer vision, 2017, pp. 2794–2802.
and H. Abbass, “Latent preserving generative adversarial network for
[122] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
imbalance classification,” in 2022 IEEE International Conference on
“Improved training of wasserstein gans,” Advances in neural informa-
Image Processing (ICIP). IEEE, 2022, pp. 3712–3716.
tion processing systems, vol. 30, 2017.
[98] G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi,
[123] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet:
“Bagan: Data augmentation with balancing gan,” arXiv preprint
Criss-cross attention for semantic segmentation,” in Proceedings of the
arXiv:1803.09655, 2018.
IEEE/CVF international conference on computer vision, 2019, pp. 603–
[99] S. Suh, H. Lee, P. Lukowicz, and Y. O. Lee, “Cegan: Classification
612.
enhancement generative adversarial networks for unraveling data im-
balance problems,” Neural Networks, vol. 133, pp. 69–86, 2021. [124] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,
[100] M. Panja, T. Chakraborty, U. Kumar, and N. Liu, “Epicasting: An A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo-realistic single
ensemble wavelet neural network for forecasting epidemics,” Neural image super-resolution using a generative adversarial network,” in
Networks, 2023. Proceedings of the IEEE conference on computer vision and pattern
[101] Y. Li, X. Peng, J. Zhang, Z. Li, and M. Wen, “Dct-gan: dilated recognition, 2017, pp. 4681–4690.
convolutional transformer-based gan for time series anomaly detection,” [125] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
IEEE Transactions on Knowledge and Data Engineering, 2021. quality assessment: from error visibility to structural similarity,” IEEE
[102] Y. Li, X. Peng, Z. Wu, F. Yang, X. He, and Z. Li, “M3gan: A masking transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
strategy with a mutable filter for multidimensional anomaly detection,” [126] L. Mescheder, S. Nowozin, and A. Geiger, “The numerics of gans,”
Knowledge-Based Systems, vol. 271, p. 110585, 2023. Advances in neural information processing systems, vol. 30, 2017.
[103] J. Yang, Y. Shao, and C.-N. Li, “Cnts: Cooperative network for time [127] Y. C. M. W. H. Sergio and G. Colmenarejo, “Learning to learn for
series,” IEEE Access, vol. 11, pp. 31 941–31 950, 2023. global optimization of black box functions,” stat, vol. 1050, p. 18,
[104] A. Geiger, D. Liu, S. Alnegheimish, A. Cuesta-Infante, and K. Veera- 2016.
machaneni, “Tadgan: Time series anomaly detection using generative [128] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual
adversarial networks,” in 2020 IEEE International Conference on Big learning for image-to-image translation,” in Proceedings of the IEEE
Data (Big Data). IEEE, 2020, pp. 33–43. international conference on computer vision, 2017, pp. 2849–2857.
[105] Y. Liu, J. Peng, J. James, and Y. Wu, “Ppgan: Privacy-preserving [129] S. R. Hashemi, S. S. M. Salehi, D. Erdogmus, S. P. Prabhu, S. K.
generative adversarial network,” in 2019 IEEE 25Th international Warfield, and A. Gholipour, “Asymmetric loss functions and deep
conference on parallel and distributed systems (ICPADS). IEEE, 2019, densely-connected networks for highly-imbalanced medical image seg-
pp. 985–989. mentation: Application to multiple sclerosis lesion detection,” IEEE
[106] A. Torfi and E. A. Fox, “Corgan: correlation-capturing convolutional Access, vol. 7, pp. 1721–1735, 2018.
generative adversarial networks for generating synthetic healthcare [130] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
records,” arXiv preprint arXiv:2001.09346, 2020. unreasonable effectiveness of deep features as a perceptual metric,” in
[107] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership Proceedings of the IEEE conference on computer vision and pattern
inference attacks against machine learning models,” in 2017 IEEE recognition, 2018, pp. 586–595.
symposium on security and privacy (SP). IEEE, 2017, pp. 3–18. [131] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
[108] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
convolutional neural networks,” in Proceedings of the IEEE conference A generative model for raw audio,” arXiv preprint arXiv:1609.03499,
on computer vision and pattern recognition, 2016, pp. 2414–2423. 2016.
[109] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative [132] H. Chu, R. Urtasun, and S. Fidler, “Song from pi: A musically plausible
adversarial networks,” in International conference on machine learning. network for pop music generation,” arXiv preprint arXiv:1611.03477,
PMLR, 2017, pp. 214–223. 2016.
[110] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan train- [133] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral
ing for high fidelity natural image synthesis,” arXiv preprint normalization for generative adversarial networks,” arXiv preprint
arXiv:1809.11096, 2018. arXiv:1802.05957, 2018.
27
[134] A. Jolicoeur-Martineau, “The relativistic discriminator: a key element IEEE/CVF International Conference on Computer Vision, 2019, pp.
missing from standard gan,” arXiv preprint arXiv:1807.00734, 2018. 4570–4580.
[135] G. Gómez-de Segura and R. Garcı́a-Mayoral, “Turbulent drag reduction [159] D. Berthelot, C. Raffel, A. Roy, and I. Goodfellow, “Understanding and
by anisotropic permeable substrates–analysis and direct numerical improving interpolation in autoencoders via an adversarial regularizer,”
simulations,” Journal of Fluid Mechanics, vol. 875, pp. 124–172, 2019. arXiv preprint arXiv:1807.07543, 2018.
[136] A. Nguyen, J. Yosinski, and J. Clune, “Multifaceted feature visualiza- [160] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
tion: Uncovering the different types of features learned by each neuron A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language mod-
in deep neural networks,” arXiv preprint arXiv:1602.03616, 2016. els are few-shot learners,” Advances in neural information processing
[137] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and systems, vol. 33, pp. 1877–1901, 2020.
P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” [161] J. Jordon, J. Yoon, and M. Van Der Schaar, “Pate-gan: Generating
arXiv preprint arXiv:1705.07204, 2017. synthetic data with differential privacy guarantees,” in International
[138] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: conference on learning representations, 2018.
Unified generative adversarial networks for multi-domain image-to- [162] G. Rogez, P. Weinzaepfel, and C. Schmid, “Lcr-net++: Multi-person 2d
image translation,” in Proceedings of the IEEE conference on computer and 3d pose detection in natural images,” IEEE transactions on pattern
vision and pattern recognition, 2018, pp. 8789–8797. analysis and machine intelligence, vol. 42, no. 5, pp. 1146–1161, 2019.
[139] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal [163] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net-
style transfer via feature transforms,” Advances in neural information works for biomedical image segmentation,” in Medical Image Comput-
processing systems, vol. 30, 2017. ing and Computer-Assisted Intervention–MICCAI 2015: 18th Interna-
[140] X. Huang and S. Belongie, “Arbitrary style transfer in real-time tional Conference, Munich, Germany, October 5-9, 2015, Proceedings,
with adaptive instance normalization,” in Proceedings of the IEEE Part III 18. Springer, 2015, pp. 234–241.
international conference on computer vision, 2017, pp. 1501–1510. [164] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[141] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image recognition,” in Proceedings of the IEEE conference on computer vision
translation with conditional adversarial networks,” in Proceedings of and pattern recognition, 2016, pp. 770–778.
the IEEE conference on computer vision and pattern recognition, 2017, [165] S. Zhu, R. Urtasun, S. Fidler, D. Lin, and C. Change Loy, “Be your own
pp. 1125–1134. prada: Fashion synthesis with structural coherence,” in Proceedings of
[142] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner, the IEEE international conference on computer vision, 2017, pp. 1680–
“Face2face: Real-time face capture and reenactment of rgb videos,” in 1688.
Proceedings of the IEEE conference on computer vision and pattern [166] M. Mameli, M. Paolanti, R. Pietrini, G. Pazzaglia, E. Frontoni, and
recognition, 2016, pp. 2387–2395. P. Zingaretti, “Deep learning approaches for fashion knowledge extrac-
[143] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, tion from social media: a review,” Ieee Access, vol. 10, pp. 1545–1576,
“Training generative adversarial networks with limited data,” Advances 2021.
in neural information processing systems, vol. 33, pp. 12 104–12 114, [167] Y. Wu, H. Liu, P. Lu, L. Zhang, and F. Yuan, “Design and implementa-
2020. tion of virtual fitting system based on gesture recognition and clothing
[144] G. Franceschelli and M. Musolesi, “Creativity and machine learning: transfer algorithm,” Scientific Reports, vol. 12, no. 1, p. 18356, 2022.
A survey,” arXiv preprint arXiv:2104.02726, 2021. [168] Z. Pan, F. Yuan, J. Lei, W. Li, N. Ling, and S. Kwong, “Miegan: Mobile
[145] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, image enhancement via a multi-module cascade neural network,” IEEE
M. Arjovsky, and A. Courville, “Adversarially learned inference,” arXiv Transactions on Multimedia, vol. 24, pp. 519–533, 2021.
preprint arXiv:1606.00704, 2016. [169] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for
[146] T. Iqbal and H. Ali, “Generative adversarial network for medical images high-resolution image synthesis,” in Proceedings of the IEEE/CVF
(mi-gan),” Journal of medical systems, vol. 42, pp. 1–11, 2018. conference on computer vision and pattern recognition, 2021, pp.
[147] M. Mahmud, M. S. Kaiser, T. M. McGinnity, and A. Hussain, “Deep 12 873–12 883.
learning in mining biological data,” Cognitive computation, vol. 13, pp. [170] K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu, “Local contrastive
1–33, 2021. loss with pseudo-label based self-training for semi-supervised medical
[148] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “Attgan: Facial attribute image segmentation,” Medical Image Analysis, vol. 87, p. 102792,
editing by only changing what you want,” IEEE transactions on image 2023.
processing, vol. 28, no. 11, pp. 5464–5478, 2019. [171] N. Kalchbrenner, A. Oord, K. Simonyan, I. Danihelka, O. Vinyals,
[149] T. Dai, Y. Feng, B. Chen, J. Lu, and S.-T. Xia, “Deep image prior A. Graves, and K. Kavukcuoglu, “Video pixel networks,” in Interna-
based defense against adversarial examples,” Pattern Recognition, vol. tional Conference on Machine Learning. PMLR, 2017, pp. 1771–
122, p. 108249, 2022. 1779.
[150] X. Hou, L. Shen, K. Sun, and G. Qiu, “Deep feature consistent vari- [172] A. Radford, J. Wu, R. Child, D. Amodei, and I. Sutskever,
ational autoencoder,” in 2017 IEEE winter conference on applications “Dall-e: Distributed, automated, and learning to generate adversarial
of computer vision (WACV). IEEE, 2017, pp. 1133–1141. networks,” OpenAI Blog, 2021. [Online]. Available: https://openai.
[151] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, com/blog/dall-e/
“Generative adversarial text to image synthesis,” in International con- [173] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen,
ference on machine learning. PMLR, 2016, pp. 1060–1069. and I. Sutskever, “Zero-shot text-to-image generation,” in International
[152] M. Zhu, P. Pan, W. Chen, and Y. Yang, “Dm-gan: Dynamic memory Conference on Machine Learning. PMLR, 2021, pp. 8821–8831.
generative adversarial networks for text-to-image synthesis,” in Pro- [174] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
ceedings of the IEEE/CVF conference on computer vision and pattern G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable
recognition, 2019, pp. 5802–5810. visual models from natural language supervision,” in International
[153] K. Li, T. Zhang, and J. Malik, “Diverse image synthesis from semantic conference on machine learning. PMLR, 2021, pp. 8748–8763.
layouts via conditional imle,” in Proceedings of the IEEE/CVF Inter- [175] G. Singh, F. Deng, and S. Ahn, “Illiterate dall-e learns to compose,”
national Conference on Computer Vision, 2019, pp. 4220–4229. arXiv preprint arXiv:2110.11405, 2021.
[154] V. Nair and G. E. Hinton, “Rectified linear units improve restricted [176] G. Marcus, E. Davis, and S. Aaronson, “A very preliminary analysis
boltzmann machines,” in Proceedings of the 27th international confer- of dall-e 2,” arXiv preprint arXiv:2204.13807, 2022.
ence on machine learning (ICML-10), 2010, pp. 807–814. [177] C. Rudin, “Stop explaining black box machine learning models for high
[155] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen- stakes decisions and use interpretable models instead,” Nature machine
cies with gradient descent is difficult,” IEEE transactions on neural intelligence, vol. 1, no. 5, pp. 206–215, 2019.
networks, vol. 5, no. 2, pp. 157–166, 1994. [178] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchi-
[156] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” cal text-conditional image generation with clip latents,” arXiv preprint
arXiv preprint arXiv:1410.5401, 2014. arXiv:2204.06125, 2022.
[157] M. D. Zeiler and R. Fergus, “Visualizing and understanding convo- [179] F. Doshi-Velez and B. Kim, “Towards a rigorous science of inter-
lutional networks,” in Computer Vision–ECCV 2014: 13th European pretable machine learning,” arXiv preprint arXiv:1702.08608, 2017.
Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, [180] E. O. Brigham, The fast Fourier transform and its applications.
Part I 13. Springer, 2014, pp. 818–833. Prentice-Hall, Inc., 1988.
[158] T. R. Shaham, T. Dekel, and T. Michaeli, “Singan: Learning a gen- [181] D. B. Percival and A. T. Walden, Wavelet methods for time series
erative model from a single natural image,” in Proceedings of the analysis. Cambridge university press, 2000, vol. 4.
28
[182] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and [206] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generative
G. Langs, “Unsupervised anomaly detection with generative adversarial neural samplers using variational divergence minimization,” Advances
networks to guide marker discovery,” in International conference on in neural information processing systems, vol. 29, 2016.
information processing in medical imaging. Springer, 2017, pp. 146– [207] G. Daras, A. Odena, H. Zhang, and A. G. Dimakis, “Your local gan:
157. Designing two dimensional local attention mechanisms for generative
[183] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “Fedformer: models,” in Proceedings of the IEEE/CVF conference on computer
Frequency enhanced decomposed transformer for long-term series fore- vision and pattern recognition, 2020, pp. 14 531–14 539.
casting,” in International Conference on Machine Learning. PMLR, [208] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural sim-
2022, pp. 27 268–27 286. ilarity for image quality assessment,” in The Thrity-Seventh Asilomar
[184] V. Vovk, “Kernel ridge regression,” in Empirical Inference: Festschrift Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee,
in Honor of Vladimir N. Vapnik. Springer, 2013, pp. 105–116. 2003, pp. 1398–1402.
[185] K. P. Murphy, Machine learning: a probabilistic perspective. MIT [209] E. L. Lehmann, J. P. Romano, and G. Casella, Testing statistical
press, 2012. hypotheses. Springer, 1986, vol. 3.
[186] H. Dong, A. Supratak, L. Mai, F. Liu, A. Oehmichen, S. Yu, [210] P. Cunningham and S. J. Delany, “k-nearest neighbour classifiers-a
and Y. Guo, “TensorLayer: A Versatile Library for Efficient Deep tutorial,” ACM computing surveys (CSUR), vol. 54, no. 6, pp. 1–25,
Learning Development,” ACM Multimedia, 2017. [Online]. Available: 2021.
http://tensorlayer.org [211] W. Bounliphone, E. Belilovsky, M. B. Blaschko, I. Antonoglou, and
[187] C. Lai, J. Han, and H. Dong, “Tensorlayer 3.0: A deep learning A. Gretton, “A test of relative similarity for model selection in
library compatible with multiple backends,” in 2021 IEEE International generative models,” arXiv preprint arXiv:1511.04581, 2015.
Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2021, [212] V. Volodina and P. Challenor, “The importance of uncertainty quan-
pp. 1–3. tification in model reproducibility,” Philosophical Transactions of the
[188] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image Royal Society A, vol. 379, no. 2197, p. 20200071, 2021.
translation using cycle-consistent adversarial networkss,” in Computer [213] P. Oberdiek, G. Fink, and M. Rottmann, “Uqgan: A unified model
Vision (ICCV), 2017 IEEE International Conference on, 2017. for uncertainty quantification of deep classifiers trained via conditional
[189] C. Esteban, S. L. Hyland, and G. Rätsch, “Real-valued (medical) gans,” Advances in Neural Information Processing Systems, vol. 35,
time series generation with recurrent conditional gans,” arXiv preprint pp. 21 371–21 385, 2022.
arXiv:1706.02633, 2017. [214] W. He and Z. Jiang, “A survey on uncertainty quantification methods
[190] G. Zhang, M. Kan, S. Shan, and X. Chen, “Generative adversarial for deep neural networks: An uncertainty source perspective,” arXiv
network with spatial attention for face attribute editing,” in Proceedings preprint arXiv:2302.13425, 2023.
of the European conference on computer vision (ECCV), 2018, pp. [215] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt, J. Feng,
417–432. A. Kruspe, R. Triebel, P. Jung, R. Roscher, et al., “A survey of
[191] A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse uncertainty in deep neural networks,” Artificial Intelligence Review,
high-fidelity images with vq-vae-2,” Advances in neural information pp. 1–77, 2023.
processing systems, vol. 32, 2019. [216] P. Samangouei, M. Kabkab, and R. Chellappa, “Defense-gan: Pro-
[192] G. Biau, B. Cadre, M. Sangnier, and U. Tanielian, “Some theoretical tecting classifiers against adversarial attacks using generative models,”
properties of gans,” Ann. Statist., vol. 48 (3), pp. 1539 – 1566, 2020. arXiv preprint arXiv:1805.06605, 2018.
[193] G. Biau, M. Sangnier, and U. Tanielian, “Some theoretical insights into [217] H. De Meulemeester, J. Schreurs, M. Fanuel, B. De Moor, and J. A.
wasserstein gans,” The Journal of Machine Learning Research, vol. 22, Suykens, “The bures metric for generative adversarial networks,” in
no. 1, pp. 5287–5331, 2021. Joint European Conference on Machine Learning and Knowledge
[194] D. Belomestny, E. Moulines, A. Naumov, N. Puchkin, and S. Sam- Discovery in Databases. Springer, 2021, pp. 52–66.
sonov, “Rates of convergence for density estimation with gans,” arXiv [218] W. Li, L. Fan, Z. Wang, C. Ma, and X. Cui, “Tackling mode collapse
preprint arXiv:2102.00199, 2021. in multi-generator gans with orthogonal vectors,” Pattern Recognition,
[195] M. Meitz, “Statistical inference for generative adversarial networks,” vol. 110, p. 107646, 2021.
arXiv preprint arXiv:2104.10601, 2021. [219] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative
[196] S. D. Mbacke, F. Clerc, and P. Germain, “Pac-bayesian general- adversarial networks,” arXiv preprint arXiv:1611.02163, 2016.
ization bounds for adversarial generative models,” arXiv preprint [220] Z. Zhang, C. Luo, and J. Yu, “Towards the gradient vanishing,
arXiv:2302.08942, 2023. divergence mismatching and mode collapse of generative adversarial
[197] S. Liu, O. Bousquet, and K. Chaudhuri, “Approximation and con- nets,” in Proceedings of the 28th ACM International Conference on
vergence properties of generative adversarial learning,” Advances in Information and Knowledge Management, 2019, pp. 2377–2380.
Neural Information Processing Systems, vol. 30, 2017. [221] B. Luo, Y. Liu, L. Wei, and Q. Xu, “Towards imperceptible and robust
[198] Z. Lin, V. Sekar, and G. Fanti, “On the privacy properties of gan- adversarial example attacks against neural networks,” in Proceedings
generated samples,” in International Conference on Artificial Intelli- of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
gence and Statistics. PMLR, 2021, pp. 1522–1530. [222] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[199] D. Alvarez-Melis, V. Garg, and A. Kalai, “Are gans overkill for nlp?” network training by reducing internal covariate shift,” in International
Advances in Neural Information Processing Systems, vol. 35, pp. 9072– conference on machine learning. pmlr, 2015, pp. 448–456.
9084, 2022. [223] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic
[200] A. Borji, “Pros and cons of gan evaluation measures,” Computer vision models,” Advances in neural information processing systems, vol. 33,
and image understanding, vol. 179, pp. 41–65, 2019. pp. 6840–6851, 2020.
[201] J. Xu, X. Ren, J. Lin, and X. Sun, “Diversity-promoting gan: A [224] Y. Song and S. Ermon, “Generative modeling by estimating gradients
cross-entropy based generative adversarial network for diversified text of the data distribution,” Advances in neural information processing
generation,” in Proceedings of the 2018 conference on empirical systems, vol. 32, 2019.
methods in natural language processing, 2018, pp. 3940–3949. [225] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image
[202] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and synthesis,” Advances in neural information processing systems, vol. 34,
X. Chen, “Improved techniques for training gans,” Advances in neural pp. 8780–8794, 2021.
information processing systems, vol. 29, 2016. [226] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion
[203] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink- models in vision: A survey,” IEEE Transactions on Pattern Analysis
ing the inception architecture for computer vision,” in Proceedings of and Machine Intelligence, 2023.
the IEEE conference on computer vision and pattern recognition, 2016, [227] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet,
pp. 2818–2826. and M. Norouzi, “Palette: Image-to-image diffusion models,” in ACM
[204] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10.
A large-scale hierarchical image database,” in 2009 IEEE conference [228] Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two transformers can
on computer vision and pattern recognition. Ieee, 2009, pp. 248–255. make one strong gan,” arXiv preprint arXiv:2102.07074, vol. 1, no. 3,
[205] S. Gurumurthy, R. Kiran Sarvadevabhatla, and R. Venkatesh Babu, 2021.
“Deligan: Generative adversarial networks for diverse and limited data,” [229] Z. Lv, X. Huang, and W. Cao, “An improved gan with transformers
in Proceedings of the IEEE conference on computer vision and pattern for pedestrian trajectory prediction models,” International Journal of
recognition, 2017, pp. 166–174. Intelligent Systems, vol. 37, no. 8, pp. 4417–4436, 2022.
29