Overview
Generative models are a prominent area of research in machine learning due to their ability to effectively handle complex and unstructured data Recent advancements in this field have shown potential for generating high-quality synthetic data in larger quantities Synthetic data, which mimics the statistical characteristics of real data, is highly valuable in various applications To produce this "nearly-real" data, several generative models have been developed, including Variational Autoencoders (VAE), deep auto-regressive networks, and normalizing flows Among these techniques, Generative Adversarial Networks (GANs) are gaining significant popularity for their innovative approach and impressive results.
Inspired by a two-player game, Generative Adversarial Networks (GANs) consist of an art forger, who creates imitations of real artworks, and an art inspector, who evaluates a mix of authentic and forged pieces The inspector's task is to identify the fakes, providing feedback that helps the forger improve the quality of generated images over time Remarkably, GANs can produce highly realistic images that can deceive even human observers This exceptional data generation capability enables GANs to achieve state-of-the-art performance across various fields, including computer vision, natural language processing, time series analysis, and reinforcement learning.
Despite the impressive outcomes achieved by Generative Adversarial Networks (GANs), these models face significant challenges The primary difficulties include the complexities involved in training GANs and the challenges associated with evaluating their results Additionally, issues such as imbalance between the two competing networks, underfitting, and limited diversity in image generation further complicate their effectiveness.
[125] Although many variants of GAN has proposed, solving GAN’s weaknesses is still an open research direction.
This research aims to enhance the performance of Generative Adversarial Networks (GANs) specifically for the super-resolution problem, which involves generating high-resolution images from low-resolution inputs Rather than addressing GANs broadly, we focus on a targeted approach that simplifies the process and minimizes the need for complex mathematical knowledge.
Super-resolution, exemplified by the photo-realistic image generated by GAN, is an essential technique in computer vision that can significantly enhance various applications, including video enhancement, medical diagnosis, and astronomical observation.
Our goal is to enhance the GAN-based approach for super-resolution, aiming to develop a versatile model applicable to various related fields We address two key challenges in GAN models for image super-resolution: improving output image quality and increasing diversity Additionally, we explore techniques to facilitate the training process, although this is not our primary focus.
Goal
This research aims to enhance Generative Adversarial Networks (GANs) for super-resolution and to apply our findings to additional vision tasks, including image denoising To accomplish this objective, we will undertake a series of targeted tasks.
• Study GANs and their variants which are suitable for super-resolution problem.
• Train and evaluate existing GAN models.
• Based on the experiment results, devise and implement some development direc- tions to improve the model performance
• Apply the model to other vision problems.
Scope
In our thesis, we focus specifically on super-resolution techniques aimed at reconstructing photo-realistic images of natural scenes, as this area has significant real-world applications and is increasingly recognized in current research Our study is limited to a scale of 4x, allowing us to concentrate on enhancing image quality effectively while contributing to the growing body of literature in this field.
Super-resolution techniques are primarily categorized into two types: single-image and multiple-image methods This thesis focuses exclusively on single-image methods, as obtaining multiple images of the same high-resolution subject is not always feasible.
Contributions
This thesis presents two innovative approaches to enhance the image generation capabilities of an existing GAN-based model, aiming to improve the overall quality and realism of the generated images.
• On the first approach, we devise a novel learning strategy that consistently achieves better super-resolution image quality.
• By comprehensive experiments, we prove that using our loss enhances the learning ability not only in the spatial domain but also in the spectral domain.
• On the second approach, we design a diversify module which can be an add- on for any previous 1-to-1 super-resolution model to generate a distribution of fine-grained outputs
• Despite learning for super-resolution only, we show that our diversify model is potential for other vision tasks such as image denoising.
This article explores state-of-the-art (SOTA) research on GAN-based solutions for super-resolution, highlighting our target paper Chapter 3 provides the necessary theoretical background for this study We then identify weaknesses in the baseline model and propose strategies to improve result quality The effectiveness of our method is validated through a series of experiments presented in Chapter 5 Finally, we summarize our findings and suggest potential future improvements.
GAN-based approach for SISR
Despite numerous surveys on Generative Adversarial Networks (GANs) and Single Image Super Resolution (SISR), there is a notable lack of in-depth analysis specifically focusing on GAN-based methods for SISR Most existing surveys primarily compare different variants of GANs based on metrics such as structure, loss functions, and applications.
The majority of Single Image Super-Resolution (SISR) surveys examine not only GAN-based methods but also a variety of other network types, including linear, residual, and recursive networks To better understand these approaches, we have structured a taxonomy, as illustrated in Figure 2.1, categorizing recent findings into three main aspects: target, approach, and output diversity Target-based classification reveals that most GAN-based SISR methods are designed to address the super-resolution problem in an end-to-end manner, focusing on enhancing images generally rather than targeting specific image types.
Models can handle a wide range of image types, but their effectiveness often hinges on the training data utilized Conversely, some studies focus exclusively on specific image categories For instance, DeepSEE and Super-FAN are tailored specifically for super-resolving portrait images.
Recent advancements in GAN-based approaches for Single Image Super-Resolution (SISR), particularly in face hallucination, have introduced innovative techniques For instance, DeepSEE incorporates semantic segmentation to enhance realism in low-resolution (LR) inputs, while Super-FAN integrates a face alignment network with the standard GAN framework, building on SRGAN Additionally, Bulat et al focused their experiments solely on face datasets, although they suggest their method could be applicable to other image types Moreover, Demiray et al have recently investigated the application of SRGAN to digital elevation models (DEM), expanding the scope of GAN techniques in image enhancement.
Approach-based classification is a prevalent method for single image super-resolution (SISR), typically addressing the problem from a low-to-high perspective While some studies have explored the high-to-low approach, critics argue that these methods yield subpar results on real-world low-resolution images due to their reliance on simplistic downsampling techniques, such as bicubic or bilinear interpolation, which are seldom encountered in practical scenarios To overcome these limitations, recent research has proposed hybrid models that consider both low-to-high and high-to-low directions, demonstrating improved performance on complex or unknown downsampling operations and yielding favorable outcomes even with unpaired data However, these advanced approaches require greater computational resources, as they often utilize multiple generator-discriminator pairs to facilitate learning in both directions.
Output diversity classification in Single Image Super Resolution (SISR) highlights the inherent challenge of its ill-posed nature, where multiple high-resolution images can correspond to a single low-resolution image While most GAN-based methods focus on learning a 1-to-1 mapping from low to high resolution, some research explores 1-to-many super-resolution Notably, DeepSEE and ExplorableSR incorporate an output controllable module, enabling users to generate various high-resolution outputs from one low-resolution input Conversely, PULSE adopts a unique strategy by navigating the latent space of a pre-trained GAN to find the high-resolution image that aligns best with the low-resolution reference However, DeepSEE and ExplorableSR are considered "work-around" solutions as they do not establish a high-resolution distribution based on low-resolution input, and PULSE's reliance on an external generative model presents a significant limitation.
Accuracy-driven models and perceptual-driven models
This section expands beyond GAN-based models to encompass a broader "general model" framework We will examine recent advancements in single image super-resolution, categorizing these models based on quantitative metrics into two main types: accuracy-driven models and perceptual-driven models.
The evaluation metrics for Single Image Super Resolution (SISR) can be categorized into two main groups: accuracy metrics and perceptual metrics Accuracy metrics, such as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM), focus on measuring pixel-wise dissimilarity While these accuracy metrics are sensitive to distortion, they do not necessarily align with human perception.
Perceptual metrics, which have been shown to correlate more effectively with human opinions, utilize deep neural networks for scoring evaluations Further details and examples of each metric type will be discussed in section 3.2.4.
Based on the above classification, in our thesis, we further classify super-resolution model into two categories:
Accuracy-driven models focus primarily on accuracy metrics, with many earlier approaches categorized as such A notable example is SRCNN, introduced by Dong et al in 2014, which employs three convolutional neural network layers to convert low-resolution images into high-resolution ones Subsequent advancements in super-resolution (SR) performance have been achieved through powerful architectures like residual networks, recursive networks, and residual dense networks Recently, Zhang et al have pioneered the integration of attention mechanisms with existing models, yielding impressive results This innovation has inspired further developments in attention techniques, including holistic attention, second-order attention, and two-stage attentive networks Additional intriguing methods in the field include learnable image downscaling and feedback frameworks.
Perceptual-driven models are defined as those incorporating at least one perceptual metric in their evaluation metrics Historically, these models have received limited attention due to the absence of robust perceptual image quality assessments Currently, the predominant approach within this category is GAN-based methods, which have seen numerous enhancements through various techniques, such as integrating segmentation maps, utilizing U-Net based discriminators, and employing pre-trained models Additionally, non-GAN methods like the novel architecture designed by Lugmayr, which leverages normalizing flow, have achieved results comparable to GAN-based models Perceptual-driven models are particularly effective in generating visually appealing images, especially in extreme super-resolution scenarios (greater than 8x), although they often produce images with unnatural noise artifacts.
Some recent noticeable results
Recent IQA model selection
In our experiments, we consider two noticeable image-quality assessment models: LPIPS [141] and DISTS [26] Summarized information of other models can be found in [60].
Learned Perceptual Image Patch Similarity (LPIPS) evaluates the similarity between deep embeddings of two images, demonstrating that deep features extracted through neural networks align closely with human perception Unlike traditional methods that rely on shallow features capturing only the overall image, LPIPS effectively captures spatial and temporal dependencies within images For a detailed overview of the LPIPS pipeline, refer to section 3.2.4.5.
In a comparison of image similarity metrics LPIPS and DISTS, two distorted versions of a grass image are analyzed The original image is shown alongside one version affected by JPEG compression and another that has undergone resampling LPIPS identifies the JPEG-compressed image as being closer to the original, while DISTS favors the resampled version This highlights the differing perspectives of these two metrics in assessing image quality.
DISTS is an advanced version of LPIPS that enhances texture resampling tolerance by not only averaging feature maps but also comparing their structural components It replaces the Euclidean distance used in LPIPS with SSIM-like structural similarity measurements Additionally, DISTS introduces a novel loss function during the training process to further boost performance.
To illustrate the difference between LPIPS and DISTS, please see Figure2.2 LPIPS predict image (b) is closer to referenced image (a), while DISTS chooses image (c).
Recent GANs
2.3.2.1 Relativistic GAN and its variants
In 2018, Martineau proposed a new type of discriminator: relativistic discriminator
The standard Generative Adversarial Network (GAN) involves a generator that aims to increase the likelihood of its generated data being perceived as real, while the discriminator assesses whether the input data is real or fake However, Martineau identifies a crucial limitation in the standard GAN: the generator only gains from the generated data In contrast, the relativistic GAN improves this process by allowing both real and fake data to contribute equally to the learning of both the generator and discriminator Furthermore, it is mathematically demonstrated that a standard GAN is merely a specific instance of a relativistic GAN.
In 2020, Martineau published a paper on Relativistic GAN, introducing new mathematical foundations and variants of the model The author emphasizes the use of the critic score instead of the discriminator, defining the critic as the discriminator without the final activation layer (D(x) = a(C(x))) This approach aligns closely with previous research related to Wasserstein GAN, allowing the critic to score the realism of the input rather than providing a probability.
Martineau presents four variants of relativistic GAN in their later research, but the experiments indicate that the most effective models are the Relativistic Average GAN (RaGAN) and the Relativistic Centered GAN (RcGAN) Consequently, our experiments will concentrate solely on RaGAN and RcGAN Additionally, we will introduce a key definition and a theorem relevant to these models.
[56], as those are crucial to build the formula of RaGAN and RcGAN.
Definition 2.1 [56] Let P and Q be probability distributions and S be the set of all probability distributions with common support A function D: (S, S) → R >0 is a divergence if it respects the following two conditions:
Divergences represent the differences between two probability distributions, as indicated by the formula Throughout the training process, the distribution of real data remains constant, requiring the efficient critic or discriminator to minimize these divergences progressively.
Theorem 2.1 [56] Let f : R → R be a concave function such that f(0) = 0, f is differentiable at 0, f ′ (0) 6= 0, sup x (f(x)) = M > 0 and arg(sup x (f(x))) > 0 Let P andQ be probability distributions with supportχ Let M= 1 2 P+ 1 2 Q Then, we have:
In theorem 2.1, D f Ra (P,Q) and D f Rc (P,Q) correspond to RaGAN and RcGAN, respectively Also, sup stands for supremum or least upper bound Further details and proofs can be found in [56].
Martineau presents examples of functions that meet the criteria outlined in Theorem 2.1 Specifically, all concave functions illustrated in Figure 2.3 serve as suitable options for relativistic divergences These functions are also utilized in the original GAN and LSGAN frameworks.
[80] and HingeGAN [88] (note that SGAN mentioned in paper [57] is the original GAN [38]) The mathematical formula for three above function respectively are: fS(z) = log(sigmoid(z)) + log(2), (2.1) fLS(z) =−(z−1) 2 + 1, (2.2) fHinge(z) =−max(0,1−z) + 1, (2.3)
By combining theorem 2.1 with equations (2.1), (2.2), (2.3) and modifying little for suitable with super-resolution problem, we can obtain many variants of Relativistic GAN for our task.
In the equations presented, I LR represents the low-resolution image, while I HR signifies the reference high-resolution image The model generates a high-resolution image denoted as G(I LR ), with B indicating the batch size Additionally, σ refers to the sigmoid function, and C(x) represents the non-transformed output of the discriminator.
Combining D Ra f (P,Q) with equation (2.1) and follow the instruction from [57], we obtain the generator and discriminator loss for RaGAN:
X b=1 logDRaGAN(I HR , G(I LR )) + log 1−DRaGAN(G(I LR ), I HR )
Figure 2.3 illustrates the subtypes of divergences by plotting the function f in relation to the critic’s difference (CD), utilizing three suitable selections for relativistic divergences The lower gray line indicates f(0) = 0, signifying that the divergence is zero when all CDs are at zero Conversely, the upper gray line denotes the maximum value of f, indicating that the divergence reaches its peak when all CDs contribute to that maximum This figure is sourced from [56].
Combining D Ra f (P,Q) with equation (2.2) and follow the instruction from [57], we obtain the generator and discriminator loss for RaLS:
Combining D Ra f (P,Q) with equation (2.3) and follow the instruction from [57], we obtain the generator and discriminator loss for RaHinge:
For relativistic centered GAN, we defineCm(x, y) = 2B 1 PB b=1(C(x) +C(y)) Com- bining D Rc f (P,Q) with equation (2.1) and follow the instruction from [57], we obtain the generator and discriminator loss for RcGAN:
X b=1 logDRcGAN(I HR , G(I LR )) + log 1−DRcGAN(G(I LR ), I HR )
Combining D Rc f (P,Q) with equation (2.2) and follow the instruction from [57], we obtain the generator and discriminator loss for RcLS:
Combining D Rc f (P,Q) with equation (2.3) and follow the instruction from [57], we obtain the generator and discriminator loss for RcHinge:
Two equations in (2.4) are the main functions in the adversarial loss in ESRGAN
[120] We also note that, in equations (2.1), (2.2) and (2.3), we have some constants such as log(2) or 1 However, as we use the minus operator, those constants are eliminated.
In summary, we have highlighted key aspects of the Relativistic GAN and its application to the super-resolution problem Our experiments will encompass all variants of GANs discussed in this section.
2.3.2.2 Wasserstein GAN and its variants
Our earlier experiments indicated that the ESRGAN’s discriminator is sensitive to hyperparameters and challenging to train, despite employing the innovative RaGAN technique To address this issue, we explored alternative discriminators Farnia and Ozdaglar demonstrated that GANs' minimax games may lack a Nash equilibrium, while the Wasserstein GAN (WGAN) offers stable learning curves and improved performance by effectively achieving a proximal equilibrium However, approximating the K-Lipschitz constraint required by the Wasserstein-1 metric poses a significant challenge To overcome this, Gulrajani et al introduced the WGAN-GP, a new GAN loss that mitigates this issue Both WGAN and WGAN-GP show promise in enhancing the performance of ESRGAN In the following section, we will highlight key aspects of WGAN and WGAN-GP.
Although we mainly use WGAN-GP [41], some concept from WGAN [7] is necessary to fully understand our work In Wasserstein GAN, one of the core concepts is the
Figure 2.4: Wasserstein-1 distance illustration by moving box example Left image: The position of box begin and after moving Right image: Moving plan table Figure from [50]
Earth-Mover distance or Wasserstein-1 metric:
X x,y kx−ykγ(x, y) = inf γ ∈ Π(P r ,P g )E(x,y) ∼ γkx−yk (2.10)
In equation (2.10), P r and P g are two probability distributions, where Π(Pr,P g ) denotes the set of all joint distributionsγ(x, y) whose marginals are respectivelyP r and
When analyzing multiple random variables, such as X and Y, it is essential to distinguish between their joint distribution and the individual distributions of each variable The marginal distribution of a random variable represents its own probability distribution To demonstrate the Wasserstein-1 distance, we refer to a visual representation from [50] (Figure 2.4).
The probability distribution can be visualized as a stack of boxes, where the original distribution is represented in columns 1, 2, and 3, and the goal is to transfer these boxes to a new distribution in columns 7, 8, 9, and 10 Two moving plans illustrate this process: the left image shows the initial and final positions of the boxes, while the right image details the movement of each box For instance, in plan γ1, moving 2 boxes from column 1 to column 10 incurs a transport cost calculated as the absolute difference between the columns, exemplified by moving box 1 from column 1 to column 7, which costs 6 Although both plans γ1 and γ2 may share the same transport cost, discrepancies can arise in other cases Ultimately, the Wasserstein-1 distance represents the minimum transport cost required to transform one data distribution, p, into another, q, as defined mathematically by the infimum operator in equation (2.10).
To apply Wasserstein distance in GAN, Arjovsky et al [7] use the Kantorovich-Rubinstein duality to obtain more simple equation:
(2.11) wheresupis the supremum or least upper bound To understand about the function f, we consider another definition
Definition 2.2 [7] A real function f :R→R is Lipschitz continuous if
In this case, with constant K, we denote kfkL ≤K
If we rewrite the equation in the above definition as | f (x | x 1 ) − f(x 2 ) |
The Lipschitz constraint, expressed as |1 - x²| ≤ K, limits the rate of change of a function, necessitating that the function f in equation (2.11) be a 1-Lipschitz function To approximate this function, Arjovsky et al eliminated the final activation layer, renaming it the critic, which led to the development of a new type of GAN called WGAN Compared to the original GAN, WGAN offers two key advantages: it reduces mode collapse and mitigates vanishing gradients, allowing the generator to learn effectively even when the critic is highly powerful Further analysis and experimental results can be found in [7].
Enforcing the Lipschitz constraint during training is a challenging task In their original study, Arjovsky et al [7] implemented a weight clipping technique, which they later criticized as ineffective Their experiments revealed that the Wasserstein GAN (WGAN) still generated poor-quality images and struggled to converge in certain instances To address these issues, Gulrajani et al [41] enhanced WGAN training by introducing a gradient penalty technique, supported by a proven theorem.
Theorem 2.2 [41] LetP r andP g are two probability distribution Denote 1-Lipschitz function f ∗ is the optimal solution of : max k f k L ≤ 1(E y ∼ P r [f(y)]−E x ∼ P g [f(x)]) Then, f ∗ has gradient norm 1 almost everywhere under P r and P g
Theorem 2.2 claims any points interpolated between real image and fake image should have a gradient norm is 1 We also note that above theorem is just a corollary of original theorem in [41] By applying this theorem into WGAN, Gulrajani et al. proposed new algorithm: WGAN-GP (see algorithm1).
WGAN-GP demonstrates enhanced training stability compared to traditional WGAN, as evidenced by various experiments While the image quality from WGAN-GP may be slightly inferior to that of the original GAN, it consistently produces more stable results during training In scenarios where the network architecture lacks power, WGAN-GP can still generate acceptable outputs, whereas the original GAN may struggle to converge A comparison of inception scores among different models, including DCGAN, WGAN-GP with RMSProp, WGAN-GP with Adam, and WGAN with weight clipping, illustrates this advantage By leveraging WGAN-GP's easier training process, we can further improve performance by incorporating deeper architectures, such as residual or dense networks.
Applying WGAN-GP in super-resolution, we obtain a new equation
Frequency artifacts problem
Frequency artifacts
In the initial phases of development, distinguishing between real images and those generated by GAN models was relatively straightforward However, recent advancements in GAN technology have led to models that can deceive even human observers Despite this progress, research indicates that both the generator and discriminator struggle in the frequency domain For instance, while StyleGAN is recognized as a state-of-the-art model in image generation, it produces high-quality images that still exhibit artifacts in the frequency domain Moreover, the discriminator's performance remains consistent across various scenarios, failing to effectively differentiate high-frequency components, except when significant changes are made to the spectrum.
To understand why Generative Adversarial Networks (GANs) struggle to learn high-frequency features, it is essential to examine the behavior of deep neural networks In 2018, Xu demonstrated through both theoretical and experimental approaches that the F-principle plays a critical role in this phenomenon.
Theorem 2.3 [128] F-principle: Deep neural networks often fit target functions from low to high frequencies during the training.
StyleGAN demonstrates limitations in the frequency domain, as illustrated in Figure 2.7 The top section features the generator experiment, showcasing the mean discrete cosine transform spectrum of a real image, alongside the real image and its generated counterpart, including their corresponding spectra The bottom section presents the discriminator experiment, displaying images produced by the generator along with the discriminator's scores, highlighting differences in high-frequency components The top images are sourced from [32], while the bottom images are from [19].
Theorem 2.3 highlights that deep neural networks, while capable of approximating any function, tend to favor low-frequency functions This phenomenon, known as spectral bias, was further analyzed through Fourier transform by Rahaman et al Additionally, Ma et al in 2021 expanded this concept to include non-gradient descent training methods, such as conjugate gradient and particle swarm optimization These findings underscore the behavior of generative models, which, including GANs, often avoid high-frequency components and may converge to less optimal solutions.
A significant frequency gap often exists between fake and generated images, highlighting the need for models to learn high-frequency details This gap indicates that models struggle to accurately reproduce spectral distribution, which is crucial for creating realistic images across both spatial and frequency domains Additionally, GAN models tend to underperform when faced with inputs rich in high-frequency details Prior research has demonstrated that guiding models to focus on high-frequency components can lead to improved outcomes in various applications, including deep fake image recognition, image generation, inverse rendering, and image regression.
Related methods
Numerous authors have suggested various methods to bridge the gap between generated images and real images Our survey categorizes these approaches into three primary strategies, as illustrated in Figure 2.8.
To enhance the loss function in generative models, some researchers adopt a strategy similar to WGAN-GP by introducing constraints that involve transforming both generated and real images into the frequency domain They then assess the dissimilarity between these outputs, imposing a heavier penalty in the loss function for greater differences Jiang et al [54] propose a novel focal frequency loss influenced by Euclidean projection and focal loss, while Durall et al [27] create a spectral loss that incorporates azimuthal integration over the spectrum, integrating this into the generator loss.
To enhance the quality of data generation, modifying input through preprocessing techniques can be effective In 2021, Li et al introduced innovative preprocessing methods, including high-frequency filters and high-frequency confusion, which involve exchanging high-frequency components of real data with low-frequency components of fake data Additionally, Tancik et al utilized Fourier feature mapping to optimize input before model processing Another noteworthy approach is frequency separation, employing a linear filter to distinguish between low and high-frequency details, allowing the discriminator to focus solely on high-frequency components while low-frequency elements are learned through pixel-wise loss.
To prevent overwhelming the discriminator in the learning process, SSD-GAN employs an additional spectral classifier, enabling the generator to effectively learn in the spectral domain Additionally, Zhou et al enhance the frequency separation technique by implementing two discriminators: one for high frequencies and another for low frequencies, rather than relying on a simple filter.
Each strategy in generative models has distinct advantages and disadvantages The first two strategies are advantageous as they do not necessitate changes to the network architecture, making them easily applicable across various models However, their simplicity often leads to unpredictable performance Regularizing the loss function introduces a new hyperparameter, which poses challenges in selecting the right balance with the original loss to create an effective regularized loss function Additionally, the effectiveness of the second strategy is heavily influenced by the properties of the images involved For instance, while images with similar low-frequency textures can yield plausible results, those with differing high and low-frequency details often produce unrealistic outputs Ultimately, increasing the number of parameters is a fundamental approach to enhance model performance, with additional modules typically delivering the best results The key challenge lies in designing a suitable module, as no single structure is universally effective, particularly in complex models like GANs, where inappropriate components can lead to model divergence.
(a) Regularize loss function (b) Change the inputs
Figure 2.8: Three strategies to alleviate frequency artifacts problem: (a) measure the difference of frequencies in focal frequency loss [54],(b)swap frequency components in
[71], (c)the discriminator and the spectral classifier in SSD-GAN’s architecture [19].
(a) Lemon image (b) Squirrel image (c) HFC(a,b) (d) HFC(b,a)
(e) Butterfly image (f) Baby image (g) HFC(e,f) (h) HFC(f,e)
Figure 2.9: High frequency confusion [71] experiment with different images HFC(x,y) means the image generated by combine low-frequency components of image x with high-frequency components of image y
Diversity-aware image generation
Architecture design
The most notable example for the architecture design approach is arguably Bicy- cleGAN [148] which train the generator and encoder in two cycles:
The conditional Variational Autoencoder GAN (cVAE-GAN) employs image reconstruction loss to ensure that the encoder learns significant encoded latents, enabling the generator to produce realistic outputs using these latents.
• On the other hand, conditional Latent Regressor GAN (cLR-GAN) constraint the generator to output an image that is consistent with the latent code by the latent reconstruction loss.
Built upon BicycleGAN, DMIT [133] utilizes not one but three additional encoders
Ec, Es, Edto better disentangle the feature of the images Although their architectures are identical to each other, these three encoders have different objectives by nature:
• Ec encodes the content and domain-specific features of the images.
• Es extracts class-agnostic attributes of the images or, as the paper describes, the style of the output.
• E d is used to output the class label, i.e domain that the generator is trained with.
One significant drawback of these frameworks is the absence of explicit regularization for diversity, which often leads to mode collapse As a result, they tend to learn an overly simplified distribution, where a given input consistently corresponds to a single output, ignoring the variations present in the latent code.
Additional loss
To address the limitations of previous methods, recent studies have proposed explicit regularizers to prevent mode collapse in image-to-image translation models These include techniques such as mode-seeking loss, diversity-sensitive loss, and contrastive loss.
As for the former two losses, they aim to maximize the ratio between the distance of two outputs and the corresponding latent codes like following: maxθ dI(Gθ(z1, c), Gθ(z2, c)) dz(z1, z2) (2.13)
In this context, dI(ã,ã) represents various distance metrics for images, including L1 loss, perceptual loss, or feature matching loss Likewise, dz(ã,ã) can denote different distance measures for latent codes, such as L1 loss or cosine similarity loss.
In image translation tasks, the use of specific loss functions can lead to diverse outputs, even with minimal differences in latent codes To address this challenge, the author of DivCo proposes a contrastive loss approach that encourages similarity in images generated from adjacent latent codes while pushing apart those from distinct latent codes This method aims to enhance the consistency of generated images and improve overall performance in image translation applications.
In this context, f represents the vector feature extracted from the generator's output by a feature extractor The notation hã,ãi denotes the inner product, while f + signifies the vector features derived from the latent code that is adjacent to f Conversely, f i − s indicates the opposing vector features.
Although these losses may appear distinct at first, they fundamentally align in their objective: to maximize the separation between multiple outputs while minimizing the distance between their corresponding latent codes.
While explicit constraints can enhance diversity in synthesized images, they may also compromise image quality It is essential to exercise caution when applying these constraints to ensure a proper balance between variety and the overall quality of the images produced.
Baseline model selection
Overview
This section focuses on four notable models in single image super-resolution (SISR): SRGAN, EnhanceNet, ESRGAN, and SR-Feat, which are highlighted in various SISR surveys These models are deemed promising and relevant to our thesis scope, leading us to analyze them further Additionally, numerous subsequent models have built upon these foundational models, enhancing specific aspects of their performance Therefore, a detailed examination of these four models is essential for selecting our target model.
As for dataset, many models use DIV2K [3] for their training and validation process. This dataset comes from NTIRE challenge, contains 2K resolution quality and includes
800 images for training and 100 images each for validation and testing Normally, this dataset is used on training and validation, as the test set is private.
SRGAN
While earlier super-resolution models achieved competitive pixel-based metrics like PSNR, they often resulted in images that lacked high-frequency details and appeared overly smooth To address this issue, Ledig et al introduced SRGAN, the first GAN-based approach for super-resolution Although SRGAN performs poorly on traditional metrics like PSNR and SSIM, it excels in generating visually appealing high-resolution images A key feature of SRGAN is its innovative multi-task loss formulation, which incorporates three distinct sub-losses to enhance image quality.
• MSE (L2) loss (2.15) that learns to pixel-wise similarity between the ground truth image and the output.
(2.15) whereW, H andrare the width, height of the low-resolution input and the scaling factor respectively
• Perceptual loss (2.16) in terms of Euclidian distance of the ReLU activation layer of VGG [110] between the reconstructed image and reference image.
WhereWi,j andHi,j describe the dimensions of the respective feature maps within the VGG network, while φi,j is the respective ReLU activation layer in VGG.
• Adversarial loss [85] that tries to encourage super-resolved outputs lying close to the manifold of natural images Here the generator (2.17) and discrimnator (2.18) loss are defined as
(2.18) where G, D and B are the generator, discriminator and batch size respectively.
• To sum up, the total loss (2.19) that the generator uses is:
L SR total =λM SE ×L SR M SE +λV GG×L SR V GG +λGen×L SR Gen (2.19) whereλM SE, λV GGandλGen are hyperparameters to balance between three losses.
Architecturally, SRGAN employs an deep residual network [39] with 16 residual blocks for the generator; on the other hand, for the discriminator, the design is inspired from Radford et al [96] (Figure 2.12)
SRGAN significantly improves perceptual image quality through the application of adversarial loss, establishing a new benchmark for 8x super-resolution This advancement demonstrates that Generative Adversarial Networks (GANs) are a powerful and promising approach for single image super-resolution.
SRGAN exhibits a peculiar phenomenon where it achieves high perceptual quality despite receiving lower scores in pixel-wise metrics such as PSNR and SSIM For instance, as illustrated in Figure 2.13, both bicubic interpolation and SRResNet, a non-GAN method for single image super-resolution (SISR), outperform SRGAN in these quantitative metrics However, the images produced by SRGAN are perceived as more visually appealing to the human eye compared to the other methods.
The SRGAN model incorporates numerous batch normalization (BN) layers, which contribute to the observed issues These layers normalize features based on the mean and variance of the training batch, often diminishing the high-frequency details in input images When there is a significant discrepancy between the training and testing datasets, BN layers can introduce undesirable artifacts and hinder the super-resolution module's ability to generalize effectively.
One significant drawback of SRGAN is its reliance on the feature-after-ReLU activation layer for perceptual loss, which results in sparsity and hinders the generator's ability to learn meaningful features effectively To tackle these challenges, a different approach is discussed in section 2.6.4.
EnhanceNet
Sajjadi et al introduce EnhanceNet, a novel super-resolution model that prioritizes generating realistic textures over achieving pixel-perfect accuracy in reproducing ground truth images This approach utilizes a combination of four distinct loss functions during the training process, similar to the methodology employed by SRGAN.
• MSE loss, which is exactly the same one used in SRGAN.
• Perceptual loss to encourage the network to produce images that have similar feature representations, which is also similar to the loss used in SRGAN.
• Adversarial loss which also the same loss used in SRGAN.
EnhanceNet employs a unique texture loss mechanism, distinguishing it from SRGAN, which relies solely on perceptual loss for visually appealing super-resolution This texture loss, inspired by Neural Style Transfer frameworks, aims to align the textures of low and high-resolution images It is calculated as the L1 loss between gram matrices derived from deep features using a pre-trained VGG model.
Where Hi,j describe the height of the respective feature maps within the VGG network, whileφi,j is the respective ReLU activation layer in VGG and the Gram matrix G(F) = F F T ∈R n × n
EnhanceNet employs an architecture that integrates a fully convolutional network with ResNet, drawing inspiration from Long et al However, to minimize checkerboard artifacts, Sajjadi et al have substituted the learnable upsampling module used by Long et al with nearest neighbor upsampling.
Figure 2.13: The two numbers in each picture of each methods are PSNR and SSIM respectively [69]
Like SRGAN, incorporating additional loss functions alongside MSE loss enhances the sharpness and fidelity of generated images, albeit at the expense of pixel accuracy While EnhanceNet achieves the highest PSNR performance when solely using MSE loss, its outputs tend to be excessively smooth and blurred In contrast, the complete EnhanceNet model generates images with significantly improved perceptual quality.
Despite the advancements made by EnhanceNet, qualitative results indicate that the images produced still exhibit unwanted artifacts, as seen in Figure 2.15 Unlike SRGAN, the presence of these artifacts cannot be attributed to batch normalization layers, which are absent in EnhanceNet's architecture We hypothesize that the model's heavy reliance on perceptual and texture loss—derived from a pre-trained classification model rather than a super-resolution model—may lead to confusion during optimization, ultimately resulting in the introduction of strange artifacts in the output images.
One significant drawback of utilizing texture loss is the high computational cost associated with the Gram matrix This inefficiency is particularly evident when dealing with feature maps that have a large height but a small width, as highlighted in equation (2.20).
ESRGAN
ESRGAN, an advanced model derived from SRGAN, excels in generating realistic high-resolution images Its primary objective is to enhance the perceptual quality of super-resolution while addressing various issues encountered in SRGAN's performance.
About network architecture, ESRGAN employs the basic architecture of SRResNet
[69] (Figure 2.16), which learns a 1-to-1 mapping from low-resolution images to high- resolution ones Here, “basic blocks“ could be selected or designed (e.g., residual block
[42], dense block [46]) for better performance.
Compared to SRGAN, ESRGAN made two modifications to the structure of the generator (Figure 2.17):
• Remove all batch norm (BN) layers.
• Replace the original basic block with the proposed Residual-in-Residual Dense Block (RRDB), which combines multi-level residual network and dense connec- tions.
Alongside re-structuring the generator G, ESRGAN also deploys some modifications for the loss:
• ESRGAN replaces L2-norm loss used in SRGAN with L1-norm loss to produce less smooth images [120] and preserves high-frequency features.
In contrast to earlier methods like SRGAN, which focused on minimizing differences in post-activation features between the restored image and the ground truth, ESRGAN innovatively utilizes pre-activation features This approach results in a richer representation of features, enhancing the quality of image restoration.
• Remarkably, ESRGAN favors a recent innovative GAN model which is relavistic average GAN [57] rather than the standard GAN used in SRGAN, as shown in Figure 2.18.
Specifically, the generator loss and discriminator loss use in adversarial training are:
X b=1 logDRaGAN(I HR , G(I LR )) + log 1−DRaGAN(G(I LR ), I HR )
Figure 2.15: Left: Generated from the full model of EnhanceNet Right: the ground truth [104]
The ESRGAN architecture, illustrated in Figure 2.16, consists of components including the low-resolution image (I LR), the referenced high-resolution image (I HR), the high-resolution image generated by the model (G(I LR)), and the batch size (B) The DRaGAN function is defined as DRaGAN(x, y) = σ(C(x)− B 1 PB b=1C(y)), where σ represents the sigmoid function and C(x) indicates the non-transformed discriminator output.
ESRGAN utilizes a distinct training approach compared to SRGAN and EnhanceNet, employing a larger dataset of over 3,450 images along with data augmentation techniques to prevent overfitting Unlike SRGAN, which trains the GAN from the ground up, ESRGAN incorporates a pre-training phase focused on L1 loss before fully training the model with a combination of all three loss functions These enhancements enable ESRGAN to learn more effectively, resulting in the ability to capture richer textures and higher-frequency details in high-resolution images.
By using features before the activation layers for perceptual loss, ESRGAN over- comes two drawbacks of the SRGAN’s design [120]:
After activating layers in deep learning models, the resulting features often become sparse, particularly in very deep networks that utilize sparse activation functions like ReLU This sparsity leads to weak supervision, ultimately resulting in diminished performance.
• By using features before activation, ESRGAN empirically produces more photo- realistic images with less inconsistent brightness compared with the ground-truth image.
The BN layer was empirically proven to be inefficient and computation-consuming in different PSNR-oriented tasks, as mentioned in2.6.2.2 ESRGAN, therefore, removes
BN layers for stable training and consistent performance Furthermore, removing BN
Figure 2.17: Left: ESRGAN removes the BN layers in residual block in SRGAN.Right: RRDB block is used in ESRGAN’s model andβ is the residual scaling param- eter [120]
Figure 2.18: Difference between standard discriminator and relativistic discriminator. [120]
Figure 2.19: Representative feature maps before (first row) and after (second row) activation for a sampled image [120] layers helps to improve generalization ability and to reduce computational complexity and memory usage.
The adversarial loss for the generator in ESRGAN incorporates both high-resolution images (I HR) and low-resolution images (G(I LR)), allowing the generator to leverage gradients from both real and generated data during adversarial training This contrasts with SRGAN, where only the generated data influences the training process As a result, ESRGAN demonstrates greater stability compared to SRGAN.
[120] but also reported to achieveboth high pixel-accuracy and perceptual quality [6].
On the downsides, ESRGAN’s model is comparatively large with the numbers of learnable parameters up to nearly 40 million Without decent computational resources,
Figure 2.20: The comparison between SRGAN, ESRGAN and EnhanceNet [120]
Figure 2.21: SRFeat architecture [93] one should not try ESRGAN but opt for other alternatives.
SRFeat
While earlier GAN-based single image super-resolution methods, such as SRFeat, achieved remarkable outcomes with realistic textures and sharp visuals, they often introduced meaningless high-frequency noise unrelated to the input images To address these limitations and preserve the perceptual fidelity of reference images, Park et al proposed a novel approach that incorporates an additional discriminator operating in the feature domain This enhancement guides the generator to produce high-frequency structural details instead of undesirable artifacts.
SRFeat enhances SISR performance by employing a generator network that includes 16 residual blocks and multiple long-range skip connections, drawing inspiration from previous studies These longer skip connections and deeper architecture facilitate more effective information propagation across distant layers Additionally, before the upsampling layers, SRFeat incorporates extra long-range skip connections to aggregate features from all prior residual blocks, which promotes gradient back-propagation and enables the reuse of intermediate features, ultimately improving the final output.
In SRFeat, while the model is optimized with the three common loss used in almost all mentioned work in this chapters:
• MSE loss to encourage pixel-accuracy
• Perceptual loss to enforce perceptual quality
• Adversarial loss to make the model produce realistic images
This work introduces an innovative approach by utilizing additional feature adversarial loss, which differs from traditional adversarial loss that operates in image space Instead, feature adversarial loss focuses on determining the authenticity of the features extracted from generated images Specifically, the loss functions for the generator, denoted as L f G, and for the feature discriminator, represented as L f D, are defined to enhance the effectiveness of the model.
Figure 2.22: The qualitative comparison between SRGAN, EnhanceNet and SRFeat [93]
HereD f is the feature discriminator and φ is the feature maps of a pre-trained model, which is VGG in SRFeat.
The training procedure for SRFeat involves a unique approach compared to ESRGAN and SRGAN Initially, SRFeat pre-trains the model using Mean Squared Error (MSE) loss on the ImageNet dataset Following this, the model is trained on the DIV2K dataset, but without employing MSE loss, which distinguishes it from other methods.
The primary benefits of SRFeat stem from its additional discriminator, which effectively differentiates between features of synthetic and real images This capability enables the model to produce outputs that emphasize high-frequency structural features, minimizing the presence of random noisy artifacts, as illustrated in Figure 2.22.
Incorporating extra discriminators can complicate training and increase computational demands beyond those of ESRGAN To address this challenge, the authors opted to pre-train the generator rather than starting its training from the ground up.
Summary
ESRGAN and SRFeat demonstrate superior performance over SRGAN and EnhanceNet, generating more detailed structures as illustrated in Figures 2.22 and 2.20 In contrast, SRGAN struggles to produce sufficient details, while EnhanceNet tends to introduce unwanted textures.
The authors do not provide a direct comparison between SRFeat and ESRGAN Anward et al [6] report that ESRGAN outperforms many other models in terms of PSNR and SSIM at a scaling factor of 4, but they do not include SRFeat in their experiments Due to limited computing resources, our focus is on ESRGAN, primarily because it employs a single generator and discriminator, while SRFeat utilizes one generator and two discriminators, making ESRGAN more efficient in terms of resource usage during training.
Deep Learning
Artificial Neural Network
The nervous system in humans and animals consists of neurons, which are processing units that transmit information through electrical signals The brain comprises billions of interconnected neurons Similarly, an Artificial Neural Network (ANN) mimics this structure, where each neuron acts as a computational unit with scalar input and output In an ANN, incoming values are multiplied by specific weights, and the sum of these products is processed through a nonlinear function to produce the final output.
Artificial neural networks consist of interconnected neurons, where the output of one neuron serves as the input for the next By fine-tuning the weights of these connections, the network can effectively approximate complex mathematical functions Each neuron, depicted as a circle, receives inputs and produces outputs, while the arrangement of neurons in layers illustrates the flow of information from the input layer to the output layer.
Activation function
Activation functions play a crucial role in ANN Normally, an ANN consists of at least one non-linear activation layer Non-linear means that the output cannot be
Figure 3.1: Inside a neuron in ANN [4]
Artificial Neural Networks (ANNs) are limited in their ability to model non-linear relationships when composed solely of linear functions, as the combination of linear functions remains linear To effectively capture non-linear dynamics, it is essential to incorporate non-linear activation functions While numerous non-linear functions are available for use in ANNs, there is currently no established theory guiding their application in specific scenarios, making the selection process largely empirical The most commonly employed non-linear activation functions include the sigmoid function, Rectified Linear Unit (ReLU), and Leaky ReLU.
The sigmoid activation function produces outputs within the range of 0 to 1, making it ideal for models that estimate probabilities Since probabilities of events fall between this range, the sigmoid function is commonly utilized in scenarios requiring probability estimation.
1 Furthermore, this activation function is continuously differentiable This property allows applying the gradient-based optimization method with this function easier to convergence.
However, there are some disadvantages:
• Gradient vanishing: If the input value is near 0 or 1, the derivative of this function goes to 0 Therefore, this function kills gradients and cannot learn.
• Limited output range: The output value only ranges from 0 to 1, which is not good in many complicated cases.
ReLU, or Rectified Linear Unit, is the most widely used activation function in deep learning models, particularly in convolutional neural networks This function consistently produces non-negative values, returning 0 for any input less than 0 and passing through positive inputs unchanged.
ReLU returns the input itself In many cases, the ReLU is the activation function that should be tried first because of its simple property (see Figure 3.4).
However, there are some drawbacks:
• Lost information: If the input value is negative, ReLU lost this input.
• Linear function: If all the input value is non-negative, ReLU becomes linear. However, this case rarely exists in real life, as the number of input is very large.
To overcome the limitations of the standard ReLU activation function, the Leaky ReLU variant introduces a small slope for negative input values while maintaining the original output for non-negative inputs This makes Leaky ReLU particularly effective in scenarios prone to sparse gradients, such as training generative adversarial networks However, it also introduces an additional hyperparameter—the slope—typically set to a value less than 1, as illustrated in Figure 3.5, where the slope is 0.1.
Convolutional Neural Network
Convolutional Neural Networks (CNNs) are a specialized type of neural network predominantly used in computer vision, as well as in natural language processing and time series analysis Like Artificial Neural Networks (ANNs), CNNs utilize weights and biases that are optimized through the back-propagation algorithm to minimize error functions Many optimization techniques effective for ANNs can also be successfully applied to CNNs However, a key difference is that in CNNs, neurons from one layer connect to only a subset of neurons in the subsequent layer, enhancing the network's efficiency and performance.
The Leaky ReLU function, with a slope of 0.1, is utilized in Convolutional Neural Networks (CNNs) to enhance parameter sharing and regularization Digital images exhibit invariant properties, allowing them to maintain their identity even if a single pixel is removed; for example, a picture of a dog remains recognizable as a dog despite such alterations To leverage this characteristic, CNNs share parameters across local regions of an image, enabling consistent feature detection across different areas This design ensures that the same dog detector can identify a dog regardless of its position within the image, whether in column i or i+1.
Parameter sharing in Convolutional Neural Networks (CNNs) significantly reduces the number of model parameters, allowing for larger network architectures without requiring additional training data resources In addition to parameter sharing, CNNs also utilize sparse interaction and equivariant representations, which are essential concepts that will be explored further.
The CNN architecture, like ANN, consists of multiple layers where the output from one layer serves as the input for the subsequent layer Typically, several foundational layers are utilized within CNNs.
• Dense Layer (Fully connected layer)
Each layer can appear many times in this network and make the network very
“deep“ Usually, a batch of images is the input of CNN, where each image is considered as pixel matrix and each pixel has a dimension for its color (RGB) (see Figure 3.6).
The convolution operation is fundamental to Convolutional Neural Networks (CNNs), which utilize this process in at least one layer At its core, convolution involves calculating the sum of the element-wise product of two matrices: the input matrix and the filter, or kernel Typically, the kernel is smaller than the input size and contains the learned weights necessary for detecting distinct features As the kernel matrix strides across the input matrix, it performs the convolution operation to produce the final output.
Figure 3.6 showcases the architecture of a classic Convolutional Neural Network, while Figure 3.7 demonstrates 2-D convolution without flipping the kernel, a technique noted by Goodfellow et al [37] as beneficial for theoretical proofs, though often unnecessary in practical applications.
Convolution help improve the CNNs performance by leverage two important ideas (beside parameter sharing):
In contrast to traditional artificial neural networks (ANNs), which have full connections between layers, convolutional neural networks (CNNs) utilize sparse interactions by maintaining a kernel size smaller than the input size This approach not only reduces the storage needed for parameters but also minimizes the computational operations required to produce outputs.
• Equivariant representations: This property is inherited by parameter sharing.
In essence, the relationship between input and output is directly linked; any alteration in the input will result in a corresponding change in the output For example, adjusting the position of objects in an input image will lead to a similar modification in its output representation However, convolution inherently exhibits equivariance only with respect to translation To achieve equivariance for other transformations, such as scaling or rotation, additional mechanisms must be implemented.
For this layer, all negative values must be ignored and assigned to 0 by ReLU function The input and output of this layer share the same size (see Figure 3.8).
In many cases, ReLU is preferred because it speeds up the training time without dramatically decrease the accuracy.
Figure 3.7: An example of 2-D convolution without kernel flipping [37]
Pooling layers are essential for downsampling feature maps by summarizing the presence of features within specific patches The two most common methods used in pooling are average pooling, which calculates the average presence of a feature, and max pooling, which identifies the most activated presence of a feature.
The dense layer plays a crucial role in recognition and classification tasks within convolutional neural networks (CNNs) The number of nodes in this layer corresponds to the total classes that need to be classified It facilitates interaction between all inputs of the current layer and all outputs of the previous layer Typically positioned at the end of CNNs, the dense layer effectively manages the reduced data size after processing through earlier layers, thereby minimizing the computational resources required.
Figure 3.10: Dense layer (fully-connected layer) [109]
Generative Adversarial Networks
Generative Adversarial Networks (GANs) operate on a min-max strategy involving two competing entities: the generator, which creates data, and the discriminator, which distinguishes between real and fake data The generator aims to maximize the error of the discriminator, while the discriminator strives to accurately identify genuine data This iterative process continues until the discriminator fails to differentiate between real and synthetic data Essentially, GANs can be likened to a "cat and mouse game," where the generator (the counterfeiter) seeks to deceive the discriminator (the cop), resulting in a continuous cycle of improvement driven by competition.
Here, the loss for GAN (standard GAN [38]) is: minG max
• pd is the distribution of real data.
• pz is the distribution of noise (usually a Gaussian distribution) from which we can generate a fake image.
• x and z are the real data and the input noise respectively.
• D outputs a real number ranged between 0 and 1 representing the probability for data being real (1) or fake (0) On the other hand, G outputs a generated sample.
Figure 3.11: Overall architecture of GAN [99]
Generative Adversarial Networks (GANs) function as a zero-sum non-cooperative game, where the success of one player results in the loss of others In the context of game theory, the GAN model achieves convergence when the generator and discriminator reach a state known as Nash equilibrium.
The Nash equilibrium, a fundamental concept in game theory, occurs when an individual cannot gain any additional benefit by altering their actions, provided that other players maintain their current strategies This principle is particularly relevant in the context of Generative Adversarial Networks (GANs), as outlined by Goodfellow et al.
[38] has proven that the solution for equation (3.1), if exist, must satisfy the following relation:
V(D, G ∗ )≤V(D ∗ , G ∗ )≤V(D ∗ , G) (3.2) assuming D ∗ and G ∗ are the respective optimal point for D and G when its corre- sponding opponent is fixed, whileV(D, G) is the above loss (3.1).
Based on the relation (3.2) and some mathematical transformation [38], we can derive that the Nash equilibrium occurs when
2 and pd(x) =pg(x) ∀x (3.3) wherepg is the distribution of the generated outputs of the generator.
To effectively solve the mini-max loss, our goal is to ensure that the discriminator has a 50% chance of accurately identifying real data, indicating that the distribution of synthetic data perfectly matches that of real data.
In 3.1.4.2, we have shown that the equation (3.1) have a analytical solution, which depends onpd(see (3.3)) However, because we do not knowpdbeforehand as well as G and D are commonly parameterized as deep neural networks Therefore, it is impossible to provide a closed-form solution for GAN To train GAN, iterative methods such asStochastic Gradient Descent are the only options In the original GAN paper [38],Goodfellow et al devise a simple learning algorithm.
Algorithm 2 Algorithm for training GAN
Require: discriminator weights θD, generator weight θG, input noise distribution pz, learning rate α, hyperparameterk, batch sizeB, number of epochsE and dataset x for epoch := 0 to E do for k steps do
• Sample a minibatch of B noise samples{z (1) , , z (B) } from noise priorpz
• Sample a minibatch of B examples{x (1) , , x (B) } from the dataset
• Update the discriminator by ascending its stochastic gradient:
•Sample minibatch of B noise samples {z (1) , , z (B) } from noise priorpz.
•Update the generator by descending its stochastic gradient:
In each iteration of the process, we begin by freezing the weights of the generator and updating the weights of the discriminator through several sub-steps using Gradient Ascent Subsequently, we freeze the weights of the discriminator and adjust the weights of the generator utilizing Gradient Descent.
Mini-max optimization is extremely difficult [101] and training GANs is no different.
Generative Adversarial Networks (GANs) face several challenges, particularly in the optimization of their mini-max loss functions, which can be highly unstable and prone to divergence For instance, consider the min-max problem represented by the equation miny max x V(x, y) = xy (3.4).
The Nash Equilibrium for equation (3.4) is identified as x=y=0; however, utilizing the learning algorithm from section 3.1.4.3 with a high learning rate can lead to instability in the model, as illustrated in Figure 3.12 Furthermore, recent research from MIT indicates that Generative Adversarial Networks (GANs) may lack a Nash equilibrium in certain scenarios In the absence of a Nash equilibrium, it becomes necessary to modify the loss function or implement alternative mechanisms to facilitate network convergence.
One significant challenge in Generative Adversarial Networks (GANs) is mode collapse, which occurs when the model fails to capture the full diversity of a multi-modal data distribution For instance, in the MNIST dataset, there are 10 distinct modes corresponding to the digits '0' through '9' However, GANs often only generate a limited subset of these modes, as demonstrated in Figure 3.13, where the model in the second row is unable to produce images for all 10 digits, unlike the more successful model in the top row.
Besides from the above two issues, GANs also suffer from many other problems and addressing these problems remains active research in the field of AI:
Figure 3.12: The divergence of the example function [49]
Figure 3.13: An example of mode collapsing [49]
• Diminished gradient: the discriminator gets too successful that the generator gradient vanishes and learns nothing.
• Unbalance between the generator and discriminator causing overfitting.
• Highly sensitive to the hyperparameter selections.