The current facial image restoration model has impressive performance, we can directly use them for video restoration by restoring each frame individually.. In addition, to the best of o
Trang 1VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER SCIENCE
BACHELOR THESIS
SPATIO-TEMPORAL RE-RENDERING FOR
FACIAL VIDEO RESTORATION
Bachelor of Computer Science (Honors degree)
NGO HUU MANH KHANH - 19520125
NGO QUANG VINH - 19520354
Supervised by
Dr Nguyen Vinh Tiep
TP HO CHi MINH, 2023
Trang 2VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER SCIENCE
BACHELOR THESIS
SPATIO-TEMPORAL RE-RENDERING FOR
FACIAL VIDEO RESTORATION
Bachelor of Computer Science (Honors degree)
NGO HUU MANH KHANH - 19520125
NGO QUANG VINH - 19520354
Supervised by
Dr Nguyen Vinh Tiep
TP HO CHi MINH, 2023
Trang 3DANH SÁCH HOI DONG BAO VỆ KHÓA LUẬN
Hội đồng chấm khóa luận tốt nghiệp, thành lập theo Quyết định số - +: =:=>
ngày của Hiệu trưởng Trường Đại học Công nghệ Thông tin.
1 — Chủ tịch.
2 Thu ký.
3 .— Ủy viên.
Trang 4research process and successfully complete this thesis.
We would like to express our sincere thanks to the Dean of the Faculty and all the teachers
in the Faculty of Computer Science, University of Information Technology, for their support and for helping us prepare enough knowledge to complete this thesis.
We are also grateful to Multimedia Laboratory (MMLab-UIT) for providing us with a conducive research environment and state-of-the-art equipment for this research Furthermore,
we would like to extend our appreciation to the researchers of the MMLab for their valuable feedback and critical questions that greatly contributed to our research It helps us identify and correct mistakes, improve the quality of this thesis
Trang 6Facial old films are a great source of historical value, providing us a vivid imagination of the significant figures in the past However, they were captured with old camera technology
in the past, old films were low-quality and exhibited visual artifacts like pepper noise and
stripes Besides, old films can be damaged due to poor keeping environment As a result, they are difficult or impossible to watch There is a demand to restore and preserve these
old films so future generations can enjoy them Not limited to restoring old films, facial restoration can be used for security purposes More specifically, surveillance cameras are installed in many public places to prevent crime, but their records are often low-quality due
to camera resolution and poor lighting, making it difficult to identify people Facial video restoration is a solution to this problem, it upgrades the quality of the face in the video and makes it easier to identify crime in the video Although the similar problem, facial image
restoration, has been researched for a long time, the work on facial video restoration is still
less explored.
The current facial image restoration model has impressive performance, we can directly
use them for video restoration by restoring each frame individually Nonetheless, this approach struggles with flickering problems since these models are designed for image restoration and do not take into account temporal information In this thesis, we propose Spatio-temporal Re-rendering for Antique Facial Video Restoration (STERR-GAN), a facial video restoration model that employs both temporal and spatial information for restoring, the experiment shows that our model can address the flickering problem and yield a better result.
In addition, to the best of our knowledge, the datasets for facial image restoration or video restoration are available, but the dataset for the facial image restoration domain is still unavailable As such, we introduce the VAR dataset (Video dataset for Antique Restoration),
a new video restoration dataset for facial domain I expect that this dataset will become a valuable resource for measuring the performance of future models and advancing research in this study area.
Trang 82.2.1 Recurrent Neural Networks (RNN)
2.2.2 Bidirectional Recurrent Neural Networks
Trang 9vi Table of contents
3.3.1 Overview 2 ee 3.3.2 3D convolutionalbased -
4.2.1 Overview 4.2.2 Temporal RecurrentNetwork
4.2.3 FacialRe-rendering 2 0.0 0 eee eee
4.2.4 Objective fuction 2 0 ee ee
5 Experiment
5.1 Metrics
5.1.1 Peak Signal-to-Noise Aatio(PSNR)
5.1.2 Structural Similarity Index Measure (SSIM)
5.1.3 Temporal Stability (Eyarp)
5.1.4 Frechet Video Distance(FVD)
62 63
64 64 66 66 67
69 69 70
71
Trang 10List of figures
11
21
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
3.1
3.2
3.4
3.5
4.1
4.2
4.3
5.1
Example of the Facial VideoRestoration and some old films deterioration 3
Story of counterfeit money 2 2.0 02 ee ee ee eee 8 Approximate the data distribution of GAN
Backpropagation in Generator traning - 10
Backpropagation in Discriminator traning 10
The progress in generating face images using GANs model 11
The architecture of StyleGAN generator 12
Illustrative example with two factors of variation 13
Example of water droplet-like artifacts in StyleGAN images 14
The architecture of StyleGAN2 16 Example of "phase" artifacts 2 0 0 eee 17 Some alternative network architectures of StyleGAN2 18
Illustratiuon of Recurrent NeuralNetworks 20
Illustration of Bidirectional Recurrent Neural Networks 22
The architectureof RAFT 2.2.0.0 0000.20.02 0 000048 25 Example optical flow esimation - 26
Illustration of GFP-GAN framework 31
Overview of the DeepRemaster 35
Illustration of the source-reference attention layer 36
Illustration of framework proposed by Wanetal [5l] 39
Overview of Wan etal pipeline 40
The process of collection Video dataset for Antique Restoration (VAR) 47
Some samples from VAR 1 ee 49 Visualization of STERR-GAN famework_ 5
Qualitative comparison SẶẶẶ So 65
Trang 11Viii List of figures
5.2 Userstudyresult ee ee 68
Trang 12List of tables
5.1 Quantitative result of STERR-GAN, GFP-GAN and DeepRemaster
5.2 Ablation study of STERR-GAN
Trang 14Chapter 1
Introduction
1.1 Overview
1.1.1 Practical Context
Back to the late 19th century, when the motion picture was first introduced to mankind From
that time, a surprising amount of films were recorded and released However, due to the
technology at that time, films were low-quality and exhibited visual artifacts like peppernoise and stripes In addition, old films suffered from degradation due to poor keeping
environment With all of these factors, the significant historical value of old videos can
be lost Despite the fact that film restoration techniques have been created to bring theseantique films back to life, the process is laborious Nowadays, video restoration is typically
conducted digitally, with artists manually retouching each frame to remove blemishes, fix
flickering, and perform colorization However, this process is extremely time-consuming andexpensive, as it requires examining and repairing every single frame of the old film As a
result, there is a desire for an algorithm that can automate these tedious tasks, allowing old
films to be restored and given a more modern appearance at a lower cost Old film restoration,
or generally Video Restoration, has many applications in real-life
Preserving historical video footage
Preserving historical video footage is an essential application of video restoration nology Historical video footage refers to videos that capture important events, people, or
tech-cultural artifacts from the past These recordings can be a valuable source of information and
cultural heritage, and it is essential to preserve them for future generations
However, video recordings are often subject to degradation over time due to factors such
as wear and tear, exposure to heat and moisture, and the passage of time This can make itdifficult to view or use these recordings, as they may be blurry, distorted, or otherwise of
Trang 152 Introduction
poor quality In addition, many historical video recordings are stored in formats that are no
longer widely used, such as VHS tapes or film reels, making it difficult to access or view thefootage
Video restoration techniques can be used to preserve and restore historical video footage,improving the quality of the video and making it possible to view and study these recordings ingreater detail This can involve various techniques, such as noise reduction, color correction,and image enhancement, to view By using video restoration techniques to improve thequality of these recordings, it is possible to preserve and share these essential pieces of
history for future generations
Enhancing the clarity of surveillance footage
Surveillance footage is typically captured by cameras that are placed in strategic locations
to monitor and record activity in a particular area This footage is often used for a variety ofpurposes, such as security, crime prevention, and investigation
However, surveillance footage can often be of low quality due to factors such as poor
lighting, camera movement, and noise This can make it difficult to identify people and
objects in the footage, which can make it less useful for its intended purpose Videorestoration techniques can be used to improve the clarity of surveillance footage by applying
a variety of techniques such as noise reduction, color correction, and image enhancement
For example, some video restoration model is proposed to remove noise or blur fromthe footage, making it easier to see details such as facial features or license plate numbers
These techniques can help to improve the effectiveness of surveillance footage by making
it easier to identify people and objects in the video, which can be useful for security, crime
prevention, and investigation purposes
1.1.2 Problem Definition
Facial Video Restoration is a subfield of video restoration that aims to restore high-quality
faces from low-quality counterparts with various deterioration, such as low-resolution, noise,blur, compression artifacts, etc Figure 1.1 illustrates an example of facial video restoration
¢ Input: a sequence of old films frame, and they contain a complex mixture of degradation
such as film grain noise (blue box) or scratches (red arrow)
¢ Output: a corresponding color high-quality videos
Trang 16and film grain noise (b) which make them challenging to restore to their original quality
1.1.3 Challenges
Besides common challenges of computer vision tasks, facial video restoration has its own
difficulties
¢ Lack of dataset The training dataset is one of the primary difficulties that we are
facing in this work A paired dataset is unavailable to our problem, and the previous
work [51] use the synthesis dataset However, to the best of our knowledge, the datasetfor facial video restoration is insufficient
¢ Keeping facial detail The face contains a lot of subtle details that are important forconveying emotions and expressions It can be challenging to restore a video in away that preserves these details while still improving the overall quality of the image.Besides, The appearance of the face can be affected by complex lighting conditions,such as shadows, highlights, and reflections This can make it challenging to correctcolor and exposure issues in the facial region
Trang 174 Introduction
¢ Flickering problem The flickering problem is the unwanted changes in brightness or
color in restored video sequences It can be particularly noticeable in high-motion orlow-light scenes and can be distracting and unpleasant for viewers
¢ Requires high computational resources Since we apply complex image processing
techniques to a large number of frames in a video, these techniques can be
computa-tionally intensive, especially when applied to high-resolution videos
¢ Old films contain a complex mixture of degradation Due to the poor keeping
environment and old capture technique, antique videos often contain many distortions
Therefore, comprehensively mitigating these issues in a single deep neural network is
difficult
1.2 Motivation
From our survey, there are many research about video and old film restoration such as Video
Restoration [56, 37, 5] and Facial Image Restoration [27, 51] On the oter hand, although
facial video restoration has many practical applications in preserving old film, security, and
crime prevention, the work on this topic is less explored Therefore, we choose Facial VideoRestoration as our research topic in this thesis General Video Restoration and Facial Image
Restoration are similar to our topic, thus, we can indirectly use them, but they yield somedisadvantages
Facial image restoration [56, 37, 5] point out that facial image restoration is more
challenging than general image restoration; as a result, they propose some novel architecture
and employ prior knowledge of the face to recover details of the face components The resultindicates that these specific designs improve the quality of restored images However, if we
adapt these approaches to facial video restoration in a frame-wise manner, the result will
exhibit illuminance flickering since they ignore the temporal correlation between frames
General Video Restoration [27, 51] aim at automatically restoring spatio-temporal
deterioration Different from damaged images has only spatial degradation, the deterioration
in consecutive frames of a video are related together Hence, it is necessary to leverage thelong-term temporal clues to remove the occurred degradations and render stable content
To do so, [27, 51] utilize techniques to employ temporal information to avoid illuminance
flickering in result videos However, we can not directly apply General Video Restoration tothe facial domain because the facial domain is a special case and can be more challenging than
general video restoration for several reasons First, the face is a rich source of information,
with many subtle details that can be important for conveying emotions and expressions As a
Trang 181.3 Objectives 5
result, it can be challenging to restore a video in a way that preserves these details while still
improving the overall quality of the image Second, the face is often the primary focus ofattention in a video, which means that any defects or artifacts in the facial region are more
noticeable to the viewer than defects in other parts of the image This can make it more
difficult to achieve an acceptable level of quality when restoring facial video
From the above observations, we hypothesize that we can alleviate the issues of the twomentioned approaches by leveraging long-term temporal clues incorporation with framefeatures (spatio-temporal information) and specific designs for facial restoration (facial
re-rendering)
1.3 Objectives
In this thesis, we aim to propose a novel video restoration framework that has a comparablerestoration result with state-of-the-art facial image restoration and has illuminance consis-
tency results We observe that the current facial image restoration approaches have archived
acceptable results, but when we apply them for restoring frames in video independently,
the result exhibits an inconsistency in illuminance between adjacent frames (i.e flickering)
To address this issue, we devised the idea of leveraging both long-term temporal clues and
Our main contributions to this work are listed in the following points:
¢ Investigating overview of related research We conduct a survey about current
research on Blind Face Restoration and Video Restoration then identify the key
com-ponents we can leverage to design a framework for Facial Video Restoration
¢ Introducing the VAR dataset To our knowledge, the dataset for facial video tion is inadequate Hence, we introduce Video dataset for Antique Restoration
restora-(VAR)
» Propose STERR-GAN We proposed Spatio-temporal Re-rendering for Antique FacialVideo Restoration, namely STERR-GAN, for facial video restoration STERR-GAN
Trang 196 Introduction
adopts a generative prior for facial restoration (""Re-rendering" term) by utilizing both
temporal and spatial information ("Spatio-temporal" term) for robust result videos
1.5 Dissertation Structure
Our thesis consists of 5 chapters:
¢ Chapter 1: Introduction This chapter presents an overview of facial video restoration
problems, which consists of research motivation, definition, challenges, and our maincontributions
¢ Chapter 2: Fundamental This chapter presents fundamental knowledge that 1s
impor-tant to our thesis
¢ Chapter 3: Related work This chapter presents some prior research that is related to
our thesis
¢ Chapter 4: Methodology This chapter presents in detail our proposed STERR-GAN
framework for facial video restoration We also introduce the VAR dataset and itscollection protocol
¢ Chapter 5: Experiments This chapter reports our experimental results and the process
of how we improved the performance of our framework
¢ Chapter 6: Conclusion This chapter summarizes our thesis and our main contributions,
we also mention some future work
Trang 20Chapter 2
Background
In this chapter, we present some essential concepts that are crucial to our research To
start, we cover the Generative Adversarial Network (GAN) [15], providing a brief overview
of its underlying concepts, architecture, and training objectives Next, we delve into the
StyleGAN family [31, 32], a GAN-based generative model, which is powerful at image
generation Afterward, we introduce several sequence models, including recurrent neuralnetworks (RNNs), long short-term memory (LSTM) networks, and Bidirectional RecurrentNeural Networks (BiRNN), all of which are crucial when working with sequence data,such as videos Finally, we briefly touch upon optical flow, an important concept in video
processing, which models the movement of objects in a video
2.1 GAN
Generative Adversarial Training (GAN) [15] was first introduced by Ian Goodfellow et al
in 2014 at NeurIPS GAN brings an innovation horizon to deep learning, specifically in
computer vision, and has become a promising direction in generative models Specifically,GAN’s approach is based on adversarial training, which trains a generator to pool theDiscriminator by deceptive data GAN has become the core for various applications in recent
years.
2.1.1 Intuition of GAN
To understand GAN intuition, think about the story of counterfeit money making illustrated
in figure 2.1 In city A, a criminal group wanted to make counterfeit money, and they hadalready prepared all the materials and tools However, they do not know how to make fake
money They come up with the idea that they will first make a sample of counterfeit money
Trang 218 Background
Generator
Fig 2.1 Story of counterfeit money !
and use it to fool the police If the police can distinguish fake money, the criminal will
adjust and make new counterfeit money to pool the police again This process repeats several
times until the police can not recognize the fake money At that time, the criminal group is
successful in making counterfeit money
The idea behind GAN is similar to the counterfeit money-making in the above story
Theoretically, GAN is based on the zero-sum game theory, which consists of 2 players, a
generator, and a Discriminator, each of which has an opposing objective The generatortries to generate fake data indistinguishable from the real one (it plays the same role as the
criminal group in the previous story) Meanwhile, Discriminator tries to distinguish fake
data from actual data (similar to the police) The objective of the training Generator and
Discriminator is to find a balanced point (Nash equilibrium) when generated fake data isidentical to the accurate data, thus, can pool the Discriminator The two main points of GAN
are Approximate the data distribution and Adversarial Training
¢ Approximate the data distribution
GAN assumes that the data distribution is py(z) The generator is considered to map from simple input distribution p;(z) (usually is Gaussian distribution) to generatve distribution p,(z) The objective of Generator training is that bridge the gap between pg(z) and pg(z) as illustrated in figure 2.2
¢ Adversarial Training'https://towardsdatascience.com/the-math-behind-gans-generative-adversarial-networks-3828f3469d9c
Trang 22Fig 2.2 Approximate the data distribution of GAN
In GAN training, generative distribution p,(z) is implicitly compared with target data distribution pg(z) by Discriminator When p„(z) is equal to pg(z), the Discriminator
can not distinguish fake data from real data, or GAN reaches the Nash equilibrium
We first randomly sample a data point z from input distribution p;(z) and get fake data G(z) by passing z through the Generator We also choose a real data point x from the dataset Then the Discriminator is used to determine the realness of G(z) and x D(G(z)) and G(x) is the probability of G(z) and x are real data, respectively The
objective function of the Generator is 2.1 We need to minimize this objective function
when training the Generator since we want it can pool the Discriminator
minL(G) = E¿~p„|logD(3)] + E-~p,[log(1 — D(G(¿))] (2.1)
In contrast, the Discriminator is trained to clearly identify G(z) and x are fake or
real data Thus, although we still use a similar objective function to 2.1, but
maxi-mize it instead of minimizing it as Generator training The objective function of the
Discriminator is 2.1
maxL(D) = Ex.pallogD(x)] + E-~p,|log(1 — D(G(2))] (2.2)
In summary, the objective of GAN is:
minmaxL(G,D) = Exp, {logD(x)] + Ezxp,[log(1 — D(G(z))] (2.3)
*https://developers.google.com/machine-learning/gan
Trang 242.1 GAN 11
2014 2015 2016 2017 2018
GANs DCGAN CoGAN Progressive StyleGAN
Growing of GANs
Fig 2.5 The progress in generating face images using GANs model and its extension in a
nearly 5-year period The images from left to right are respectively reprinted from GAN,DCGAN, CoGAN, ProgessiveGAN, StyleGAN
2.1.2 StyleGAN
Figure 2.5 illustrates the evolution of the quality and resolution of images generated by
Generative adversarial networks (GAN) However, one of the most important component of
GAN is the generator has been neglected and remains used as a black box Despite recentattempts, a comprehensive understanding synthesis process is still lacking For example,
the poor understanding of latent space’s properties makes the image synthesis process more
complicated
Motivated by this, Karras et al proposed StyleGAN, a generative adversarial network
type inspired by the style transfer literature Instead of sampling a vector from latent space
as input, StyleGAN takes a fixed learned input and uses ADAptive Instance Normalization(AdaIN) to change the style of the image according to the information embedded in the
latent code estimated from the mapping network This design allows users to control the
synthesis intuitively and scale specifically Although style-GAN-based (StyleGAN) has
achieved state-of-the-art results in image generation, there exist some characteristic artifacts
in the images generated by StyleGAN To deal with this issue, Karras et al expose and
analyze several characteristics of artifacts and propose StyleGAN2, an extended version of
StyleGAN with some modifications in the architecture and propose a novel training scheme
2.1.2.1 StyleGAN 1
Style-based generator
Trang 2512 Background
Latent z € Z Latent z € Z : Noise Ñ
Ỷ n Synthesis network ø zcZ Generator
Normalize Normalize Const 4x4x512 - —
-control over the style of the generated image
In the vanilla generator, the latent code is sampled from a normal distribution latent space
Z as shown in 2.6-a, then provided to the generator at the first layer After a series of
convolution blocks, the latent code is transformed into an image The authors of StyleGAN
point out that this principle force the latent space Z to be a probability density and result
in unavoidable entanglement To avoid this, StyleGAN instead takes a learnable constanttensor as input, then it is transformed into an intermediate latent space W via non-linear
mapping network ƒ : Z — W The mapping network is comprised of many fully connected
layers (StyleGAN uses 8-layer MLP) A learned affine transformation A was employed to
transform the latent code w € W into styles y = (ys;yp) for controling the style of generated
images through Adaptive instance normalization (AdaIN) as illustrated in figure 2.6-b The
AdalIN operation is defined as:
xj — X¡
AdalN (x;,y) = “am
In this design, the latent code is disentangled, which mean the latent space consisting
+ Yp,i (2.4)
of linear subspaces, each of which controls one factor of variation The input latent code
w € W is sampled by a non-linear mapping network ƒ In other words, w is a combination
Trang 262.1 GAN 13
Fig 2.7 Illustrative example with two factors of variation (image features, e.g., masculinity
and hair length) (a) An example training set where some combination (e.g., long-hairedmales) is missing (b) This forces the mapping from Z to image features to become curved
so that the forbidden combination disappears in Z to prevent the sampling of invalid nations (c) The learned mapping from Z to W is able to "undo" much of the warping
combi-of factors combi-of variation from subspaces Therefore, the generated images by StyleGAN are
more realistic
For example, as shown in figure 2.7 Sub-figure 2.7-a illustrated a distribution of features
in training that consists of 2 factors of variation ( masculinity and hair length) but missing the
long-haired male combination In a traditional generator, the latent code is a sample from afixed distribution (e.x normal distribution) Therefore, the feature distribution according to Z
is forced to follow the distribution as Z (shown in sub-figure 2.7-b) to prevent the sampling
of invalid combinations This precludes the factors from being fully disentangled While insub-figure 2.7-c, W is a learned latent space, so it better model the features distribution oftraining data
Stochastic variation
Some aspects of human portraits, such as the exact placement of hairs, stubble, freckles, or
skin pores, can be considered stochastic If these aspects change following a distribution, the
image’s perception does not change
In traditional generators, the input is only fed to the network through the first layer To
produce the stochastic variation of mentioned human portraits aspects, the generator needs
to find a way to generate spatially-varying pseudorandom numbers from earlier activationswhenever needed However, doing so is challenging and not always successful As a result,
there are commonly seen repetitive patterns in generated images StyleGAN dresses this
problem by adding per-pixel noise after each convolution The implementation is illustrated
in figure 2.6-b, a single-channel noise input is scaled and add to the output feature of each
convolution layer of generator Although StyleGAN is under the pressure to generate new
Trang 2714 Background
256x256 512x512
Fig 2.8 Example of water droplet-like artifacts in StyleGAN images The artifact appears inall feature maps starting from the 64 x 64 resolution to the final generated image
content as the traditional generator, it can easily create stochastic variation by relying on the
provided noise
StyleGAN uses a consistent set of scaling and biasing for all its feature maps, allowing
it to effectively manipulate global image properties such as pose, lighting, or background
style At the same time, it adds noise to each pixel independently, making it effective atcontrolling stochastic variation If the network tried to use noise to control aspects like
pose, the resulting spatially inconsistent decisions would be penalized by the discriminator
This allows the network to learn how to appropriately use global and local channels withoutexplicit guidance
2.1.2.2 StyleGAN 2
Despite StyleGAN success at generating high-quality images on various datasets, StyleGAN
has some issues One is the presence of blob-like artifacts that resemble water droplets
in many generated images Another issue is "phase" artifacts, where features like teeth oreyes may appear stuck in place while it ought to move smoothly over the image when we
walk along the latent space In their paper, Karras et al [32] identify and analyze thesecharacteristic artifacts of StyleGAN and suggest to change the model architecture and training
methods to address them
Removing normalization artifacts
Many observers have noticed that most images generated by StyleGAN exhibit characteristic
blob-shaped artifacts that resemble water droplets An example of this artifact is shown in
Trang 282.1 GAN 15
figure 2.8 It is obvious that the artifact is present in all intermediate feature maps of the
generator around 64x64 resolution and becomes progressively stronger at higher resolutions
Karras et al argue that issues arise from the use of the AdaIN operation According toequation 2.4, the mean and variance of each feature map normalized individually, whichmay lead to the loss of information present in the relationship between the magnitudes ofthe features The authors posit that the AdaIN discards information in the relative activationintensities of feature maps Hence, the generator sneaking information past these layers and
resulting in these water-droplet artifacts
To address this problem, the authors of StyleGAN2 reconstruct Adaptive Instance malization as Weight Demodulation This change is depicted step by step in figure 2.9.Figure 2.9-a and 2.9-b respectively illustrated StyleGAN’s generator architecture and its
Nor-expanded version, which break the AdaIN into two constituent parts: normalization andmodulation
Karras et al observe that if we move bias and noise outside the style block and operate
on normalized data, the result is more predictable This modification allows normalization
and modulation to function using only the standard deviation Therefore, it is safe to removethe bias, noise, and normalizing applied to the constant input
In practice, the modulation part in AdaIN may amplify certain feature maps and the
normalization part takes the responsibility to counteract this amplification On the other hand,
when the normalization step is removed from the generator, the droplet artifacts disappearcompletely As a result, Weight demodulation is proposed as an alternative that removes the
artifacts by eliminating normalization while still counteracting feature map amplification Infigure 2.9-c, each style block consists of modulation, convolution, and normalization The
modulation can alternatively implement as the scaling of the convolution weights
Wijk = Si Wijk
Instance normalization is used to make the statistics of feature maps free from the effect
of s Authors of StyleGAN2 argue that another way to achieve this goal 1s to employ the
Statistics of the incoming feature maps to base normalization The final standard deviation of
_ 2
oj = an
\V ix
The following normalization’s objective is to restore unit standard deviation from outputs,
the output activations is
dubbed as demodulate process Alternatively, we can do this by formulating it as the
convolution weights
Trang 29$@<——B| = by © B Nonn ad Demod}>[_ Conv 3x3
|—>| Adan 2 Norm mean/std ,
L
Upsample U le Bees [A}>Dited Upsample
Conv 3x3 3 W3 Conv 3x3 W3 Conv 3x3 B=si>[ coma]
Bị 5i h————>9<——h] Nom sid
[AK] AdalN é ‘Norm mean/std bs > @ ‹ {B] bị a B
——_Y —, A Mod mean/std [AL———> L Mod std Ws
Conv 3x3 _ "4 Conv 3x3 Wy > Conv 3x3 lam Vv
_ œ—— BỊ l =— Demod>|_ Conv 3x3
- [AF >| Adal là; Senge >> $8] "——>‡—B
v (a) StyleGAN (b) StyleGAN (detailed) (c) Revised architecture (d) Weight demodulation
&
Style block
Fig 2.9 The architecture of StyleGAN2
Wijk = Wijy/ whic +E (2.5)
ik
The demodulation method is inferior to instance normalization because it is based onstatistical assumptions about the signal rather than the actual contents of the feature maps
Removing Progressive Growing
Progressive growing is the key mechanism that enables StyleGAN to generate high-resolution
images successfully, but it also causes characteristic artifacts (or phase artifacts) because
the generator seems to have a preference for the position of facial component (I.e mouth,teeth, ) The official StyleGAN2 video demonstrates that features like teeth or eyes
seem stuck in one place wherever the face of the person moves, figure 2.10 shows a related
artifact Karras et al believe that the Progressive Growing technique is the root of theissue In progressive growth, each resolution serves as the output resolution, compelling it to
produce the highest frequency details possible This results in an extremely high frequency
in the intermediary layers of the trained network, which compromises its shift-invariance.This motivates StyleGAN2’s authors’ search for an alternative mechanism that retains the
high-resolution generation ability without phase artifacts
Both the generator and discriminator of StyleGAN employ a basic feedforward design.However, there have been various research recently aim at developing a more effective
network architecture, such as skip connection [44, 30], residual networks [17, 20], and
hierarchical methods [60, 61] It has been demonstrated that these techniques can improve
the model’s overall performance in general and generative scenarios As such, Karras et
Trang 302.1 GAN 17
al reevaluate the design of the StyleGAN network and aim to find an architecture that can
generate high-quality images without the use of progressive growing
The new design network architecture of StyleGAN is inspired by MSG-GAN [30]
Figure 2.11-a illustrates MSG-GAN, it connects the intermediate output of the generator
and discriminator at the same resolutions by multiple skip connections Adopting thisarchitecture, StyleGAN2 uses bilinear up-sampling layers to upsampling RGB outputs of
different resolutions and then sum them together In the discriminator, the generated image
is downsampled and fed to each resolution block, this design is depicted in figure 2.11-b
Finally, Karras et al further employ residual connections as shown in 2.11-c As proven
in StyleGAN2’s publication, the skip generator and residual discriminator significantlyimproves FID and PPL without progressive growth
Path length regularization and Lazy regularization
In the image generation topic, the Quantitative evaluation of synthetic images continues to be
a tough problem Recent works explicitly quantify the similarity between generated imagesand training data by using Frechet inception distance (FID) [21], it first employs a pre-trained
classifier to extract image features for estimating the density distributions of generated images
and training data and finally, it calculates the differences of two distributions by Frechetdistance Precision and Recall (P&R) is used to quantify the generator can produce how
much data class in training data Both FID and P&R are based on classifier networks, but
[14] shows that classifier networks concentrate on textures rather than shapes while humans
Trang 31(a) MSG-GAN (b) Input/output skips (c) Residual nets
Fig 2.11 Some alternative network architectures of StyleGAN2 that replace progressive
growing
focus more on shapes As such, these metrics are insufficient to assess image quality In[32], Karras et al conducted a small experiment to demonstrate that although two sets of
generated images have similar FID and P&R scores, their overall quality is notably distinct
StyleGAN assumes that if interpolation of latent-space vectors results in an unexpectedlynon-linear change in generated image, this indicates that the latent space is entangled and
the variation factors are not sufficiently separated Thus, it introduced a novel metric known
as Perceptual path length (PPL) to measure the magnitude of the image’s transformations
during the latent-space interpolation PPL is calculated as the sum of the perceptually-basedpairwise image distance, measured by using a weighted difference between two VGG16
embeddings over each sub-linear segment of the interpolation path in the latent space
In StyleGAN2, they observe that perceptual path length (PPL) correlates with image
quality More specifically, the higher PPL, the lower of image quality They argue thatwhen the discriminator penalizes the broken images during training As such, when we do
the latent-space interpolation in low-quality latent space, the image changes dramatically,
resulting in high perceptual path length However, simply minimizing PPL would lead to adegenerate solution with a lack of diversity To address this, Karras et al proposed a new
Trang 322.2 Sequence model 19
regularizer that encourages smoother mapping from the latent space to the generated images
without this drawback
The main idea behind this regularizer is that if we walk along the latent space with a
fixed-size step, it has to yield a non-zero, constant change in the magnitude of images Wecan determine the deviation from this ideal by moving in random directions in the imagespace and monitoring the corresponding w gradients Our objective is to find a mapping fromlatent space to an image that has these gradients similar to equal length with any moving
direction in latent space or latent code w The jacobian matrix Jy = 0g(w) /Ow is used to
capture the local metric scaling properties of the generator mapping g(w) : ⁄W -> 3 at single
latent point w € W The regularizer is formulated as:
ew y~vion) (DS|Ủ — s) 2.6)
where random images are denoted as y and w ~ f(z), where z are normally distributed.
The practical experiment shows that path length regularization results in more trustworthyand consistent results
2.2 Sequence model
Sequence models are a type of deep learning model that are specifically designed to process
sequential data, such as natural language, time series data, or audio data These models
typically use recurrent neural networks (RNNs) or their variants, such as long short-term
memory (LSTM) networks or Bidirectional Recurrent Neural Networks (BiRNN), to processthe input sequence and capture the temporal dependencies between the elements
2.2.1 Recurrent Neural Networks (RNN)
Artificial Neural Networks, also referred to as Feed-Forward Neural Networks or ANNs, are
a type of machine learning model that process inputs in a linear, unidirectional manner Due
to this design, ANNs are unable to capture the sequential information present in input data,
which is necessary for tasks involving sequence data When two data elements are related
to one another, such as in speech recognition, text generation, and semantic recognition of
text or voice, treating each element independently is inadequate To address this limitation,
Recurrent Neural Networks (RNNs) were developed RNNs possess a "memory" component,
which allows them to recall previous information and thus effectively handle sequential data
A basic Recurrent Neural Network (RNN) comprises a feedback loop, as illustrated
in Figure 2.12, using both current and past inputs to make decisions RNNs allow for a
Trang 3320 Background
Hidden layers
(x9 )
a
|
| Xí
Recurrent Neural Network
Fig 2.12 Recurrent Neural Networks (RNN) has a recurrent relation on the hidden state, this
looping constraint ensures the capture of sequential information in the input data
degree of flexibility in terms of the activation function employed, with commonly usedoptions including the Sigmoid function, Tanh function, and Relu function The process of
training an RNN includes three principal steps Firstly, the network performs a forwardpass, making a prediction Secondly, this prediction is compared to the true value using
a loss function, which outputs a value representing the error or discrepancy between the
prediction and the true value Lastly, the error value is used to perform backpropagation,
which calculates the gradients for each node in the network There are various architectures
of RNN available, some examples are One-to-One, One-to-Many (where single input can
produce multiple outputs, such as in Music generation), Many-to-One (where multiple inputs
from different time steps produce a single output, such as in Sentiment Analysis or Emotion
Trang 34One of the main disadvantages of RNNs is the so-called "vanishing gradient" problem.This occurs when the gradient of the loss function with respect to the weights of the network
becomes very small, which makes it difficult for the network to learn This can be a particular
problem for traditional RNNs, which only process input sequences in one direction andcan struggle to capture long-term dependencies in the data Additionally, some RNNs can
be difficult to parallelize and train on GPUs, which can further limit their computational
efficiency This is because the sequential nature of RNNs makes it difficult to distributethe computations across multiple GPUs, which can limit their ability to take advantage of
parallel processing
2.2.2 Bidirectional Recurrent Neural Networks
Bidirectional RNNs (BRNNs)[46] are a type of recurrent neural network that can processinput sequences in both forward and backward directions This allows them to capture both
past and future context, giving them a potential advantage over traditional RNNs, which only
process input sequences in one direction (typically forward, from left to right)
One potential advantage of BRNNs over traditional RNNs is that they can better capturelong-term dependencies in the input data Traditional RNNs can struggle to capture long-term
dependencies because they only have access to past context, which may not be sufficient tofully understand the underlying structure of the data By processing input in both directions,
BRNNs can better capture the full context of the input, which can help improve their
performance on tasks such as language modeling or machine translation
Another potential advantage of BRNNs is that they can be more efficient to train Becausetraditional RNNs only process input sequences in one direction, they must be unrolled
through time in order to compute gradients for backpropagation This can be computationally
expensive for long input sequences BRNNs, on the other hand, can be trained using atechnique called "backward-forward training", which avoids the need to unroll the network
through time In the backward phase, the gradient of the loss function is computed using the
standard backpropagation algorithm, but only for the backward RNN This computes the
gradient of the loss with respect to the weights of the backward RNN, but not the forwardRNN This can be written mathematically as:
Trang 3522 Background
Fig 2.13 Illustration of Bidirectional Recurrent Neural Networks
OL 'Ằ OL OV nek OShek
OWock ~ OY nck Sock 9W)„¿
where L is the loss function, W?„„ are the weights of the backward RNN, Y;„„ are the outputs
of the backward RNN, and Š;„¿ are the hidden states of the backward RNN
In the forward phase, the gradient of the loss is computed using a variant of the
backprop-agation algorithm that is tailored to BRNNs This computes the gradient of the loss with
respect to the weights of the forward RNN, but not the backward RNN This can be writtenmathematically as:
OL OL OV wa OS fwa
WWrwd OV pwd OS twa 9Wwa
where W/„„ are the weights of the forward RNN, Yfwa are the outputs of the forward RNN,
and Sfyq are the hidden states of the forward RNN
By decomposing the computation of the gradient into these two phases, backward-forward
training allows BRNNs to be trained more efficiently than traditional RNNs, especially forlong input sequences
Trang 362.3 Optical Flow 23
2.3 Optical Flow
Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene
caused by the relative motion between an observer and the scene It is typically represented
as a vector field, where each vector represents the motion of a point in the image from one frame to the next.
Optical flow can be computed over time by tracking the motion of points or features in a sequence of frames from a video This can be done using a variety of techniques, such as feature-based methods, which track the motion of individual points or edges in the image, or region-based methods, which track the motion of larger image regions.
Once the optical flow has been computed for a sequence of frames, it can be used for a
wide range of applications, such as video compression, where accurate estimates of motion can be used to reduce the amount of data needed to represent the video, or action recognition, where the optical flow can be used to identify and classify different types of motion in the
video.
2.3.1 Handcrafted Features for Optical Flow Estimation
Using handcrafted features for optical flow refers to the process of manually designing and extracting features from an image, such as edges and corners, and using them to estimate the motion of objects in the scene This approach is in contrast to deep learning-based methods, which use convolutional neural networks (CNNs) to automatically learn features from the
data.
There are several algorithms that use handcrafted features for optical flow estimation.
Some examples include:
Lucas-Kanade [38]: This algorithm uses the brightness constancy assumption, which states that the intensity of a pixel in an image remains constant over time It uses this assumption to calculate the optical flow between two images by minimizing the sum of squared differences between the intensity values of corresponding pixels This is formulated
as an optimization problem, where the objective function is the sum of squared differences and the optimization variables are the optical flow vectors The objective function can be written as:
E(p) =3 |I1(x+ p) —12(x)?
x
where E(p) is the objective function, /1 and /2 are the two images, x is a pixel in the first image, and p is the optical flow vector for that pixel.
Trang 3724 Background
The goal of the Lucas-Kanade algorithm is to find the optimal optical flow vectors p that minimizes the objective function This can be done using gradient descent, where the gradient of the objective function is calculated and used to update the optical flow vectors in
the direction that reduces the objective function This process is repeated until the optical flow vectors converge to a minimum of the objective function, at which point the algorithm terminates and the final optical flow vectors are returned as the result.
Horn-Schunck [23]: This algorithm is similar to Lucas-Kanade, but it also takes into account the smoothness of the optical flow field It uses an iterative approach to minimize
the sum of squared differences between the intensity values of corresponding pixels, as well
as a regularization term that encourages smoothness in the flow field The objective function can be written as:
E(P) = Etucas-Kanade(p) +4 [A(x + p) — 12(x)?
Farneback [12]: This algorithm uses a pyramid structure to calculate the optical flow
between two images It first calculates the flow at the lowest resolution and then uses this flow to initialize the flow at the next higher resolution This process is repeated until the flow
at the highest resolution is calculated.
Using handcrafted features for optical flow estimation has both advantages and vantages One advantage of using handcrafted features is that they are interpretable and can provide insights into the motion of objects in the scene Since the features are designed by humans, they can be easily understood and explained, which can be useful for understanding the behavior of the algorithm and debugging problems Another advantage of using hand-
disad-crafted features is that they can be tailored to specific types of scenes and motion patterns.
By carefully designing the features, it is possible to improve the accuracy of the optical flow estimates in certain scenarios.
However, there are also disadvantages to using handcrafted features for optical flow estimation One disadvantage is that they can be susceptible to noise and other factors that can affect their performance Since the features are designed by humans, they may not be
Trang 382.3 Optical Flow 25
able to handle the complex patterns of motion that can occur in real-world scenes Another
disadvantage of using handcrafted features is that they require significant effort to design and
implement This can be time-consuming and labor-intensive, and it may not be feasible forlarge-scale applications
2.3.2 RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Deep learning methods have become popular for optical flow estimation in recent years
be-cause they can often produce more accurate results than traditional methods This is bebe-causedeep learning methods are able to learn and model the complex patterns and relationships
in the data that are often present in optical flow estimation tasks The paper "RAFT:
Recur-rent All-Pairs Field Transforms for Optical Flow" by Clark et al [48], published in 2020,
introduces RAFT, a semi-supervised all-pairs field transform that learns optical flow from afew annotated pairs It uses a recurrent all-pairs field transform (RAFT) with a convolutional
neural network (CNN) to produce accurate and smooth optical flow estimation even for
challenging scenes At the time of publication, the paper has achieved State-of-The-Art
performance on well-known datasets such as KITTI and Sintel, and has been selected as theBest Paper at the ECCV 2020, the world’s leading image processing conference
Fig 2.14 The RAFT architecture comprises of three key components: a feature encoder that
extracts per-pixel features from both the input images, a context encoder that extracts features
from only the first input image, and a correlation layer The correlation layer generates a4-dimensional volume by taking the inner product of all pairs of feature vectors It thenapplies pooling at multiple scales of the last 2-dimensions to produce a set of multi-scale
volumes Finally, an update operator uses the current estimate to recurrently update theoptical flow by referencing values from the set of correlation volumes
RAFT consists of three main components:
Trang 3926 Background
Feature extraction is similar to what is done in general deep learning architectures usingconvolutional networks in order to emphasize significant features
Computing visual similarity aims at calculating the similarity between a specific part
of the previous frame image and each part of the subsequent frame image in a brute-forcefashion
Iterative updates is an approach that increases accuracy by iteratively performinginferences In other words, if the number of iterations is small, the calculation time is shortbut the accuracy is relatively low, and if the number of iterations is large, the calculation time
is long but the accuracy tends to be relatively high
The RAFT architecture is an innovative approach that combines traditional optimizationtechniques with modern deep learning methods It features a feature encoder that extracts per-
pixel features, a correlation layer that calculates the similarity among pixels, and an update
operator that simulates the steps of an iterative optimization algorithm Unlike traditionalmethods, the RAFT architecture utilizes deep learning techniques to automatically learn and
adapt the features and motion priors, rather than relying on manual crafting
iter O iter 1 iter 2
iter 3 iter 5 iter 8
iter 11 iter 15 iter 19
Fig 2.15 The following figure shows the optical flow estimates for each iteration for twospecific frames of the racing car, the higher the number of iterations, the better the accuracy
In conclusion, this chapter provides an overview of essential concepts that are crucial to
our research By introducing Generative Adversarial Networks (GANS) and the StyleGANfamily, we provide a foundation for understanding image generation with deep learning.Additionally, we delve into the world of sequence data by discussing recurrent neuralnetworks (RNNs), long short-term memory (LSTM) networks, and Bidirectional Recurrent
Trang 402.3 Optical Flow 27
Neural Networks (BiRNN), and touch upon the important concept of optical flow in video
processing The knowledge gained from this chapter will serve as a foundation for the workpresented in subsequent chapters