Khóa luận tốt nghiệp Khoa học máy tính: Phục chế video chân dung cũ thông qua tái tạo khung hình sử dụng thông tin không thời gian

The current facial image restoration model has impressive performance, we can directly use them for video restoration by restoring each frame individually.. In addition, to the best of o

Trang 1

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER SCIENCE

BACHELOR THESIS

SPATIO-TEMPORAL RE-RENDERING FOR

FACIAL VIDEO RESTORATION

Bachelor of Computer Science (Honors degree)

NGO HUU MANH KHANH - 19520125

NGO QUANG VINH - 19520354

Supervised by

Dr Nguyen Vinh Tiep

TP HO CHi MINH, 2023

Trang 2

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER SCIENCE

BACHELOR THESIS

SPATIO-TEMPORAL RE-RENDERING FOR

FACIAL VIDEO RESTORATION

Bachelor of Computer Science (Honors degree)

NGO HUU MANH KHANH - 19520125

NGO QUANG VINH - 19520354

Supervised by

Dr Nguyen Vinh Tiep

TP HO CHi MINH, 2023

Trang 3

DANH SÁCH HOI DONG BAO VỆ KHÓA LUẬN

Hội đồng chấm khóa luận tốt nghiệp, thành lập theo Quyết định số - +: =:=>

ngày của Hiệu trưởng Trường Đại học Công nghệ Thông tin.

1 — Chủ tịch.

2 Thu ký.

3 .— Ủy viên.

Trang 4

research process and successfully complete this thesis.

We would like to express our sincere thanks to the Dean of the Faculty and all the teachers

in the Faculty of Computer Science, University of Information Technology, for their support and for helping us prepare enough knowledge to complete this thesis.

We are also grateful to Multimedia Laboratory (MMLab-UIT) for providing us with a conducive research environment and state-of-the-art equipment for this research Furthermore,

we would like to extend our appreciation to the researchers of the MMLab for their valuable feedback and critical questions that greatly contributed to our research It helps us identify and correct mistakes, improve the quality of this thesis

Trang 6

Facial old films are a great source of historical value, providing us a vivid imagination of the significant figures in the past However, they were captured with old camera technology

in the past, old films were low-quality and exhibited visual artifacts like pepper noise and

stripes Besides, old films can be damaged due to poor keeping environment As a result, they are difficult or impossible to watch There is a demand to restore and preserve these

old films so future generations can enjoy them Not limited to restoring old films, facial restoration can be used for security purposes More specifically, surveillance cameras are installed in many public places to prevent crime, but their records are often low-quality due

to camera resolution and poor lighting, making it difficult to identify people Facial video restoration is a solution to this problem, it upgrades the quality of the face in the video and makes it easier to identify crime in the video Although the similar problem, facial image

restoration, has been researched for a long time, the work on facial video restoration is still

less explored.

The current facial image restoration model has impressive performance, we can directly

use them for video restoration by restoring each frame individually Nonetheless, this approach struggles with flickering problems since these models are designed for image restoration and do not take into account temporal information In this thesis, we propose Spatio-temporal Re-rendering for Antique Facial Video Restoration (STERR-GAN), a facial video restoration model that employs both temporal and spatial information for restoring, the experiment shows that our model can address the flickering problem and yield a better result.

In addition, to the best of our knowledge, the datasets for facial image restoration or video restoration are available, but the dataset for the facial image restoration domain is still unavailable As such, we introduce the VAR dataset (Video dataset for Antique Restoration),

a new video restoration dataset for facial domain I expect that this dataset will become a valuable resource for measuring the performance of future models and advancing research in this study area.

Trang 8

2.2.1 Recurrent Neural Networks (RNN)

2.2.2 Bidirectional Recurrent Neural Networks

Trang 9

vi Table of contents

3.3.1 Overview 2 ee 3.3.2 3D convolutionalbased -

4.2.1 Overview 4.2.2 Temporal RecurrentNetwork

4.2.3 FacialRe-rendering 2 0.0 0 eee eee

4.2.4 Objective fuction 2 0 ee ee

5 Experiment

5.1 Metrics

5.1.1 Peak Signal-to-Noise Aatio(PSNR)

5.1.2 Structural Similarity Index Measure (SSIM)

5.1.3 Temporal Stability (Eyarp)

5.1.4 Frechet Video Distance(FVD)

62 63

64 64 66 66 67

69 69 70

71

Trang 10

List of figures

11

21

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

2.10

2.11

2.12

2.13

2.14

2.15

3.1

3.2

3.4

3.5

4.1

4.2

4.3

5.1

Example of the Facial VideoRestoration and some old films deterioration 3

Story of counterfeit money 2 2.0 02 ee ee ee eee 8 Approximate the data distribution of GAN

Backpropagation in Generator traning - 10

Backpropagation in Discriminator traning 10

The progress in generating face images using GANs model 11

The architecture of StyleGAN generator 12

Illustrative example with two factors of variation 13

Example of water droplet-like artifacts in StyleGAN images 14

The architecture of StyleGAN2 16 Example of "phase" artifacts 2 0 0 eee 17 Some alternative network architectures of StyleGAN2 18

Illustratiuon of Recurrent NeuralNetworks 20

Illustration of Bidirectional Recurrent Neural Networks 22

The architectureof RAFT 2.2.0.0 0000.20.02 0 000048 25 Example optical flow esimation - 26

Illustration of GFP-GAN framework 31

Overview of the DeepRemaster 35

Illustration of the source-reference attention layer 36

Illustration of framework proposed by Wanetal [5l] 39

Overview of Wan etal pipeline 40

The process of collection Video dataset for Antique Restoration (VAR) 47

Some samples from VAR 1 ee 49 Visualization of STERR-GAN famework_ 5

Qualitative comparison SẶẶẶ So 65

Trang 11

Viii List of figures

5.2 Userstudyresult ee ee 68

Trang 12

List of tables

5.1 Quantitative result of STERR-GAN, GFP-GAN and DeepRemaster

5.2 Ablation study of STERR-GAN

Trang 14

Chapter 1

Introduction

1.1 Overview

1.1.1 Practical Context

Back to the late 19th century, when the motion picture was first introduced to mankind From

that time, a surprising amount of films were recorded and released However, due to the

technology at that time, films were low-quality and exhibited visual artifacts like peppernoise and stripes In addition, old films suffered from degradation due to poor keeping

environment With all of these factors, the significant historical value of old videos can

be lost Despite the fact that film restoration techniques have been created to bring theseantique films back to life, the process is laborious Nowadays, video restoration is typically

conducted digitally, with artists manually retouching each frame to remove blemishes, fix

flickering, and perform colorization However, this process is extremely time-consuming andexpensive, as it requires examining and repairing every single frame of the old film As a

result, there is a desire for an algorithm that can automate these tedious tasks, allowing old

films to be restored and given a more modern appearance at a lower cost Old film restoration,

or generally Video Restoration, has many applications in real-life

Preserving historical video footage

Preserving historical video footage is an essential application of video restoration nology Historical video footage refers to videos that capture important events, people, or

tech-cultural artifacts from the past These recordings can be a valuable source of information and

cultural heritage, and it is essential to preserve them for future generations

However, video recordings are often subject to degradation over time due to factors such

as wear and tear, exposure to heat and moisture, and the passage of time This can make itdifficult to view or use these recordings, as they may be blurry, distorted, or otherwise of

Trang 15

2 Introduction

poor quality In addition, many historical video recordings are stored in formats that are no

longer widely used, such as VHS tapes or film reels, making it difficult to access or view thefootage

Video restoration techniques can be used to preserve and restore historical video footage,improving the quality of the video and making it possible to view and study these recordings ingreater detail This can involve various techniques, such as noise reduction, color correction,and image enhancement, to view By using video restoration techniques to improve thequality of these recordings, it is possible to preserve and share these essential pieces of

history for future generations

Enhancing the clarity of surveillance footage

Surveillance footage is typically captured by cameras that are placed in strategic locations

to monitor and record activity in a particular area This footage is often used for a variety ofpurposes, such as security, crime prevention, and investigation

However, surveillance footage can often be of low quality due to factors such as poor

lighting, camera movement, and noise This can make it difficult to identify people and

objects in the footage, which can make it less useful for its intended purpose Videorestoration techniques can be used to improve the clarity of surveillance footage by applying

a variety of techniques such as noise reduction, color correction, and image enhancement

For example, some video restoration model is proposed to remove noise or blur fromthe footage, making it easier to see details such as facial features or license plate numbers

These techniques can help to improve the effectiveness of surveillance footage by making

it easier to identify people and objects in the video, which can be useful for security, crime

prevention, and investigation purposes

1.1.2 Problem Definition

Facial Video Restoration is a subfield of video restoration that aims to restore high-quality

faces from low-quality counterparts with various deterioration, such as low-resolution, noise,blur, compression artifacts, etc Figure 1.1 illustrates an example of facial video restoration

¢ Input: a sequence of old films frame, and they contain a complex mixture of degradation

such as film grain noise (blue box) or scratches (red arrow)

¢ Output: a corresponding color high-quality videos

Trang 16

and film grain noise (b) which make them challenging to restore to their original quality

1.1.3 Challenges

Besides common challenges of computer vision tasks, facial video restoration has its own

difficulties

¢ Lack of dataset The training dataset is one of the primary difficulties that we are

facing in this work A paired dataset is unavailable to our problem, and the previous

work [51] use the synthesis dataset However, to the best of our knowledge, the datasetfor facial video restoration is insufficient

¢ Keeping facial detail The face contains a lot of subtle details that are important forconveying emotions and expressions It can be challenging to restore a video in away that preserves these details while still improving the overall quality of the image.Besides, The appearance of the face can be affected by complex lighting conditions,such as shadows, highlights, and reflections This can make it challenging to correctcolor and exposure issues in the facial region

Trang 17

4 Introduction

¢ Flickering problem The flickering problem is the unwanted changes in brightness or

color in restored video sequences It can be particularly noticeable in high-motion orlow-light scenes and can be distracting and unpleasant for viewers

¢ Requires high computational resources Since we apply complex image processing

techniques to a large number of frames in a video, these techniques can be

computa-tionally intensive, especially when applied to high-resolution videos

¢ Old films contain a complex mixture of degradation Due to the poor keeping

environment and old capture technique, antique videos often contain many distortions

Therefore, comprehensively mitigating these issues in a single deep neural network is

difficult

1.2 Motivation

From our survey, there are many research about video and old film restoration such as Video

Restoration [56, 37, 5] and Facial Image Restoration [27, 51] On the oter hand, although

facial video restoration has many practical applications in preserving old film, security, and

crime prevention, the work on this topic is less explored Therefore, we choose Facial VideoRestoration as our research topic in this thesis General Video Restoration and Facial Image

Restoration are similar to our topic, thus, we can indirectly use them, but they yield somedisadvantages

Facial image restoration [56, 37, 5] point out that facial image restoration is more

challenging than general image restoration; as a result, they propose some novel architecture

and employ prior knowledge of the face to recover details of the face components The resultindicates that these specific designs improve the quality of restored images However, if we

adapt these approaches to facial video restoration in a frame-wise manner, the result will

exhibit illuminance flickering since they ignore the temporal correlation between frames

General Video Restoration [27, 51] aim at automatically restoring spatio-temporal

deterioration Different from damaged images has only spatial degradation, the deterioration

in consecutive frames of a video are related together Hence, it is necessary to leverage thelong-term temporal clues to remove the occurred degradations and render stable content

To do so, [27, 51] utilize techniques to employ temporal information to avoid illuminance

flickering in result videos However, we can not directly apply General Video Restoration tothe facial domain because the facial domain is a special case and can be more challenging than

general video restoration for several reasons First, the face is a rich source of information,

with many subtle details that can be important for conveying emotions and expressions As a

Trang 18

1.3 Objectives 5

result, it can be challenging to restore a video in a way that preserves these details while still

improving the overall quality of the image Second, the face is often the primary focus ofattention in a video, which means that any defects or artifacts in the facial region are more

noticeable to the viewer than defects in other parts of the image This can make it more

difficult to achieve an acceptable level of quality when restoring facial video

From the above observations, we hypothesize that we can alleviate the issues of the twomentioned approaches by leveraging long-term temporal clues incorporation with framefeatures (spatio-temporal information) and specific designs for facial restoration (facial

re-rendering)

1.3 Objectives

In this thesis, we aim to propose a novel video restoration framework that has a comparablerestoration result with state-of-the-art facial image restoration and has illuminance consis-

tency results We observe that the current facial image restoration approaches have archived

acceptable results, but when we apply them for restoring frames in video independently,

the result exhibits an inconsistency in illuminance between adjacent frames (i.e flickering)

To address this issue, we devised the idea of leveraging both long-term temporal clues and

Our main contributions to this work are listed in the following points:

¢ Investigating overview of related research We conduct a survey about current

research on Blind Face Restoration and Video Restoration then identify the key

com-ponents we can leverage to design a framework for Facial Video Restoration

¢ Introducing the VAR dataset To our knowledge, the dataset for facial video tion is inadequate Hence, we introduce Video dataset for Antique Restoration

restora-(VAR)

» Propose STERR-GAN We proposed Spatio-temporal Re-rendering for Antique FacialVideo Restoration, namely STERR-GAN, for facial video restoration STERR-GAN

Trang 19

6 Introduction

adopts a generative prior for facial restoration (""Re-rendering" term) by utilizing both

temporal and spatial information ("Spatio-temporal" term) for robust result videos

1.5 Dissertation Structure

Our thesis consists of 5 chapters:

¢ Chapter 1: Introduction This chapter presents an overview of facial video restoration

problems, which consists of research motivation, definition, challenges, and our maincontributions

¢ Chapter 2: Fundamental This chapter presents fundamental knowledge that 1s

impor-tant to our thesis

¢ Chapter 3: Related work This chapter presents some prior research that is related to

our thesis

¢ Chapter 4: Methodology This chapter presents in detail our proposed STERR-GAN

framework for facial video restoration We also introduce the VAR dataset and itscollection protocol

¢ Chapter 5: Experiments This chapter reports our experimental results and the process

of how we improved the performance of our framework

¢ Chapter 6: Conclusion This chapter summarizes our thesis and our main contributions,

we also mention some future work

Trang 20

Chapter 2

Background

In this chapter, we present some essential concepts that are crucial to our research To

start, we cover the Generative Adversarial Network (GAN) [15], providing a brief overview

of its underlying concepts, architecture, and training objectives Next, we delve into the

StyleGAN family [31, 32], a GAN-based generative model, which is powerful at image

generation Afterward, we introduce several sequence models, including recurrent neuralnetworks (RNNs), long short-term memory (LSTM) networks, and Bidirectional RecurrentNeural Networks (BiRNN), all of which are crucial when working with sequence data,such as videos Finally, we briefly touch upon optical flow, an important concept in video

processing, which models the movement of objects in a video

2.1 GAN

Generative Adversarial Training (GAN) [15] was first introduced by Ian Goodfellow et al

in 2014 at NeurIPS GAN brings an innovation horizon to deep learning, specifically in

computer vision, and has become a promising direction in generative models Specifically,GAN’s approach is based on adversarial training, which trains a generator to pool theDiscriminator by deceptive data GAN has become the core for various applications in recent

years.

2.1.1 Intuition of GAN

To understand GAN intuition, think about the story of counterfeit money making illustrated

in figure 2.1 In city A, a criminal group wanted to make counterfeit money, and they hadalready prepared all the materials and tools However, they do not know how to make fake

money They come up with the idea that they will first make a sample of counterfeit money

Trang 21

8 Background

Generator

Fig 2.1 Story of counterfeit money !

and use it to fool the police If the police can distinguish fake money, the criminal will

adjust and make new counterfeit money to pool the police again This process repeats several

times until the police can not recognize the fake money At that time, the criminal group is

successful in making counterfeit money

The idea behind GAN is similar to the counterfeit money-making in the above story

Theoretically, GAN is based on the zero-sum game theory, which consists of 2 players, a

generator, and a Discriminator, each of which has an opposing objective The generatortries to generate fake data indistinguishable from the real one (it plays the same role as the

criminal group in the previous story) Meanwhile, Discriminator tries to distinguish fake

data from actual data (similar to the police) The objective of the training Generator and

Discriminator is to find a balanced point (Nash equilibrium) when generated fake data isidentical to the accurate data, thus, can pool the Discriminator The two main points of GAN

are Approximate the data distribution and Adversarial Training

¢ Approximate the data distribution

GAN assumes that the data distribution is py(z) The generator is considered to map from simple input distribution p;(z) (usually is Gaussian distribution) to generatve distribution p,(z) The objective of Generator training is that bridge the gap between pg(z) and pg(z) as illustrated in figure 2.2

¢ Adversarial Training'https://towardsdatascience.com/the-math-behind-gans-generative-adversarial-networks-3828f3469d9c

Trang 22

Fig 2.2 Approximate the data distribution of GAN

In GAN training, generative distribution p,(z) is implicitly compared with target data distribution pg(z) by Discriminator When p„(z) is equal to pg(z), the Discriminator

can not distinguish fake data from real data, or GAN reaches the Nash equilibrium

We first randomly sample a data point z from input distribution p;(z) and get fake data G(z) by passing z through the Generator We also choose a real data point x from the dataset Then the Discriminator is used to determine the realness of G(z) and x D(G(z)) and G(x) is the probability of G(z) and x are real data, respectively The

objective function of the Generator is 2.1 We need to minimize this objective function

when training the Generator since we want it can pool the Discriminator

minL(G) = E¿~p„|logD(3)] + E-~p,[log(1 — D(G(¿))] (2.1)

In contrast, the Discriminator is trained to clearly identify G(z) and x are fake or

real data Thus, although we still use a similar objective function to 2.1, but

maxi-mize it instead of minimizing it as Generator training The objective function of the

Discriminator is 2.1

maxL(D) = Ex.pallogD(x)] + E-~p,|log(1 — D(G(2))] (2.2)

In summary, the objective of GAN is:

minmaxL(G,D) = Exp, {logD(x)] + Ezxp,[log(1 — D(G(z))] (2.3)

*https://developers.google.com/machine-learning/gan

Trang 24

2.1 GAN 11

2014 2015 2016 2017 2018

GANs DCGAN CoGAN Progressive StyleGAN

Growing of GANs

Fig 2.5 The progress in generating face images using GANs model and its extension in a

nearly 5-year period The images from left to right are respectively reprinted from GAN,DCGAN, CoGAN, ProgessiveGAN, StyleGAN

2.1.2 StyleGAN

Figure 2.5 illustrates the evolution of the quality and resolution of images generated by

Generative adversarial networks (GAN) However, one of the most important component of

GAN is the generator has been neglected and remains used as a black box Despite recentattempts, a comprehensive understanding synthesis process is still lacking For example,

the poor understanding of latent space’s properties makes the image synthesis process more

complicated

Motivated by this, Karras et al proposed StyleGAN, a generative adversarial network

type inspired by the style transfer literature Instead of sampling a vector from latent space

as input, StyleGAN takes a fixed learned input and uses ADAptive Instance Normalization(AdaIN) to change the style of the image according to the information embedded in the

latent code estimated from the mapping network This design allows users to control the

synthesis intuitively and scale specifically Although style-GAN-based (StyleGAN) has

achieved state-of-the-art results in image generation, there exist some characteristic artifacts

in the images generated by StyleGAN To deal with this issue, Karras et al expose and

analyze several characteristics of artifacts and propose StyleGAN2, an extended version of

StyleGAN with some modifications in the architecture and propose a novel training scheme

2.1.2.1 StyleGAN 1

Style-based generator

Trang 25

12 Background

Latent z € Z Latent z € Z : Noise Ñ

Ỷ n Synthesis network ø zcZ Generator

Normalize Normalize Const 4x4x512 - —

-control over the style of the generated image

In the vanilla generator, the latent code is sampled from a normal distribution latent space

Z as shown in 2.6-a, then provided to the generator at the first layer After a series of

convolution blocks, the latent code is transformed into an image The authors of StyleGAN

point out that this principle force the latent space Z to be a probability density and result

in unavoidable entanglement To avoid this, StyleGAN instead takes a learnable constanttensor as input, then it is transformed into an intermediate latent space W via non-linear

mapping network ƒ : Z — W The mapping network is comprised of many fully connected

layers (StyleGAN uses 8-layer MLP) A learned affine transformation A was employed to

transform the latent code w € W into styles y = (ys;yp) for controling the style of generated

images through Adaptive instance normalization (AdaIN) as illustrated in figure 2.6-b The

AdalIN operation is defined as:

xj — X¡

AdalN (x;,y) = “am

In this design, the latent code is disentangled, which mean the latent space consisting

+ Yp,i (2.4)

of linear subspaces, each of which controls one factor of variation The input latent code

w € W is sampled by a non-linear mapping network ƒ In other words, w is a combination

Trang 26

2.1 GAN 13

Fig 2.7 Illustrative example with two factors of variation (image features, e.g., masculinity

and hair length) (a) An example training set where some combination (e.g., long-hairedmales) is missing (b) This forces the mapping from Z to image features to become curved

so that the forbidden combination disappears in Z to prevent the sampling of invalid nations (c) The learned mapping from Z to W is able to "undo" much of the warping

combi-of factors combi-of variation from subspaces Therefore, the generated images by StyleGAN are

more realistic

For example, as shown in figure 2.7 Sub-figure 2.7-a illustrated a distribution of features

in training that consists of 2 factors of variation ( masculinity and hair length) but missing the

long-haired male combination In a traditional generator, the latent code is a sample from afixed distribution (e.x normal distribution) Therefore, the feature distribution according to Z

is forced to follow the distribution as Z (shown in sub-figure 2.7-b) to prevent the sampling

of invalid combinations This precludes the factors from being fully disentangled While insub-figure 2.7-c, W is a learned latent space, so it better model the features distribution oftraining data

Stochastic variation

Some aspects of human portraits, such as the exact placement of hairs, stubble, freckles, or

skin pores, can be considered stochastic If these aspects change following a distribution, the

image’s perception does not change

In traditional generators, the input is only fed to the network through the first layer To

produce the stochastic variation of mentioned human portraits aspects, the generator needs

to find a way to generate spatially-varying pseudorandom numbers from earlier activationswhenever needed However, doing so is challenging and not always successful As a result,

there are commonly seen repetitive patterns in generated images StyleGAN dresses this

problem by adding per-pixel noise after each convolution The implementation is illustrated

in figure 2.6-b, a single-channel noise input is scaled and add to the output feature of each

convolution layer of generator Although StyleGAN is under the pressure to generate new

Trang 27

14 Background

256x256 512x512

Fig 2.8 Example of water droplet-like artifacts in StyleGAN images The artifact appears inall feature maps starting from the 64 x 64 resolution to the final generated image

content as the traditional generator, it can easily create stochastic variation by relying on the

provided noise

StyleGAN uses a consistent set of scaling and biasing for all its feature maps, allowing

it to effectively manipulate global image properties such as pose, lighting, or background

style At the same time, it adds noise to each pixel independently, making it effective atcontrolling stochastic variation If the network tried to use noise to control aspects like

pose, the resulting spatially inconsistent decisions would be penalized by the discriminator

This allows the network to learn how to appropriately use global and local channels withoutexplicit guidance

2.1.2.2 StyleGAN 2

Despite StyleGAN success at generating high-quality images on various datasets, StyleGAN

has some issues One is the presence of blob-like artifacts that resemble water droplets

in many generated images Another issue is "phase" artifacts, where features like teeth oreyes may appear stuck in place while it ought to move smoothly over the image when we

walk along the latent space In their paper, Karras et al [32] identify and analyze thesecharacteristic artifacts of StyleGAN and suggest to change the model architecture and training

methods to address them

Removing normalization artifacts

Many observers have noticed that most images generated by StyleGAN exhibit characteristic

blob-shaped artifacts that resemble water droplets An example of this artifact is shown in

Trang 28

2.1 GAN 15

figure 2.8 It is obvious that the artifact is present in all intermediate feature maps of the

generator around 64x64 resolution and becomes progressively stronger at higher resolutions

Karras et al argue that issues arise from the use of the AdaIN operation According toequation 2.4, the mean and variance of each feature map normalized individually, whichmay lead to the loss of information present in the relationship between the magnitudes ofthe features The authors posit that the AdaIN discards information in the relative activationintensities of feature maps Hence, the generator sneaking information past these layers and

resulting in these water-droplet artifacts

To address this problem, the authors of StyleGAN2 reconstruct Adaptive Instance malization as Weight Demodulation This change is depicted step by step in figure 2.9.Figure 2.9-a and 2.9-b respectively illustrated StyleGAN’s generator architecture and its

Nor-expanded version, which break the AdaIN into two constituent parts: normalization andmodulation

Karras et al observe that if we move bias and noise outside the style block and operate

on normalized data, the result is more predictable This modification allows normalization

and modulation to function using only the standard deviation Therefore, it is safe to removethe bias, noise, and normalizing applied to the constant input

In practice, the modulation part in AdaIN may amplify certain feature maps and the

normalization part takes the responsibility to counteract this amplification On the other hand,

when the normalization step is removed from the generator, the droplet artifacts disappearcompletely As a result, Weight demodulation is proposed as an alternative that removes the

artifacts by eliminating normalization while still counteracting feature map amplification Infigure 2.9-c, each style block consists of modulation, convolution, and normalization The

modulation can alternatively implement as the scaling of the convolution weights

Wijk = Si Wijk

Instance normalization is used to make the statistics of feature maps free from the effect

of s Authors of StyleGAN2 argue that another way to achieve this goal 1s to employ the

Statistics of the incoming feature maps to base normalization The final standard deviation of

_ 2

oj = an

\V ix

The following normalization’s objective is to restore unit standard deviation from outputs,

the output activations is

dubbed as demodulate process Alternatively, we can do this by formulating it as the

convolution weights

Trang 29

$@<——B| = by © B Nonn ad Demod}>[_ Conv 3x3

|—>| Adan 2 Norm mean/std ,

L

Upsample U le Bees [A}>Dited Upsample

Conv 3x3 3 W3 Conv 3x3 W3 Conv 3x3 B=si>[ coma]

Bị 5i h————>9<——h] Nom sid

[AK] AdalN é ‘Norm mean/std bs > @ ‹ {B] bị a B

——_Y —, A Mod mean/std [AL———> L Mod std Ws

Conv 3x3 _ "4 Conv 3x3 Wy > Conv 3x3 lam Vv

_ œ—— BỊ l =— Demod>|_ Conv 3x3

- [AF >| Adal là; Senge >> $8] "——>‡—B

v (a) StyleGAN (b) StyleGAN (detailed) (c) Revised architecture (d) Weight demodulation

&

Style block

Fig 2.9 The architecture of StyleGAN2

Wijk = Wijy/ whic +E (2.5)

ik

The demodulation method is inferior to instance normalization because it is based onstatistical assumptions about the signal rather than the actual contents of the feature maps

Removing Progressive Growing

Progressive growing is the key mechanism that enables StyleGAN to generate high-resolution

images successfully, but it also causes characteristic artifacts (or phase artifacts) because

the generator seems to have a preference for the position of facial component (I.e mouth,teeth, ) The official StyleGAN2 video demonstrates that features like teeth or eyes

seem stuck in one place wherever the face of the person moves, figure 2.10 shows a related

artifact Karras et al believe that the Progressive Growing technique is the root of theissue In progressive growth, each resolution serves as the output resolution, compelling it to

produce the highest frequency details possible This results in an extremely high frequency

in the intermediary layers of the trained network, which compromises its shift-invariance.This motivates StyleGAN2’s authors’ search for an alternative mechanism that retains the

high-resolution generation ability without phase artifacts

Both the generator and discriminator of StyleGAN employ a basic feedforward design.However, there have been various research recently aim at developing a more effective

network architecture, such as skip connection [44, 30], residual networks [17, 20], and

hierarchical methods [60, 61] It has been demonstrated that these techniques can improve

the model’s overall performance in general and generative scenarios As such, Karras et

Trang 30

2.1 GAN 17

al reevaluate the design of the StyleGAN network and aim to find an architecture that can

generate high-quality images without the use of progressive growing

The new design network architecture of StyleGAN is inspired by MSG-GAN [30]

Figure 2.11-a illustrates MSG-GAN, it connects the intermediate output of the generator

and discriminator at the same resolutions by multiple skip connections Adopting thisarchitecture, StyleGAN2 uses bilinear up-sampling layers to upsampling RGB outputs of

different resolutions and then sum them together In the discriminator, the generated image

is downsampled and fed to each resolution block, this design is depicted in figure 2.11-b

Finally, Karras et al further employ residual connections as shown in 2.11-c As proven

in StyleGAN2’s publication, the skip generator and residual discriminator significantlyimproves FID and PPL without progressive growth

Path length regularization and Lazy regularization

In the image generation topic, the Quantitative evaluation of synthetic images continues to be

a tough problem Recent works explicitly quantify the similarity between generated imagesand training data by using Frechet inception distance (FID) [21], it first employs a pre-trained

classifier to extract image features for estimating the density distributions of generated images

and training data and finally, it calculates the differences of two distributions by Frechetdistance Precision and Recall (P&R) is used to quantify the generator can produce how

much data class in training data Both FID and P&R are based on classifier networks, but

[14] shows that classifier networks concentrate on textures rather than shapes while humans

Trang 31

(a) MSG-GAN (b) Input/output skips (c) Residual nets

Fig 2.11 Some alternative network architectures of StyleGAN2 that replace progressive

growing

focus more on shapes As such, these metrics are insufficient to assess image quality In[32], Karras et al conducted a small experiment to demonstrate that although two sets of

generated images have similar FID and P&R scores, their overall quality is notably distinct

StyleGAN assumes that if interpolation of latent-space vectors results in an unexpectedlynon-linear change in generated image, this indicates that the latent space is entangled and

the variation factors are not sufficiently separated Thus, it introduced a novel metric known

as Perceptual path length (PPL) to measure the magnitude of the image’s transformations

during the latent-space interpolation PPL is calculated as the sum of the perceptually-basedpairwise image distance, measured by using a weighted difference between two VGG16

embeddings over each sub-linear segment of the interpolation path in the latent space

In StyleGAN2, they observe that perceptual path length (PPL) correlates with image

quality More specifically, the higher PPL, the lower of image quality They argue thatwhen the discriminator penalizes the broken images during training As such, when we do

the latent-space interpolation in low-quality latent space, the image changes dramatically,

resulting in high perceptual path length However, simply minimizing PPL would lead to adegenerate solution with a lack of diversity To address this, Karras et al proposed a new

Trang 32

2.2 Sequence model 19

regularizer that encourages smoother mapping from the latent space to the generated images

without this drawback

The main idea behind this regularizer is that if we walk along the latent space with a

fixed-size step, it has to yield a non-zero, constant change in the magnitude of images Wecan determine the deviation from this ideal by moving in random directions in the imagespace and monitoring the corresponding w gradients Our objective is to find a mapping fromlatent space to an image that has these gradients similar to equal length with any moving

direction in latent space or latent code w The jacobian matrix Jy = 0g(w) /Ow is used to

capture the local metric scaling properties of the generator mapping g(w) : ⁄W -> 3 at single

latent point w € W The regularizer is formulated as:

ew y~vion) (DS|Ủ — s) 2.6)

where random images are denoted as y and w ~ f(z), where z are normally distributed.

The practical experiment shows that path length regularization results in more trustworthyand consistent results

2.2 Sequence model

Sequence models are a type of deep learning model that are specifically designed to process

sequential data, such as natural language, time series data, or audio data These models

typically use recurrent neural networks (RNNs) or their variants, such as long short-term

memory (LSTM) networks or Bidirectional Recurrent Neural Networks (BiRNN), to processthe input sequence and capture the temporal dependencies between the elements

2.2.1 Recurrent Neural Networks (RNN)

Artificial Neural Networks, also referred to as Feed-Forward Neural Networks or ANNs, are

a type of machine learning model that process inputs in a linear, unidirectional manner Due

to this design, ANNs are unable to capture the sequential information present in input data,

which is necessary for tasks involving sequence data When two data elements are related

to one another, such as in speech recognition, text generation, and semantic recognition of

text or voice, treating each element independently is inadequate To address this limitation,

Recurrent Neural Networks (RNNs) were developed RNNs possess a "memory" component,

which allows them to recall previous information and thus effectively handle sequential data

A basic Recurrent Neural Network (RNN) comprises a feedback loop, as illustrated

in Figure 2.12, using both current and past inputs to make decisions RNNs allow for a

Trang 33

20 Background

Hidden layers

(x9 )

a

|

| Xí

Recurrent Neural Network

Fig 2.12 Recurrent Neural Networks (RNN) has a recurrent relation on the hidden state, this

looping constraint ensures the capture of sequential information in the input data

degree of flexibility in terms of the activation function employed, with commonly usedoptions including the Sigmoid function, Tanh function, and Relu function The process of

training an RNN includes three principal steps Firstly, the network performs a forwardpass, making a prediction Secondly, this prediction is compared to the true value using

a loss function, which outputs a value representing the error or discrepancy between the

prediction and the true value Lastly, the error value is used to perform backpropagation,

which calculates the gradients for each node in the network There are various architectures

of RNN available, some examples are One-to-One, One-to-Many (where single input can

produce multiple outputs, such as in Music generation), Many-to-One (where multiple inputs

from different time steps produce a single output, such as in Sentiment Analysis or Emotion

Trang 34

One of the main disadvantages of RNNs is the so-called "vanishing gradient" problem.This occurs when the gradient of the loss function with respect to the weights of the network

becomes very small, which makes it difficult for the network to learn This can be a particular

problem for traditional RNNs, which only process input sequences in one direction andcan struggle to capture long-term dependencies in the data Additionally, some RNNs can

be difficult to parallelize and train on GPUs, which can further limit their computational

efficiency This is because the sequential nature of RNNs makes it difficult to distributethe computations across multiple GPUs, which can limit their ability to take advantage of

parallel processing

2.2.2 Bidirectional Recurrent Neural Networks

Bidirectional RNNs (BRNNs)[46] are a type of recurrent neural network that can processinput sequences in both forward and backward directions This allows them to capture both

past and future context, giving them a potential advantage over traditional RNNs, which only

process input sequences in one direction (typically forward, from left to right)

One potential advantage of BRNNs over traditional RNNs is that they can better capturelong-term dependencies in the input data Traditional RNNs can struggle to capture long-term

dependencies because they only have access to past context, which may not be sufficient tofully understand the underlying structure of the data By processing input in both directions,

BRNNs can better capture the full context of the input, which can help improve their

performance on tasks such as language modeling or machine translation

Another potential advantage of BRNNs is that they can be more efficient to train Becausetraditional RNNs only process input sequences in one direction, they must be unrolled

through time in order to compute gradients for backpropagation This can be computationally

expensive for long input sequences BRNNs, on the other hand, can be trained using atechnique called "backward-forward training", which avoids the need to unroll the network

through time In the backward phase, the gradient of the loss function is computed using the

standard backpropagation algorithm, but only for the backward RNN This computes the

gradient of the loss with respect to the weights of the backward RNN, but not the forwardRNN This can be written mathematically as:

Trang 35

22 Background

Fig 2.13 Illustration of Bidirectional Recurrent Neural Networks

OL 'Ằ OL OV nek OShek

OWock ~ OY nck Sock 9W)„¿

where L is the loss function, W?„„ are the weights of the backward RNN, Y;„„ are the outputs

of the backward RNN, and Š;„¿ are the hidden states of the backward RNN

In the forward phase, the gradient of the loss is computed using a variant of the

backprop-agation algorithm that is tailored to BRNNs This computes the gradient of the loss with

respect to the weights of the forward RNN, but not the backward RNN This can be writtenmathematically as:

OL OL OV wa OS fwa

WWrwd OV pwd OS twa 9Wwa

where W/„„ are the weights of the forward RNN, Yfwa are the outputs of the forward RNN,

and Sfyq are the hidden states of the forward RNN

By decomposing the computation of the gradient into these two phases, backward-forward

training allows BRNNs to be trained more efficiently than traditional RNNs, especially forlong input sequences

Trang 36

2.3 Optical Flow 23

2.3 Optical Flow

Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene

caused by the relative motion between an observer and the scene It is typically represented

as a vector field, where each vector represents the motion of a point in the image from one frame to the next.

Optical flow can be computed over time by tracking the motion of points or features in a sequence of frames from a video This can be done using a variety of techniques, such as feature-based methods, which track the motion of individual points or edges in the image, or region-based methods, which track the motion of larger image regions.

Once the optical flow has been computed for a sequence of frames, it can be used for a

wide range of applications, such as video compression, where accurate estimates of motion can be used to reduce the amount of data needed to represent the video, or action recognition, where the optical flow can be used to identify and classify different types of motion in the

video.

2.3.1 Handcrafted Features for Optical Flow Estimation

Using handcrafted features for optical flow refers to the process of manually designing and extracting features from an image, such as edges and corners, and using them to estimate the motion of objects in the scene This approach is in contrast to deep learning-based methods, which use convolutional neural networks (CNNs) to automatically learn features from the

data.

There are several algorithms that use handcrafted features for optical flow estimation.

Some examples include:

Lucas-Kanade [38]: This algorithm uses the brightness constancy assumption, which states that the intensity of a pixel in an image remains constant over time It uses this assumption to calculate the optical flow between two images by minimizing the sum of squared differences between the intensity values of corresponding pixels This is formulated

as an optimization problem, where the objective function is the sum of squared differences and the optimization variables are the optical flow vectors The objective function can be written as:

E(p) =3 |I1(x+ p) —12(x)?

x

where E(p) is the objective function, /1 and /2 are the two images, x is a pixel in the first image, and p is the optical flow vector for that pixel.

Trang 37

24 Background

The goal of the Lucas-Kanade algorithm is to find the optimal optical flow vectors p that minimizes the objective function This can be done using gradient descent, where the gradient of the objective function is calculated and used to update the optical flow vectors in

the direction that reduces the objective function This process is repeated until the optical flow vectors converge to a minimum of the objective function, at which point the algorithm terminates and the final optical flow vectors are returned as the result.

Horn-Schunck [23]: This algorithm is similar to Lucas-Kanade, but it also takes into account the smoothness of the optical flow field It uses an iterative approach to minimize

the sum of squared differences between the intensity values of corresponding pixels, as well

as a regularization term that encourages smoothness in the flow field The objective function can be written as:

E(P) = Etucas-Kanade(p) +4 [A(x + p) — 12(x)?

Farneback [12]: This algorithm uses a pyramid structure to calculate the optical flow

between two images It first calculates the flow at the lowest resolution and then uses this flow to initialize the flow at the next higher resolution This process is repeated until the flow

at the highest resolution is calculated.

Using handcrafted features for optical flow estimation has both advantages and vantages One advantage of using handcrafted features is that they are interpretable and can provide insights into the motion of objects in the scene Since the features are designed by humans, they can be easily understood and explained, which can be useful for understanding the behavior of the algorithm and debugging problems Another advantage of using hand-

disad-crafted features is that they can be tailored to specific types of scenes and motion patterns.

By carefully designing the features, it is possible to improve the accuracy of the optical flow estimates in certain scenarios.

However, there are also disadvantages to using handcrafted features for optical flow estimation One disadvantage is that they can be susceptible to noise and other factors that can affect their performance Since the features are designed by humans, they may not be

Trang 38

2.3 Optical Flow 25

able to handle the complex patterns of motion that can occur in real-world scenes Another

disadvantage of using handcrafted features is that they require significant effort to design and

implement This can be time-consuming and labor-intensive, and it may not be feasible forlarge-scale applications

2.3.2 RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

Deep learning methods have become popular for optical flow estimation in recent years

be-cause they can often produce more accurate results than traditional methods This is bebe-causedeep learning methods are able to learn and model the complex patterns and relationships

in the data that are often present in optical flow estimation tasks The paper "RAFT:

Recur-rent All-Pairs Field Transforms for Optical Flow" by Clark et al [48], published in 2020,

introduces RAFT, a semi-supervised all-pairs field transform that learns optical flow from afew annotated pairs It uses a recurrent all-pairs field transform (RAFT) with a convolutional

neural network (CNN) to produce accurate and smooth optical flow estimation even for

challenging scenes At the time of publication, the paper has achieved State-of-The-Art

performance on well-known datasets such as KITTI and Sintel, and has been selected as theBest Paper at the ECCV 2020, the world’s leading image processing conference

Fig 2.14 The RAFT architecture comprises of three key components: a feature encoder that

extracts per-pixel features from both the input images, a context encoder that extracts features

from only the first input image, and a correlation layer The correlation layer generates a4-dimensional volume by taking the inner product of all pairs of feature vectors It thenapplies pooling at multiple scales of the last 2-dimensions to produce a set of multi-scale

volumes Finally, an update operator uses the current estimate to recurrently update theoptical flow by referencing values from the set of correlation volumes

RAFT consists of three main components:

Trang 39

26 Background

Feature extraction is similar to what is done in general deep learning architectures usingconvolutional networks in order to emphasize significant features

Computing visual similarity aims at calculating the similarity between a specific part

of the previous frame image and each part of the subsequent frame image in a brute-forcefashion

Iterative updates is an approach that increases accuracy by iteratively performinginferences In other words, if the number of iterations is small, the calculation time is shortbut the accuracy is relatively low, and if the number of iterations is large, the calculation time

is long but the accuracy tends to be relatively high

The RAFT architecture is an innovative approach that combines traditional optimizationtechniques with modern deep learning methods It features a feature encoder that extracts per-

pixel features, a correlation layer that calculates the similarity among pixels, and an update

operator that simulates the steps of an iterative optimization algorithm Unlike traditionalmethods, the RAFT architecture utilizes deep learning techniques to automatically learn and

adapt the features and motion priors, rather than relying on manual crafting

iter O iter 1 iter 2

iter 3 iter 5 iter 8

iter 11 iter 15 iter 19

Fig 2.15 The following figure shows the optical flow estimates for each iteration for twospecific frames of the racing car, the higher the number of iterations, the better the accuracy

In conclusion, this chapter provides an overview of essential concepts that are crucial to

our research By introducing Generative Adversarial Networks (GANS) and the StyleGANfamily, we provide a foundation for understanding image generation with deep learning.Additionally, we delve into the world of sequence data by discussing recurrent neuralnetworks (RNNs), long short-term memory (LSTM) networks, and Bidirectional Recurrent

Trang 40

2.3 Optical Flow 27

Neural Networks (BiRNN), and touch upon the important concept of optical flow in video

processing The knowledge gained from this chapter will serve as a foundation for the workpresented in subsequent chapters

Tiêu đề	Spatio-temporal Re-rendering for Facial Video Restoration
Tác giả	Ngo Huu Manh Khanh, Ngo Quang Vinh
Người hướng dẫn	Dr. Nguyen Vinh Tiep
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Bachelor Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	89
Dung lượng	47,45 MB

Tài liệu tham khảo	Loại	Chi tiết
[1] Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., and Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark.CoRR, abs/1609.08675	Khác
[2] Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299-6308	Khác
[3] Chan, K. C., Wang, X., Yu, K., Dong, C., and Loy, C. C. (2021). Understanding deformable alignment in video super-resolution. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 973-981	Khác
[4] Chang, Y.-L., Liu, Z. Y., Lee, K.-Y., and Hsu, W. (2019). Free-form video inpainting with 3d gated convolution and temporal patchgan. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9066-9075	Khác
[5] Chen, C., Li, X., Lingbo, Y., Lin, X., Zhang, L., and Wong, K. (2021a). Progressive semantic-aware style transformation for blind face restoration	Khác
[6] Chen, D., Liao, J., Yuan, L., Yu, N., and Hua, G. (2017). Coherent online video style transfer. In Proceedings of the IEEE International Conference on Computer Vision, pages 1105-1114	Khác
[7] Chen, H., He, X., Yang, H., Qing, L., and Teng, Q. (2021b). A feature-enriched deep convolutional neural network for jpeg image compression artifacts reduction and its applications. [EEE Transactions on Neural Networks and Learning Systems, 33(1):430—444	Khác
[8] Chen, Y., Phonevilay, V., Tao, J., Chen, X., Xia, R., Zhang, Q., Yang, K., Xiong, J., and Xie, J. (2021c). The face image super-resolution algorithm based on combined representation learning. Multimedia Tools and Applications, 80:1—23	Khác
[9] Chen, Y., Tai, Y., Liu, X., Shen, C., and Yang, J. (2018). Fsrnet: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2492-2501	Khác
[10] Cho, S.-J., Chung, J. R., Kim, S.-W., Jung, S.-W., and Ko, S.-J. (2021). Compression artifacts reduction using fusion of multiple restoration networks. IEEE Access, 9:66176- 66187	Khác
[11] Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690-4699	Khác
[12] Farnebọck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Bigun, J. and Gustavsson, T., editors, Image Analysis, pages 363-370, Berlin, Heidelberg.Springer Berlin Heidelberg	Khác
[13] Gao, G., Zhu, D., Lu, H., Yu, Y., Chang, H., and Yue, D. (2021). Robust facial image super-resolution by kernel locality-constrained coupled-layer regression. ACM Trans.Internet Technol	Khác
[15] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K., editors, Advances in NeuralInformation Processing Systems, volume 27. Curran Associates, Inc	Khác
[16] Gu, J., Shen, Y., and Zhou, B. (2020). Image processing using multi-code gan prior.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3012-3021	Khác
[17] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. (2017).Improved training of wasserstein gans. Advances in neural information processing systems, 30	Khác
[18] Haris, M., Shakhnarovich, G., and Ukita, N. (2019). Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3897-3906	Khác
[19] He, J., Shi, W., Chen, K., Fu, L., and Dong, C. (2022). Gcfsr: A generative and controllable face super resolution method without facial and gan priors. In Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1889-1898	Khác
[20] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and patternrecognition, pages 770-778	Khác
[21] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017a). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30	Khác