On solving mode collapse problem in several generative models = giảm thiểu mất cân bằng trong một số mô hình sinh dữ liệu

77 Figure 3.7 The synthesized images of two existing inversion-based image syn-thesis approaches using pre-trained classiﬁers and without real ing data: DeepInversion DI [6] top row, DA

Trang 1

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

DINH ANH DUNG

ON SOLVING MODE COLLAPSE PROBLEM

IN SEVERAL GENERATIVE MODELS

MASTER THESIS IN COMPUTER SCIENCE

Hanoi - 2021

Trang 2

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Dinh Anh Dung

ON SOLVING MODE COLLAPSE PROBLEM

IN SEVERAL GENERATIVE MODELS

Subject: Computer ScienceTitle ID: 19BKHMT-KH04

MASTER THESIS IN COMPUTER SCIENCE

SupervisorAssoc.Prof Huynh Thi Thanh Binh

Signature of supervisor

Hanoi - 2021

Trang 3

REQUIREMENTS OF THE THESIS

1 Student’s information

Name: Dinh Anh Dung

Class: Computer Science

Afﬁliation: Hanoi University of Science and Technology

Phone: +6590947468 Email: dinhanhdung1996@gmail.com

Hanoi, 12th December 2021

Author

Dinh Anh Dung

4 Attestation of thesis advisor

Supervisor

Huynh thi Thanh Binh

Trang 4

Table of Contents

1.1 Neural Networks 21

1.1.1 Feed forward 22

1.1.2 Stochastic Gradient Descent 25

1.2 Generative Models 26

1.2.1 Maximum Likelihood 26

1.2.2 Bayesian Networks 28

1.2.3 Generative Adversarial Networks 32

1.3 Mode Collapse 34

1.3.1 Reasons 36

1.3.2 Measurement 37

2 Mode Collapse on tabular data modelling 38 2.1 Tabular data and challenges in tabular data modelling 38

2.2 Related Work 40

2.3 High dimensions learning problem 42

Trang 5

2.4 Structure learning problem 43

2.5 Proposed method 45

2.5.1 Input data 45

2.5.2 Discriminative Embedding GAN 47

2.5.3 Graphical Conditional Vector 48

2.6 Experimental Results 51

2.6.1 Experimental setup 51

2.6.2 Experiments 56

2.7 Discussion 62

3 Mode Collapse on special case of image data modelling 63 3.1 Background 64

3.1.1 Inversion-based Learning 64

3.1.2 Deep Inversion 65

3.2 Data Free Generative Adversarial Networks 66

3.2.1 Discriminator Learning 68

3.2.2 Generator Learning 69

3.2.3 Improving Generator Learning with Class Activation Max-imization 70

3.2.4 Improving Generator Learning via Mutual Information and Diversity constraints 70

3.3 Experimental Results 71

3.3.1 Experiments on 2D dataset 72

3.3.2 Experiments on CIFAR datasets 75

3.4 Discussion and Conclusion 84

Trang 6

List of Figures

Figure 1 [1] illustrates the difference when using Mean Square Error (MSE)

loss and adversarial loss on predicting next frame for video. 14

Figure 2 [2] illustrates the case of using maximum likelihood framework to

estimate parameters for Mixture Gaussian model. 15

Figure 3 The Figure illustrates two set of samples generated by two GANs.

The ﬁrst row GAN does not face the problem of mode collapse, and the second row faces the full mode collapse problem [3] 18

Figure 1.1 Neural Network [2] 22

Figure 1.2 Taxonomy of generative models based on Maximum Likelihood [4] 27

Figure 1.3 Attributes in a medical record 28

Figure 1.4 An example of Bayesian Network for modeling real data 29

Figure 1.5 Bayesian Network example from [2] 29

Figure 1.6 Generative Adversarial Neural Networks (GAN) framework 33

Figure 1.7 Mode collapse paradox 35

Figure 2.1 alarm dataset: performance distance, and wasserstein distance of

Conditional Tabular GAN (CTGAN) and Discriminative ding GAN (DEGAN) during training 44

Embed-Figure 2.2 insurance dataset: performance distance, and wasserstein distance

of CTGAN and DEGAN during training 44

Figure 2.3 Discrete attribute represented by one-hot vectors 46

Figure 2.4 Highlights of the proposed model GDEGAN. 48

Trang 7

Figure 2.5 MNIST28: Images generated by different methods 57

Figure 3.1 Inversion model and results taken from [5] 65

Figure 3.2 Deep Inversion model from [6] 65

Figure 3.3 Main difference between standard GAN and proposed Data-Free GAN66

Figure 3.4 Detailed diagram of the proposed Data Free GAN 68

Figure 3.5 Generated samples of (a) our method, (b) DeepInversion [6] and

(c) KEGNET [7] on 2D toy dataset. 75

Figure 3.6 We plot the IS (first figure) and FID (second figure) scores and

computational time of DeepInversion (DI) [6] trained on the

CIFAR-10 dataset at different iterations (blue color). 77

Figure 3.7 The synthesized images of two existing inversion-based image

syn-thesis approaches using pre-trained classiﬁers and without real ing data: DeepInversion (DI) [6] (top row), DAFL [8] (second row) and our proposed method (third row) on CIFAR-10 dataset. 80

train-Figure 3.8 Random synthetic samples of DCGAN (left), WGAN-GP

(mid-dle), and ours (right) on the CIFAR-100 dataset DCGAN and WGAN-GP are trained with real data. 81

Figure 3.9 The synthetic samples of KEGNET [7] (top row), DI [6] (middle

row), and our method (bottom row) on the SVHN dataset NET and DI perform poorly on this dataset. 83

KEG-Figure 3.10 Synthetic ImageNet images generated by our method. 84

Trang 8

List of Tables

Table 2.1 The appearance rate in synthetic datasets of 1000 randomly picked

pair of values from three real datasets which are adult, census and covertype 45

Table 2.2 Tabular data benchmark description 52

Table 2.3 Table show the performance according to L test of current

state-of-the-art tabular generative models on 6 simulated datasets: alarm, asia, child, insurance, grid and gridr. 56

Table 2.4 The performance (in %) of current state-of-the-art tabular

genera-tive models on four real datasets: adult, census, covtype, intrusion, mnist12 and mnist28. 58

Table 2.5 FID scores of different methods DEGAN overcomes mode-collapse

problems with signiﬁcant gaps compared to other methods 59

Table 2.6 Marginal statistics score on 4 real datasets: adult, census, covtype

and credit The result indicates the out-performance of the posed methods over others. 60

pro-Table 2.7 Marginal statistics score on 3 real datasets: intrusion, mnist12 and

mnist28 The mnist28 has high dimensionality of input. 61

Table 2.8 Marginal statistics score on 2 real datasets: Colorado and Fire

de-partment The two datasets have sparse and high dimensionality of input. 61

Table 2.9 Compared with other methods in Differential Privacy Synthetic

Data Challenge 2018 [9]. 61

Trang 9

Table 2.10 Comparison between different graph construction schemes on

ma-chine learning efﬁciency score. 62

Table 2.11 Comparison between different graph construction schemes on marginal

statistics score. 62

Table 3.1 Architecture for 2D dataset 73

Table 3.2 The hyper-parameters of our method on the 2D dataset In this

dataset, we ﬁrst pre-train the classiﬁer C before training our model with G and D None: no parameter. 74

Table 3.3 Network architectures of the generator, discriminator, and encoder

on the CIFAR-10, CIFAR-100, and SVHN datasets. 76

Table 3.4 The hyper-parameters of our method on the CIFAR-10/100 datasets.

n/a: no parameters. 77

Table 3.5 The ablation study on three losses used in our model 79

Table 3.6 IS and FID scores of our method, existing works of data-free

mod-els (DAFL and DeepInversion), and two well-known GAN GAN and WGAN-GP) 81

(DC-Table 3.7 IS and FID scores of our method, existing works of data-free

mod-els (DAFL and DeepInversion), and two well-known GAN GAN and WGAN-GP) trained with real data on the CIFAR-100 dataset Data-free ( �): model is trained via model inversion or noise, and no real data. 81

(DC-Table 3.8 IS and FID of KEGNET [7], DeepInversion [6] and our method

on SVHN dataset For a fair comparison, we use the same trained classiﬁer (ResNet10) published by KEGNET for all compared methods. 83

pre-Table 3.9 The hyper-parameters of our method on the SVHN dataset n/a:

no parameter. 83

Trang 11

Mode collapse is one of the hardest problems of data distribution ing This is often caused by the complexity of dataset in terms of dimensionsand distributions As a result, the thesis aims to solve the problem Our ap-proaches tackle mode-collapse on each type of dataset due to their difference

model-in characteristics There are two genres of data are considered which are age and tabular data For tabular data, through observing the characteristics

im-of the dataset and the current SOTA performance, the author has identiﬁedthe problem of mode-collapse caused by high-dimensions of one-hot vectorlearning as well as structural learning problem of the current state-of-the-art.From that, a model name Discriminative Embedding Generative AdversarialNetwork is proposed to solve the issues For image data, the author onlyconsiders a speciﬁc case where generative models are trained without usingreal datasets Due to the lack of training dataset, the model is easier to fallinto mode-collapse problems In order to make the model work effectively,

a model named Data-Free Generative Adversarial Network (DF-GAN) is troduced As far as the author has concerned, these proposed methods arenovel and have not been investigated before The author also conducted ex-tensive experiments to investigate every aspect of the proposed methods Theresults show that both of the proposed methods outperform other SOTAs sig-niﬁcantly

Trang 12

Data distribution modelling is one of the most important problem in ern computer and data science By modelling data distributions, we couldmanipulate the data to support many other learning tasks This chapter wouldsummarize the key points on generative models, especially GAN to provide ageneral view on what is generative model?, why does generative model worthresearch? or why does this thesis focus on the mode collapse problem? Atthe end, there will be a contributions summarize and structures of the thesis.Generative model is a type of modeling techniques based on probabilisticapproaches In machine learning and learning theories, the probabilistic ap-proach recently prevailed the modern ideas on modeling data Starting fromthe linear discriminant functions to do linear classiﬁcation with its drawback

mod-on geometric interpretatimod-on, the scientists have successfully deployed the tuitions of using three elements p(Y |X), p(Y) and p(X) to model any type

of data The p(Y |X) represents the relationship between observable data stances X with the associated label outcomes Y , p(Y ) is prior knowledgeabout the label distribution, and p(X) is the prior knowledge about the distri-bution of the dataset Due to the fruitful outcomes of probabilistic approachand geometric interpretation, the p(Y |X) has been used to model most ofour recent and modern classiﬁcation and regression tasks, and this method

in-is often referred to Din-iscriminative model On the other hand, p(X) in-is ten utilized to model the data distribution of dataset X Based on the ability

of-to generate data instance, this model is named as generative model Hence,the term ”generative model” is not only associated with generating data ap-plication, it further delivers the best of our understanding on data throughmodeling data distribution p(X)

Trang 13

Due to the characteristics of generative models on data distribution eling, the research on generative model is extremely important First, thesuccess of generative models reﬂects our knowledge on data distributions,especially high dimensional data distributions In fact, data understanding

mod-is one of the trending research recently in physics, mathematics, economics,computer science and data science [10, 11, 12, 13] The generative modelwill offer a statistical approach to provide knowledge on data distribution ex-plicitly and implicitly Second, generative models beneﬁts storing and pro-cessing data in the days of data booming The aim of generative models is

to offer a set of statistics or statistical parameters that could represent thedatasets (sufﬁcient statistics) In stead of storing the whole dataset, only acertain parameters need storing so that we could recover the whole datasets

at anytime For example, with billions of univariate Gaussian data points,

in stead of storing all billions of points, we only need to store two valuesare mean µ and variance Δ Based on these two values, we could sample asmuch as we want to recover the original dataset Third, the generative modelsallow learning on multi-modal outputs For many scenarios, one input mightcause a different output [4] While using the MSE will only help to repeat thesimilar information from the training data, training using generative mannerswill provide an output in a probability, hence the output will be more ﬂexible(less dependent on training data) and will be more realistic toward constraints

of the input For example, the Fig.1 shows the problem of generated imagestrained by MSE loss lacking information such as ears, while all information

is kept by training with adversarial loss The leftmost one is the ground truthdata, the middle one is the generated image by training with MSE loss, andthe lass image is the generated output trained by adversarial approach

Many works have exploited the strength of generative models to modeldata and provide many great applications with high practical demands For

Trang 14

Figure 1: [1] illustrates the difference when using MSE loss and adversarial loss on predicting next frame for video.

example:

• Image to image translation: The work aims to translate images with

similar styles of target domains from a source domain [14, 15]

• Images recovering from damage.

• Tasks related to creating arts:

• Low-resolution to high resolution

Most of the generative models will obey a certain number of steps tomodel the data:

1 Observe the datasets

2 Estimate parametersθ of data distributions

3 Sampling based on estimated parametersθ

As a statistical model, a generative model also faces problems of ﬁtting, lack of training data, singularity or bias learning [16, 2] MaximumLikelihood (ML) and Expectation Maximization (EM) framework has beenwidely used to estimate parameters for this model showing a number of limi-tations ML or EM both require prerequisite assumptions about density func-tion of data before it could perform estimating steps[16, 2] While there are

Trang 15

over-Figure 2: [2] illustrates the case of using maximum likelihood framework to estimate parameters for Mixture Gaussian model.

a lot of density functions could be used to model a distribution, an tion bias to one function would lead to wrongly learned parameters (Later

assump-in chapter 2, we would know further that this is a common problem for plicit generative models) The second problem is about the density functionsthemselves Several density functions are not working well with ML or ap-proximate inference frameworks For example, consider the case of estimat-ing parameters for Gaussian Mixture models via ML as Fig.2 When onemode has mean value µj collapse to one data point xn, the density function

ex-of that mode turn into N (xn|xn,σ2

In order to cover most of the problems in traditional machine learningmethods in modeling data mentioned above, GAN has been proposed in [17]

by Ian Goodfellow This method changes the second step of a normal ative model to employ a minimax game between two neural networks Gen-

Trang 16

gener-erator and Discriminator competing with each others to estimate tion parameters By this, GAN offers a method to do data modeling withasymptotically consistent, parallel computing and without density functionsassumptions.

distribu-Since the existence of GAN, this research direction has become one ofthe most active research problem in generative models and computer vision.Although GAN provides a good solution to generate high quality outputs, itfaces a number of other difficulties The first one is finding the Nash equi-librium [18] which has been proved to be more difficult than optimizing anobjective function The second problem is the non-converge problem which

is also a common phenomenon in a minimax game To have a clearer picture

on this problem, we could consider the equation for training GAN as in Eq.1

Trang 17

lapse problem where the Generator G only models a set of repeated featuremodes without diversity.

In fact, mode collapse is the largest problem makes GAN unfeasible to beapplied to real world scenarios [17] although the generated instances given

by training GAN is realistic By generating only few distinct samples, GANwill be useless since it does not provide enough information to support othertasks and do not solve the problem of inﬂexibility mentioned before in Fig.1

In terms of modelling data distribution, the parameters trained by GAN istotally useless if it only model peak points of the distribution As a results,

in this thesis, I provide a number of methods to solve the problem of modecollapse bridging the gap between GAN research to real life problems

Scope of research

The scope of research of the thesis focuses on solving the problem ofmode collapse on generative model In details, the author focuses more onGAN rather than other methods because currently, GAN achieves the state-of-the-art performance on most of the datasets The author only considerstwo types of datasets which are tabular data and image data There are severalother types of data such as text or audio are not covered in the thesis due tothe limitations in time and the performance of current generative models onthese types of data

Methodology

In order to approach the mode collapse problem, I consider the problem

on each type of data instead of a common method for working on all datatypes This is because each type of data will have different types of modecollapses For example, on image datasets, we could observe the problem

of mode collapse by looking at a batch of data, while on tabular data, themode collapse happens inside the marginal sets Fig.3 shows the examples

Trang 18

Figure 3: The Figure illustrates two set of samples generated by two GANs The ﬁrst row GAN does not face the problem of mode collapse, and the second row faces the full mode collapse problem [3]

of generated images mode collapse, and the Tab.2.1 shows the mode collapsephenomenon in tabular data

tab-• For tabular data:

– Propose an efﬁcient way to cope with the learning of sparse highdimension one-hot encoding via embedding module GAN is thenlater trained on embedded features instead of raw data The featureembedding module is trained at the same time with GAN As a re-sult we name the proposed method as Discriminative Embedding

Trang 19

– Propose an improved version of conditional vector to solve the tural learning problem via information inferred from the graphicalmodel (so calledGraphical Conditional Vector (GCV))

struc-• For image data, we only consider a special case where the model is

trained without utilizing a real dataset:

– Propose a novel method Data Free GAN (DFGAN), a GAN modelwhich is trained to generate images without observing real datasets.– The model is improved by stimulating activation of the pretrainedclassiﬁcation network

– Mode-collapse problem is considered and solved by mutual mation and diversity constraints

infor-– Over most of the baseline, the model outperforms the existing workssigniﬁcantly

– The work provides an intuitions for the research community thatthe knowledge from classification model could be transferred di-rectly to generative models with high fidelity As far as the authorconcerns, this is the first work that is able to produce high qual-ity synthetic images on popular benchmark datasets with a singleforwarding generator without real data

Structure of the thesis

In the chapter 1, several techniques in generative models will be detailed

to support later chapters on solving mode collapse problem The chapter 1will not mention about related works related to chapter 2 or 3 Instead, eachchapter 2 or 3 will have their own related works separately with each othersand with the chapter 1 The author believes that by separating the background

Trang 20

knowledge from the related works, it would make the ﬂow of the thesis easy

to follow Chapter 2 analyses the problem of mode collapse of GAN ontabular data and proposes methods to solve the problem In chapter 3, I willdiscuss about a special case of modelling image data which allows modeltraining without observing real datasets The author does not consider thevanilla case of generating image data as the recent works such as BigGAN

or TACGAN have mostly solved the mode collapse problem on commondatasets such as CIFAR [19] and ImageNet [20] By considering solving themode-collapse problem on other image datasets with similar resolutions, inbelief of the author, will not answer any scientiﬁc question but to improveperformance in the matter of industrial works

Trang 21

This chapter would provide several details on background knowledge inorder to support later works in chapter 2 and 3 The author ﬁrst notes outseveral main characteristics of Neural Network and Deep Neural Networkwhich will be later utilized by GAN in section 1.1 Bayesian network is alsodescribed to support later usage for modelling tabular data Finally, severalgenerative models and taxonomy of these models will also be presented togive readers with overview of the development of this research direction insection 1.2

1.1 Neural Networks

Neural network is a computing system which imitates the behaviours ofbiological neural systems Due to its ability to generalize data through itsparameters, neural networks have been the most powerful means in machinelearning when learning on high dimensional data Although the topic re-ceives extremely high attentions from a number of researchers around theworld, understanding neural networks or deep neural networks is still a chal-lenging problem As a result, in the future, research on this topic will still bevery fertilized until there is a proof on its convergences and its boundaries onlearnable information

Basically, a neural network will include several layers, and between twolayers, there are a set of bipartite nodes These nodes are connected to eachothers with a weight value on each of the connections The paradigm of asimple neural network is illustrated in Fig.1.1 There are two main oper-ations on a neural network which are feed forward and Stochastic GradientDescent The ﬁrst operation is used to compute the output value of the model,

Trang 22

and the second operation is utilized to optimize the parameters of the modelcorrespondingly In the next subsections, I will describe clearer on these twooperations as well as the reasons for their attractions.

Figure 1.1: Neural Network [2]

func-w = {func-w1,w2, ,wM} In the Eq.1.1, there are M ﬁxed basis functions whichare utilized for modeling regression problem These basis functions are themain differences between the traditional linear regression models and mod-ern neural networks in terms of architecture that we will visit again theseterms later in this subsection The problem aims to achieve the optimalw sothat y(x,w) is closely matched with the optimal point t as Eq.1.3

Trang 23

Where ε is the distance between y(x,w) By assuming the ε is a dom Gaussian variable which has the density function N (ε|0,β) with meanvalue is 0 and variance value is β, the Eq.1.3 leads to the the modeling ofeach optimal value t as a random Gaussian variable with density function

ran-N (t|y(xj,w),β) As a result, for t = {t1,t2, tN} corresponding the set

of data instances X = {x1,x2, ,xN}, we have density function as Eq.1.4.The Eq.1.4 has one basic assumption is identical and independent distributed(i.i.d) for x From that assumption, we could have the product of densityfunctions of tj to achievet

P(t|w,X) =∏N N (tj|y(xj,w),β) (1.4)

=

N

∏N (tj|wTφ(xj),β) (1.5)Take the derivation from Eq.1.4 ﬂowing maximum log likelihood frame-work, we could achieve the optimal solution as Eq.1.6 In the equation, wedenoteΦ as Φ(X)

Coming back to the Eq.1.1, there are M ﬁxed basis functions are utilized to

Trang 24

model the regression problem This is an extremely large disadvantage whenapplied to large datasets in terms of dimensions The problem is caused bythe curse of dimensionality [2] and the need for adapting the choice of M aswell as the types of basis functions for each type of datasets [21] In order

to solve this problem, neural network was born to parameterize the basisfunctions [2] From the basis function φj(xi), we could have a new form ofparameterized function:

E(w) = ||yi(x,w) − t|| (1.9)Combining the equation with the Eq.1.8, we could clearly see that there is

no hope for ﬁnding an analytical solution for optimizing the networks This

is due to the non-linear activation function h(.) of the networks that stop usfrom taking the derivative of the parameters inside the activation function In-

Trang 25

stead, gradient information will be utilized to optimize the neural networks.Later subsection 1.1.2 will describe Stochastic Gradient Descent algorithmwhich is the most common algorithm for ﬁne-tuning neural networks param-eters.

1.1.2 Stochastic Gradient Descent

The Stochastic Gradient Descent algorithm(SGD) is a very common gorithm for optimizing a continuous function The idea is originated fromGradient Descent which utilizing gradient information to search for solu-tion close to minimum points The algorithm will be utilized when the opti-mal closed-form solution could not be found The term ”stochastic” literallymeans ”random” where instead of conducting Gradient Descent on the largescale of images, we take a random small sets of data to perform the algo-rithm In order to perform the SGD, there must be a basic assumption thatthe dataset is identical, independent distributed (i.i.d) Without this assump-tion, the information learned at each iteration of SGD will provide wrongdirections for optimization The details step of the Stochastic Gradient De-scent is presented in Algorithm 1

al-Algorithm 1: Stochastic Gradient Descent

input : Training data D

learning rate λ

random parameters θ

output: Model parameters θ �

1 while stopping criteria does not meet do

in both research and industries

Trang 26

1.2 Generative Models

Generative model is referred to one type of machine learning model wherethe distribution P(X) is modeled and utilized for sampling data points Fromthe development of generative model, we could categorize the model intotwo types are implicit model and explicit model (Fig.1.2) The early ma-chine learning methods have a transparent approach to model data distribu-tion by assuming the distribution of data in some certain forms For example,

a dataset is often assumed to be in Gaussian form This approach is termed

as explicit models Most of the explicit models have, however, a number oflimitations in modeling complex distributions The main problems comingfrom the wrong predictions in the form of distribution and the number ofmodes that could cover the dataset Later, Variational Autoencoder (VAE)does not need a tractable density function assumption to estimate the den-sity of datasets, yet it faces another issue related to lower bound assumption.Readers are encouraged to go through the tutorial at NeurIPS 2016 [4] forfurther understanding about the problem Shedding the light to these issues,implicit modeling was proposed where no prior assumptions about dataset ismade before learning This turns out to be very useful in generative mod-els, where current implicit models could even generate realistic images Wewould start the section with Maximum Likelihood to have an overview aboutdensity estimation After that, we move to the Bayesian Network to knowabout discrete data distribution modeling Finally, we come to the intro-duction of Generative Adversarial Networks, a method to recover datasetimplicitly

1.2.1 Maximum Likelihood

The essence of the Maximum Likelihood (ML) is to deﬁne a model thatcould output a probability distribution of a data point with a set of parame-

Trang 27

Figure 1.2: Taxonomy of generative models based on Maximum Likelihood [4]

tersθ The dataset likelihood is then latter assigned by the estimation of thetraining data P(X) = ∏mi=1pθ(xi) for a data set X = {xi}m The optimiza-tion objective is to choose parameters for the model that maximize the P(X).However, instead of doing optimization on product of instance likelihoods,

ML is often optimized via ∑mi=1log pθ(xi) The reason is that doing mization on product will lead to underﬂow problem caused by multiplyingtogether a number of extremely small value

Trang 28

1.2.2 Bayesian Networks

Given that we have a dataset X with a set of attributes as in Fig.1.3 How

do we represent the P(X) which is the joint distribution of the dataset X?

We have the P(X) = {Unhealthy f ood,Typhoid,Flu,Fever,Bodyache} Wecould count the co-existence of every possible cases of the attributes to build

up the P(X) However, this approach has a scalability problem which is possible to calculate when the number of attributes become very large As aresult, the Bayesian Network (BN) is utilized to model this distribution Theidea of BN is to construct a graph that represents the conditional probability

im-of the attributes Fig.1.4 represents the BN im-of the X dataset

Figure 1.3: Attributes in a medical record

We would calculate the marginal probability of the dataset based throughthe equation P(a,b) = P(a|b) ∗ P(b) In order to scale up to many attributes,

we have the Eq.1.13 for a dataset X with attributes D = {X1,X2, ,Xk}

P(X1,X2, ,Xk) =P(xk|x1,x2 ,xk−1) p(x2|x1)p(x1) (1.13)The example in Fig.1.5 is used to illustrate the a BN The marginal prob-ability of the dataset is calculated as:

Trang 29

Figure 1.4: An example of Bayesian Network for modeling real data

Figure 1.5: Bayesian Network example from [2]

Trang 30

P(X) = P(X1)P(X2)P(X3)P(X4|X1,X2,X3)P(X5|X1,X3)P(X6|X4)P(X7|X4,X5)

(1.14)

As a result, if we have a BN and the conditional probability between pairs

of attributes, we could achieve the distribution of the dataset The questionsare ”How to obtain the BN?” and ”How to assign the conditional probabilitybetween pairs of attributes?” The next two subsections will answer thesequestions

Bayesian Network construction

There are many ways to construct the Bayesian Network The most straightforward way is to construct the network based on high-level semantic rela-tions between attributes or prior knowledge about attributes For example,

”age” attribute and ”salary” attributes must have high correlations with eachother, and they could be connected together to form one pair of the graph.Based on human knowledge, we could build a graph that have good informa-tion for inference

However, the above methods are not always easy as people might nothave prior knowledge about the network Thus, we have to base on someother scalar values There are several common values such as Mutual Infor-mation [22], Bayesian information criterion (BIC), Bayesian score (BD) andthe minimum description length (MDL) criterion [23] Given the researchscope of the thesis, the author only reports the Mutual Information [23] ap-proach for constructing BN in a tree form The readers are encouraged torefer to the full text of [23, 2] to ﬁnd out more about other methods

Chow and Liu [24] provided an approach for approximating a variable probability distribution as Eq.1.13 which is the basis of tree-structurelearning of BNs We could have a skeleton to build a tree as the Algorithm

Trang 31

multi-2 The algorithm is similar to Kruskal algorithm [25] to construct maximumweight spanning tree whose weight is calculated via Mutual Information be-tween two attributes

Algorithm 2: Graph Construction

input : Set of attributes D = (X 1 ,X 2 , , X n )

output: Graph G = {V,E}

Inference on Bayesian Network

This subsection will assume that we already have a Bayesian Network,and the objective is to do inference on this network Similar to previoussubsection, the author only brieﬂy describe a simple Belief Propagation forsingly connected network due to the scope of the thesis

Given E is the subset of instantiated variables The value of the posteriorprobability of the value i of a variable Xj can be obtained via Bayes rules:

P(Xji|E) = P(Xij)P(E|Xij)/P(E) (1.15)

As this subsection focuses on the simple case of BN as a tree, any nodewill divide the network into two separate parts We would call them as E+and E− E−is the tree of nodes rooted from Xi

j, and E+is the set of all othernodes.The Eq.1.15 will turn into:

P(Xij|E) = P(Xij)P(E−,E+

|Xji)/P(E) (1.16)

As theE+and E−are two independent sets, we can turn the Eq.1.16 as:

Trang 32

P(Xij|E) = αP(Xij|E+)P(E−|Xji) (1.17)Where α is a normalization constant and α = P(Xi

j)/P(E) We coulddeﬁne the terms as:

λ(Xi

j) =P(E−|Xij) (1.18)π(Xi

j) =P(Xji|E+) (1.19)Several books [2] might factorize the graph and deﬁne the λ and π asbelief and message, yet in terms of meaning, they are the same As a resultthe Eq.1.17 could be written as:

A message sent from node B to its descendant S for a speciﬁc value Sk as;

1.2.3 Generative Adversarial Networks

From the Fig.1.2, we could see that Generative Adversarial Networks longs to implicit model The mechanism of GAN is quite straightforward

Trang 33

be-GAN employs two neural networks The ﬁrst neural network is named asGenerator, and the second one is Discriminator As their names suggest, theGenerator is responsible for generating data, while the Discriminator willaim to distinguish the generated samples from the samples from the originaldataset These two networks will compete with each other to perform thegenerative learning task where Generator learns to fool the Discriminator,and the Discriminator tries not be be fooled by the Generator The paradigm

of the GAN is illustrated in Fig.1.6

Figure 1.6: GAN framework

The generator is a differentiable continuous function Gθ (θ is the eters of G) whose input is a random noise z We have the synthetic samples

param-˜x = Gθ(z) The Discriminator is also a differentiable continuous function Dφ(φ is the parameters of Dφ) It takes input as a set of samples Dφ(x) or Dφ(˜x).The output of Dφ is normalized into [0;1] represents the possibility that thesample is fallen into real dataset distribution

In order to optimize θ and φ, we model the optimization problem into aminimax game The cost function used for optimizing the Dφ is:

Trang 34

JD(θ,φ) = −12Ex∼p datalogDφ(x) −12Ezlog(1 − Dφ(Gθ(z))) (1.23)The Eq.1.23 shows the case of binary cross entropy for binary classiﬁ-cation where we have two classes The ﬁrst class is the real class, and thesecond one is the fake class The cost function means that the Dφ tries tolearn to classify samples into these two categories The optimalφ would be:

φ∗=argmin

The Generator Gθ, on the other hand, will try to fool the Dφ by ing the JD which means making the D can not differentiate the real and thefake sample:

= 12Ex∼pdatalogDφ(x) +1

Trang 35

includes one mean and one variance The mode-collapse happens when thenumber of distributions are not fully recovered [3].

This is the most difﬁcult problem in modeling data distributions Reasonsfor this vary, but intuitively, we could imagine the mode-collapse problem islike chicken and egg problem If we want to solve the mode collapse we willneed to detect the mode collapse The mode collapse detection requires us

to know how many of modes in the real dataset that the synthetic data hasalready matched However, the number of modes in the real dataset is whatthe generative model is trying to ﬁgure out This chicken and egg problemcould be illustrated in the Fig.1.7

Figure 1.7: Mode collapse paradox

Mode collapse has two types which are complete collapse and partialcollapse Partial collapse happens more frequently than the complete col-lapse This phenomenon could be observed when the synthetic data are nat-ural enough, however the diversity is not guaranteed Fig.3 illustrates thecomplete mode collapse which is very easy to detect The partial mode col-lapse is very hard to detect due to the unknown number of modes and dis-tributions For GAN, the problem is more severe due to the utilization ofadversarial networks Due to the relevance to the content of the thesis, theauthor will only focus on the mode-collapse on GAN In the next sections,

we will explore the reasons for mode collapse and how to measure the modecollapse

Trang 36

– When there is an imbalance in the datasets For example, if we takeout the MNIST dataset with only 2 classes ”0” and ”1” in whichthere are 1000 images that contain ”1” digit and only one imagecontains ”0” digit The loss of not producing ”0” digit is very low.

As a result, the model will neglect the ”0” images

• Optimization perspective:

– The objective of generator G is to fool the Discriminator D, we havethe objective of G is:

∇θg 1m

m

∑

i=1(1 − D(G(zi))) (1.29)When the Discriminator D is weakly trained and mostly not updatedfor sometime, the x∗ will not be dependent of z:

Trang 37

1.3.2 Measurement

While there is no way to measure the exact number of modes disappear,the research community has provided a number of methods to quantitativelyobserve the mode collapse which is IS[26] and FID [27]

Inception Score: The Inception score (IS) is one of the most popularmethods to measure the quality of the image The main idea of the IS score

is to match label distribution with predicted output distribution of the thetic datasets given a pretrained Inception classiﬁcation model through KLdivergence The score is calculated as:

syn-KL(p(y)||p(ˆy|G(z))) = p(ˆy|G(z) ∗ (log p(ˆy|G(z)) − log p(y)) (1.31)Frechet Inception Distance: The Franechet Inception Distance (FID) isthe currently the most popular method to measure the quality and the diver-sity of the images Not same as the IS score which measures the distributiondistance between conditional predicted probability and a set of given labels,the FID measures the difference in distributions between the real and syn-thetic datasets

For ”multivariate” normal distribution Frechet distance, we have:

d(X,Y ) = ||µX− µY||2+Tr(ΣX +ΣY − 2ΣXΣY) (1.32)Where X and Y are two sets of samples from two distributions, and µX,

µY are the magnitudes of the vector X and Y correspondingly ΣX,ΣY are thecovariance matrix of X and Y

In order to calculate Frechet Inception Distance, we take the embeddingvectors of the datasets going through a pretrained Inception classiﬁcationmodel The embedding vectors for real dataset is X and the embedding vec-tors for synthetic datasets is Y

Trang 38

Chapter2MODE COLLAPSE ON TABULAR DATA MODELLING

Tabular data is one of the most important types of information in thisworld However, learning with tabular data is a very challenging task Thischapter concerns with the distribution modeling of the tabular data, and tab-ular data generation

2.1 Tabular data and challenges in tabular data modelling

Tabular has long been one of the most common forms of structured datawhere each column is represented for one characteristics or one attribute ofthe data Due to its ﬂexibility that does not depend on the any constrainedcontinuous range, the tabular data has been utilized to model most of thedata we have on earth from medical records, bank statements to researchdata This causes the paramount of important reason for tabular data re-search Before trying to understand tabular data, ﬁrst we will go into several

of its characteristics:

• Scarcity: Although it is the most popular scheme to model data, peoplefind it very difficult to approach them The main reason is that everyinstance of tabular data contain confidential information which mightdirectly expose privacy problem if they are public

• Categorical and continuous attributes mix-up: In tabular data, notall attributes are continuous, and this causes many challenges to currentMachine Learning approach

• Transfer learning unavailability: On image data-set, the scarcity ofdata could be solved by transfer learning For example, a pre-trainedmodel on IMAGENET[20] could be utilized to train on other data-sets

Trang 39

for different tasks such as MS-COCO and Pascal VOC for object tion However, on tabular form, due to the difference in data structureand contents of data, the transferring from one data-set to another is achallenging problem.

detec-• High cardinality: The categorical column of the tabular data is sented in the form of one-hot vector which is the main cause for highcardinality or high-dimensional vector if the number of categories in acolumn is extremely large For example, in order to encode 1 million

repre-of house addresses in a city, the categorical one-hot vector is utilizedwhich resulting in 1 million dimensions one-hot vector This is quiteproblematic due to over-fitting problem and high-dimension learning.Due to the characteristics of the tabular data, modeling this type of in-formation is very important This would help to offer an alternative datasetwhich covering private information instead of original dataset The syntheticdata could be used for other research without concerning about sensitiveleaking of information Of course, in order to completely protect confiden-tial information, we need another work called Differential Privacy [28], yetmodelling data distribution is the very first step before looking into the noise-adding information as in the work Hence, in this chapter of the thesis, theauthor only focuses on the modeling of tabular data By modeling the datadistribution correctly, the adding noise process will be done easier to preserveinformation as much as possible

Similar to other learning tasks, tabular modeling also could be done viatwo approaches which are traditional machine learning and deep learningalgorithms The most common method for modeling tabular data is thePrivBayesian [29] which shows many limitations in scaling up The emerg-ing deep learning method seems to be the solution CTGAN [30] seems toachieve the state-of-the-art in modeling, yet its ability to learn the depen-

Trang 40

dency between discrete features are limited Furthermore, the CTGAN alsofaces several problems related to high dimensional learning when the num-ber of categories inside one column increase drastically We have shown ourjustiﬁcations in the next sections This issue differs from the ﬁrst PrivBayesmethod[29] where the increase in the number of columns causes the scalabil-ity problem In this chapter, I will point out advantages and disadvantages ofthese methods and propose solutions to combine the strength of each method

to model tabular data The rest of the chapter will be presented as follow, tion 2.2 will detail the technicals related to PrivBayes [29] and CTGAN [30].The section 2.3 will consider about the problem of high dimension learning

sec-of CTGAN, and the section 2.4 shows the justiﬁcation for structure ing problem In section 2.5, I describe the proposed solutions to each of theproblems The obtained results of these proposed method will be presented

learn-in section 2.6, and the last section 2.7 would discuss several future works onthis problem

2.2 Related Work

From the previous section, we would know that the Bayesian network isoften used to model tabular data as a traditional machine learning method.Although for small dataset, Bayesian network works very well due to thesmall scale of approximate graph, Bayesian network has to face serious prob-lems when the number of columns increase The junction tree creation will

be the main problem due to the approximation from the original graph to

a triangular graph or variables elimination This process with the action ofadding and removing information in and out of the data causes loss in orig-inal information The most noticeable work is the PrivBayes [29] where theauthor proposed a method to add noise into network to protect privacy fol-lowing Differential Privacy [31] The later work [32] improves the PrivBayes

Định dạng
Số trang	92
Dung lượng	3,33 MB