77 Figure 3.7 The synthesized images of two existing inversion-based image syn-thesis approaches using pre-trained classifiers and without real ing data: DeepInversion DI [6] top row, DA
Trang 1SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
DINH ANH DUNG
ON SOLVING MODE COLLAPSE PROBLEM
IN SEVERAL GENERATIVE MODELS
MASTER THESIS IN COMPUTER SCIENCE
Hanoi - 2021
Trang 2SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
Dinh Anh Dung
ON SOLVING MODE COLLAPSE PROBLEM
IN SEVERAL GENERATIVE MODELS
Subject: Computer ScienceTitle ID: 19BKHMT-KH04
MASTER THESIS IN COMPUTER SCIENCE
SupervisorAssoc.Prof Huynh Thi Thanh Binh
Signature of supervisor
Hanoi - 2021
Trang 3REQUIREMENTS OF THE THESIS
1 Student’s information
Name: Dinh Anh Dung
Class: Computer Science
Affiliation: Hanoi University of Science and Technology
Phone: +6590947468 Email: dinhanhdung1996@gmail.com
Hanoi, 12th December 2021
Author
Dinh Anh Dung
4 Attestation of thesis advisor
Supervisor
Huynh thi Thanh Binh
Trang 4Table of Contents
1.1 Neural Networks 21
1.1.1 Feed forward 22
1.1.2 Stochastic Gradient Descent 25
1.2 Generative Models 26
1.2.1 Maximum Likelihood 26
1.2.2 Bayesian Networks 28
1.2.3 Generative Adversarial Networks 32
1.3 Mode Collapse 34
1.3.1 Reasons 36
1.3.2 Measurement 37
2 Mode Collapse on tabular data modelling 38 2.1 Tabular data and challenges in tabular data modelling 38
2.2 Related Work 40
2.3 High dimensions learning problem 42
Trang 52.4 Structure learning problem 43
2.5 Proposed method 45
2.5.1 Input data 45
2.5.2 Discriminative Embedding GAN 47
2.5.3 Graphical Conditional Vector 48
2.6 Experimental Results 51
2.6.1 Experimental setup 51
2.6.2 Experiments 56
2.7 Discussion 62
3 Mode Collapse on special case of image data modelling 63 3.1 Background 64
3.1.1 Inversion-based Learning 64
3.1.2 Deep Inversion 65
3.2 Data Free Generative Adversarial Networks 66
3.2.1 Discriminator Learning 68
3.2.2 Generator Learning 69
3.2.3 Improving Generator Learning with Class Activation Max-imization 70
3.2.4 Improving Generator Learning via Mutual Information and Diversity constraints 70
3.3 Experimental Results 71
3.3.1 Experiments on 2D dataset 72
3.3.2 Experiments on CIFAR datasets 75
3.4 Discussion and Conclusion 84
Trang 6List of Figures
Figure 1 [1] illustrates the difference when using Mean Square Error (MSE)
loss and adversarial loss on predicting next frame for video. 14
Figure 2 [2] illustrates the case of using maximum likelihood framework to
estimate parameters for Mixture Gaussian model. 15
Figure 3 The Figure illustrates two set of samples generated by two GANs.
The first row GAN does not face the problem of mode collapse, and the second row faces the full mode collapse problem [3] 18
Figure 1.1 Neural Network [2] 22
Figure 1.2 Taxonomy of generative models based on Maximum Likelihood [4] 27
Figure 1.3 Attributes in a medical record 28
Figure 1.4 An example of Bayesian Network for modeling real data 29
Figure 1.5 Bayesian Network example from [2] 29
Figure 1.6 Generative Adversarial Neural Networks (GAN) framework 33
Figure 1.7 Mode collapse paradox 35
Figure 2.1 alarm dataset: performance distance, and wasserstein distance of
Conditional Tabular GAN (CTGAN) and Discriminative ding GAN (DEGAN) during training 44
Embed-Figure 2.2 insurance dataset: performance distance, and wasserstein distance
of CTGAN and DEGAN during training 44
Figure 2.3 Discrete attribute represented by one-hot vectors 46
Figure 2.4 Highlights of the proposed model GDEGAN. 48
Trang 7Figure 2.5 MNIST28: Images generated by different methods 57
Figure 3.1 Inversion model and results taken from [5] 65
Figure 3.2 Deep Inversion model from [6] 65
Figure 3.3 Main difference between standard GAN and proposed Data-Free GAN66
Figure 3.4 Detailed diagram of the proposed Data Free GAN 68
Figure 3.5 Generated samples of (a) our method, (b) DeepInversion [6] and
(c) KEGNET [7] on 2D toy dataset. 75
Figure 3.6 We plot the IS (first figure) and FID (second figure) scores and
computational time of DeepInversion (DI) [6] trained on the
CIFAR-10 dataset at different iterations (blue color). 77
Figure 3.7 The synthesized images of two existing inversion-based image
syn-thesis approaches using pre-trained classifiers and without real ing data: DeepInversion (DI) [6] (top row), DAFL [8] (second row) and our proposed method (third row) on CIFAR-10 dataset. 80
train-Figure 3.8 Random synthetic samples of DCGAN (left), WGAN-GP
(mid-dle), and ours (right) on the CIFAR-100 dataset DCGAN and WGAN-GP are trained with real data. 81
Figure 3.9 The synthetic samples of KEGNET [7] (top row), DI [6] (middle
row), and our method (bottom row) on the SVHN dataset NET and DI perform poorly on this dataset. 83
KEG-Figure 3.10 Synthetic ImageNet images generated by our method. 84
Trang 8List of Tables
Table 2.1 The appearance rate in synthetic datasets of 1000 randomly picked
pair of values from three real datasets which are adult, census and covertype 45
Table 2.2 Tabular data benchmark description 52
Table 2.3 Table show the performance according to L test of current
state-of-the-art tabular generative models on 6 simulated datasets: alarm, asia, child, insurance, grid and gridr. 56
Table 2.4 The performance (in %) of current state-of-the-art tabular
genera-tive models on four real datasets: adult, census, covtype, intrusion, mnist12 and mnist28. 58
Table 2.5 FID scores of different methods DEGAN overcomes mode-collapse
problems with significant gaps compared to other methods 59
Table 2.6 Marginal statistics score on 4 real datasets: adult, census, covtype
and credit The result indicates the out-performance of the posed methods over others. 60
pro-Table 2.7 Marginal statistics score on 3 real datasets: intrusion, mnist12 and
mnist28 The mnist28 has high dimensionality of input. 61
Table 2.8 Marginal statistics score on 2 real datasets: Colorado and Fire
de-partment The two datasets have sparse and high dimensionality of input. 61
Table 2.9 Compared with other methods in Differential Privacy Synthetic
Data Challenge 2018 [9]. 61
Trang 9Table 2.10 Comparison between different graph construction schemes on
ma-chine learning efficiency score. 62
Table 2.11 Comparison between different graph construction schemes on marginal
statistics score. 62
Table 3.1 Architecture for 2D dataset 73
Table 3.2 The hyper-parameters of our method on the 2D dataset In this
dataset, we first pre-train the classifier C before training our model with G and D None: no parameter. 74
Table 3.3 Network architectures of the generator, discriminator, and encoder
on the CIFAR-10, CIFAR-100, and SVHN datasets. 76
Table 3.4 The hyper-parameters of our method on the CIFAR-10/100 datasets.
n/a: no parameters. 77
Table 3.5 The ablation study on three losses used in our model 79
Table 3.6 IS and FID scores of our method, existing works of data-free
mod-els (DAFL and DeepInversion), and two well-known GAN GAN and WGAN-GP) 81
(DC-Table 3.7 IS and FID scores of our method, existing works of data-free
mod-els (DAFL and DeepInversion), and two well-known GAN GAN and WGAN-GP) trained with real data on the CIFAR-100 dataset Data-free ( �): model is trained via model inversion or noise, and no real data. 81
(DC-Table 3.8 IS and FID of KEGNET [7], DeepInversion [6] and our method
on SVHN dataset For a fair comparison, we use the same trained classifier (ResNet10) published by KEGNET for all com- pared methods. 83
pre-Table 3.9 The hyper-parameters of our method on the SVHN dataset n/a:
no parameter. 83
Trang 11Mode collapse is one of the hardest problems of data distribution ing This is often caused by the complexity of dataset in terms of dimensionsand distributions As a result, the thesis aims to solve the problem Our ap-proaches tackle mode-collapse on each type of dataset due to their difference
model-in characteristics There are two genres of data are considered which are age and tabular data For tabular data, through observing the characteristics
im-of the dataset and the current SOTA performance, the author has identifiedthe problem of mode-collapse caused by high-dimensions of one-hot vectorlearning as well as structural learning problem of the current state-of-the-art.From that, a model name Discriminative Embedding Generative AdversarialNetwork is proposed to solve the issues For image data, the author onlyconsiders a specific case where generative models are trained without usingreal datasets Due to the lack of training dataset, the model is easier to fallinto mode-collapse problems In order to make the model work effectively,
a model named Data-Free Generative Adversarial Network (DF-GAN) is troduced As far as the author has concerned, these proposed methods arenovel and have not been investigated before The author also conducted ex-tensive experiments to investigate every aspect of the proposed methods Theresults show that both of the proposed methods outperform other SOTAs sig-nificantly
Trang 12Data distribution modelling is one of the most important problem in ern computer and data science By modelling data distributions, we couldmanipulate the data to support many other learning tasks This chapter wouldsummarize the key points on generative models, especially GAN to provide ageneral view on what is generative model?, why does generative model worthresearch? or why does this thesis focus on the mode collapse problem? Atthe end, there will be a contributions summarize and structures of the thesis.Generative model is a type of modeling techniques based on probabilisticapproaches In machine learning and learning theories, the probabilistic ap-proach recently prevailed the modern ideas on modeling data Starting fromthe linear discriminant functions to do linear classification with its drawback
mod-on geometric interpretatimod-on, the scientists have successfully deployed the tuitions of using three elements p(Y |X), p(Y) and p(X) to model any type
of data The p(Y |X) represents the relationship between observable data stances X with the associated label outcomes Y , p(Y ) is prior knowledgeabout the label distribution, and p(X) is the prior knowledge about the distri-bution of the dataset Due to the fruitful outcomes of probabilistic approachand geometric interpretation, the p(Y |X) has been used to model most ofour recent and modern classification and regression tasks, and this method
in-is often referred to Din-iscriminative model On the other hand, p(X) in-is ten utilized to model the data distribution of dataset X Based on the ability
of-to generate data instance, this model is named as generative model Hence,the term ”generative model” is not only associated with generating data ap-plication, it further delivers the best of our understanding on data throughmodeling data distribution p(X)
Trang 13Due to the characteristics of generative models on data distribution eling, the research on generative model is extremely important First, thesuccess of generative models reflects our knowledge on data distributions,especially high dimensional data distributions In fact, data understanding
mod-is one of the trending research recently in physics, mathematics, economics,computer science and data science [10, 11, 12, 13] The generative modelwill offer a statistical approach to provide knowledge on data distribution ex-plicitly and implicitly Second, generative models benefits storing and pro-cessing data in the days of data booming The aim of generative models is
to offer a set of statistics or statistical parameters that could represent thedatasets (sufficient statistics) In stead of storing the whole dataset, only acertain parameters need storing so that we could recover the whole datasets
at anytime For example, with billions of univariate Gaussian data points,
in stead of storing all billions of points, we only need to store two valuesare mean µ and variance Δ Based on these two values, we could sample asmuch as we want to recover the original dataset Third, the generative modelsallow learning on multi-modal outputs For many scenarios, one input mightcause a different output [4] While using the MSE will only help to repeat thesimilar information from the training data, training using generative mannerswill provide an output in a probability, hence the output will be more flexible(less dependent on training data) and will be more realistic toward constraints
of the input For example, the Fig.1 shows the problem of generated imagestrained by MSE loss lacking information such as ears, while all information
is kept by training with adversarial loss The leftmost one is the ground truthdata, the middle one is the generated image by training with MSE loss, andthe lass image is the generated output trained by adversarial approach
Many works have exploited the strength of generative models to modeldata and provide many great applications with high practical demands For
Trang 14Figure 1: [1] illustrates the difference when using MSE loss and adversarial loss on predicting next frame for video.
example:
• Image to image translation: The work aims to translate images with
similar styles of target domains from a source domain [14, 15]
• Images recovering from damage.
• Tasks related to creating arts:
• Low-resolution to high resolution
Most of the generative models will obey a certain number of steps tomodel the data:
1 Observe the datasets
2 Estimate parametersθ of data distributions
3 Sampling based on estimated parametersθ
As a statistical model, a generative model also faces problems of fitting, lack of training data, singularity or bias learning [16, 2] MaximumLikelihood (ML) and Expectation Maximization (EM) framework has beenwidely used to estimate parameters for this model showing a number of limi-tations ML or EM both require prerequisite assumptions about density func-tion of data before it could perform estimating steps[16, 2] While there are
Trang 15over-Figure 2: [2] illustrates the case of using maximum likelihood framework to estimate parameters for Mixture Gaussian model.
a lot of density functions could be used to model a distribution, an tion bias to one function would lead to wrongly learned parameters (Later
assump-in chapter 2, we would know further that this is a common problem for plicit generative models) The second problem is about the density functionsthemselves Several density functions are not working well with ML or ap-proximate inference frameworks For example, consider the case of estimat-ing parameters for Gaussian Mixture models via ML as Fig.2 When onemode has mean value µj collapse to one data point xn, the density function
ex-of that mode turn into N (xn|xn,σ2
In order to cover most of the problems in traditional machine learningmethods in modeling data mentioned above, GAN has been proposed in [17]
by Ian Goodfellow This method changes the second step of a normal ative model to employ a minimax game between two neural networks Gen-
Trang 16gener-erator and Discriminator competing with each others to estimate tion parameters By this, GAN offers a method to do data modeling withasymptotically consistent, parallel computing and without density functionsassumptions.
distribu-Since the existence of GAN, this research direction has become one ofthe most active research problem in generative models and computer vision.Although GAN provides a good solution to generate high quality outputs, itfaces a number of other difficulties The first one is finding the Nash equi-librium [18] which has been proved to be more difficult than optimizing anobjective function The second problem is the non-converge problem which
is also a common phenomenon in a minimax game To have a clearer picture
on this problem, we could consider the equation for training GAN as in Eq.1
Trang 17lapse problem where the Generator G only models a set of repeated featuremodes without diversity.
In fact, mode collapse is the largest problem makes GAN unfeasible to beapplied to real world scenarios [17] although the generated instances given
by training GAN is realistic By generating only few distinct samples, GANwill be useless since it does not provide enough information to support othertasks and do not solve the problem of inflexibility mentioned before in Fig.1
In terms of modelling data distribution, the parameters trained by GAN istotally useless if it only model peak points of the distribution As a results,
in this thesis, I provide a number of methods to solve the problem of modecollapse bridging the gap between GAN research to real life problems
Scope of research
The scope of research of the thesis focuses on solving the problem ofmode collapse on generative model In details, the author focuses more onGAN rather than other methods because currently, GAN achieves the state-of-the-art performance on most of the datasets The author only considerstwo types of datasets which are tabular data and image data There are severalother types of data such as text or audio are not covered in the thesis due tothe limitations in time and the performance of current generative models onthese types of data
Methodology
In order to approach the mode collapse problem, I consider the problem
on each type of data instead of a common method for working on all datatypes This is because each type of data will have different types of modecollapses For example, on image datasets, we could observe the problem
of mode collapse by looking at a batch of data, while on tabular data, themode collapse happens inside the marginal sets Fig.3 shows the examples
Trang 18Figure 3: The Figure illustrates two set of samples generated by two GANs The first row GAN does not face the problem of mode collapse, and the second row faces the full mode collapse problem [3]
of generated images mode collapse, and the Tab.2.1 shows the mode collapsephenomenon in tabular data
tab-• For tabular data:
– Propose an efficient way to cope with the learning of sparse highdimension one-hot encoding via embedding module GAN is thenlater trained on embedded features instead of raw data The featureembedding module is trained at the same time with GAN As a re-sult we name the proposed method as Discriminative Embedding
Trang 19– Propose an improved version of conditional vector to solve the tural learning problem via information inferred from the graphicalmodel (so calledGraphical Conditional Vector (GCV))
struc-• For image data, we only consider a special case where the model is
trained without utilizing a real dataset:
– Propose a novel method Data Free GAN (DFGAN), a GAN modelwhich is trained to generate images without observing real datasets.– The model is improved by stimulating activation of the pretrainedclassification network
– Mode-collapse problem is considered and solved by mutual mation and diversity constraints
infor-– Over most of the baseline, the model outperforms the existing workssignificantly
– The work provides an intuitions for the research community thatthe knowledge from classification model could be transferred di-rectly to generative models with high fidelity As far as the authorconcerns, this is the first work that is able to produce high qual-ity synthetic images on popular benchmark datasets with a singleforwarding generator without real data
Structure of the thesis
In the chapter 1, several techniques in generative models will be detailed
to support later chapters on solving mode collapse problem The chapter 1will not mention about related works related to chapter 2 or 3 Instead, eachchapter 2 or 3 will have their own related works separately with each othersand with the chapter 1 The author believes that by separating the background
Trang 20knowledge from the related works, it would make the flow of the thesis easy
to follow Chapter 2 analyses the problem of mode collapse of GAN ontabular data and proposes methods to solve the problem In chapter 3, I willdiscuss about a special case of modelling image data which allows modeltraining without observing real datasets The author does not consider thevanilla case of generating image data as the recent works such as BigGAN
or TACGAN have mostly solved the mode collapse problem on commondatasets such as CIFAR [19] and ImageNet [20] By considering solving themode-collapse problem on other image datasets with similar resolutions, inbelief of the author, will not answer any scientific question but to improveperformance in the matter of industrial works
Trang 21This chapter would provide several details on background knowledge inorder to support later works in chapter 2 and 3 The author first notes outseveral main characteristics of Neural Network and Deep Neural Networkwhich will be later utilized by GAN in section 1.1 Bayesian network is alsodescribed to support later usage for modelling tabular data Finally, severalgenerative models and taxonomy of these models will also be presented togive readers with overview of the development of this research direction insection 1.2
1.1 Neural Networks
Neural network is a computing system which imitates the behaviours ofbiological neural systems Due to its ability to generalize data through itsparameters, neural networks have been the most powerful means in machinelearning when learning on high dimensional data Although the topic re-ceives extremely high attentions from a number of researchers around theworld, understanding neural networks or deep neural networks is still a chal-lenging problem As a result, in the future, research on this topic will still bevery fertilized until there is a proof on its convergences and its boundaries onlearnable information
Basically, a neural network will include several layers, and between twolayers, there are a set of bipartite nodes These nodes are connected to eachothers with a weight value on each of the connections The paradigm of asimple neural network is illustrated in Fig.1.1 There are two main oper-ations on a neural network which are feed forward and Stochastic GradientDescent The first operation is used to compute the output value of the model,
Trang 22and the second operation is utilized to optimize the parameters of the modelcorrespondingly In the next subsections, I will describe clearer on these twooperations as well as the reasons for their attractions.
Figure 1.1: Neural Network [2]
func-w = {func-w1,w2, ,wM} In the Eq.1.1, there are M fixed basis functions whichare utilized for modeling regression problem These basis functions are themain differences between the traditional linear regression models and mod-ern neural networks in terms of architecture that we will visit again theseterms later in this subsection The problem aims to achieve the optimalw sothat y(x,w) is closely matched with the optimal point t as Eq.1.3
Trang 23Where ε is the distance between y(x,w) By assuming the ε is a dom Gaussian variable which has the density function N (ε|0,β) with meanvalue is 0 and variance value is β, the Eq.1.3 leads to the the modeling ofeach optimal value t as a random Gaussian variable with density function
ran-N (t|y(xj,w),β) As a result, for t = {t1,t2, tN} corresponding the set
of data instances X = {x1,x2, ,xN}, we have density function as Eq.1.4.The Eq.1.4 has one basic assumption is identical and independent distributed(i.i.d) for x From that assumption, we could have the product of densityfunctions of tj to achievet
P(t|w,X) =∏N N (tj|y(xj,w),β) (1.4)
=
N
∏N (tj|wTφ(xj),β) (1.5)Take the derivation from Eq.1.4 flowing maximum log likelihood frame-work, we could achieve the optimal solution as Eq.1.6 In the equation, wedenoteΦ as Φ(X)
Coming back to the Eq.1.1, there are M fixed basis functions are utilized to
Trang 24model the regression problem This is an extremely large disadvantage whenapplied to large datasets in terms of dimensions The problem is caused bythe curse of dimensionality [2] and the need for adapting the choice of M aswell as the types of basis functions for each type of datasets [21] In order
to solve this problem, neural network was born to parameterize the basisfunctions [2] From the basis function φj(xi), we could have a new form ofparameterized function:
E(w) = ||yi(x,w) − t|| (1.9)Combining the equation with the Eq.1.8, we could clearly see that there is
no hope for finding an analytical solution for optimizing the networks This
is due to the non-linear activation function h(.) of the networks that stop usfrom taking the derivative of the parameters inside the activation function In-
Trang 25stead, gradient information will be utilized to optimize the neural networks.Later subsection 1.1.2 will describe Stochastic Gradient Descent algorithmwhich is the most common algorithm for fine-tuning neural networks param-eters.
1.1.2 Stochastic Gradient Descent
The Stochastic Gradient Descent algorithm(SGD) is a very common gorithm for optimizing a continuous function The idea is originated fromGradient Descent which utilizing gradient information to search for solu-tion close to minimum points The algorithm will be utilized when the opti-mal closed-form solution could not be found The term ”stochastic” literallymeans ”random” where instead of conducting Gradient Descent on the largescale of images, we take a random small sets of data to perform the algo-rithm In order to perform the SGD, there must be a basic assumption thatthe dataset is identical, independent distributed (i.i.d) Without this assump-tion, the information learned at each iteration of SGD will provide wrongdirections for optimization The details step of the Stochastic Gradient De-scent is presented in Algorithm 1
al-Algorithm 1: Stochastic Gradient Descent
input : Training data D
learning rate λ
random parameters θ
output: Model parameters θ �
1 while stopping criteria does not meet do
in both research and industries
Trang 261.2 Generative Models
Generative model is referred to one type of machine learning model wherethe distribution P(X) is modeled and utilized for sampling data points Fromthe development of generative model, we could categorize the model intotwo types are implicit model and explicit model (Fig.1.2) The early ma-chine learning methods have a transparent approach to model data distribu-tion by assuming the distribution of data in some certain forms For example,
a dataset is often assumed to be in Gaussian form This approach is termed
as explicit models Most of the explicit models have, however, a number oflimitations in modeling complex distributions The main problems comingfrom the wrong predictions in the form of distribution and the number ofmodes that could cover the dataset Later, Variational Autoencoder (VAE)does not need a tractable density function assumption to estimate the den-sity of datasets, yet it faces another issue related to lower bound assumption.Readers are encouraged to go through the tutorial at NeurIPS 2016 [4] forfurther understanding about the problem Shedding the light to these issues,implicit modeling was proposed where no prior assumptions about dataset ismade before learning This turns out to be very useful in generative mod-els, where current implicit models could even generate realistic images Wewould start the section with Maximum Likelihood to have an overview aboutdensity estimation After that, we move to the Bayesian Network to knowabout discrete data distribution modeling Finally, we come to the intro-duction of Generative Adversarial Networks, a method to recover datasetimplicitly
1.2.1 Maximum Likelihood
The essence of the Maximum Likelihood (ML) is to define a model thatcould output a probability distribution of a data point with a set of parame-
Trang 27Figure 1.2: Taxonomy of generative models based on Maximum Likelihood [4]
tersθ The dataset likelihood is then latter assigned by the estimation of thetraining data P(X) = ∏mi=1pθ(xi) for a data set X = {xi}m The optimiza-tion objective is to choose parameters for the model that maximize the P(X).However, instead of doing optimization on product of instance likelihoods,
ML is often optimized via ∑mi=1log pθ(xi) The reason is that doing mization on product will lead to underflow problem caused by multiplyingtogether a number of extremely small value
Trang 281.2.2 Bayesian Networks
Given that we have a dataset X with a set of attributes as in Fig.1.3 How
do we represent the P(X) which is the joint distribution of the dataset X?
We have the P(X) = {Unhealthy f ood,Typhoid,Flu,Fever,Bodyache} Wecould count the co-existence of every possible cases of the attributes to build
up the P(X) However, this approach has a scalability problem which is possible to calculate when the number of attributes become very large As aresult, the Bayesian Network (BN) is utilized to model this distribution Theidea of BN is to construct a graph that represents the conditional probability
im-of the attributes Fig.1.4 represents the BN im-of the X dataset
Figure 1.3: Attributes in a medical record
We would calculate the marginal probability of the dataset based throughthe equation P(a,b) = P(a|b) ∗ P(b) In order to scale up to many attributes,
we have the Eq.1.13 for a dataset X with attributes D = {X1,X2, ,Xk}
P(X1,X2, ,Xk) =P(xk|x1,x2 ,xk−1) p(x2|x1)p(x1) (1.13)The example in Fig.1.5 is used to illustrate the a BN The marginal prob-ability of the dataset is calculated as:
Trang 29Figure 1.4: An example of Bayesian Network for modeling real data
Figure 1.5: Bayesian Network example from [2]
Trang 30P(X) = P(X1)P(X2)P(X3)P(X4|X1,X2,X3)P(X5|X1,X3)P(X6|X4)P(X7|X4,X5)
(1.14)
As a result, if we have a BN and the conditional probability between pairs
of attributes, we could achieve the distribution of the dataset The questionsare ”How to obtain the BN?” and ”How to assign the conditional probabilitybetween pairs of attributes?” The next two subsections will answer thesequestions
Bayesian Network construction
There are many ways to construct the Bayesian Network The most straightforward way is to construct the network based on high-level semantic rela-tions between attributes or prior knowledge about attributes For example,
”age” attribute and ”salary” attributes must have high correlations with eachother, and they could be connected together to form one pair of the graph.Based on human knowledge, we could build a graph that have good informa-tion for inference
However, the above methods are not always easy as people might nothave prior knowledge about the network Thus, we have to base on someother scalar values There are several common values such as Mutual Infor-mation [22], Bayesian information criterion (BIC), Bayesian score (BD) andthe minimum description length (MDL) criterion [23] Given the researchscope of the thesis, the author only reports the Mutual Information [23] ap-proach for constructing BN in a tree form The readers are encouraged torefer to the full text of [23, 2] to find out more about other methods
Chow and Liu [24] provided an approach for approximating a variable probability distribution as Eq.1.13 which is the basis of tree-structurelearning of BNs We could have a skeleton to build a tree as the Algorithm
Trang 31multi-2 The algorithm is similar to Kruskal algorithm [25] to construct maximumweight spanning tree whose weight is calculated via Mutual Information be-tween two attributes
Algorithm 2: Graph Construction
input : Set of attributes D = (X 1 ,X 2 , , X n )
output: Graph G = {V,E}
Inference on Bayesian Network
This subsection will assume that we already have a Bayesian Network,and the objective is to do inference on this network Similar to previoussubsection, the author only briefly describe a simple Belief Propagation forsingly connected network due to the scope of the thesis
Given E is the subset of instantiated variables The value of the posteriorprobability of the value i of a variable Xj can be obtained via Bayes rules:
P(Xji|E) = P(Xij)P(E|Xij)/P(E) (1.15)
As this subsection focuses on the simple case of BN as a tree, any nodewill divide the network into two separate parts We would call them as E+and E− E−is the tree of nodes rooted from Xi
j, and E+is the set of all othernodes.The Eq.1.15 will turn into:
P(Xij|E) = P(Xij)P(E−,E+
|Xji)/P(E) (1.16)
As theE+and E−are two independent sets, we can turn the Eq.1.16 as:
Trang 32P(Xij|E) = αP(Xij|E+)P(E−|Xji) (1.17)Where α is a normalization constant and α = P(Xi
j)/P(E) We coulddefine the terms as:
λ(Xi
j) =P(E−|Xij) (1.18)π(Xi
j) =P(Xji|E+) (1.19)Several books [2] might factorize the graph and define the λ and π asbelief and message, yet in terms of meaning, they are the same As a resultthe Eq.1.17 could be written as:
A message sent from node B to its descendant S for a specific value Sk as;
1.2.3 Generative Adversarial Networks
From the Fig.1.2, we could see that Generative Adversarial Networks longs to implicit model The mechanism of GAN is quite straightforward
Trang 33be-GAN employs two neural networks The first neural network is named asGenerator, and the second one is Discriminator As their names suggest, theGenerator is responsible for generating data, while the Discriminator willaim to distinguish the generated samples from the samples from the originaldataset These two networks will compete with each other to perform thegenerative learning task where Generator learns to fool the Discriminator,and the Discriminator tries not be be fooled by the Generator The paradigm
of the GAN is illustrated in Fig.1.6
Figure 1.6: GAN framework
The generator is a differentiable continuous function Gθ (θ is the eters of G) whose input is a random noise z We have the synthetic samples
param-˜x = Gθ(z) The Discriminator is also a differentiable continuous function Dφ(φ is the parameters of Dφ) It takes input as a set of samples Dφ(x) or Dφ(˜x).The output of Dφ is normalized into [0;1] represents the possibility that thesample is fallen into real dataset distribution
In order to optimize θ and φ, we model the optimization problem into aminimax game The cost function used for optimizing the Dφ is:
Trang 34JD(θ,φ) = −12Ex∼p datalogDφ(x) −12Ezlog(1 − Dφ(Gθ(z))) (1.23)The Eq.1.23 shows the case of binary cross entropy for binary classifi-cation where we have two classes The first class is the real class, and thesecond one is the fake class The cost function means that the Dφ tries tolearn to classify samples into these two categories The optimalφ would be:
φ∗=argmin
The Generator Gθ, on the other hand, will try to fool the Dφ by ing the JD which means making the D can not differentiate the real and thefake sample:
= 12Ex∼pdatalogDφ(x) +1
Trang 35includes one mean and one variance The mode-collapse happens when thenumber of distributions are not fully recovered [3].
This is the most difficult problem in modeling data distributions Reasonsfor this vary, but intuitively, we could imagine the mode-collapse problem islike chicken and egg problem If we want to solve the mode collapse we willneed to detect the mode collapse The mode collapse detection requires us
to know how many of modes in the real dataset that the synthetic data hasalready matched However, the number of modes in the real dataset is whatthe generative model is trying to figure out This chicken and egg problemcould be illustrated in the Fig.1.7
Figure 1.7: Mode collapse paradox
Mode collapse has two types which are complete collapse and partialcollapse Partial collapse happens more frequently than the complete col-lapse This phenomenon could be observed when the synthetic data are nat-ural enough, however the diversity is not guaranteed Fig.3 illustrates thecomplete mode collapse which is very easy to detect The partial mode col-lapse is very hard to detect due to the unknown number of modes and dis-tributions For GAN, the problem is more severe due to the utilization ofadversarial networks Due to the relevance to the content of the thesis, theauthor will only focus on the mode-collapse on GAN In the next sections,
we will explore the reasons for mode collapse and how to measure the modecollapse
Trang 36– When there is an imbalance in the datasets For example, if we takeout the MNIST dataset with only 2 classes ”0” and ”1” in whichthere are 1000 images that contain ”1” digit and only one imagecontains ”0” digit The loss of not producing ”0” digit is very low.
As a result, the model will neglect the ”0” images
• Optimization perspective:
– The objective of generator G is to fool the Discriminator D, we havethe objective of G is:
∇θg 1m
m
∑
i=1(1 − D(G(zi))) (1.29)When the Discriminator D is weakly trained and mostly not updatedfor sometime, the x∗ will not be dependent of z:
Trang 371.3.2 Measurement
While there is no way to measure the exact number of modes disappear,the research community has provided a number of methods to quantitativelyobserve the mode collapse which is IS[26] and FID [27]
Inception Score: The Inception score (IS) is one of the most popularmethods to measure the quality of the image The main idea of the IS score
is to match label distribution with predicted output distribution of the thetic datasets given a pretrained Inception classification model through KLdivergence The score is calculated as:
syn-KL(p(y)||p(ˆy|G(z))) = p(ˆy|G(z) ∗ (log p(ˆy|G(z)) − log p(y)) (1.31)Frechet Inception Distance: The Franechet Inception Distance (FID) isthe currently the most popular method to measure the quality and the diver-sity of the images Not same as the IS score which measures the distributiondistance between conditional predicted probability and a set of given labels,the FID measures the difference in distributions between the real and syn-thetic datasets
For ”multivariate” normal distribution Frechet distance, we have:
d(X,Y ) = ||µX− µY||2+Tr(ΣX +ΣY − 2ΣXΣY) (1.32)Where X and Y are two sets of samples from two distributions, and µX,
µY are the magnitudes of the vector X and Y correspondingly ΣX,ΣY are thecovariance matrix of X and Y
In order to calculate Frechet Inception Distance, we take the embeddingvectors of the datasets going through a pretrained Inception classificationmodel The embedding vectors for real dataset is X and the embedding vec-tors for synthetic datasets is Y
Trang 38Chapter2MODE COLLAPSE ON TABULAR DATA MODELLING
Tabular data is one of the most important types of information in thisworld However, learning with tabular data is a very challenging task Thischapter concerns with the distribution modeling of the tabular data, and tab-ular data generation
2.1 Tabular data and challenges in tabular data modelling
Tabular has long been one of the most common forms of structured datawhere each column is represented for one characteristics or one attribute ofthe data Due to its flexibility that does not depend on the any constrainedcontinuous range, the tabular data has been utilized to model most of thedata we have on earth from medical records, bank statements to researchdata This causes the paramount of important reason for tabular data re-search Before trying to understand tabular data, first we will go into several
of its characteristics:
• Scarcity: Although it is the most popular scheme to model data, peoplefind it very difficult to approach them The main reason is that everyinstance of tabular data contain confidential information which mightdirectly expose privacy problem if they are public
• Categorical and continuous attributes mix-up: In tabular data, notall attributes are continuous, and this causes many challenges to currentMachine Learning approach
• Transfer learning unavailability: On image data-set, the scarcity ofdata could be solved by transfer learning For example, a pre-trainedmodel on IMAGENET[20] could be utilized to train on other data-sets
Trang 39for different tasks such as MS-COCO and Pascal VOC for object tion However, on tabular form, due to the difference in data structureand contents of data, the transferring from one data-set to another is achallenging problem.
detec-• High cardinality: The categorical column of the tabular data is sented in the form of one-hot vector which is the main cause for highcardinality or high-dimensional vector if the number of categories in acolumn is extremely large For example, in order to encode 1 million
repre-of house addresses in a city, the categorical one-hot vector is utilizedwhich resulting in 1 million dimensions one-hot vector This is quiteproblematic due to over-fitting problem and high-dimension learning.Due to the characteristics of the tabular data, modeling this type of in-formation is very important This would help to offer an alternative datasetwhich covering private information instead of original dataset The syntheticdata could be used for other research without concerning about sensitiveleaking of information Of course, in order to completely protect confiden-tial information, we need another work called Differential Privacy [28], yetmodelling data distribution is the very first step before looking into the noise-adding information as in the work Hence, in this chapter of the thesis, theauthor only focuses on the modeling of tabular data By modeling the datadistribution correctly, the adding noise process will be done easier to preserveinformation as much as possible
Similar to other learning tasks, tabular modeling also could be done viatwo approaches which are traditional machine learning and deep learningalgorithms The most common method for modeling tabular data is thePrivBayesian [29] which shows many limitations in scaling up The emerg-ing deep learning method seems to be the solution CTGAN [30] seems toachieve the state-of-the-art in modeling, yet its ability to learn the depen-
Trang 40dency between discrete features are limited Furthermore, the CTGAN alsofaces several problems related to high dimensional learning when the num-ber of categories inside one column increase drastically We have shown ourjustifications in the next sections This issue differs from the first PrivBayesmethod[29] where the increase in the number of columns causes the scalabil-ity problem In this chapter, I will point out advantages and disadvantages ofthese methods and propose solutions to combine the strength of each method
to model tabular data The rest of the chapter will be presented as follow, tion 2.2 will detail the technicals related to PrivBayes [29] and CTGAN [30].The section 2.3 will consider about the problem of high dimension learning
sec-of CTGAN, and the section 2.4 shows the justification for structure ing problem In section 2.5, I describe the proposed solutions to each of theproblems The obtained results of these proposed method will be presented
learn-in section 2.6, and the last section 2.7 would discuss several future works onthis problem
2.2 Related Work
From the previous section, we would know that the Bayesian network isoften used to model tabular data as a traditional machine learning method.Although for small dataset, Bayesian network works very well due to thesmall scale of approximate graph, Bayesian network has to face serious prob-lems when the number of columns increase The junction tree creation will
be the main problem due to the approximation from the original graph to
a triangular graph or variables elimination This process with the action ofadding and removing information in and out of the data causes loss in orig-inal information The most noticeable work is the PrivBayes [29] where theauthor proposed a method to add noise into network to protect privacy fol-lowing Differential Privacy [31] The later work [32] improves the PrivBayes