Lecture Introduction to Machine learning and Data mining: Lesson 9.2

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	20
Dung lượng	0,92 MB

Nội dung

Lecture Introduction to Machine learning and Data mining: Lesson 9.2. This lesson provides students with content about: probabilistic modeling; expectation maximization; intractable inference; the Baum-Welch algorithm;... Please refer to the detailed content of the lecture!

Introduction to Machine Learning and Data Mining (Học máy Khai phá liệu) Khoat Than School of Information and Communication Technology Hanoi University of Science and Technology 2022 Content ¡ Introduction to Machine Learning & Data Mining ¡ Unsupervised learning Ă Supervised learning Ă Probabilistic modeling ă Expectation maximization ¡ Practical advice Difficult situations ¡ No closed-form solution for the learning/inference problem? (khơng tìm cơng thức nghim) ă ă The examples before are easy cases, as we can find solutions in a closed form by using gradient Many models (e.g., GMM) not admit a closed-form solution ¡ No explicit expression of the density/mass function? (không có cơng thức tường minh để tính tốn) ¡ Intractable inference (bi toỏn khụng kh thi) ă Inference in many probabilistic models is NP-hard [Sontag & Roy, 2011; Tosh & Dasgupta, 2019] Expectation maximization The EM algorithm GMM revisit q q Consider learning GMM, with K Gaussian distributions, from the training data D = {x1, x2, …, xM} The density function is 𝑝(𝒙|𝝁, 𝜮, 𝝓) = ∑$ !"# ! ! , ! ) ă ă 𝝓 = (𝜙! , … , 𝜙" ) represents the weights of the Gaussians, 𝑃 𝑧 = 𝑘| 𝝓 = 𝜙# Each multivariate Gaussian has density ! ! 𝒩 𝒙 𝝁# , 𝜮# ) = exp − ( 𝒙 − 𝝁# $%&(()𝜮! ) q , -! 𝜮# 𝒙 − 𝝁# MLE tries to maximize the following log-likelihood function & $ 𝐿 𝝁, 𝜮, 𝝓 = / log / 𝜙! 𝒩 𝒙% 𝝁! , 𝜮! ) %"# !"# q We cannot find a closed-form solution! q Naïve gradient decent: repeat until convergence ă Optimize , , w.r.t , when fixing (, ) ă Optimize , , w.r.t (𝝁, 𝜮), when fixing 𝝓 Still hard GMM revisit: K-means q GMM: we need to know ă ă q Among K gaussian components, which generates an instance x? the index z of the gaussian component 𝑃(𝑧|𝒙, 𝝁, 𝜮, )? (note $ !"# ( K-means: ă The parameters of individual gaussian components: 𝝁# , 𝜮# , 𝜙# Idea for GMM? ă q ă q ă = |, , , 𝝓) = 1) Update the parameters of individual gaussians: 𝝁# , 𝜮# , 𝜙# The parameters of individual clusters: the mean K-means training: Step 1: assign each instance x to the nearest cluster (the cluster index z for each x) (hard assignment) (soft assignment) ă Among K clusters, to which an instance x belongs? the cluster index z ă Step 2: recompute the means of the clusters GMM: lower bound q Idea for GMM? ă Step 1: compute (|, 𝝁, 𝜮, 𝝓)? (note ∑$!"# 𝑃(𝑧 = 𝑘|𝒙, 𝝁, 𝜮, ) = 1) ă Step 2: Update the parameters of the gaussian components: 𝜽 = 𝝁, 𝜮, 𝝓 ¡ Consider the log-likelihood function " 𝐿 𝜽 = log 𝑃(𝑫|𝜽) = : log : 𝜙# 𝒩 𝒙 𝝁# , 𝜮# ) /! #/! ă Too complex if directly using gradient ¨ Note that log 𝑃(𝒙|𝜽) = log 𝑃(𝒙, 𝑧|𝜽) − log 𝑃(𝑧|𝒙, 𝜽) Therefore log 𝑃(𝒙|𝜽) = 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽 − 𝔼'|𝒙,𝜽 log 𝑃 𝑧 𝒙, 𝜽 ≥ 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽 ¡ Maximizing 𝐿(𝜽) can be done by maximizing the lower bound 𝐿𝐵 𝜽 = / 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽 = / 𝒙∈𝑫 𝒙∈𝑫 / 𝑃 𝑧 𝒙, 𝜽 log 𝑃 𝒙, ' GMM: maximize the lower bound ă Step 1: compute 𝑃(𝑧|𝒙, 𝝁, 𝜮, 𝝓)? (note ∑$!"# 𝑃(𝑧 = |, , , ) = 1) ă Step 2: Update the parameters of the gaussian components: 𝜽 = 𝝁, 𝜮, 𝝓 ¡ Bayes’ rule: 𝑃 𝑧 𝒙, 𝜽 = 𝑃 𝒙 𝑧, 𝜽 𝑃(𝑧|𝝓)/𝑃(𝒙) = 𝜙' 𝒩 𝒙 𝝁' , 𝜮' )/𝐶, where 𝐶 = ∑# 𝜙# 𝒩 𝒙 𝝁# , # ) is the normalizing constant ă Meaning that one can compute 𝑃 𝑧 𝒙, 𝜽 if 𝜽 is known ă Denoting # = = , 𝜽 for any index 𝑘 = 1, 𝐾, 𝑖 = 1, Ă How about ? ă = 𝑃 𝑧 𝝓 = 𝑃 𝑧 𝜽 = ∫ 𝑃 𝑧, 𝒙 𝜽 𝑑𝒙 = ∫ 𝑃 𝑧 𝒙, 𝜽 𝑃 𝒙 𝜽 𝑑𝒙 = 𝔼𝒙 𝑃 𝑧 𝒙, 𝜽 ≈ #" ∑𝒙∈4 𝑃 𝑧 𝒙, 𝜽 = #" ∑0 /! 𝑇1 ¡ Then the lower bound can be maximized w.r.t individual (𝝁! , 𝜮! ): 𝐿𝐵 𝜽 = : " 𝒙∈𝑫 = : : 𝑇# − /! #/! : 𝑃 𝑧 𝒙, 𝜽 log[𝑃 𝒙 𝑧, 𝜽 𝑃 𝑧 𝜽 ] 1 𝒙 − 𝝁# , -! 𝜮# 𝒙 − 𝝁# − log det(2𝜋𝜮# ) + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 GMM: EM algorithm ¡ Input: training data 𝑫 = {𝒙1, 𝒙2, … , 𝒙𝑀 }, 𝐾 > ¡ Output: model parameter 𝝁, 𝜮, 𝝓 ¡ Initialize ă , (.) , (.) randomly (&) must be non-negative and sum to ¡ At iteration : ă (9) (9) (9) E step: compute # = 𝑃 𝑧 = 𝑘 𝒙 , 𝜽(9) = 𝜙# 𝒩 𝒙 𝝁# , 𝜮# )/𝐶 for any index 𝑘 = 1, 𝐾, 𝑖 = 1, 𝑀 M step: update for any k, 𝑎# (9:!) 𝜙# = , where 𝑎# = : 𝑇# ; 𝑀 /! 0 1 (9:!) (9:!) (9:!) 𝝁# = : 𝑇# 𝒙 ; 𝜮# = : 𝑇# 𝒙 − 𝝁# 𝑎# 𝑎# /! /! ¨ ¡ If not convergence, go to iteration 𝑡 + 𝒙 − (9:!) , 𝝁# 10 GMM: example ¡ We wish to model the height of a person ¨ We had collected a dataset from 10 people in Hanoi + 10 people in Sydney D={1.6, 1.7, 1.65, 1.63, 1.75, 1.71, 1.68, 1.72, 1.77, 1.62, 1.75, 1.80, 1.85, 1.65, 1.91, 1.78, 1.88, 1.79, 1.82, 1.81} GMM with components GMM with components 11 GMM: example ¡ A GMM is fitted in a 2-dimensional dataset to clustering From initialization To convergence https://en.wikipedia.org/wiki/Expectation-maximization_algorithm GMM: comparison with K-means q K-means: ă ă q Step 1: hard assignment GMM clustering ¨ Step 2: the means similar shape for the clusters? 12 ă Soft assignment of data to the clusters Parameters 𝝁# , 𝜮# , 𝜙# àdifferent shapes for the clusters https://en.wikipedia.org/wiki/Expectation-maximization_algorithm General models 13 ¡ We can make the EM algorithm in more general cases ¡ Consider a model 𝐵(𝒙, 𝒛; 𝜽) with observed variable x, hidden variable z, and parameterized by 𝜽 (mơ hình có biến x quan sát được, biến ẩn z, tham số 𝜽) ¨ ¨ x depends on z and 𝛉, while z may depend on 𝛉 Mixture models: each observed data point has a corresponding latent variable, specifying the mixture component which generated the data point ¡ The learning task is to find a specific model, from the model family parameterized by 𝜽, that maximizes the log-likelihood of training data D: 𝜽∗ = argmax𝜽 log 𝑃(𝑫|𝜽) ¡ We assume D consists of i.i.d samples of x, the the log-likelihood function can be expressed analytically, 𝐿𝐵 𝜽 = ∑𝒙∈𝑫 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽 can be computed easily (hàm log-likelihood vit mt cỏch tng minh) ă Since there is a latent variable, MLE may not have a close form solution The Expectation Maximization algorithm 14 ¡ The Expectation maximization (EM) algorithm was introduced in 1977 by Arthur Dempster, Nan Laird, and Donald Rubin ¡ The EM algorithm maximizes the lower bound of the log-likelihood L 𝜽; 𝑫 = log 𝑃 𝑫 𝜽 ≥ 𝐿𝐵 𝜽 = / 𝔼'|𝒙,𝜽 log 𝑃 𝒙, 𝑧 𝜽 𝒙∈𝑫 ¡ Initialization: 𝜽(.) , 𝑡 = Ă At iteration : ă E step: compute the expectation 𝑄 𝜽|𝜽(9) = 𝐿𝐵 𝜽(9-!) (tính hàm kỳ vọng Q cố định giá trị 𝜽(() biết bc trc) ă M step: find (9:!) = argmax 𝜽|𝜽(9) (tìm điểm 𝜽(()#) mà làm cho hàm Q đạt cực đại) ¡ If not convergence, go to iteration 𝑡 + EM: covergence condition ¡ Different conditions can be used to check convergence ă does not change much between two consecutive iterations ă does not change much between two consecutive iterations ¡ In practice, we sometimes need to limit the maximum number of iterations 15 16 EM: some properties ¡ The EM algorithm is guaranteed to return a stationary point of the lower bound 𝐿𝐵 𝜽 (thuật toán EM đảm bảo hội tụ im dng ca hm cn di) ă It may be the local maximum ¡ Due to maximizing the lower bound, EM does not necessarily returns the maximizer of the log-likelihood function (EM chưa trả điểm cực đại hm log-likelihood) ă ă No guarantee exists It can be seen in cases of multimodel, where the log-likelihood function is non-concave ¡ The Baum-Welch algorithm is the a special case of EM for hidden Markov models multimodel distribution EM, mixture model, and clustering 17 ¡ Mixture model: we assume the data population is composed of K different components (distributions), and each data point is generated from one of those components ¨ ¨ E.g., Gaussian mixture model, categorical mixture model, Bernoulli mixture model,… The mixture density function can be written as " 𝑓(𝒙; 𝜽, 𝝓) = : 𝜙# 𝑓# 𝒙 𝜽# ) #/! where 𝑓# 𝒙 𝜽# ) is the density of the k-th component ¡ We can interpret that a mixture distribution partitions the data space into different regions, each associates with a component (Một phân bố hỗn hợp tạo cách chia không gian liệu thành vùng khác nhau, mà vùng tương ứng với thành phần hỗn hợp đó) ¡ Hence, mixture models provide solutions for clustering ¡ The EM algorithm provides a natural way to learn mixture models EM: limitation 18 ¡ When the lower bound 𝐿𝐵 𝜽 does not admit easy computation of the expectation or maximization steps ă Admixture models, Bayesian mixture models ă Hierarchical probabilistic models ă Nonparametric models ¡ EM finds a point estimate, hence easily gets stuck at a local maximum ¡ In practice, EM is sensitive with initialization ă Is it good to use the idea of K-means++ for initialization? ¡ Sometimes EM converges slowly in practice Further? Ă Variational inference ă Inference for more general models Ă Deep generative models ă Neural networks + probability theory Ă Bayesian neural networks ă Neural networks + Bayesian inference Ă Amortized inference ă Neural networks for doing Bayesian inference ă Learning to inference 19 Reference 20 ¡ Blei, David M., Alp Kucukelbir, and Jon D McAuliffe "Variational inference: A review for statisticians." Journal of the American Statistical Association 112, no 518 (2017): 859-877 ¡ Blundell, Charles, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra "Weight Uncertainty in Neural Network." In International Conference on Machine Learning (ICML), pp 1613-1622 2015 ¡ Dempster, A.P.; Laird, N.M.; Rubin, D.B (1977) "Maximum Likelihood from Incomplete Data via the EM Algorithm" Journal of the Royal Statistical Society, Series B 39 (1): 1-38 ¡ Gal, Yarin, and Zoubin Ghahramani "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." In ICML, pp 1050-1059 2016 ¡ Ghahramani, Zoubin "Probabilistic machine learning and artificial intelligence." Nature 521, no 7553 (2015): 452-459 ¡ Kingma, Diederik P., and Max Welling "Auto-encoding variational bayes.” In International Conference on Learning Representations (ICLR), 2014 ¡ Jordan, Michael I., and Tom M Mitchell "Machine learning: Trends, perspectives, and prospects." Science 349, no 6245 (2015): 255-260 ¡ Tosh, Christopher, and Sanjoy Dasgupta “The Relative Complexity of Maximum Likelihood Estimation, MAP Estimation, and Sampling.” In COLT, PMLR 99:2993-3035, 2019 ¡ Sontag, David, and Daniel Roy, “Complexity of inference in latent dirichlet allocation” in: Advances in Neural Information Processing System, 2011 ...Content ¡ Introduction to Machine Learning & Data Mining ¡ Unsupervised learning Ă Supervised learning Ă Probabilistic modeling ă Expectation maximization... Welling "Auto-encoding variational bayes.” In International Conference on Learning Representations (ICLR), 2014 ¡ Jordan, Michael I., and Tom M Mitchell "Machine learning: Trends, perspectives, and. .. deep learning. " In ICML, pp 1050-1059 2016 ¡ Ghahramani, Zoubin "Probabilistic machine learning and artificial intelligence." Nature 521, no 7553 (2015): 452-459 ¡ Kingma, Diederik P., and Max

Ngày đăng: 09/12/2022, 00:14