Lecture Introduction to Machine learning and Data mining: Lesson 6

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	49
Dung lượng	2,37 MB

Nội dung

Lecture Introduction to Machine learning and Data mining: Lesson 6. This lesson provides students with content about: supervised learning; support vector machines; linear separability assumption; separating hyperplane; hyperplane with max margin;... Please refer to the detailed content of the lecture!

Introduction to Machine Learning and Data Mining (Học máy Khai phá liệu) Khoat Than School of Information and Communication Technology Hanoi University of Science and Technology 2021 Contents ¡ Introduction to Machine Learning & Data Mining ¡ Unsupervised learning Ă Supervised learning ă Support Vector Machines Ă Practical advice Support Vector Machines (1) ¡ Support Vector Machines (SVM) (máy vectơ hỗ trợ) was proposed by Vapnik and his colleages in 1970s Then it became famous and popular in 1990s ¡ Originally, SVM is a method for linear classification It finds a hyperplane (also called linear classifier) to separate the two classes of data ¡ For non-linear classification for which no hyperplane separates well the data, kernel functions (hm nhõn) will be used ă Kernel functions play the role to transform the data into another space, in which the data is linearly separable ¡ Sometimes, we call linear SVM when no kernel function is used (in fact, linear SVM uses a linear kernel) Support Vector Machines (2) ¡ SVM has a strong theory that supports its performance ¡ It can work well with very high dimensional problems ¡ It is now one of the most popular and strong methods ¡ For text categorization, linear SVM performs very well SVM: the linearly separable case ¡ Problem representation: ă ă Training data D = {(x1, y1), (x2, y2), …, (xr, yr)} with r instances Each xi is a vector in an n-dimensional space, e.g., xi = (xi1, xi2, , xin)T Each dimension represents an attribute ă Bold characters denote vectors ă yi is a class label in {-1; 1} ‘1’ is possitive class, ‘-1’ is negative class ¡ Linear separability assumption: there exists a hyperplane (of linear form) that well separates the two classes (giả thuyết tồn siêu phẳng mà phân tách lớp được) Linear SVM ¡ SVM finds a hyperplane of the form: f(x) = ỏw ì xủ + b ă ¨ [Eq.1] w is the weight vector; b is a real number (bias) áw × xđ and áw, xđ denote the inner product of two vectors (tích vơ hướng hai véctơ) ¡ Such that for each xi: ì if yi = ợ- if ỏw ì xi ñ + b ³ áw × xi ñ + b < [Eq.2] Separating hyperplane ¡ The hyperplane (H0) which separates the possitive from negative class is of the form: áw × xđ + b = ¡ It is also known as the decision boundary/surface ¡ But there might be infinitely many separating hyperplanes Which one should we choose? [Liu, 2006] Hyperplane with max margin ¡ SVM selects the hyperplane with max margin (SVM tìm siêu phẳng tách mà có lề lớn nhất) ¡ It is proven that the max-margin hyperplane has minimal errors among all possible hyperplanes [Liu, 2006] Marginal hyperplanes ¡ Assume that the two classes in our data can be separated clearly by a hyperplane ¡ Denote (x+,1) in possitive class and (x-,-1) in negative class which are closest to the separating hyperplane H0 (áw × xđ + b = 0) ¡ We define two parallel marginal hyperplanes as follows: ă H+ crosses x+ and is parallel with H0: áw × x+đ + b = ă H- crosses x- and is parallel with H0: áw × x-đ + b = -1 ¨ No data point lies between these two marginal hyperplanes And satisfying: áw × xiđ + b ≥ 1, if yi = áw × xiđ + b ≤ -1, if yi = -1 [Eq.3] 10 The margin (1) ¡ Margin (mức lề) is defined as the distance between the two marginal hyperplanes ă Denote d+ the distance from H0 to H+ ă Denote d- the distance from H0 to H- ă (d+ + d-) is the margin Ă Remember that the distance from a point xi to the hyperplane H0 (áw × xđ + b = 0) is computed as: | # ! + | ă [Eq.4] Where: 𝒘 = 𝒘#𝒘 = 𝑤"# + 𝑤## + ⋯+ 𝑤$# [Eq.5] 35 Soft-margin SVM: classifying new instances ¡ The decision boundary is f(x) = w * ×xđ + b* = å α y áx × xđ + b* = x i ỴSV i i i [Eq.19] ¡ For a new instance z, we compute: ỉ sign( ỏ w * ìzủ + b*) = signỗỗ i yi ỏ x i ì zủ + b* ữữ ố xi ẻSV ứ ă [Eq.20] If the result is 1, z will be assigned to the possitive class; otherwise z will be assigned to the negative class ¡ Note: it is important to choose a good value of C, since it significantly affects performance of SVM ă We often use a validation set to choose a value for C Linear SVM: summary 36 ¡ Classification is based on a separating hyperplane ¡ Such a hyperplane is represented as a combination of some support vectors ¡ The determination of support vectors reduces to solve a quadratic programming problem ¡ In the dual problem and the separating hyperplane, dot products can be used in place of the original training data ă This is the door for us to learn a nonlinear classifier Non-linear SVM 37 ¡ Consider the case in which our data are not linearly separable ă This may often happen in practice ¡ How about using a non-linear function? Ă Idea of Non-linear SVM: ă ă Step 1: transform the input into another space, which often has higher dimensions, so that the projection of data is linearly separable Step 2: use linear SVM in the new space 38 Non-linear SVM ¡ Input space: initial representation of data ¡ Feature space: the new space after the transformation 𝜙(𝒙) Non-linear SVM: transformation ¡ Our idea is to map the input x to a new representation, using a non-linear mapping 𝜙: 𝑋 ⟶ 𝐹 𝒙 ⟼ 𝜙(𝒙) ¡ In the feature space, the original training data {(𝒙𝟏, 𝑦1), (𝒙𝟐, 𝑦2), … , (𝒙𝒓, 𝑦& )} are represented by {(f(x1), y1), (f(x2), y2), …, (f(xr), yr)} 39 Non-linear SVM: transformation ¡ Consider the input space to be 2-dimensional, and we choose the following map 𝜙: 𝑋 ⟶ 𝐹 (𝑥" , 𝑥# ) ⟼ (𝑥"# , 𝑥## , 2𝑥" 𝑥# ) ¡ So instance x = (2, 3) will have the representation in the feature space as f(x) = (4, 9, 8.49) 40 Non-linear SVM: learning & prediction 41 ¡ Training problem: Minimize Such that r w × wđ LP = + C å xi i =1 ì yi (á w × f (x i )đ + b ) ³ - x i , "i = r í î x i ³ 0, "i = r ¡ The dual problem: Maximize Such that ¡ Classifier: r LD = å a i - å a ia j yi y j áf (x i ) × f (x j )đ i , j =1 i =1 ì r ï å a i yi = í i =1 ïỵ0 £ a i £ C , "i = r [Eq.34] r 𝑓 𝒛 = 𝒘∗, 𝜙(𝒛) + 𝑏 ∗ = ; 𝛼! 𝑦! 𝜙(𝒙! ), 𝜙(𝒛) + 𝑏 ∗ 𝒙! ∈+, [Eq.35] [Eq.36] Non-linear SVM: difficulties ¡ How to find the mapping? ă An intractable problem Ă The curse of dimensionality ă ă ă As the dimensionality increases, the volume of the space increases so fast that the available data become sparse This sparsity is problematic Increasing the dimensionality will require significantly more training data ược Dữ liệu dù thu thập đ lớn đến đâu g nhỏ so với khôn gian chúng 42 Non-linear SVM: Kernel functions 43 ¡ An explicit form of a tranformation is not necessary ¡ The dual problem: r r Maximize LD = å a i - å a ia j yi y j áf (x i ) × f (x j )đ i , j =1 i =1 ì r Such that ï å a i yi = í i =1 ïỵ0 £ a i £ C , "i = r ¡ Classifier: 𝑓 𝒛 = 𝒘∗ , 𝜙(𝒛) + 𝑏 ∗ = ∑𝒙" ∈+, 𝛼! 𝑦! 𝜙(𝒙! ), 𝜙(𝒛) + 𝑏 ∗ ¡ Both require only the inner product áf(x),f(z)ñ ¡ Kernel trick: Nonlinear SVM can be used by replacing those inner products by evaluations of some kernel function K(x,z) = áf(x),f(z)ñ [Eq.37] Kernel functions: example ¡ Polynomial 𝐾 𝒙, 𝒛 = 𝒙, 𝒛 44 ! ¡ Consider the polynomial with degree d=2 For any vectors x=(x1,x2) and z=(z1,z2) , ă # = = 𝑥" 𝑧" + 𝑥# 𝑧# # 𝑥"# 𝑧"# + 2𝑥" 𝑧" 𝑥# 𝑧# + 𝑥## 𝑧## = = 𝑥"# , 𝑥## , 2𝑥" 𝑥# , 𝑧"# , 𝑧## , 2𝑧" 𝑧# 𝜙 𝒙 , 𝜙(𝒛) = 𝐾 𝒙, 𝒛 Where 𝜙 𝒙 = 𝑥- , 𝑥 , 2𝑥- 𝑥 ¡ Therefore the polynomial is the product of two vectors 𝜙 𝒙 and 𝜙(𝒛) Kernel functions: popular choices ¡ Polynomial K(x,z) = (á x × zđ + θ ) ; : θ Ỵ R,d Î N d ¡ Gaussian radial basis function (RBF) K(x,z) = e - x-z 2σ ; : σ > ¡ Sigmoid 𝐾 𝒙, 𝒛 = tanh(𝛾 𝒙 𝒛 + 𝑟), 𝛾, 𝑟 ∈ 𝑅 ¡ What conditions ensure a kernel function? Mercer’s theorem 45 SVM: summary 46 ¡ SVM works with real-value attributes ¨ Any nominal attribute need to be transformed into a real one ¡ The learning formulation of SVM focuses on classes ă ă How about a classification problem with > classes? One-vs-the-rest, one-vs-one: a multiclass problem can be solved by reducing to many different problems with classes ¡ The decision function is simple, but may be hard to interpret ă It is more serious if we use some kernel functions SVM: some packages ¡ LibSVM: •http://www.csie.ntu.edu.tw/~cjlin/libsvm/ ¡ Linear SVM for large datasets: •http://www.csie.ntu.edu.tw/~cjlin/liblinear/ •http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html ¡ Scikit-learn in python: •http://scikit-learn.org/stable/modules/svm.html ¡ SVMlight: •http://www.cs.cornell.edu/people/tj/svm_light/index.html 47 References 48 ¡ B Liu Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data Springer, 2006 ¡ C J C Burges A Tutorial on Support Vector Machines for Pattern Recognition Data Mining and Knowledge Discovery, 2(2): 121-167, 1998 ¡ Cortes, Corinna, and Vladimir Vapnik "Support-vector networks." Machine learning 20.3 (1995): 273-297 Exercises 49 ¡ What is the main difference between SVM and KNN? ¡ How many support vectors are there in the worst case? Why? ¡ The meaning of the constant C in SVM? Compare the role of C in SVM with that of λ in Ridge regression ... Contents ¡ Introduction to Machine Learning & Data Mining Ă Unsupervised learning Ă Supervised learning ă Support Vector Machines ¡ Practical advice Support Vector Machines (1) ¡ Support Vector Machines... 48 ¡ B Liu Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data Springer, 20 06 ¡ C J C Burges A Tutorial on Support Vector Machines for Pattern Recognition Data Mining and Knowledge... ỴSV i i i (due to

Ngày đăng: 09/12/2022, 00:14