Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 19 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
19
Dung lượng
918,41 KB
Nội dung
Learning Kernels -Tutorial Part I: IntroductiontoKernelMethods. page Outline Part I: Introductiontokernelmethods. Part II: Learning kernel algorithms. Part III: Theoretical guarantees. Part IV: Software tools. 2 page Binary Classification Problem Training data: sample drawn i.i.d. from set according to some distribution , Problem: find hypothesis in (classifier) with small generalization error . Linear classification: • Hypotheses based on hyperplanes. • Linear separation in high-dimensional space. 3 h :X �→{−1, +1} H S =((x 1 ,y 1 ), ,(x m ,y m )) ∈ X ×{−1, +1}. R D (h) X ⊆ R N D page Linear Separation Classifiers: . 4 H ={x �→sgn(w · x + b):w ∈ R N ,b ∈ R} w · x+b =0 w · x+b =0 page Optimal Hyperplane: Max. Margin Canonical hyperplane: and chosen such that for closest points . Margin: . 5 margin (Vapnik and Chervonenkis, 1964) w b ρ=min x∈S |w ·x +b| �w� = 1 �w� w · x+b =+1 w · x+b = −1 w · x+b =0 |w ·x + b|=1 page Optimization Problem Constrained optimization: Properties: • Convex optimization (strictly convex). • Unique solution for linearly separable sample. 6 min w,b 1 2 �w� 2 subject to y i (w · x i + b) ≥ 1,i∈ [1,m]. page Support Vector Machines Problem: data often not linearly separable in practice. For any hyperplane, there exists such that Idea: relax constraints using slack variables 7 (Cortes and Vapnik, 1995) x i y i [w · x i + b] �≥ 1. y i [w · x i + b] ≥ 1 − ξ i . ξ i ≥ 0 page Support vectors: points along the margin or outliers. Soft margin: Soft-Margin Hyperplanes 8 ξ i ξ j w · x+b =+1 w · x+b = −1 w · x+b =0 ρ =1/�w�. page Optimization Problem Constrained optimization: Properties: • trade-off parameter. • Convex optimization (strictly convex). • Unique solution. 9 min w,b,ξ 1 2 �w� 2 + C m � i=1 ξ i subject to y i (w · x i + b) ≥ 1 − ξ i ∧ ξ i ≥ 0,i∈ [1,m]. C ≥ 0 (Cortes and Vapnik, 1995) page Dual Optimization Problem Constrained optimization: Solution: 10 for any SV . h(x)=sgn m i=1 α i y i (x i · x)+b , b = y i − m � j=1 α j y j (x j · x i ) x i with max α m � i=1 α i − 1 2 m � i,j=1 α i α j y i y j (x i · x j ) subject to: α i ≥ 0 ∧ m � i=1 α i y i =0,i∈ [1,m]. [...].. .Kernel Methods Idea: Define K : X ×X → R , called kernel, such that: • Φ(x) · Φ(y) = K(x, y) • K often interpreted as a similarity measure Benefits: Efficiency: K is often more efficient to compute than Φ and the dot product Flexibility: K can be chosen arbitrarily so long as the existence of Φ is guaranteed (Mercer’s condition) • • page 11 Example - Polynomial Kernels Definition: ∀x,... second-degree polynomial kernel with c = 1: (-1, 1) x2 √ (1, 1) 2 x1 x2 √ √ √ (1, 1, + 2, − 2, − 2, 1) √ √ (-1, -1) Linearly non-separable √ √ √ (1, 1, − 2, − 2, + 2, 1) √ (1, 1, + 2, + 2, + 2, 1) x1 (1, -1) √ 2 x1 √ √ √ (1, 1, − 2, + 2, − 2, 1) Linearly separable by x1 x2 = 0 page 13 Other Standard PDS Kernels Gaussian kernels: 2 ||x − y|| , σ = 0 K(x, y) = exp − 2 2σ Sigmoid Kernels: K(x, y) = tanh(a(x... with PDS Kernels (Boser, Guyon, and Vapnik, 1992) Constrained optimization: m m 1 αi αj yi yj K(xi , xj ) αi − max α 2 i,j=1 i=1 subject to: 0 ≤ αi ≤ C ∧ Solution: m i=1 αi yi = 0, i ∈ [1, m] m αi yi K(xi , x) + b , h(x) = sgn m i=1 αj yj K(xj , xi ) for any xi with with b = yi − j=1 0 < αi < C page 15 SVMs with PDS Kernels Constrained optimization: max 2 α 1 − α Y KYα α subject to: 0... from set X according to some distribution D, S = ((x1 , y1 ), , (xm , ym )) ∈ X ×Y, with Y ⊆ R is a measurable subset Loss function: L : Y ×Y → R+ a measure of closeness, L(y, y ) = (y −y)2 or L(y, y ) = |y −y|p for typically some p ≥ 1 Problem: find hypothesis h :X → R in H with small generalization error with respect to target f RD (h) = E L h(x), f (x) x∼D page 17 Kernel Ridge Regression... α = (K + λI)−1 y page 18 Questions How should the user choose the kernel? problem similar to that of selecting features for other learning algorithms poor choice learning made very difficult good choice even poor learners could succeed • • • The requirement from the user is thus critical can this requirement be lessened? is a more automatic selection of features possible? • • page 19 . Learning Kernels -Tutorial Part I: Introduction to Kernel Methods. page Outline Part I: Introduction to kernel methods. Part II: Learning kernel algorithms. Part III: Theoretical. x i ) x i with max α m � i=1 α i − 1 2 m � i,j=1 α i α j y i y j (x i · x j ) subject to: α i ≥ 0 ∧ m � i=1 α i y i =0,i∈ [1,m]. page Kernel Methods Idea: • Define , called kernel, such that: • often interpreted as a similarity. =1 x 1 x 2 =0. page Other Standard PDS Kernels Gaussian kernels: Sigmoid Kernels: 14 K(x, y)=tanh(a(x · y)+b),a,b≥ 0. K(x, y)=exp − ||x − y|| 2 2σ 2 ,σ�=0. page Consequence: SVMs with PDS Kernels Constrained