Bài 2 Slide Linear Regression. Linear Regression Linear Regression Regression Given – Data X = x(1), , x(n) – Corresponding labels y = where x(i) y(1), , y(n) where y(i) 2 R 2 9 8 7 6 5 4 3 2 1 0 1975 1980 1985 1990 1995 2000 2005.
Linear Regression Regression Given: – Data x X = x (1) where , , Rd x(i) (n ) where y – Corresponding labels y (1) , , y y(i) R (n ) = Linear Regression Quadratic Regression 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year Prostate Cancer Dataset • • 97 samples, partitioned into 67 train / 30 test Eight predictors (features): – • Continuous outcome variable: – Based on slide by Jeff Howbert continuous (4 log transforms), binary, ordinal lpsa: log(prostate specific antigen level) Linear Regression • Hypothesis: X y = ✓0 + ✓1x1 + ✓2x2 + + ✓dxd d = ✓j x j j =0 Assume x = • Fit model by minimizing sum of squared errors x Figures are courtesy of Greg Shakhnarovich Least Squares Linear Regression • Cost Function X n J (✓) = 2n • Fit by solving ⇣ ⇣ h✓ i=1 ⌘ x (i ) ⌘2 (i) — y J (✓) ✓ Intuition Behind Cost Function J (✓) = X n ⇣ ⇣ h✓ 2n ⌘ x (i ) ⌘2 — i=1 For insight on J(), let’s assume y (i ) x R so ✓ = [✓0 , ✓1 ] Based on example by Andrew Ng Intuition Behind Cost Function J (✓) = X n ⇣ ⇣ h✓ 2n ⌘ ⌘2 x (i ) — i=1 y (i ) x For insight on J(), let’s assume (for fixed , this is a function of x) R so ✓ = [✓0 , ✓1 ] (function of the parameter 3 2 1 0 ) y x -0.5 0.5 1.5 2.5 Based on example by Andrew Ng Intuition Behind Cost Function X J (✓) = n ⇣ ⇣ h✓ 2n ⌘ ⌘2 x (i ) — i=1 y (i ) x For insight on J(), let’s assume (for fixed , this is a function of x) R ✓ = [✓0 , ✓1 ] so (function of the parameter 3 2 1 0 ) y Based on example by Andrew Ng x J ([0, 5]) = -0.5 3 ⇥ (0.5 — 1) + (1 — ) + (1.5 — 3) 0.5 ⇤ 1.5 2.5 ⇡ 0.58 Intuition Behind Cost Function J (✓) = X n ⇣ ⇣ h✓ 2n ⌘ ⌘2 x (i ) — i=1 y (i ) x For insight on J(), let’s assume (for fixed , this is a function of x) R so ✓ = [✓0 , ✓1 ] (function of the parameter 3 2 1 0 ) J ([0, 0]) ⇡ 333 y x -0.5 J() is concave 0.5 1.5 2.5 Based on example by Andrew Ng 10 Intuition Behind Cost Function Slide by Andrew Ng 11 Logistic Regression Objective Function • Can’t just use squared loss as in linear regression: J (✓) = X n ⇣ ⇣ h✓ 2n x ⌘ (i ) ⌘2 (i) i= —y – Using the logistic regression model h ✓ (x ) = results in a non-convex optimization + e—✓T x 70 Deriving the Cost Function via Maximum Likelihood Estimation • Y Likelihood of data is given by: n l (✓) = p(y (i) |x (i ) ; ✓) i= • So, looking for the θ that maximizes the likelihood Y ✓M L E n = arg max l ( ✓) = arg max p( y ✓ (i ) |x (i ) ; ✓) ✓ i =1 • Can take the log without changing the solution: Y ✓M L E n = arg max log p( y (i ) |x (i ) ; ✓) ✓ i =1 X n = arg m ax log p( y (i ) |x (i ) ; ✓) ✓ i =1 10 Deriving the Cost Function via Maximum Likelihood Estimation • Expand as follows: X ✓M LE n = arg ma x log p( y (i ) |x (i ) ; ✓) ✓ i =1 X n h = arg max y ✓ i =1 • (i) log p(y (i) =1 |x (i) ⇣ ; ✓) + 1—y (i) ⌘ ⇣ log — p(y (i) =1 |x (i) ⌘ i ; ✓) Substitute in model, and take negative to yield Logistic regression objective: J (✓) ✓ X n h ⇣ i= J (✓) = — y ⌘ (i) (i) (i) (i) log h ✓ (x )+ 1—y ⇣ log ⌘i — h✓ ( x ) 11 Intuition Behind the Objective X h n i=1 ⇣ (i) (i) (i) (i) J (✓) = — y log h ✓ (x Cost of a single instance: )+ • 1—y log — h✓ ( x ) if y = if y= — log( — h ✓ ( x ) ) cost ( h ✓ ( x ) , y) = ⌘ i ⇣ — log( h ✓ ( x ) ) ⇢ • ⌘ Can re-write objective function as X ⇣ n J (✓) = cost ⌘ h ✓ (x ( i ) ), y ( i ) i= 1 Compare to linear regression: J (✓) = 2n X n ⇣ ⇣ h✓ ⌘ x ⌘2 (i ) (i) —y 12 Intuition Behind the Objective ⇢ cost ( h ✓ ( x ) , y) = Aside: — log( h ✓ ( x ) ) — log( — h ✓ ( x ) ) if y = if y= Recall the plot of log(z) 13 Intuition Behind the Objective — log(h✓ ( x ) ) ⇢ if y = — log(1 — h✓ ( x ) ) cost ( h ✓ ( x ) , y) = if y = If y = • • If y = • cost Cost = if prediction is correct As h ✓ (x ) ! 0, cost ! Captures intuition that larger mistakes should get larger penalties – e.g., predict Based on example by Andrew Ng h ✓ (x ) h ✓ (x ) = , but y = 75 Intuition Behind the Objective — log(h✓ ( x ) ) ⇢ if y = — log(1 — h✓ ( x ) ) cost ( h ✓ ( x ) , y) = if y = If y = • Cost = if prediction is correct • As If y = If y =0 • cost (1 — h ✓ (x )) ! 0, cost ! Captures intuition that larger mistakes should get larger penalties Based on example by Andrew Ng h ✓ (x ) 76 Regularized Logistic Regression X h n i= ⇣ ⌘ (i) (i) (i) (i) J (✓) = — y log h ✓ (x )+ 1—y We can regularize logistic regression exactly as before: log • λ X J regularized = J (✓) + ) ✓2 j =1 j λ k✓ — h✓ ( x d (✓) = J (✓) + ⌘ i ⇣ [1:d ] k2 77 Gradient Descent for Logistic Regression X Jreg (✓) = — Want n i =1 h y (i ) log h ✓ (x (i ) ⇣ )+ 1—y ⌘ (i ) ⇣ log — h ✓ (x (i ) λ ⌘i ) k✓ + 2 [1:d ] k2 J (✓) ✓ • • Initialize ✓ Repeat until convergence @ ✓j J (✓) ← ✓j — ↵ simultaneous update for j = d @✓j Use the natural logarithm (ln = loge) to cancel with the exp() in h✓ (x ) 78 Gradient Descent for Logistic Regression X Jreg(✓) = — Want n h i =1 y (i) log h✓ ( x (i) ⇣ )+ ⌘ 1—y (i) ⇣ log (i) — h✓ (x λ ⌘i ) ✓[1:d ] k + k 2 J (✓) ✓ • Initialize • ✓ Repeat until convergence (simultaneous update for j = d) X ⇣ n ✓0 ← ✓0 — ↵ ⇣ h✓ ⌘ x (i ) (i) —y i=1 " X ✓j n ← ✓j — ↵ ⇣ ⇣ h✓ ⌘ ⌘ x (i ) # ⌘ —y (i ) x (i ) j + ✓ λj i=1 79 Gradient Descent for Logistic Regression • • Initialize ✓ until convergence Repeat (simultaneous update for j = d) X ⇣ n ✓0 ← ✓0 — ↵ ⇣ h✓ ⌘ x (i ) ⌘ — y (i ) i=1 " X ✓j n ⇣ ← ✓j — ↵ ⇣ h✓ ⌘ x (i ) # ⌘ —y (i ) x (i ) j + ✓ λj i=1 This looks IDENTICAL to linear regression!!! • • Ignoring the 1/n constant However, the form of the model is very different: h ✓ (x ) = 80 Multi-Class Classification Binary classification: Multi-class classification: x2 x2 x1 x1 Disease diagnosis: healthy / cold / flu / pneumonia Object classification: desk / chair / monitor / bookcase 81 Multi-Class Logistic Regression • For classes: h ✓ (x ) = + exp(—✓TTx ) exp(✓ x) = exp(✓T x ) 1+ weight assigned to y = • weight assigned to y = For C classes {1, , C }: T p(y = c | x ; ✓1 , , ✓C ) = exp(✓ x)c C c= exp(✓Tx)c – Called the softmax function 82 Multi-Class Logistic Regression Split into One vs Rest: x2 x1 • Train a logistic regression classifier for each class i to predict the probability that y = i hc (x) = T exp(✓ x) C c= with c exp(✓Tx)c 83 Implementing Multi-Class Logistic Regression • Use hc(x) = • T exp(✓ x) Gradient descent c simultaneously C as the model for class c Tx) exp(✓ c c= updates all parameters for all models – Same derivative as before, just with the above hc(x) • Predict class label as the most probable label max h c (x ) c 84 ... (u1 u2 ) • ( v = length(u )2 = – Note: dot product of u with itself • Matrix product: a a11 A= a 12 a 21 22 uk b = b , B 2 12 21 22 11 11 12 21 ab 21 11 22 21 ab 11 21 + u2v2... Based on slides by Joseph Bradley v2 ) = u1v1 12 12 +ab +ab 12 22 22 22 Linear Algebra Concepts • Vector products: – Dot product: T u•v = u v= (u vu1 ) = + u2v2 u12v1 v – Outer... Vectorization • For the linear regression cost function: X n ⇣ J (✓) = 2n = ⇣ h✓ ⌘ x (i ) (i) i=1 X n — y ⇣ ✓| x (i ) — y (i ) R n⇥(d+1) = y = y (1 ) y (2) y (n) ? ?2 2n i=1 Let: ? ?2 2n | (X ✓ — y ) (X