máy học,elad hazan,www cs princeton edu

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	4
Dung lượng	178,24 KB

Nội dung

máy học,elad hazan,www cs princeton edu THEORETICAL MACHINE LEARNING COS 511 LECTURE #14 MARCH 24, 2016 LECTURER ELAD HAZAN SCRIBE LYDIA LIU 1 Regularization 1 1 RFTL In the last lecture we discussed[.]

T HEORETICAL M ACHINE L EARNING COS 511 L ECTURE #14 M ARCH 24, 2016 L ECTURER : E LAD H AZAN 1.1 S CRIBE : LYDIA L IU Regularization RFTL In the last lecture we discussed RFTL, an algorithm that arose naturally in the online learning community There is clear intuition to motivate it too—we may obtain more stable solutions across iterations of the online learning algorithm by adding a regularization function To recap, the RFTL update is: ( t ) X xt+1 := arg η ∇i · x + R(x) x∈K i=1 and the regret bound is: T X √ ∗ Regret(RF T L) ≤ [R(x1 ) − R(x )] + 2η k∇t k∗∇2 R(zt ) = O( T ) η t=1 1.2 Mirrored Descent Given R, ∇R : Rd → Rd is a vector field, the updates for the Mirrored Descent algorithm are: ∇R(yt+1 ) = ∇R(xt ) − η∇t BR xt+1 = ΠK (yt+1 ) = arg BR (x, yt+1 ) x∈K Unlike RFTL, the intuition behind Mirrored Descent (MD) seems less clear, but one can show under fairly general conditions that D TL xM = xRF with the same regularization function R t t Also, note that if RPis the squared euclidean norm k · k2 , then MD is the gradient descent algorithm If R is negative entropy, i xi log xi , then MD is the multiplicative weights algorithm If we also optimize the parameter, η, we get the same regret bound: v u T u X Regret(M D) = Regret(RF T L) ≤ 2t2DR k∇t k∗∇2 R(zt ) t=1 1.3 Motivating adaptive regularization This leads us to the question: what is the best R to choose to minimize regret? Clearly, it is more important PT to optimize the term t=1 k∇t k∗∇2 R(zt ) than the term DR , since the former is a sum that increases with T P We saw in SGD that xt = T1 t xt and that E[f (xt )] ≤ f (x∗ ) + ∗ x regretT regretT where ≈√ T T T CuuDuongThanCong.com https://fb.com/tailieudientucntt which is state-of-the-art If we apply matrix-norm regularization, i.e R(x) = 12 xT Ax, in RFTL, then the average regret term will be q P DR t k∇t k∗ −1 A Thus it is reasonable for us to try to optimize regret over the choice of matrix A   ±1 ±1     Here we sketch the idea for a simplified example Suppose ∇t ∈   ⊆ Rd We introduce an     T important definition for the set of matrices that we want to restrict ourselves to consider Definition 1.1 The spectohedron is the set of matrices Sn := {X : X 0, T r(X) ≤ 1, X ∈ Rn×m } qP t k∇t k∗ −1 A What is the best A ∈ Sn minimizing in this case? Since ∇t is only non-zero in its first T coordinates, it makes sense to have non-negative weights only in the top left × submatrix of A When restricted only to the set Sn , we can in fact learn the best A for regularization and get approximately the same asymptotic performance as gradient descent For the rest of this lecture, we will not concern ourselves with the question of whether a matrix is invertible, since we can perturb a singular matrix with δI where δ is vanishing, or just take the pseudoinverse AdaGrad We introduce the AdaGrad algorithm: • Initialize S0 = G0 = δI, x1 ∈ K • For t = to T , do: Predict xt , suffer loss ft (xt ) Update: St = St−1 + ∇t ∇Tt , Gt = p St yt+1 = xt − G−1 t ∇t t xt+1 = ΠG K (yt+1 ) Projection step ‘optional’ because in reality we never step outside of K A note on computational efficiency: another version of AdaGrad p deals with the time consuming matrix ˆ square root and inversion steps by defining St = diag(St ), Gt = Sˆt , so everything can be accomplished in linear time The regret bound for this version is asymptotically the same as the usual AdaGrad (only slightly worse theoretical guarantees), and is popular in real world applications (Note: k · k∗A = k · kA−1 ) We state and prove the regret bound for the usual AdaGrad as follows: q PT ∗ minA∈Sn t=1 k∇t kA Theorem 2.1 Regret(AG)T = O Proof We use the following fact: Let B Then arg minn A−1 ◦ B = A∈S B 1/2 T r(B 1/2 ) CuuDuongThanCong.com https://fb.com/tailieudientucntt where for symmetric matrices M, N , M ◦ N := T r(M N ) This fact leads to the following observation Notice that there is a closed form expression for the ‘best’ A−1 norm: arg minn A∈S T X k∇t k∗A = arg minn A−1 ◦ ST A∈S t=1 1/2 = ST 1/2 T r(ST ) GT = T r(GT ) Thus, qP T ∗2 t=1 k∇t kA = r GT T r(GT ) −1 ◦ GT = T r(GT ) It now suffices to prove: Regret(AG)T = O(T r(GT )) Define D = maxu∈K ku − x1 k2 kxt+1 − x∗ k2Gt ≤ kyt+1 − x∗ k2Gt (because of projection) ∗ = kxt − G−1 t ∇t − x kGt = kxt − x∗ k2Gt − 2∇Tt (xt − x∗ ) + ∇Tt G−1 t ∇t ft (xt ) − ft (x∗ ) ≤ ∇t (xt − x∗ ) ≤ kxt − x∗ k2Gt − kxt+1 − x∗ k2Gt + ∇Tt G−1 t ∇t Summing over t = to T, × Regret(AG) ≤ = T X t=1 T X kxt − x∗ k2Gt − kxt+1 − x∗ k2Gt + ∇Tt G−1 t ∇t (xt − x∗ )T (Gt − Gt−1 )(xt − x∗ ) + t=1 T X t=1 k∇t k2G−1 + O(1) t Looking at the first term, T X (xt − x∗ )T (Gt − Gt−1 )(xt − x∗ ) = t=1 ≤ T X t=1 T X T r (Gt − Gt−1 )(xt − x∗ )(xt − x∗ )T T r (Gt − Gt−1 ) k(xt − x∗ )(xt − x∗ )T k2 by Hăolders inequality t=1 D2 T X T r(Gt ) − T r(Gt−1 ) t=1 ≤ D2 T r(GT ) Looking at the second term, we can prove by induction that T X t=1 k∇t k2G−1 ≤ t T X k∇t k2G−1 = 2T r(GT ) t=1 T CuuDuongThanCong.com https://fb.com/tailieudientucntt Check that the result holds for t = Now, assume it holds for t = T , check the induction hypothesis for t = T + 1: T +1 X −1 T ∇Tt G−1 t ∇t ≤ 2T r(GT ) + ∇T +1 GT +1 ∇T +1 t=1 ≤ 2T r((G2T +1 − ∇T +1 ∇TT +1 )1/2 ) + T r(G−1 T +1 ∇T +1 ∇T +1 ) ≤ 2T r(GT ) where the last inequality is due to the following matrix inequality: 2T r((A − B)1/2 ) + T r(A−1/2 B) ≤ 2T r(A1/2 ) Together, this proves that Regret(AG)T ≤ O(1) · T r(GT ) CuuDuongThanCong.com https://fb.com/tailieudientucntt

Ngày đăng: 25/11/2022, 22:45