xử lý ngôn ngữ tự nhiên,kai wei chang,www cs virginia edu Lecture 7 Word Embeddings Kai Wei Chang CS @ University of Virginia kw@kwchang net Couse webpage http //kwchang net/teaching/NLP16 16501 Natur[.]
Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt This lecture v Learning word vectors (Cont.) v Representation learning in NLP 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Recap: Latent Semantic Analysis v Data representation v Encode single-relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Synonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors CuuDuongThanCong.com https://fb.com/tailieudientucntt Recap: Mapping to Latent Space via SVD 𝚺 ≈ 𝐔 𝑪 𝑑×𝑛 𝐕' 𝑘×𝑘 𝑘×𝑛 𝑑×𝑘 v SVD generalizes the original data v Uncovers relationships not explicit in the thesaurus v Term vectors projected to 𝑘-dim latent space v Word similarity: cosine of two column vectors in 𝚺𝐕 $ CuuDuongThanCong.com https://fb.com/tailieudientucntt Low rank approximation v Frobenius norm C is a 𝑚×𝑛 matrix ||𝐶||/ = 1 |𝑐34 |5 378 478 v Rank of a matrix v How many vectors in the matrix are independent to each other 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Low rank approximation v Low rank approximation problem: ||𝐶 − 𝑋||/ 𝑠 𝑡 𝑟𝑎𝑛𝑘 𝑋 = 𝑘 = v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the “reconstruction loss” under a low rank constraint 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Low rank approximation v Low rank approximation problem: ||𝐶 − 𝑋||/ 𝑠 𝑡 𝑟𝑎𝑛𝑘 𝑋 = 𝑘 = v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the “reconstruction loss” under a low rank constraint 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Low rank approximation v Assume rank of 𝐶 is r v SVD: 𝐶 = 𝑈Σ𝑉 ' , Σ = diag(𝜎8 , 𝜎5 … 𝜎P , 0,0,0, … 0) 𝜎8 0 Σ = ⋱ 0 0 𝑟 non-zeros v Zero-out the r − 𝑘 trailing values Σ′ = diag(𝜎8 , 𝜎5 … 𝜎U , 0,0,0, … 0) v 𝐶 V = UΣV 𝑉 ' is the best k-rank approximation: C V = 𝑎𝑟𝑔 min ||𝐶 − 𝑋||/ 𝑠 𝑡 𝑟𝑎𝑛𝑘 𝑋 = 𝑘 = 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Word2Vec v LSA: a compact representation of cooccurrence matrix v Word2Vec:Predict surrounding words (skip-gram) v Similar to using co-occurrence counts Levy&Goldberg (2014), Pennington et al (2014) v Easy to incorporate new words or sentences 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Word2Vec v Similar to language model, but predicting next word is not the goal v Idea: words that are semantically similar often occur near each other in text v Embeddings that are good at predicting neighboring words are also good at representing similarity 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt 10 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt 19 Example v Assume vocabulary set is 𝑊 We have one center word 𝑐, and one context word 𝑜 v What is the conditional probability 𝑝 𝑜 𝑐 exp (𝑢• ⋅ 𝑣– ) 𝑝 𝑜𝑐 = ∑pV exp (𝑢p n ⋅ 𝑣– ) v What is the gradient of the log likelihood w.r.t 𝑣– ? 𝜕 log 𝑝 𝑜 𝑐 = 𝑢• − 𝐸p∼™ 𝑤 𝑐 [𝑢p ] 𝜕𝑣– 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt 20