xử lý ngôn ngữ tự nhiên,kai wei chang,www cs virginia edu Lecture 6 Vector Space Model Kai Wei Chang CS @ University of Virginia kw@kwchang net Couse webpage http //kwchang net/teaching/NLP16 16501 Na[.]
Lecture 6: Vector Space Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt This lecture v How to represent a word, a sentence, or a document? v How to infer the relationship among words? v We focus on “semantics”: distributional semantics v What is the meaning of “life”? 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt How to represent a word v Naïve way: represent words as atomic symbols: student, talk, university v N-germ language model, logical analysis v Represent word as a “one-hot” vector [0 0 … 0] egg student talk university happy buy v How large is this vector? v PTB data: ~50k, Google 1T data: 13M v 𝑣 ⋅ 𝑢 =? 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Issues? v Dimensionality is large; vector is sparse v No similarity 𝑣'())* = [0 0 0 … 0 ] 𝑣+(, = [0 0 … 0 ] 𝑣-./0 = [1 0 0 … 0 ] 𝑣'())* ⋅ 𝑣+(, = 𝑣'())* ⋅ 𝑣-./0 = 0 v Cannot represent new words v Any idea? 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Idea 1: Taxonomy (Word category) 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt What is “car”? >>> from nltk.corpus importwordnet as wn >>> wn.synsets('motorcar') [Synset('car.n.01')] >>> motorcar.hypernyms() [Synset('motor_vehicle.n.01')] >>> paths = motorcar.hypernym_paths() >>> [synset.name() for synsetin paths[0]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01','self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] >>> [synset.name() for synsetin paths[1]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Word similarity? >>> right = wn.synset('right_whale.n.01') >>> minke = wn.synset('minke_whale.n.01') >>> orca = wn.synset('orca.n.01') >>> tortoise = wn.synset('tortoise.n.01') >>> novel = wn.synset('novel.n.01') >>> right.lowest_common_hypernyms(minke) [Synset('baleen_whale.n.01')] >>> right.lowest_common_hypernyms(orca) [Synset('whale.n.02')] >>>right.lowest_common_hypernyms(tortoise) [Synset('vertebrate.n.01')] >>> right.lowest_common_hypernyms(novel) [Synset('entity.n.01')] Require human labor 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Taxonomy (Word category) v Synonym, hypernym (Is-A), hyponym 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt Idea 2: Similarity = Clustering 6501 Natural Language Processing CuuDuongThanCong.com https://fb.com/tailieudientucntt 10 ... or a document? v How to infer the relationship among words? v We focus on “semantics”: distributional semantics v What is the meaning of “life”? 6501 Natural Language Processing CuuDuongThanCong.com... https://fb.com/tailieudientucntt 10 Cluster n-gram model v Can be generated from unlabeled corpora v Based on statistics, e.g., mutual information Implementation of the Brown hierarchical word clustering algorithm... representations v Word meanings are vector of “basic concept” v What are “basic concept”? v How to assign weights? v How to define the similarity/distance?