1. Trang chủ
  2. » Công Nghệ Thông Tin

nlp in scala with breeze and epic

66 276 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 66
Dung lượng 2,31 MB

Nội dung

NLP in Scala with Breeze and Epic David Hall UC Berkeley ScalaNLP Ecosystem Breeze Epic Puck • Linear Algebra • Scientific Computing • Optimization • Natural Language Processing • Structured Prediction • Super-fast GPU parser for English ≈ ≈ Numpy/Scipy PyStruct/NLTK ≈ { } [...]... score(x, y) > score(x, y’), forall y’ Machine Learning Primer score(x, y) = wTf(x, y) Machine Learning Primer score(x, y) = w.t * f(x, y) Machine Learning Primer score(x, y) = w dot f(x, y) Machine Learning Primer score(x, y) >= score(x, y’) Machine Learning Primer w dot f(x, y) >= w dot f(x, y’) Machine Learning Primer w dot f(x, y) >= w dot f(x, y’) Machine Learning Primer w + f(x,y) f(x,y) w f(x,y’)... WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL import dsl._ word(begin) // word at the beginning of the span + word(end – 1) // end of the span + word(begin – 1) // before (gets things like Mr.) + word (end) // one past the end + prefixes(begin) // prefixes up to some length + suffixes(begin) + length(begin, end) // names tend to be 1-3 words + gazetteer(begin, end) Using your own featurizer > val data: IndexedSeq[Segmentation[Label,... someday tumble the Red Delicious from the top of America's apple heap Multilingual Parser Berkeley Epic “Berkeley:” [Petrov & Klein, 2007]; Epic [Hall, Durrett, and Klein, 2014] Epic Pre-built Models • Parsing – English, Basque, French, German, Swedish, Polish, Korean – (working on Arabic, Chinese, Spanish) • Part-of-Speech Tagging – English, Basque, French, German, Swedish, Polish • Named Entity Recognition... featureIndex : Index[Feature] = ??? > val labelIndex: Index[Label] = ??? > val weights = DenseVector.rand[Double](featureIndex.size) > for ( epoch val featureIndex... Semi-Markov Conditional Random Field • Don’t worry about the name Semi-CRFs Semi-CRFs score(Chez Panisse) + score(Berkeley, CA) + score(- A bowl of ) + score(Churchill-Brenneis Orchards) + score(Page mandarins and medjool dates) Features = w(starts -with- Chez) score(Chez Panisse) + w(starts -with- C…) + w(ends -with- P…) + w(starts-sentence) + w(shape:Xxx Xxx) + w(two-words) + w (in- gazetteer) Building your own features... own featurizer > val data: IndexedSeq[Segmentation[Label, String]] = ??? > val myFeaturizer = ??? > val system = SemiCRF.buildSimple(data, startLabel, outsideLabel, featurizer = myFeaturizer) Features • So far, we’ve been able to do everything with (nearly) no math • To understand more, need to do some math Machine Learning Primer • Training example (x, y) – x: sentence of some sort – y: labeled version... form Structured Perceptron > val featureIndex : Index[Feature] = ??? > val labelIndex: Index[Label] = ??? > val weights = DenseVector.rand[Double](featureIndex.size) > for ( epoch val myGazetteer = ??? > val system = SemiCRF.buildSimple(data, startLabel, outsideLabel, gaz = myGazetteer) Gazetteer • Careful with gazetteers! • If built from training data, system will use it and only it to make predictions! • So, only known forms will be... featuresFor(x, yy) weights.t * new FeatureVector(indexed) // or weights dot new FeatureVector(indexed) } … } The Perceptron (cont’d) > val featureIndex : Index[Feature] = ??? > val labelIndex: Index[Label] = ??? > val weights = DenseVector.rand[Double](featureIndex.size) > for ( epoch . NLP in Scala with Breeze and Epic David Hall UC Berkeley ScalaNLP Ecosystem Breeze Epic Puck • Linear Algebra • Scientific Computing • Optimization • Natural Language Processing • Structured. score(Page mandarins and medjool dates) score(Chez Panisse) Features = w(starts -with- Chez) + w(starts -with- C…) + w(ends -with- P…) + w(starts-sentence) + w(shape:Xxx Xxx) + w(two-words) + w (in- gazetteer) score(Chez. + w(two-words) + w (in- gazetteer) score(Chez Panisse) Building your own features val dsl = new WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL import dsl._ word(begin) // word at the beginning of the span + word(end

Ngày đăng: 24/10/2014, 13:47

TỪ KHÓA LIÊN QUAN