Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 33 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
33
Dung lượng
652,25 KB
Nội dung
GATE Annie Lib Lucene Course Project Presentator: Bui Dac Thinh For: IR Students TH2010 1 Objective Information Retrieval: Search Engine on crawler text datasets and open-source system 2 “What is the largest city of VietNam?” “……….” HIT/ NOT HIT question Answer document set Objective 3 4 Search Engine Architecture User Interface Caching Indexing and Ranking Index Builder Web Page Parser Crawler Web Graph Builder Link Analysis Inverted index Cached Pages Page & Site Statistics Page Ranks Web Graph Pages Links Anchors Link Map Online Part Offline Part 7/2/14 What to do 5 Crawler crawler4j JAVA Sphider PHP Scrapy PYTHON HTMLAgilityPack .NET GeckoFx .NET Link set Textual data Preprocessing GATE JAVA UIMA JAVA Data ANNIE OPENNLP NLP Survey of Tools & Resources General frameworks UIMA GATE NLP components, pipelines, and tools Stanford Named Entity Recognizer (NER) Stanford CoreNLP (CoreNLP) NegEx (NegEx) ENJU (ENJU) OpenNLP 6 Java framework Apache OpenNLP OpenNLP tools Sentence detector Pos-tagger Tokenizer Shallow and full syntactic parser Named-entity detector Emdros Text database engine for analyzed and annotated text Mallet Machine learning for language toolkit in Java NLTK Weka Wordnet::Similarity Measures of semantic relatedness using WordNet 7 Annie a Nearly - New Information Extraction System 8 Annie a Nearly - New Information Extraction System Document Reset Tokeniser Gazetter Sentence Splitter RegEx Sentence Splitter POS Tagger Semantic Tagger 9 GATE Open source software Community of Text engineering Defined and repeatable process The Eclipse of NLP The Lucene of Infromation Extraction 10 [...]... given directory calculates a score for each of the documents that match a given query Search with Lucene Search Result Primary class ScoreDOC: document that hits Position Score TopDOCs: total documents that hit [number] Codeline Index //state the file location of the index string indexFileLocation = @"C:\Index"; Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation... //state the file location of the index string indexFileLocation = @"C:\Index"; Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation , true); //create an index searcher that will perform the search Lucene.Net.Search.IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher(dir); Codeline Search //build a query object Lucene.Net.Index.Term searchTerm = new Lucene.Net.Index.Term("content",... to process the text Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(); //create the index writer with the directory and analyzer defined Lucene.Net.Index.IndexWriter indexWriter = new Lucene.Net.Index.IndexWriter(dir, analyzer, true); /*true to create a new index*/ Codeline //create a document, add in a single field Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();... Product in Jakartar Apache Popular: Xerox, Apple, Wikipedia, IBM, CNN, Nutch… Open source in JAVA The most efficient framework for IR • Index • Search Lucene4c / CLucene Nlucene / Lucene.NET PyLucene Ferret / RubyLucene ZEND Framework What uses Lucene WIKIPEDIA NUTCH CNET RED-PIRANHA ……… Lucene Sketch In 5 Mins http://www.ibm.com/developerworks /library/os-apache-lucenesearch/ Analysis with Lucene Analysis... searcher.Search(query); //iterate over the results for (int i = 0; i < hits.Length(); i++) { Document doc = hits.Doc(i); string contentValue = doc.Get("content"); Console.WriteLine(contentValue); } Codeline Display /* First parameter is the query to be executed and second parameter indicates the no of search results to fetch */ TopDocs topDocs = indexSearcher.search(query,20); System.out.println("Total hits "+topDocs.totalHits); . efficient framework for IR • Index • Search 13 Lucene4c / CLucene Nlucene / Lucene.NET PyLucene Ferret / RubyLucene ZEND Framework What uses Lucene 14 NUTCH WIKIPEDIA RED-PIRANHA CNET ………. Lucene. GATE Annie Lib Lucene Course Project Presentator: Bui Dac Thinh For: IR Students TH2010 1 Objective Information Retrieval: Search Engine on crawler text datasets and