1. Trang chủ
  2. » Công Nghệ Thông Tin

Taming Text: How to Find, Organize, and Manipulate It ppt

322 1,9K 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 322
Dung lượng 10,05 MB

Nội dung

www.it-ebooks.info Taming Text www.it-ebooks.info www.it-ebooks.info Taming Text HOW TO FIND, ORGANIZE, AND MANIPULATE IT GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. FARRIS MANNING SHELTER ISLAND www.it-ebooks.info For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: orders@manning.com ©2013 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Jeff Bleiel 20 Baldwin Road Technical proofreader: Steven Rowe PO Box 261 Copyeditor: Benjamin Berg Shelter Island, NY 11964 Proofreader: Katie Tennant Typesetter: Dottie Marsico Cover designer: Marija Tudor ISBN 9781933988382 Printed in the United States of America 12345678910–MAL–181716151413 www.it-ebooks.info v brief contents 1 ■ Getting started taming text 1 2 ■ Foundations of taming text 16 3 ■ Searching 37 4 ■ Fuzzy string matching 84 5 ■ Identifying people, places, and things 115 6 ■ Clustering text 140 7 ■ Classification, categorization, and tagging 175 8 ■ Building an example question answering system 240 9 ■ Untamed text: exploring the next frontier 260 www.it-ebooks.info www.it-ebooks.info vii contents foreword xiii preface xiv acknowledgments xvii about this book xix about the cover illustration xxii 1 Getting started taming text 1 1.1 Why taming text is important 2 1.2 Preview: A fact-based question answering system 4 Hello, Dr. Frankenstein 5 1.3 Understanding text is hard 8 1.4 Text, tamed 10 1.5 Text and the intelligent app: search and beyond 11 Searching and matching 12 ■ Extracting information 13 Grouping information 13 ■ An intelligent application 14 1.6 Summary 14 1.7 Resources 14 2 Foundations of taming text 16 2.1 Foundations of language 17 Words and their categories 18 ■ Phrases and clauses 19 Morphology 20 www.it-ebooks.info CONTENTSviii 2.2 Common tools for text processing 21 String manipulation tools 21 ■ Tokens and tokenization 22 Part of speech assignment 24 ■ Stemming 25 ■ Sentence detection 27 ■ Parsing and grammar 28 ■ Sequence modeling 30 2.3 Preprocessing and extracting content from common file formats 31 The importance of preprocessing 31 ■ Extracting content using Apache Tika 33 2.4 Summary 36 2.5 Resources 36 3 Searching 37 3.1 Search and faceting example: Amazon.com 38 3.2 Introduction to search concepts 40 Indexing content 41 ■ User input 43 ■ Ranking documents with the vector space model 46 ■ Results display 49 3.3 Introducing the Apache Solr search server 52 Running Solr for the first time 52 ■ Understanding Solr concepts 54 3.4 Indexing content with Apache Solr 57 Indexing using XML 58 ■ Extracting and indexing content using Solr and Apache Tika 59 3.5 Searching content with Apache Solr 63 Solr query input parameters 64 ■ Faceting on extracted content 67 3.6 Understanding search performance factors 69 Judging quality 69 ■ Judging quantity 73 3.7 Improving search performance 74 Hardware improvements 74 ■ Analysis improvements 75 Query performance improvements 76 ■ Alternative scoring models 79 ■ Techniques for improving Solr performance 80 3.8 Search alternatives 82 3.9 Summary 83 3.10 Resources 83 www.it-ebooks.info CONTENTS ix 4 Fuzzy string matching 84 4.1 Approaches to fuzzy string matching 86 Character overlap measures 86 ■ Edit distance measures 89 N-gram edit distance 92 4.2 Finding fuzzy string matches 94 Using prefixes for matching with Solr 94 ■ Using a trie for prefix matching 95 ■ Using n-grams for matching 99 4.3 Building fuzzy string matching applications 100 Adding type-ahead to search 101 ■ Query spell-checking for search 105 ■ Record matching 109 4.4 Summary 114 4.5 Resources 114 5 Identifying people, places, and things 115 5.1 Approaches to named-entity recognition 117 Using rules to identify names 117 ■ Using statistical classifiers to identify names 118 5.2 Basic entity identification with OpenNLP 119 Finding names with OpenNLP 120 ■ Interpreting names identified by OpenNLP 121 ■ Filtering names based on probability 122 5.3 In-depth entity identification with OpenNLP 123 Identifying multiple entity types with OpenNLP 123 Under the hood: how OpenNLP identifies names 126 5.4 Performance of OpenNLP 128 Quality of results 129 ■ Runtime performance 130 Memory usage in OpenNLP 131 5.5 Customizing OpenNLP entity identification for a new domain 132 The whys and hows of training a model 132 ■ Training an OpenNLP model 133 ■ Altering modeling inputs 134 A new way to model names 136 5.6 Summary 138 5.7 Further reading 139 www.it-ebooks.info [...]... with it You’ll learn how to evaluate the search performance factors of quantity and quality Chapter 4 examines fuzzy string matching with prefixes and n-grams We look at two character overlap measures—the Jaccard measure and the Jaro-Winkler distance and explain how to find candidate matches with Solr and rank them Chapter 5 presents the basic concepts behind named-entity recognition We show how to use... and manipulate text with little -to- no user intervention  The ability to do both of these things with ever-increasing amounts of input This leads us to the primary goal of this book: to give you, the programmer, the tools and hands-on advice to build applications that help people better manage the tidal wave of communication that swamps their lives The secondary goal of Taming Text is to show how to. .. through it all! Tom Morton Thanks to my coauthors for their hard work and partnership; to my wife, Thuy, and daughter, Chloe, for their patience, support, and time freely given; to my family, Mortons and Trans, for all your encouragement; to my colleagues from the University of Pennsylvania and Comcast for their support and collaboration, especially Na-Rae Han, Jason Baldridge, Gann Bierner, and Martha... “Identifying people, places, and things,” we’ll look at how to identify proper names and numeric phrases and put them into semantic categories such as person, location, and date, irrespective of their linguistic usage This ability will be fundamental to your ability to build a QA system in chapter 8 For both of these tasks we’ll use the capabilities of OpenNLP and explore how to use its existing models as... hundreds of thousands of dollars it takes to educate a person through college Next, step back another level and think about how much content you do read in a day To get started, take a moment to consider the following questions:  How many email messages did you get today (both work and personal, includ     ing spam)? How many of those did you read? How many did you respond to right away? Within the... commercially and in the open source community (see http://www.opensource.org) to tackle these topics and many more One of the great things about the journey you’re embarking on is its ever-changing and ever-improving nature Problems that were intractable 10 years ago due to resource limits are now yielding positive results thanks to better algorithms, faster CPUs, cheaper memory, cheaper disk space, and tools... likely to have the biggest impact in taming your text By focusing on topics like search, entity identification (finding people, places, and things), grouping and labeling, clustering, and summarization, we can build practical applications that help users find and understand the important parts of their text quickly and easily Though we hate to be a buzzkill on all the excitement of taming text, it s... Apache Mahout, and how to cluster search results using Carrot2 Chapter 7 discusses the basic concepts behind classification, categorization, and tagging We show how categorization is used in text applications, and how to build, train, and evaluate classifiers using open source tools We also use the Mahout implementation of the naive Bayes algorithm to build a document categorizer www .it- ebooks.info... CNLP, and as part of Lucene xvii www .it- ebooks.info xviii ACKNOWLEDGMENTS ■ ■ ■ Dr Liz Liddy, for introducing Drew and Grant to the world of text analytics and all the fun and opportunity therein, and for contributing the foreword All of our MEAP readers, for their patience and feedback Most of all, our family, friends, and coworkers, for their encouragement, moral support, and understanding as we took... book, ask technical questions, and receive help from the authors and from other users To access the forum and subscribe to it, point your web browser at www.manning.com/TamingText This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum Manning’s commitment to our readers is to provide a venue where a meaningful . www .it- ebooks.info Taming Text www .it- ebooks.info www .it- ebooks.info Taming Text HOW TO FIND, ORGANIZE, AND MANIPULATE IT GRANT S. INGERSOLL THOMAS S. MORTON ANDREW. con- tinue to grow a vibrant ecosystem dedicated to the development of open source software and the people, process, and community that support it. The tools and

Ngày đăng: 06/03/2014, 23:21

TỪ KHÓA LIÊN QUAN

w