Thông tin tài liệu
www.it-ebooks.info
Taming Text
www.it-ebooks.info
www.it-ebooks.info
Taming Text
HOW TO FIND, ORGANIZE, AND MANIPULATE IT
GRANT S. INGERSOLL
THOMAS S. MORTON
ANDREW L. FARRIS
MANNING
SHELTER ISLAND
www.it-ebooks.info
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964
Email: orders@manning.com
©2013 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.
Manning Publications Co. Development editor: Jeff Bleiel
20 Baldwin Road Technical proofreader: Steven Rowe
PO Box 261 Copyeditor: Benjamin Berg
Shelter Island, NY 11964 Proofreader: Katie Tennant
Typesetter: Dottie Marsico
Cover designer: Marija Tudor
ISBN 9781933988382
Printed in the United States of America
12345678910–MAL–181716151413
www.it-ebooks.info
v
brief contents
1
■
Getting started taming text 1
2
■
Foundations of taming text 16
3
■
Searching 37
4
■
Fuzzy string matching 84
5
■
Identifying people, places, and things 115
6
■
Clustering text 140
7
■
Classification, categorization, and tagging 175
8
■
Building an example question answering system 240
9
■
Untamed text: exploring the next frontier 260
www.it-ebooks.info
www.it-ebooks.info
vii
contents
foreword xiii
preface xiv
acknowledgments xvii
about this book xix
about the cover illustration xxii
1
Getting started taming text 1
1.1 Why taming text is important 2
1.2 Preview: A fact-based question answering system 4
Hello, Dr. Frankenstein 5
1.3 Understanding text is hard 8
1.4 Text, tamed 10
1.5 Text and the intelligent app: search and beyond 11
Searching and matching 12
■
Extracting information 13
Grouping information 13
■
An intelligent application 14
1.6 Summary 14
1.7 Resources 14
2
Foundations of taming text 16
2.1 Foundations of language 17
Words and their categories 18
■
Phrases and clauses 19
Morphology 20
www.it-ebooks.info
CONTENTSviii
2.2 Common tools for text processing 21
String manipulation tools 21
■
Tokens and tokenization 22
Part of speech assignment 24
■
Stemming 25
■
Sentence
detection 27
■
Parsing and grammar 28
■
Sequence
modeling 30
2.3 Preprocessing and extracting content from common file
formats 31
The importance of preprocessing 31
■
Extracting content using
Apache Tika 33
2.4 Summary 36
2.5 Resources 36
3
Searching 37
3.1 Search and faceting example: Amazon.com 38
3.2 Introduction to search concepts 40
Indexing content 41
■
User input 43
■
Ranking documents
with the vector space model 46
■
Results display 49
3.3 Introducing the Apache Solr search server 52
Running Solr for the first time 52
■
Understanding Solr
concepts 54
3.4 Indexing content with Apache Solr 57
Indexing using XML 58
■
Extracting and indexing content
using Solr and Apache Tika 59
3.5 Searching content with Apache Solr 63
Solr query input parameters 64
■
Faceting on extracted
content 67
3.6 Understanding search performance factors 69
Judging quality 69
■
Judging quantity 73
3.7 Improving search performance 74
Hardware improvements 74
■
Analysis improvements 75
Query performance improvements 76
■
Alternative scoring
models 79
■
Techniques for improving Solr performance 80
3.8 Search alternatives 82
3.9 Summary 83
3.10 Resources 83
www.it-ebooks.info
CONTENTS ix
4
Fuzzy string matching 84
4.1 Approaches to fuzzy string matching 86
Character overlap measures 86
■
Edit distance measures 89
N-gram edit distance 92
4.2 Finding fuzzy string matches 94
Using prefixes for matching with Solr 94
■
Using a trie for
prefix matching 95
■
Using n-grams for matching 99
4.3 Building fuzzy string matching applications 100
Adding type-ahead to search 101
■
Query spell-checking for
search 105
■
Record matching 109
4.4 Summary 114
4.5 Resources 114
5
Identifying people, places, and things 115
5.1 Approaches to named-entity recognition 117
Using rules to identify names 117
■
Using statistical
classifiers to identify names 118
5.2 Basic entity identification with OpenNLP 119
Finding names with OpenNLP 120
■
Interpreting names
identified by OpenNLP 121
■
Filtering names based on
probability 122
5.3 In-depth entity identification with OpenNLP 123
Identifying multiple entity types with OpenNLP 123
Under the hood: how OpenNLP identifies names 126
5.4 Performance of OpenNLP 128
Quality of results 129
■
Runtime performance 130
Memory usage in OpenNLP 131
5.5 Customizing OpenNLP entity identification
for a new domain 132
The whys and hows of training a model 132
■
Training
an OpenNLP model 133
■
Altering modeling inputs 134
A new way to model names 136
5.6 Summary 138
5.7 Further reading 139
www.it-ebooks.info
[...]... life of two centuries ago, brought back to life by Maréchal’s pictures xxii www.it-ebooks.info Getting started taming text In this chapter Understanding why processing text is important Learning what makes taming text hard Setting the stage for leveraging open source libraries to tame text If you’re reading this book, chances are you’re a programmer, or at least in the information technology field... users want tools that enable 1 www.it-ebooks.info 2 CHAPTER 1 Getting started taming text them to focus on their lives and their work, not just their technology They want to control—or tame—the uncontrolled beast that is text But what does it mean to tame text? We’ll talk more about it later in this chapter, but for now taming text involves three primary things: The ability to find relevant answers... with the book and also available on GitHub at http:// www.github.com/tamingtext/book See https://github.com/tamingtext/book/blob /master/README for instructions on building the source The high-level code that drives this process can be seen in the following listing www.it-ebooks.info 6 CHAPTER 1 Listing 1.1 Getting started taming text Frankenstein driver program Frankenstein frankenstein = new Frankenstein();... time and time again in text applications: Search the text based on user input and return the relevant passage (a para- graph in this example) Split the passage into sentences Extract “interesting” things from the text, like the names of people To accomplish these tasks, we’ll use two Java libraries, Apache Lucene and Apache OpenNLP, along with the code in the com.tamingtext.frankenstein.Frankenstein... online But for brevity’s sake, we may have removed material such as comments from the code to fit it well within the text The source code for the examples in the book is available for download from the publisher’s website at www.manning.com/TamingText Author Online The purchase of Taming Text includes free access to a private web forum run by Manning Publications, where you can make comments about the... odd formatting in the text For now, we’ll wave our hands as to why these failed If you explore further with other queries, you’ll likely find plenty of the good, bad, and even the ugly in processing text This example makes for a nice segue into our next section, which will touch www.it-ebooks.info 8 CHAPTER 1 Getting started taming text on some of these difficulties in processing text as well as serve... volume of text available and the variety with which it occurs There’s a reason the saying goes “the numbers don’t lie” and not “the text doesn’t lie”; text comes in all shapes and meanings and trips up even the smartest people on a regular basis Writing applications to process text can mean facing a number of technical and nontechnical challenges Table 1.2 outlines some of the challenges text applications... have the biggest impact in taming your text By focusing on topics like search, entity identification (finding people, places, and things), grouping and labeling, clustering, and summarization, we can build practical applications that help users find and understand the important parts of their text quickly and easily Though we hate to be a buzzkill on all the excitement of taming text, it’s important to... let’s take a look at an example of processing a chunk of text to find passages and identify interesting things like names 1.2.1 Hello, Dr Frankenstein In light of our discussion of a question answering system as well as our three primary tasks for working with text, let’s take a look at some basic text processing Naturally, we need some sample text to process in this simple system For that, we chose... when we first started down our career paths in programming text- based applications Roadmap Chapter 1 explains why processing text is important, and what makes it so challenging We preview a fact-based question answering (QA) system, setting the stage for utilizing open source libraries to tame text Chapter 2 introduces the building blocks of text processing: tokenizing, chunking, parsing, and part of . started taming text 1
1.1 Why taming text is important 2
1.2 Preview: A fact-based question answering system 4
Hello, Dr. Frankenstein 5
1.3 Understanding text. America
12345678910–MAL–181716151413
www.it-ebooks.info
v
brief contents
1
■
Getting started taming text 1
2
■
Foundations of taming text 16
3
■
Searching 37
4
■
Fuzzy string matching 84
5
■
Identifying
Ngày đăng: 17/02/2014, 23:20
Xem thêm: Tài liệu Taming Text pdf, Tài liệu Taming Text pdf, 2 Preview: A fact-based question answering system, 1 Semantics, discourse, and pragmatics: exploring higher levels of NLP