Thông tin tài liệu
MANNING
Michael McCandless
Erik Hatcher
Otis Gospodnetic
F
OREWORD
BY
D
OUG
C
UTTING
Covers Apache Lucene 3.0
SECOND EDITION
IN ACTION
,
www.it-ebooks.info
Praise for the First Edition
This is definitely the book to have if you’re planning on using Lucene in your application, or are
interested in what Lucene can do for you.
—JavaLobby
Search powers the information age. This book is a gateway to this invaluable resource It suc-
ceeds admirably in elucidating the application programming interface (API), with many code
examples and cogent explanations, opening the door to a fine tool.
—Computing Reviews
A must-read for anyone who wants to learn about Lucene or is even considering embedding
search into their applications or just wants to learn about information retrieval in general.
Highly recommended!
—TheServerSide.com
Well thought-out thoroughly edited stands out clearly from the crowd I enjoyed reading this
book. If you have any text-searching needs, this book will be more than sufficient equipment to
guide you to successful completion. Even, if you are just looking to download a pre-written search
engine, then this book will provide a good background to the nature of information retrieval in
general and text indexing and searching specifically.
—Slashdot.org
The book is more like a crystal ball than ink on pape I run into solutions to my most pressing
problems as I read through it.
—Arman Anwar, Arman@Web
Provides a detailed blueprint for using and customizing Lucene a thorough introduction to the
inner workings of what’s arguably the most popular open source search engine loaded with code
examples and emphasizes a hands-on approach to learning.
—SearchEngineWatch.com
Hatcher and Gospodnetic
´
bring their experience as two of Lucene’s core committers to author this
excellently written book. This book helps any developer not familiar with Lucene or development
of a search engine to get up to speed within minutes on the project and domain I would recom-
mend this book to anyone who is new to Lucene, anyone who needs powerful indexing and
searching capabilities in their application, or anyone who needs a great reference for Lucene.
—Fort Worth Java Users Group
Licensed to theresa smith <anhvienls@gmail.com>
www.it-ebooks.info
More Praise for the First Edition
Outstanding comprehensive and up-to-date grab this book and learn how to leverage
Lucene’s potential.
—Val’s blog
the code examples are useful and reusable.
—Scott Ganyo, Lucene Java Committer
packed with examples and advice on how to effectively use this incredibly powerful tool.
—Brian Goetz, Quiotix Corporation
it unlocked for me the amazing power of Lucene.
—Reece Wilton, Walt Disney Internet Group
code samples as JUnit test cases are incredibly helpful.
—Norman Richards, co-author XDoclet in Action
A quick and easy guide to making Lucene work.
—Books-On-Line
A comprehensive guide The authors of this book are experts in this field they have unleashed
the power of Lucene the best guide to Lucene available so far.
—JavaReference.com
Licensed to theresa smith <anhvienls@gmail.com>
www.it-ebooks.info
Lucene in Action
Second Edition
MICHAEL MCCANDLESS
ERIK HATCHER
OTIS GOSPODNETIĆ
MANNING
Greenwich
(74° w. long.)
Licensed to theresa smith <anhvienls@gmail.com>
www.it-ebooks.info
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
180 Broad St.
Suite 1323
Stamford, CT 06901
Email: orders@manning.com
©2010 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine
Manning Publications Co. Development editor: Sebastian Stirling
180 Broad St. Copyeditor: Liz Welch
Suite 1323 Typesetter: Dottie Marsico
Stamford, CT 06901 Cover designer: Marija Tudor
ISBN 978-1-933988-17-7
Printed in the United States of America
12345678910–MAL–151413121110
Licensed to theresa smith <anhvienls@gmail.com>
www.it-ebooks.info
v
brief contents
PART 1CORE LUCENE 1
1
■
Meet Lucene 3
2
■
Building a search index 31
3
■
Adding search to your application 74
4
■
Lucene’s analysis process 110
5
■
Advanced search techniques 152
6
■
Extending search 204
PART 2APPLIED LUCENE 233
7
■
Extracting text with Tika 235
8
■
Essential Lucene extensions 255
9
■
Further Lucene extensions 288
10
■
Using Lucene from other programming languages 325
11
■
Lucene administration and performance tuning 345
PART 3CASE STUDIES 381
12
■
Case study 1: Krugle 383
13
■
Case study 2: SIREn 394
14
■
Case study 3: LinkedIn 409
Licensed to theresa smith <anhvienls@gmail.com>
www.it-ebooks.info
Licensed to theresa smith <anhvienls@gmail.com>
www.it-ebooks.info
vii
contents
foreword xvii
preface xix
preface to the first edition xx
acknowledgments xxiii
about this book xxvi
JUnit primer xxxiv
about the authors xxxvii
PART 1CORE LUCENE 1
1
Meet Lucene 3
1.1 Dealing with information explosion 4
1.2 What is Lucene? 6
What Lucene can do 7
■
History of Lucene 7
1.3 Lucene and the components of a search application 9
Components for indexing 11
■
Components for searching 14
The rest of the search application 16
■
Where Lucene fits into your
application 18
1.4 Lucene in action: a sample application 19
Creating an index 19
■
Searching an index 23
1.5 Understanding the core indexing classes 25
IndexWriter 26
■
Directory 26
■
Analyzer 26
Document 27
■
Field 27
Licensed to theresa smith <anhvienls@gmail.com>
www.it-ebooks.info
CONTENTSviii
1.6 Understanding the core searching classes 28
IndexSearcher 28
■
Term 28
■
Query 29
■
TermQuery 29
TopDocs 29
1.7 Summary 29
2
Building a search index 31
2.1 How Lucene models content 32
Documents and fields 32
■
Flexible schema 33
Denormalization 34
2.2 Understanding the indexing process 34
Extracting text and creating the document 34
Analysis 35
■
Adding to the index 35
2.3 Basic index operations 36
Adding documents to an index 37
■
Deleting documents from
an index 39
■
Updating documents in the index 41
2.4 Field options 43
Field options for indexing 43
■
Field options for storing fields 44
Field options for term vectors 44
■
Reader, TokenStream, and
byte[] field values 45
■
Field option combinations 46
■
Field
options for sorting 46
■
Multivalued fields 47
2.5 Boosting documents and fields 48
Boosting documents 48
■
Boosting fields 49
■
Norms 50
2.6 Indexing numbers, dates, and times 51
Indexing numbers 51
■
Indexing dates and times 52
2.7 Field truncation 53
2.8 Near-real-time search 54
2.9 Optimizing an index 54
2.10 Other directory implementations 56
2.11 Concurrency, thread safety, and locking issues 58
Thread and multi-JVM safety 58
■
Accessing an index over a
remote file system 59
■
Index locking 61
2.12 Debugging indexing 63
2.13 Advanced indexing concepts 64
Deleting documents with IndexReader 65
■
Reclaiming disk space
used by deleted documents 66
■
Buffering and flushing 66
Index commits 67
■
ACID transactions and index
consistency 69
■
Merging 70
2.14 Summary 72
Licensed to theresa smith <anhvienls@gmail.com>
www.it-ebooks.info
CONTENTS ix
3
Adding search to your application 74
3.1 Implementing a simple search feature 76
Searching for a specific term 76
■
Parsing a user-entered query
expression: QueryParser 77
3.2 Using IndexSearcher 80
Creating an IndexSearcher 81
■
Performing searches 82
Working with TopDocs 82
■
Paging through results 84
Near-real-time search 84
3.3 Understanding Lucene scoring 86
How Lucene scores 86
■
Using explain() to understand
hit scoring 88
3.4 Lucene’s diverse queries 90
Searching by term: TermQuery 90
■
Searching within a term
range: TermRangeQuery 91
■
Searching within a numeric range:
NumericRangeQuery 92
■
Searching on a string:
PrefixQuery 93
■
Combining queries: BooleanQuery 94
Searching by phrase: PhraseQuery 96
■
Searching by wildcard:
WildcardQuery 99
■
Searching for similar terms:
FuzzyQuery 100
■
Matching all documents:
MatchAllDocsQuery 101
3.5 Parsing query expressions: QueryParser 101
Query.toString 102
■
TermQuery 103
■
Term range
searches 103
■
Numeric and date range searches 104
Prefix and wildcard queries 104
■
Boolean operators 105
Phrase queries 105
■
Fuzzy queries 106
MatchAllDocsQuery 107
■
Grouping 107
■
Field
selection 107
■
Setting the boost for a subquery 108
To QueryParse or not to QueryParse? 108
3.6 Summary 109
4
Lucene’s analysis process 110
4.1 Using analyzers 111
Indexing analysis 113
■
QueryParser analysis 114
Parsing vs. analysis: when an analyzer isn’t appropriate 114
4.2 What’s inside an analyzer? 115
What’s in a token? 116
■
TokenStream uncensored 117
Visualizing analyzers 120
■
TokenFilter order can be
significant 125
4.3 Using the built-in analyzers 127
StopAnalyzer 127
■
StandardAnalyzer 128
■
Which core
analyzer should you use? 128
Licensed to theresa smith <anhvienls@gmail.com>
www.it-ebooks.info
[...]... documents Indexing with NumericField and performing fast numeric range querying with NumericRangeQuery Updating and deleting documents using IndexWriter Working with IndexWriter’s new transactional semantics (commit, rollback) Improving search concurrency with read-only IndexReaders and NIOFSDirectory Enabling pure Boolean searching Adding payloads to your index and using them with BoostingTermQuery Using IndexReader.reopen... existing one Understanding resource usage, like memory, disk, and file descriptors Using Function queries Tuning for performance metrics like indexing and searching throughput Making a hot backup of your index without pausing indexing Using new ports of Lucene to other programming languages Measuring performance using the “benchmark” contrib package Understanding the new reusable TokenStream API Using... the door for exploring the rest of Lucene s capabilities Chapter 2 familiarizes you with Lucene s indexing operations We describe the various field types and techniques for indexing numbers and dates Tuning the indexing process, optimizing an index, using near real-time search and handling thread-safety are covered Chapter 3 takes you through basic searching, including details of how Lucene ranks documents... 240 Extracting text programmatically 242 Indexing a Lucene document 242 Customizing parser selection 246 7.6 7.7 8 245 ■ Parsing and indexing using 250 Alternatives 253 Summary 254 Essential Lucene extensions 8.1 The Tika utility class Tika’s limitations 246 Indexing custom XML 247 Parsing using SAX 248 Apache Commons Digester 7.8 7.9 ■ 255 Luke, the Lucene Index Toolbox 256 Overview: seeing the big... ever, in part thanks to the first edition of Lucene in Action making it easier for more people to get started with Lucene With every new release Lucene is getting better, more mature, more feature-rich, and faster Since the first edition of Lucene in Action was published in 2004, Lucene internals and its API have gone through radical changes that called for more than just minor book updates In this... to open-source Lucene made a huge impact in my life, and to Michael McCandless for the amazing effort he has been putting into both Lucene in Action, Second Edition and Lucene I think Mike actually has a few clones of him working 24/7 in his basement No wonder I haven’t met him in person yet! www.it-ebooks.info Licensed to theresa smith about this book Lucene in Action, Second... edition of Lucene in Action was published I already had experience building search engines, but didn’t know much about Lucene in particular So, I picked up a copy of Lucene in Action by Erik and Otis and read it, cover to cover, and I was hooked! As I used Lucene, I found small improvements here and there, so I started contributing small patches, updating javadocs, discussing topics on Lucene s mailing lists,... resulted in an offer from Manning to co-author Lucene in Action with Erik Hatcher Lucene in Action is the most comprehensive source of information about Lucene The information contained in the chapters encompasses all the knowledge you need to create sophisticated applications built on top of Lucene It’s the result of a very smooth and agile collaboration process, much like that within the Lucene community... 159 Sorting by a field 160 Reversing sort order 161 Sorting by multiple fields 161 Selecting a sorting field type 163 Using a nondefault locale for sorting 163 ■ ■ ■ ■ ■ ■ 5.3 5.4 5.5 ■ Using MultiPhraseQuery 163 Querying on multiple fields at once Span queries 168 166 Building block of spanning, SpanTermQuery 170 Finding spans at the beginning of a field 172 Spans near one another 173 Excluding span... it, point your web browser to http:// www.manning.com/LuceneinActionSecondEdition This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum About the title By combining introductions, overviews, and how-to examples, the In Action books are designed to help learning and remembering According to research in cognitive . 58
■
Accessing an index over a
remote file system 59
■
Index locking 61
2.12 Debugging indexing 63
2.13 Advanced indexing concepts 64
Deleting documents with IndexReader. 84
3.3 Understanding Lucene scoring 86
How Lucene scores 86
■
Using explain() to understand
hit scoring 88
3.4 Lucene s diverse queries 90
Searching by term:
Ngày đăng: 08/03/2014, 19:20
Xem thêm: Lucene in Action pptx, Lucene in Action pptx, 6 Indexing numbers, dates, and times, 11 Concurrency, thread safety, and locking issues, 2 What’s inside an analyzer?, 5 Synonyms, aliases, and words that mean the same, 4 Tika’s built-in text extraction tool, 1 Luke, the Lucene Index Toolbox, 2 Analyzers, tokenizers, and TokenFilters, 5 XML QueryParser: Beyond “one box” search interfaces