1. Trang chủ
  2. » Công Nghệ Thông Tin

Lucene in Action pptx

528 9,4K 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 528
Dung lượng 17,67 MB

Nội dung

MANNING Michael McCandless Erik Hatcher Otis Gospodnetic F OREWORD BY D OUG C UTTING Covers Apache Lucene 3.0 SECOND EDITION IN ACTION , www.it-ebooks.info Praise for the First Edition This is definitely the book to have if you’re planning on using Lucene in your application, or are interested in what Lucene can do for you. —JavaLobby Search powers the information age. This book is a gateway to this invaluable resource It suc- ceeds admirably in elucidating the application programming interface (API), with many code examples and cogent explanations, opening the door to a fine tool. —Computing Reviews A must-read for anyone who wants to learn about Lucene or is even considering embedding search into their applications or just wants to learn about information retrieval in general. Highly recommended! —TheServerSide.com Well thought-out thoroughly edited stands out clearly from the crowd I enjoyed reading this book. If you have any text-searching needs, this book will be more than sufficient equipment to guide you to successful completion. Even, if you are just looking to download a pre-written search engine, then this book will provide a good background to the nature of information retrieval in general and text indexing and searching specifically. —Slashdot.org The book is more like a crystal ball than ink on pape I run into solutions to my most pressing problems as I read through it. —Arman Anwar, Arman@Web Provides a detailed blueprint for using and customizing Lucene a thorough introduction to the inner workings of what’s arguably the most popular open source search engine loaded with code examples and emphasizes a hands-on approach to learning. —SearchEngineWatch.com Hatcher and Gospodnetic ´ bring their experience as two of Lucene’s core committers to author this excellently written book. This book helps any developer not familiar with Lucene or development of a search engine to get up to speed within minutes on the project and domain I would recom- mend this book to anyone who is new to Lucene, anyone who needs powerful indexing and searching capabilities in their application, or anyone who needs a great reference for Lucene. —Fort Worth Java Users Group Licensed to theresa smith <anhvienls@gmail.com> www.it-ebooks.info More Praise for the First Edition Outstanding comprehensive and up-to-date grab this book and learn how to leverage Lucene’s potential. —Val’s blog the code examples are useful and reusable. —Scott Ganyo, Lucene Java Committer packed with examples and advice on how to effectively use this incredibly powerful tool. —Brian Goetz, Quiotix Corporation it unlocked for me the amazing power of Lucene. —Reece Wilton, Walt Disney Internet Group code samples as JUnit test cases are incredibly helpful. —Norman Richards, co-author XDoclet in Action A quick and easy guide to making Lucene work. —Books-On-Line A comprehensive guide The authors of this book are experts in this field they have unleashed the power of Lucene the best guide to Lucene available so far. —JavaReference.com Licensed to theresa smith <anhvienls@gmail.com> www.it-ebooks.info Lucene in Action Second Edition MICHAEL MCCANDLESS ERIK HATCHER OTIS GOSPODNETIĆ MANNING Greenwich (74° w. long.) Licensed to theresa smith <anhvienls@gmail.com> www.it-ebooks.info For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 180 Broad St. Suite 1323 Stamford, CT 06901 Email: orders@manning.com ©2010 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine Manning Publications Co. Development editor: Sebastian Stirling 180 Broad St. Copyeditor: Liz Welch Suite 1323 Typesetter: Dottie Marsico Stamford, CT 06901 Cover designer: Marija Tudor ISBN 978-1-933988-17-7 Printed in the United States of America 12345678910–MAL–151413121110 Licensed to theresa smith <anhvienls@gmail.com> www.it-ebooks.info v brief contents PART 1CORE LUCENE 1 1 ■ Meet Lucene 3 2 ■ Building a search index 31 3 ■ Adding search to your application 74 4 ■ Lucene’s analysis process 110 5 ■ Advanced search techniques 152 6 ■ Extending search 204 PART 2APPLIED LUCENE 233 7 ■ Extracting text with Tika 235 8 ■ Essential Lucene extensions 255 9 ■ Further Lucene extensions 288 10 ■ Using Lucene from other programming languages 325 11 ■ Lucene administration and performance tuning 345 PART 3CASE STUDIES 381 12 ■ Case study 1: Krugle 383 13 ■ Case study 2: SIREn 394 14 ■ Case study 3: LinkedIn 409 Licensed to theresa smith <anhvienls@gmail.com> www.it-ebooks.info Licensed to theresa smith <anhvienls@gmail.com> www.it-ebooks.info vii contents foreword xvii preface xix preface to the first edition xx acknowledgments xxiii about this book xxvi JUnit primer xxxiv about the authors xxxvii PART 1CORE LUCENE 1 1 Meet Lucene 3 1.1 Dealing with information explosion 4 1.2 What is Lucene? 6 What Lucene can do 7 ■ History of Lucene 7 1.3 Lucene and the components of a search application 9 Components for indexing 11 ■ Components for searching 14 The rest of the search application 16 ■ Where Lucene fits into your application 18 1.4 Lucene in action: a sample application 19 Creating an index 19 ■ Searching an index 23 1.5 Understanding the core indexing classes 25 IndexWriter 26 ■ Directory 26 ■ Analyzer 26 Document 27 ■ Field 27 Licensed to theresa smith <anhvienls@gmail.com> www.it-ebooks.info CONTENTSviii 1.6 Understanding the core searching classes 28 IndexSearcher 28 ■ Term 28 ■ Query 29 ■ TermQuery 29 TopDocs 29 1.7 Summary 29 2 Building a search index 31 2.1 How Lucene models content 32 Documents and fields 32 ■ Flexible schema 33 Denormalization 34 2.2 Understanding the indexing process 34 Extracting text and creating the document 34 Analysis 35 ■ Adding to the index 35 2.3 Basic index operations 36 Adding documents to an index 37 ■ Deleting documents from an index 39 ■ Updating documents in the index 41 2.4 Field options 43 Field options for indexing 43 ■ Field options for storing fields 44 Field options for term vectors 44 ■ Reader, TokenStream, and byte[] field values 45 ■ Field option combinations 46 ■ Field options for sorting 46 ■ Multivalued fields 47 2.5 Boosting documents and fields 48 Boosting documents 48 ■ Boosting fields 49 ■ Norms 50 2.6 Indexing numbers, dates, and times 51 Indexing numbers 51 ■ Indexing dates and times 52 2.7 Field truncation 53 2.8 Near-real-time search 54 2.9 Optimizing an index 54 2.10 Other directory implementations 56 2.11 Concurrency, thread safety, and locking issues 58 Thread and multi-JVM safety 58 ■ Accessing an index over a remote file system 59 ■ Index locking 61 2.12 Debugging indexing 63 2.13 Advanced indexing concepts 64 Deleting documents with IndexReader 65 ■ Reclaiming disk space used by deleted documents 66 ■ Buffering and flushing 66 Index commits 67 ■ ACID transactions and index consistency 69 ■ Merging 70 2.14 Summary 72 Licensed to theresa smith <anhvienls@gmail.com> www.it-ebooks.info CONTENTS ix 3 Adding search to your application 74 3.1 Implementing a simple search feature 76 Searching for a specific term 76 ■ Parsing a user-entered query expression: QueryParser 77 3.2 Using IndexSearcher 80 Creating an IndexSearcher 81 ■ Performing searches 82 Working with TopDocs 82 ■ Paging through results 84 Near-real-time search 84 3.3 Understanding Lucene scoring 86 How Lucene scores 86 ■ Using explain() to understand hit scoring 88 3.4 Lucene’s diverse queries 90 Searching by term: TermQuery 90 ■ Searching within a term range: TermRangeQuery 91 ■ Searching within a numeric range: NumericRangeQuery 92 ■ Searching on a string: PrefixQuery 93 ■ Combining queries: BooleanQuery 94 Searching by phrase: PhraseQuery 96 ■ Searching by wildcard: WildcardQuery 99 ■ Searching for similar terms: FuzzyQuery 100 ■ Matching all documents: MatchAllDocsQuery 101 3.5 Parsing query expressions: QueryParser 101 Query.toString 102 ■ TermQuery 103 ■ Term range searches 103 ■ Numeric and date range searches 104 Prefix and wildcard queries 104 ■ Boolean operators 105 Phrase queries 105 ■ Fuzzy queries 106 MatchAllDocsQuery 107 ■ Grouping 107 ■ Field selection 107 ■ Setting the boost for a subquery 108 To QueryParse or not to QueryParse? 108 3.6 Summary 109 4 Lucene’s analysis process 110 4.1 Using analyzers 111 Indexing analysis 113 ■ QueryParser analysis 114 Parsing vs. analysis: when an analyzer isn’t appropriate 114 4.2 What’s inside an analyzer? 115 What’s in a token? 116 ■ TokenStream uncensored 117 Visualizing analyzers 120 ■ TokenFilter order can be significant 125 4.3 Using the built-in analyzers 127 StopAnalyzer 127 ■ StandardAnalyzer 128 ■ Which core analyzer should you use? 128 Licensed to theresa smith <anhvienls@gmail.com> www.it-ebooks.info [...]... documents Indexing with NumericField and performing fast numeric range querying with NumericRangeQuery Updating and deleting documents using IndexWriter Working with IndexWriter’s new transactional semantics (commit, rollback) Improving search concurrency with read-only IndexReaders and NIOFSDirectory Enabling pure Boolean searching Adding payloads to your index and using them with BoostingTermQuery Using IndexReader.reopen... existing one Understanding resource usage, like memory, disk, and file descriptors Using Function queries Tuning for performance metrics like indexing and searching throughput Making a hot backup of your index without pausing indexing Using new ports of Lucene to other programming languages Measuring performance using the “benchmark” contrib package Understanding the new reusable TokenStream API Using... the door for exploring the rest of Lucene s capabilities Chapter 2 familiarizes you with Lucene s indexing operations We describe the various field types and techniques for indexing numbers and dates Tuning the indexing process, optimizing an index, using near real-time search and handling thread-safety are covered Chapter 3 takes you through basic searching, including details of how Lucene ranks documents... 240 Extracting text programmatically 242 Indexing a Lucene document 242 Customizing parser selection 246 7.6 7.7 8 245 ■ Parsing and indexing using 250 Alternatives 253 Summary 254 Essential Lucene extensions 8.1 The Tika utility class Tika’s limitations 246 Indexing custom XML 247 Parsing using SAX 248 Apache Commons Digester 7.8 7.9 ■ 255 Luke, the Lucene Index Toolbox 256 Overview: seeing the big... ever, in part thanks to the first edition of Lucene in Action making it easier for more people to get started with Lucene With every new release Lucene is getting better, more mature, more feature-rich, and faster Since the first edition of Lucene in Action was published in 2004, Lucene internals and its API have gone through radical changes that called for more than just minor book updates In this... to open-source Lucene made a huge impact in my life, and to Michael McCandless for the amazing effort he has been putting into both Lucene in Action, Second Edition and Lucene I think Mike actually has a few clones of him working 24/7 in his basement No wonder I haven’t met him in person yet! www.it-ebooks.info Licensed to theresa smith about this book Lucene in Action, Second... edition of Lucene in Action was published I already had experience building search engines, but didn’t know much about Lucene in particular So, I picked up a copy of Lucene in Action by Erik and Otis and read it, cover to cover, and I was hooked! As I used Lucene, I found small improvements here and there, so I started contributing small patches, updating javadocs, discussing topics on Lucene s mailing lists,... resulted in an offer from Manning to co-author Lucene in Action with Erik Hatcher Lucene in Action is the most comprehensive source of information about Lucene The information contained in the chapters encompasses all the knowledge you need to create sophisticated applications built on top of Lucene It’s the result of a very smooth and agile collaboration process, much like that within the Lucene community... 159 Sorting by a field 160 Reversing sort order 161 Sorting by multiple fields 161 Selecting a sorting field type 163 Using a nondefault locale for sorting 163 ■ ■ ■ ■ ■ ■ 5.3 5.4 5.5 ■ Using MultiPhraseQuery 163 Querying on multiple fields at once Span queries 168 166 Building block of spanning, SpanTermQuery 170 Finding spans at the beginning of a field 172 Spans near one another 173 Excluding span... it, point your web browser to http:// www.manning.com/LuceneinActionSecondEdition This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum About the title By combining introductions, overviews, and how-to examples, the In Action books are designed to help learning and remembering According to research in cognitive . 58 ■ Accessing an index over a remote file system 59 ■ Index locking 61 2.12 Debugging indexing 63 2.13 Advanced indexing concepts 64 Deleting documents with IndexReader. 84 3.3 Understanding Lucene scoring 86 How Lucene scores 86 ■ Using explain() to understand hit scoring 88 3.4 Lucene s diverse queries 90 Searching by term:

Ngày đăng: 08/03/2014, 19:20

TỪ KHÓA LIÊN QUAN