Solr in action

THÔNG TIN TÀI LIỆU

Trey Grainger Timothy Potter FOREWORD BY Yonik Seeley MANNING www.it-ebooks.info Solr in Action TREY GRAINGER TIMOTHY POTTER MANNING SHELTER ISLAND www.it-ebooks.info For online information and ordering of this and other Manning books, please visit www.manning.com The publisher offers discounts on this book when ordered in quantity For more information, please contact Special Sales Department Manning Publications Co 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: orders@manning.com ©2014 by Manning Publications Co All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher Photographs in this book were created by Martin Evans and Jordan Hochenbaum, unless otherwise noted Illustrations were created by Martin Evans, Joshua Noble, and Jordan Hochenbaum Fritzing (fritzing.org) was used to create some of the circuit diagrams Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine Manning Publications Co 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Development editors: Copyeditor: Proofreader: Typesetter: Cover designer: Elizabeth Lexleigh, Susan Conant Melinda Rankin Elizabeth Martin Dennis Dalinnik Marija Tudor ISBN: 9781617291029 Printed in the United States of America 10 – MAL – 19 18 17 16 15 14 www.it-ebooks.info brief contents PART PART MEET SOLR 1 ■ Introduction to Solr ■ Getting to know Solr 26 ■ Key Solr concepts ■ Configuring Solr ■ Indexing 116 ■ Text analysis 48 82 162 CORE SOLR CAPABILITIES 195 ■ Performing queries and handling results 197 ■ Faceted search ■ Hit highlighting 281 10 ■ Query suggestions 11 ■ Result grouping/field collapsing 12 ■ Taking Solr to production 250 306 iii www.it-ebooks.info 356 330 iv PART BRIEF CONTENTS TAKING SOLR TO THE NEXT LEVEL .403 13 ■ SolrCloud 405 14 ■ Multilingual search 15 ■ Complex query operations 501 16 ■ Mastering relevancy 450 548 www.it-ebooks.info contents foreword xv preface xvii acknowledgments xix about this book xxi PART MEET SOLR .1 Introduction to Solr 1.1 Why I need a search engine? Managing text-centric data Common search-engine use cases 1.2 What is Solr? Information retrieval engine 11 Flexible schema management 13 Java web application 13 Multiple indexes in one server 15 Extendable (plugins) Scalable 15 Fault-tolerant 16 ■ ■ ■ ■ 1.3 Why Solr? 17 Solr for the software architect 17 Solr for the system administrator 18 Solr for the CEO 19 ■ ■ v www.it-ebooks.info 15 CONTENTS vi 1.4 Features overview 19 User-experience features 19 New features in Solr 23 1.5 ■ Data-modeling features 21 Summary 24 Getting to know Solr 26 2.1 Getting started 27 Installing Solr 27 Starting the Solr example server 28 Understanding Solr home 32 Indexing the example documents 33 ■ ■ 2.2 Searching is what it’s all about 34 Exploring Solr’s query form 34 What comes back from Solr when you search 38 Ranked retrieval 39 Paging and sorting 40 Expanded search features 41 ■ ■ 2.3 2.4 2.5 ■ Tour of the Solr administration console 43 Adapting the example to your needs 45 Summary 46 Key Solr concepts 48 3.1 Searching, matching, and finding content 49 What is a document? 49 The fundamental search problem 50 The inverted index 53 Terms, phrases, and Boolean logic 54 Finding sets of documents 56 Phrase queries and term positions 59 Fuzzy matching 60 Quick recap 65 ■ ■ ■ ■ 3.2 Relevancy ■ 65 Default similarity 65 Term frequency 67 Inverse document frequency 68 Boosting 69 Normalization factors 69 ■ ■ 3.3 Precision and Recall 71 Precision 3.4 72 ■ Recall Searching at scale 73 ■ Striking the right balance 73 74 The denormalized document 75 Distributed searching 77 Clusters vs servers 78 The limits of Solr 79 ■ ■ 3.5 Summary 80 Configuring Solr 82 4.1 Overview of solrconfig.xml 85 Common XML data-structure and type elements 87 Applying configuration changes 87 Miscellaneous settings ■ www.it-ebooks.info 88 CONTENTS 4.2 vii Query request handling 90 Request-handling overview 90 Search handler 93 Browse request handler for Solritas: an example 94 Extending query processing with search components 98 ■ 4.3 4.4 Managing searchers 103 New searcher overview 103 ■ Warming a new searcher 104 Cache management 107 Cache fundamentals 107 Filter cache 109 Query result cache 112 Document cache 113 Field value cache 113 ■ ■ 4.5 4.6 Remaining configuration options 114 Summary 114 Indexing 116 5.1 Example microblog search application 117 Representing content for searching 117 Overview of the Solr indexing process 119 5.2 Designing your schema 121 Document granularity 121 Unique key 122 Indexed fields 123 Stored fields 123 Preview of schema.xml 124 ■ ■ 5.3 Defining fields in schema.xml 125 Required field attributes 126 Multivalued fields 127 Dynamic fields 128 Copy fields 131 Unique key field ■ ■ 5.4 Field types for structured nontext fields String fields 134 Date fields 135 Advanced field type attributes 138 ■ 5.5 ■ ■ Sending documents to Solr for indexing Update handler ■ ■ Using the SolrJ Other tools for Transaction log Index management 155 Index storage 155 5.8 141 ■ 147 Committing documents to the index 148 Atomic updates 152 5.7 133 Numeric fields 137 Indexing documents using XML or JSON 141 client library to add documents from Java 144 importing documents into Solr 146 5.6 133 ■ Segment merging 158 Summary 160 www.it-ebooks.info 151 CONTENTS viii Text analysis 162 6.1 6.2 Analyzing microblog text Basic text analysis 167 163 Analyzer 168 Tokenizer 168 Token filter 169 StandardTokenizer 169 Removing stop words with StopFilterFactory 170 LowerCaseFilterFactory—lowercase letters in terms 171 Testing your analysis with Solr’s analysis form 172 ■ ■ ■ ■ ■ 6.3 Defining a custom field type for microblog text 174 Collapsing repeated letters with PatternReplaceCharFilterFactory 177 Preserving hashtags, mentions, and hyphenated terms 178 Removing diacritical marks using ASCIIFoldingFilterFactory 182 Stemming with KStemFilterFactory 182 Injecting synonyms at query time with SynonymFilterFactory 183 Putting it all together 184 ■ ■ 6.4 Advanced text analysis 187 Advanced field attributes 187 Per-language text analysis Extending text analysis using a Solr plug-in 190 ■ 6.5 189 Summary 194 PART CORE SOLR CAPABILITIES 195 Performing queries and handling results 197 7.1 The anatomy of a Solr request Request handlers 198 Query parsers 206 7.2 ■ Search components Working with query parsers Specifying a query parser 207 7.3 Queries and filters 207 ■ Local params 207 ■ Handling expensive filters 213 The default query parser (Lucene query parser) Lucene query parser syntax 7.5 203 210 The fq and q parameters 210 7.4 198 215 215 Handling user queries (eDisMax query parser) 222 eDisMax query parser overview 222 eDisMax query parameters 223 Searching across multiple fields 223 Boosting queries and phrases 224 Field aliasing 226 User-accessible fields 227 Minimum match 228 eDisMax benefits and drawbacks 230 ■ ■ ■ ■ www.it-ebooks.info CONTENTS 7.6 ix Other useful query parsers 232 Field query parser 232 Term and Raw query parsers 232 Function and Function Range query parsers 233 Nested queries and the Nested query parser 233 Boost query parser 234 Prefix query parser 235 Spatial query parsers 235 Join query parser 236 Switch query parser 236 Surround query parser 236 Max Score query parser 237 Collapsing query parser 238 ■ ■ ■ ■ ■ ■ 7.7 Returning results 238 Choosing a response format 238 Paging through results 243 7.8 Sorting results ■ 240 Sorting by functions 247 Debugging query results 248 Returning debug information 7.10 Choosing fields to return 245 Sorting by fields 245 Fuzzy sorting 247 7.9 ■ Summary 248 249 Faceted search 250 8.1 8.2 8.3 8.4 8.5 8.6 Navigating your content at a glance Setting up test data 254 Field faceting 259 Query faceting 264 Range faceting 266 Filtering upon faceted values 269 251 Applying filters to your facets 269 Safely filtering on faceted values 273 8.7 Multiselect faceting, keys, and tags Keys 275 8.8 8.9 ■ 275 Tags, excludes, and multiselect faceting Beyond the basics Summary 280 277 279 Hit highlighting 281 9.1 9.2 Overview of hit highlighting 282 How highlighting works 283 Set up a new Solr core for UFO sightings 284 Preprocess UFO sightings before indexing 284 Exploring the UFO sightings dataset 286 Hit highlighting out of the box 288 Nuts and bolts 290 Refining highlighter results 296 ■ ■ ■ ■ www.it-ebooks.info INDEX indexes (continued) indexed fields 123 stored fields 123–124 unique key 122–123 schema.xml section copy fields 131–133 dynamic fields 128–131 multivalued fields 127–128 required attributes 126–127 unique key field 133 segment merging defined 158–159 deletions and 160–161 elements in solrconfig.xml 159–160 overview 364–365 pros and cons 159 sending documents to Solr DIH 146 ExtractingRequestHandler 146–147 Nutch 147 using JSON 141–144 using SolrJ library 144–146 using XML 141–144 separating from searching 376 separating per language 470–473 splitting 382–383 storage choosing directory 156–158 default configuration 156 throughput for sharding 373 update handler atomic updates 152–155 autocommit 149–150 normal commit 149 overview 147–148 soft commit 149 transaction log 151–152 indexlog utility 410 IndicNormalizationFilterFactory 607 Indonesian language 608 IndonesianStemFilterFactory 608 information discovery use case 8–9 information retrieval See IR installing Solr 27–28 instanceDir parameter 380 element 87 Integrated Development Environment See IDE IntelliJ IDEA 599 interacting with Solr REST API 388 Solr client libraries 388–389 SolrJ adding documents 144–146 adding to project 389 connecting to server with 389–390 embedding Solr within application 391– 392 interacting with Solr 390–391 versioning and serialization 392 internationalization See multilingual search Intersects operation 533 invalidating cached objects 108 invariants section 205 inverse document frequency See idf inverted index 11 ordering of terms 54 overview 53–54 IR (information retrieval) 11 Irish language 608 IrishLowerCaseFilterFactory 462, 608 IsDisjointTo operation 533 IsWithin operation 533 Italian language 608 ItalianLightStemFilterFactory 608 J J2EE (Java Platform, Enterprise Edition) 13 Japanese language 461, 608 JapaneseBaseFormFilterFactory 608 JapaneseKatakanaStemFilterFactory 608 JapaneseTokenizerFactory 608 JAR files 89 Java requirements 27 Solr as web application 13–15 SolrJ library 144–146 Java Platform, Enterprise Edition See J2EE Java Management Extensions See JMX Java Topology Suite See JTS Java Virtual Machine See JVM javabin 146, 148 Javascript 388 JavaScript Object Notation See JSON JBoss 13, 18, 358 JDBC 146 Jetty 13, 18, 600 advantages of using 31 deploying with 358 JIRA page for Solr 357, 596 JMX (Java Management Extensions) 89–90, 399 element 89 Join query parser 236 joins cross-core joins 544–545 cross-document joins 543–544 data-modeling features 22 as subqueries 127 www.it-ebooks.info 625 626 INDEX JSON (JavaScript Object Notation) conforming data to schema.xml 284 importing documents using 141–144 respose formats 39 update handler support 148 JSONResponseWriter class 239 JTS (Java Topology Suite) 531 JVM (Java Virtual Machine) 360–361 K keys, multiselect faceting 275–277 keyword search use cases 7–8 KeywordMarkerFilterFactory 459 Korean analyzer chain 461 KStem algorithm 457 KStemFilter 458 KStemFilterFactory 165, 177, 182–183, 454–456 L LangDetectLanguageIdentifierUpdateProcessor 486, 494 LangDetectLanguageIdentifierUpdateProcessorFactory 487 langid parameter 490 Language Detection Library 486 language identification dynamically assigning language analyzers 494–500 dynamically mapping content 489–494 overview 485 text analysis per 189–190 update processors for 488–489 See also multilingual search language-specific field type configurations 605 last component 313 element 98, 101 Latvian language 608 LatvianStemFilterFactory 608 LBHttpSolrServer class 390 leader vote wait period 426–427 leaderVoteWait parameter 427 leading wildcards 61 Least Recently Used See LRU lemmatization vs stemming 452–454 Levenshtein 320 LFU (Least Frequently Used) 107 LGPL (lesser general public license) 531 element 89, 610 licensing 531 LightStem filters 462 LIKE queries 51–52 limitations core autodiscovery mode 379 on custom hashing 447 on distributed queries 436 on scaling search 79–81 linear function 510 linearly scalable 412 LingoClusteringAlgorithm.desiredClusterCountBase parameter 582 linguistic analysis 451–452 element 420 LMDirichletSimilarity class 569 LMJelinekMercerSimilarity class 569 ln function 511 load balancers 384–385 load testing 400–401 loadOnStartup parameter 84, 380 local params parameter dereferencing 209–210 parameter value 209 purpose of 207–208 syntax 208–209 localization See multilingual search location, searching near bounding box filter 522–525 defining location fields 521–522 returning calculated distances 525–526 reusing parameters 527 sorting on distance 526 log function 511 LoggingHandler class 200 logs 44, 400 long queries 80 element 87 LowerCaseFilter 316 LowerCaseFilterFactory 165, 171–172, 462 LRU (Least Recently Used) 107 element 87, 96 Lucene 11, 88–89 lucene folder 598 Lucene in Action 53 Lucene query parser boosting expressions 221 character proximity 219 escaping special characters 221–222 excluding terms 219 fielded term searches 216 grouped expressions 218 optional terms 217 overview 215–216 phrase searches 217 range searches 219–220 required terms 216–217 term proximity 218 term-proximity boosts 563–564 wildcard searches 220–221 element 88 LuceneQParserPlugin class 215 www.it-ebooks.info INDEX lucene-solr/ folder 598 LukeRequestHandler class 198, 201 M map function 509 MappingCharFilterFactory 178 MapReduce 12 master.replication.enabled parameter 387 masterUrl parameter 377 math functions 510–511 element 159 maxdoc function 512 maxMergeAtOnce parameter 364 maxShardsPerNode parameter 439 maxWarmingSearchers parameter 395–396 element 106–107 MBeans 89, 398–399 mean reciprocal rank metric 592 memcached 192 memory RAM 156 sorting and 246–247 mentions, preserving in text 178–182 mergeFactor parameter 364 element 159 MERGEINDEXES action 383 element 159 element 159 metadata 252 microblog search application example 117, 163– 167 MinimalStem filter 462 minimum match 228–230 missing values, and sorting 246 misspelled terms 309–311 mm parameter 228 MMapDirectory 157–158 monitoring, external 399–400 More Like This feature 99, 188, 574–579, 590 MoreLikeThisHandler class 201, 574 ms function 509 MS Office documents 147 MS SQL Server 146 multicore configuration 32 multilingual search data-modeling features 23 language identification dynamically assigning language analyzers 494–500 dynamically mapping content 489–494 overview 485 update processors for 488–489 language-specific field type configurations 605 linguistic analysis 451–452 627 scenarios field type for multiple languages 474–485 multiple languages in one field 473–474 separate fields per language 464–470 separate indexes per language 470–473 stemming dictionary-based (Hunspell) 463–464 example 454–458 KeywordMarkerFilterFactory 459 language-specific analyzer chains 460–463 vs lemmatization 452–454 StemmerOverrideFilterFactory 459–460 multiselect faceting defined 275 excludes 277–279 keys 275–277 multitenant search 446 MultiTextField 476, 482–483 MultiTextFieldAnalyzer 476 MultiTextFieldLanguageIdentifierUpdateProcessor 496 MultiTextFieldLanguageIdentifierUpdateProcessorFactory 496 MultiTextFieldTokenizer 477, 481–482 multiValued attribute 131 multivalued fields highlighting 298–299 result grouping on 352–353 schema.xml file 127–128 murmur hash algorithm 429 MySQL 146 N Nagios 18, 89 Natural Language Processing See NLP natural language, search using 163 near real-time search See NRT search negated terms 55 Nested query parser 233–234 nesting function queries 502 NET 28 Netflix 570 newSearcher event 105 n-grams 321–323 NIOFSDirectory 157 NLP (Natural Language Processing) 451 node recovery process 433–434 norm function 512 normal commit 149 normalization factors coord factor (coord) 71 field norms (norm) 69–71 query norms (queryNorm) 71 Norwegian language 608 www.it-ebooks.info 628 INDEX NorwegianLightStemFilterFactory 608 NoSQL (Not only SQL) 3, 75, 546 not function 515 NOT operator 55, 219 NRT (near real-time) search distributed indexing 432–433 soft commit 149 Solr features 23 NRTCachingDirectory 157 NRTCachingDirectoryFactory class 157 numdocs function 512 numeric fields overview 137–138 precisionStep attribute 138–141 numShards parameter 407, 411, 439 Nutch 147 O offsite backup for SolrCloud 445–446 omitNorms attribute 135, 187–188, 556 op parameter 56 OpenOffice documents 147 element 150 Optimize request, update handler 148 optional terms 55, 217 optmistic concurrency control 154–155 OR operator 55, 217 Oracle AS 13 ord function 509 outage types 413 OutOfMemoryError 422 P paging default size 40 overview 40 result grouping 347–348 results 243–245 user experience 20 parameters dereferencing 209–210 local params 209 parameter substitutions 502 element 97 parseArg() method 519 parseFloat() method 519 parseValueSource() method 519 patches contributing 602–604 downloading and applying 601–602 PatternReplaceCharFilterFactory 166, 177–178 payload boosting 559–560 PDF documents importing common formats 22 indexing 147 peer sync 433 perception of relevancy 550 performance cache performance 397–398 external monitoring 399–400 hit highlighting 300–302 load testing 400–401 pulling stats from request handlers and MBeans 398–399 query and update request statistics 394–395 searchers 102–104 Solr Core statistics 395–397 Solr logs 400 warming new searcher first searcher 106 element 106–107 element 106 warming queries 105–106 permissions, document Persian language 461, 608 persist parameter 380 pf (phrase fields) parameters 224 PHP 28, 388 PHPResponseWriter class 239 PHPSerializedResponseWriter class 239 phrase searches 56, 217 phrase slop parameters See ps parameters phrases highlighting 298 Lucene query parser 217 as search terms 56 pi function 511 Ping Request Handler 385 PingRequestHandler class 201 pivot faceting defined 279–280 future of 540–541 limitations 540 overview 538–540 PluginInfoHandler class 201 plugins directory 194 pollInterval parameter 377 polygons 531 popular queries boosting recent popularity 326–329 finding most popular 325–326 popularity field 565 port for SolrCloud 426 PorterStemFilter 458 PorterStemFilterFactory 454, 456 ports, changing 30 Portuguese language 609 www.it-ebooks.info INDEX PortugueseLightStemFilterFactory 609 PositionFilterFactory 606 positionIncrementGap attribute 138, 294, 299 post filtering 214–215 POST method 90 PostFilter interface 215 Postgres 146 PostingsHighlighter 302–305 post.jar file 34 pow function 511 Precision balancing with Recall 73–74 example of calculations 593 graphing versus Recall 592–595 overview 72 precisionStep attribute 138–141 Prefix query parser 235 preserveOriginal attribute 181 product function 511 production systems cluster management generic vs customized configuration 385–388 load balancers 384–385 cores creating through Core Admin API 379–380 defining 378–379 getting status of 383–384 reloading 380 renaming and swapping 381 splitting and merging indexes 382–383 unloading and deleting 381–382 creating distribution 357 data acquisition strategies batching documents 367–368 Data Import Handler 368–370 extracting text from files with Solr Cell 370– 371 deploying building solr.war file 358 deploying with Jetty 358 embedded Solr 359 overview 357–358 interacting with Solr REST API 388 Solr client libraries 388–389 SolrJ 389–392 performance cache performance 397–398 external monitoring 399–400 load testing 400–401 pulling stats from request handlers and MBeans 398–399 query and update request statistics 394–395 Solr Core statistics 395–397 Solr logs 400 629 replication combining with sharding 377–378 fault tolerance 376 overview 374 separating indexing from searching 376 setting up 376–377 simple scenario 375 server configuration autowarming OS filesystem cache 365–366 garbage collection for Solr 361 increasing available file descriptors 366 incremental indexing 361–362 index flipping and cache warming 362–363 JVM settings 360–361 RAM 359–360 segment merging 364–365 SSDs 359–360 sharding document size 373 expected growth 374 overview 371–372 query complexity 374 required indexing throughput 373 testing in development 374 total number of documents 372–373 upgrading version 401–402 PropertiesRequestHandler class 201 proximity searches 63–65, 218–219 ps (phrase slop) parameters 224–225 pseudo-field 507, 525 pt parameter 523 Python 28, 39, 388 PythonResponseWriter class 239 Q q parameter boosting 327 caching 211 execution speed 211 order of execution 211–213 overview 210–211 query form 36 relevancy impact 211 specifying multiple 211 QParserPlugin classes 206 qs (query phrase slop) parameter 225 queries big data analytics 546–547 component overview 99 distributed queries client sends query 434 get fields stage 435–436 limitations on 436 overview 410–411 www.it-ebooks.info 630 INDEX queries (continued) process overview 434 query controller receives request 435 query stage 435 external data ExternalFileField 542–543 overview 541 functions Boolean functions 515 custom 515–521 data transformation functions 509 distance functions 513–515 frange filter 506 math functions 510–511 overview 502 relevancy functions 511–513 returning as field 507–508 searching on 504–506 sorting on 508–509 syntax for function queries 502–504 geospatial search circle 531 faceting on distance 535–538 grid-based location searching 529–530 overview 521 point 530 querying for shapes 532–534 rectangle 531 searching near point 521–527 shapes 531–532 sorting on distance with geofilt 534–535 sorting on distance with SpatialRecursivePrefixTreeFieldType 534 handling expensive filters caching 213–214 order of execution 214–215 overview 213 post filtering 214–215 joins cross-core joins 544–545 cross-document joins 543–544 pivot faceting future of 540–541 limitations 540 overview 538–540 q/fq parameters caching 211 execution speed 211 order of execution 211–213 overview 210–211 relevancy impact 211 specifying multiple 211 result grouping by 345–346 sharding, and complexity of 374 statistics 394–395 suggestions based on user activity boosting recent popularity 326–329 find most popular query 325–326 overview 324 schema design for 324–325 warming 105–106 Query Elevation Component 566–567 query faceting 264–266 query form extended features 41–42 overview 34–38 purpose of 37 query function 512 query norms (queryNorm) 71 query parsers Boost query parser 234–235 eDisMax query parser bf parameter 225 bq parameter 225 field aliasing 226–227 minimum match 228–230 overview 222–223 pf parameters 224 pros and cons 230–232 ps parameters 224–225 qs parameter 225 query parameters 223 searching across multiple fields 223–224 tie parameter 225 user accessible fields 227–228 Field query parser 232 Function query parser 233 Function Range query parser 233 Join query parser 236 local params parameter dereferencing 209–210 parameter value 209 purpose of 207–208 syntax 208–209 Lucene query parser boosting expressions 221 character proximity 219 escaping special characters 221–222 excluding terms 219 fielded term searches 216 grouped expressions 218 optional terms 217 overview 215–216 phrase searches 217 range searches 219–220 required terms 216–217 term proximity 218 wildcard searches 220–221 Nested query parser 233–234 overview 206–207 www.it-ebooks.info INDEX query parsers (continued) Prefix query parser 235 Raw query parser 232–233 spatial query parsers 235–236 specifying 207 Surround query parser 236–238 Switch query parser 236 Term query parser 232–233 query phrase slop parameter See qs parameter element 102 queryAnalyzerFieldType setting 315 QueryResponseWriter class 240 query-result cache element 113 element 113 element 112–113 QueryScorer 295 query-time boosting 69, 221 quorum 419 R rad function 511 RAID (redundant array of independent disks) 414 RAM (random access memory) 156, 359–360 element 159 range faceting 266–269 range searches 62, 219–220 ranked retrieval 8, 39 ranking, influencing rare search terms 68 See also idf Raw query parser 232–233 read dominant 5–6 real-time get 23–24 Recall balancing with Precision 73–74 graphing versus Precision 592–595 overview 73 recency, boosting by 328 recip function 328, 511, 561 recommendations vs search 570–571 Recovering state 422 Recovery Failed state 422 rectangle (geospatial) 531 redundancy 413 redundant array of independent disks See RAID RegexFragmenter 295 relational databases, importing data from 22 relationships between documents relevancy boosting documents 564–567 field boosting 556–558 function boosting 505, 560–562 payload boosting 559–560 term boosting 69, 559 term-proximity boosting 562–564 using t.getBoost 69 debugging relevancy calculation 550– 556 default similarity 65–66 functions for 511–513 impact of tuning 549–550 inverse document frequency (idf) 68–69 normalization factors coord factor (coord) 71 field norms (norm) 69–71 query norms (queryNorm) 71 overview 65 personalizing search attribute-based matching 571–573 collaborative filtering 586–590 concept-based matching 579–585 geographical matching 585–586 hierarchical matching 573–574 hybrid recommendation scenarios 590– 591 More Like This 574–579 overview 591–592 search vs recommendations 570–571 q/fq parameters 211 ranking by running experiments 592–595 Similarity classes 567–569 term frequency (tf) 67–68 Reload button 87 ReloadCacheRequestHandler class 200 reloading cores 380 remote debugging 601 renaming cores 381 repeated letters, collapsing 177–178 replicas adding 445 choosing number of 421–422 defined 411 replicateAfter directive 377 replication combining with sharding 377–378 fault tolerance 376 overview 374 separating indexing from searching 376 setting up 376–377 simple scenario 375 viewing replication of index 44 replicationFactor parameter 439 ReplicationHandler class 198, 201 representational state transfer See REST www.it-ebooks.info 631 632 INDEX request handlers autosuggest 318–319 example of 94–98 overview 198–203 statistics from 398–399 suggesting field values 323–324 element 93 RequestHandlerBase class 199 requests, query debug component 100–101 facet component 99 highlight component 99 More Like This component 99 overview 90–93 query component 99 request handler example 94–98 search handler 93–94 spellcheck component 101–102 stats component 99–100 required terms 55, 216–217 Resin 31, 358 response format 38–39 REST (representational state transfer) 15, 388 restarting failed node 444 rolling restart 443 result grouping distributed result grouping 352 documents multiple per group 339–343 skipping duplicates 332–339 faceting on result groups 349–351 field collapsing vs 331 by function 343–344 on multivalued fields 352–353 paging 347–348 performance 353–355 by query 345–346 returning flat list 352 sorting 347–348 on tokenized fields 352–353 results debugging 248–249 fields to return document transformers 241–242 dynamic values 241 return field aliases 243 stored fields 241 format for 238–240 hit highlighting field-level overrides 300 highlighting multivalued fields 298–299 highlighting phrases 298 parameters for highlighting 299–300 using facets 296–298 paging 243–245 sorting by fields 245–246 by functions 247 fuzzy sorting 247 memory footprint for 246–247 missing values 246 See also result grouping results formatting available options 238–240 changing default 257 highlight search component 295–296 ReversedWildcardFilterFactory 61 rint function 511 rolling restart 443–444 Romanian language 461, 609 rord function 509 rows parameter 37 RSolr 388 Ruby 28, 388 RubyResponseWriter class 239 Russian language 609 RussianLightStemFilterFactory 609 S scalability of Solr 15–16 Solr features 24–25 SolrCloud 411–412 using virtualized commodity hardware 16 ScalaLikeSolr 388 scale function 510 scaling search clusters vs servers 78–79 denormalized document 75–77 distributed searching 77–78 limitations 79–81 Schema Browser 44 schema management 6–7, 13 schema parameter 84 schema.xml file 13, 32 copy fields applying analyzers to field 132–133 catch-all field from many fields 131–132 designing document granularity 121 example of 124–125 indexed fields 123 stored fields 123–124 unique key 122–123 dynamic fields adding document sources 130–131 documents from diverse sources 129–130 documents with many fields 129 www.it-ebooks.info INDEX schema.xml file (continued) multivalued fields 127–128 purpose of 83 required attributes 126–127 unique key field 133 viewing from Solr admin page 44 scoring, highlight search component 295 search components 203–206 search handler 93–94 search-engine optimization See SEO searcherName property 103 searchers performance and 102–104 warming new first searcher 106 element 106–107 element 106 warming queries 105–106 SearchHandler class 198, 202 searching documents and how documents are found 56–59 overview 49–50 fundamental search problem 50–53 fuzzy matching defined 60 distances 63 proximity searching 63–65 range searching 62 wildcard searching 60–62 inverted index and 53–54 paging 40 personalizing search attribute-based matching 571–573 collaborative filtering 586–590 concept-based matching 579–585 geographical matching 585–586 hierarchical matching 573–574 hybrid recommendation scenarios 590–591 More Like This 574–579 overview 591–592 search vs recommendations 570–571 Precision balancing with Recall 73–74 overview 72 query form extended features 41–42 overview 34–38 ranked retrieval 39 Recall balancing with Precision 73–74 overview 73 relevancy boosting terms 69, 221, 223–228 default similarity 65–66 633 inverse document frequency (idf) 68–69 normalization factors 69–71 overview 65 term frequency (tf) 67–68 response format 38–39 scaling clusters vs servers 78–79 denormalized document 75–77 distributed searching 77–78 limitations 79–81 separating from indexing 376 sorting 40–41 terms grouped expressions 56 negated 55 optional 55 phrases 56 position of 59–60 required 54–55 segment merging defined 158–159 deletions and 160–161 elements in solrconfig.xml 159–160 indexes 364–365 optimizing 159, 364–365 segmentsPerTier parameter 364 sending documents to Solr See importing documents SEO (search-engine optimization) servers autowarming OS filesystem cache 365–366 vs clusters 78–79 garbage collection for Solr 361 increasing available file descriptors 366 incremental indexing 361–362 index flipping and cache warming 362–363 JVM settings 360–361 RAM 359–360 segment merging 364–365 SSDs 359–360 service-level agreements See SLA sfield parameter (geospatial) 523 shapes (geospatial) 531–532 [shard] field 242 shard parameter 84 sharding choosing number of shards 421–422 combining replication with 377–378 defined 16, 411 distributed indexing 431 document assignment 428–429 document size 373 expected growth 374 overview 371–372 query complexity 374 www.it-ebooks.info 634 INDEX sharding (continued) required indexing throughput 373 shard-leader election 423–424 shard-leader, defined 408 splitting shard 447–449 targeting specific shards 447 testing in development 374 total number of documents 372–373 shard.key parameter 447 shards parameter 385 shards.tolerant parameter 436 shortening URLs 192 ShowFileRequestHandler class 201 Similarity classes 65, 567–569 SimpleFSDirectory 157 simplicity, and SolrCloud 415–416 sin function 511 single point of failure See SPoF sinh function 511 SLA (service-level agreements) 414 slave.replication.enabled parameter 387 SmartChineseSentenceTokenizerFactory 606 SmartChineseWordTokenFilterFactory 606 snapshot replication 433 snippets, highlighting multiple per result 289– 290 SnowballPorterFilter 458 SnowballPorterFilterFactory 462, 605–607, 609 soft commit 149 software architect 17–18 Solarium 388 solid-state drives See SSDs SolJSON 388 SolPerl 388 SolPython 388 Solr Cell (Content Extraction Library) 147, 370– 371 Solr introduction administration console 43–45 client libraries 388–389 data-modeling features document clustering 22 field collapsing/grouping 21–22 importing data 22 joins 22 multilingual support 23 query features 22 document oriented extensibility 15 fault tolerant 16–17 general discussion 9–11 home directory 32 indexing example documents 33–34 information retrieval engine 11–13 installing 27–28 Java web application 13–15 JIRA page 596 multiple indexes in one server 15 patches contributing 602–604 downloading and applying 601–602 read dominant 5–6 scalability 15–16 schema management 6–7, 13 software architect 17–18 Solr features atomic updates 23 durable writes 24 NRT feature 23 real-time get 23–24 scaling with ZooKeeper 24–25 vs SolrCloud 27 system administrator 18 text-centric 4–5 use cases information discovery 8–9 keyword search 7–8 ranked retrieval what Solr cannot user experience autosuggest 20 faceting 20 geospatial 21 hit highlighting 21 pagination and sorting 20 spell-checker 20–21 versions 596–597 solr/ folder 598 SolrCloud administration adding replica 445 checking node activity 444 offsite backup 445–446 reloading collection 443 restarting failed node 444 rolling restart 443–444 uploading changes to ZooKeeper 443 cluster-state management 422–423 collections collection aliasing 440–442 vs cores 416–417 creating 437–440 configuring client timeout 426 core node names 426 host 425–426 hostContext parameter 426 leader vote wait period 426–427 port 426 consistency 414–415 www.it-ebooks.info INDEX SolrCloud (continued) custom hashing composite document ID 446–447 limitations on 447 overview 446 targeting specific shard 447 distributed indexing batches 432 commit 432 document shard assignment 428–429 forwarding request to replicas 431 node recovery process 433–434 NRT search 432–433 routing document to correct shard 430 sending update request using CloudSolrServer 429–430 shard leader indexes document 431 version ID 431 distributed queries client sends query 434 get fields stage 435–436 limitations on 436 overview 410–411 process overview 434 query controller receives request 435 query stage 435 elasticity 416 high availability 413–414 monitoring 18 replicas, choosing number of 421–422 scalability 24, 411–412 shards choosing number of 421–422 shard-leader election 423–424 splitting 447–449 simplicity 415–416 vs Solr 27 starting Solr in 406–410 ZooKeeper centralized configuration management 420– 421 client timeout 419–420 data model 417–419 defined 417 production configuration 419 znode watcher 418 element 425 solrconfig.xml file applying changes 87 common XML elements 87 enabling JMX 89–90 loading dependency JAR files 89 Lucene version 88–89 overview 85–87 segment merging elements in 159–160 viewing active 44 See also configuration SolrInfoMBeanHandler class 201 SolrJ library adding to project 389 CloudSolrServer 429 connecting to server with 389–390 embedding Solr within application 391–392 importing documents using 144–146 interacting with Solr 390–391 versioning and serialization 392 SolrMeter 400 SolrNet 388 SolrRequestHandler interface 199, 203 SolrServer class 389 solr.solr.home property 32, 83 solr.war file 358 solr.xml file 32, 83 sort parameter 37, 327, 347 sorting by distance overview 526 with geofilt 534–535 with SpatialRecursivePrefixTreeFieldType 534 by fields 245–246 by functions 247, 508–509 fuzzy sorting 247 memory footprint for 246–247 missing values 246 overview 40–41 result grouping 347–348 user experience 20 sortMissingFirst attribute 138, 246 sortMissingLast attribute 135, 138, 246 Spanish language 609 SpanishLightStemFilterFactory 609 spatial parameter 41 spatial query parsers 235–236 Spatial4J library 531 SpatialRecursivePrefixTreeFieldType 528, 530, 532, 534 spellcheck parameter 41–42, 312–313 spell-check search component chaining multiple spell-check implementations 316–317 DirectSolrSpellChecker 314–315 example using misspelled term 309–311 indexing Wikipedia articles 307–308 overview 101–102, 311 text analysis 315–316 user experience 20–21 splitOnCaseChange attribute 181 splitOnNumerics attribute 181 SPoF (single point of failure) 17 www.it-ebooks.info 635 636 INDEX sqedist function 513–514 sqrt function 511 Squid 18 srcCore parameter 383 SSDs (solid-state drives) 16, 156, 359–360, 411 Stack Exchange, indexing posts 612, 615 StandardTokenizer 169–170, 178 StandardTokenizerFactory 462 start parameter 245, 347 paging 40 query form 37 starting/running server example server 28 stopping server 31–32 troubleshooting 30–31 statistics component for 99–100 queries 394–395 Solr Core 395–397 update requests 394–395 stemEnglishPossessive attribute 181 StemmerOverrideFilterFactory 459–460, 606 stemming vs lemmatization 452–454 multilingual search dictionary-based (Hunspell) 463–464 example 454–458 KeywordMarkerFilterFactory 459 language-specific analyzer chains 460–463 StemmerOverrideFilterFactory 459–460 overview 182–183 stop words Google approach 171 More Like This and 579 removing 171 StopFilter 316 StopFilterFactory 165, 171, 462 stopping server 31–32 storage for indexes 156–158 stored attribute 131 stored fields hit highlighting 293 returning 241 schema 123–124 storeOffsetsWithPositions parameter 302 element 87 strdist function 513–514 stream.body parameter 364, 578 string fields 134–135 structured data, field types for attributes for 138–141 class diagram 133–134 date fields granularity of 136–137 overview 135–136 numeric fields overview 137–138 precisionStep attribute 138–141 string fields 134–135 sttf function 512 sub function 511 subshards, defined 448 Subversion See SVN suggestions autosuggest overview 318 request handler 318–319 search component 320 for document field values overview 320–321 request handler 323–324 using n-grams 321–323 for queries, based on user activity boosting recent popularity 326–329 find most popular query 325–326 overview 324 schema design for 324–325 spell-check search component chaining multiple spell-check implementations 316–317 DirectSolrSpellChecker 314–315 example using misspelled term 309–311 indexing Wikipedia articles 307–308 overview 311 text analysis 315–316 sum function 510–511 sumtotaltermfreq function See sttf function Surround query parser 236–238 SVN (Subversion) 596–597 swapping cores 381 Swedish language 609 SwedishLightStemFilterFactory 609 SweetSpotSimilarityFactory class 569 Switch query parser 236 SynonymFilterFactory 177–184 synonyms system administrator 18 SystemInfoHandler class 202 SystemInfoRequestHandler class 199 T tag local param 277 tan function 511 function 511 targetCore parameter 383 tdate field 136–137 term frequency See tf Term query parser 232–233, 273–275 termfreq function 512 www.it-ebooks.info INDEX termOffset attribute 188–189 termPositions attribute 188–189 term-proximity boosting eDisMax query parser 562–563 Lucene query parser 218, 563–564 overview 562 TermQParserPlugin 274 terms, search boosting 559 grouped expressions 56 negated 55 optional 55 phrases 56 position of 59–60 required 54–55 termVectors attribute 187–189 testing sharding, in development 374 text analysis with analysis form 172–174 text analysis analyzers 168 custom field type collapsing repeated letters 177–178 hyphenated terms 178–182 injecting synonyms 183–184 microblog example 174–177 preserving hashtags and mentions 178– 182 removing diacritcal marks 182 stemming words 182–183 WhitespaceTokenizerFactory 179 WordDelimiterFilterFactory 179–182 extending using Solr plug-in custom TokenFilter class 191–192 custom TokenFilterFactory class 193–194 overview 190–191 highlight search component 293 LowerCaseFilterFactory 171–172 microblog example 163–167 omitNorms attribute 188 per-language 189–190 removing stop words 171 spell-check search component 315–316 StandardTokenizer 169–170 termVectors attribute 188–189 testing with analysis form 172–174 token filters 169 tokenizers 168–169 text-centric, defined 4–5 tf (term frequency) 67–68, 512 tf function 512 Thai language 609 ThaiWordFilterFactory 609 Thread Dump 44 ThreadDumpHandler class 202 637 Throughput Garbage Collector 361 tie (tie breaker) parameter 225 TikaLanguageIdentifierUpdateProcessor 486 TikaLanguageIdentifierUpdateProcessorFactory 487 token filters 169 TokenFilter class, custom 191–192 TokenFilterFactory class, custom 193–194 tokenizers overview 168–169 result grouping on tokenized fields 352– 353 Tomcat 13, 18, 358 top function 510 totaltermfreq function See ttf function transaction log 151–152 TransformerFactory class 242 transient parameter 84, 380 trie 136 troubleshooting, starting server 30–31 ttf function 512 Turkish language 609 TurkishLowerCaseFilterFactory 462, 609 U UAX29URLEmailTokenizer 316 ulimit command 366 ulogDir parameter 84 Unified Information Management Architecture See Apache UIMA unimportant words 53 unique key field designing schema 122–123 schema.xml file 133 element 133 unloading cores 381–382 unstructured data upconfig command 438, 443 update handler atomic updates field-level updates 153 optmistic concurrency control 154–155 autocommit 149–150 normal commit 149 overview 147–148 soft commit 149 statistics 394 transaction log 151–152 update processors 488–489 update request statistics 394–395 updateHandler 394 UpdateRequestHandler class 198, 200 upgrading 401–402 URL shortening services 192 www.it-ebooks.info 638 INDEX well-known text See WKT WhitespaceTokenizer 166, 169, 193 WhitespaceTokenizerFactory 177–179 Wikipedia, indexing articles 307–308, 610–612 wildcard searches autosuggest and 320 deciding to use 61 Lucene query parser 220–221 searching using 60–62 WKT (well-known text) 531–532 Word documents 22 WordBreakSolrSpellchecker 316–317 WordDelimiterFilter 193 WordDelimiterFilterFactory 166, 177, 179–182 wt parameter 37, 258 use cases for Solr information discovery 8–9 keyword search 7–8 ranked retrieval what Solr cannot element 106 user accessible fields 227–228 user experience autosuggest 20 faceting 20 geospatial 21 hit highlighting 21 pagination and sorting 20 spell-checker 20–21 V X ValueSource 516–518 ValuesourceParser 518–519 Varnish 18 _version_ field 23, 154, 423 versions of Solr 392, 596–597 vertical scaling 411 virtualized commodity hardware 16 XInclude 106 XML (Extensible Markup Language) importing documents using 141–144 response format 38–39 update handler support 148 XMLResponseWriter class 239 xor function 515 XSLTResponseWriter class 239 W waitSearcher attribute 150 war (web application archive) files 357 warming caches 108–109 filter cache 110–112 OS filesystem cache 365–366 queries 105–106, 247 searchers first searcher 106 element 106–107 element 106 web application archive files See war files Weblogic 358 web-scale inverted index 12 Websphere 358 Z zkcli utility 443 zkClientTimeout parameter 420, 426 zkHost parameter 419, 443 znode watcher 418 ZooKeeper centralized configuration management 420– 421 client timeout 419–420 data model 417–419 defined 417 production configuration 419 Solr features 24–25 znode watcher 418 www.it-ebooks.info SEARCH Solr IN ACTION Grainger Potter ● W hether you’re handling big (or small) data, managing documents, or building a website, it is important to be able to quickly search through your content and discover meaning in it Apache Solr is your tool: a ready-todeploy, Lucene-based, open source, full-text search engine Solr can scale across many servers to enable real-time queries and data analytics across billions of documents Solr in Action teaches you to implement scalable search using Apache Solr This easy-to-read guide balances conceptual discussions with practical examples to show you how to implement all of Solr’s core capabilities You’ll master topics like text analysis, faceted search, hit highlighting, result grouping, query suggestions, multilingual search, advanced geospatial and data operations, and relevancy tuning What’s Inside ● ● ● ● ● How to scale Solr for big data Rich real-world examples Solr as a NoSQL data store Advanced multilingual, data, and relevancy tricks Coverage of versions through Solr 4.7 This book assumes basic knowledge of Java and standard database technology No prior knowledge of Solr or Lucene is required Trey Grainger is a director of engineering at CareerBuilder Timothy Potter is a senior member of the engineering team at LucidWorks The authors work on the scalability and reliability of Solr, as well as on recommendation engine and big data analytics technologies To download their free eBook in PDF, ePub, and Kindle formats, owners of this book should visit manning.com/SolrinAction MANNING $49.99 / Can $52.99 [INCLUDING eBOOK] www.it-ebooks.info SEE INSERT The knowledge and “techniques you need ” —From the Foreword by Yonik Seeley, Creator of Solr Readable and “ immediately applicable an excellent book ” —John Viviano, InterCorp, Inc The go-to guide for Solr “a defi nitive resource for both beginners and experts ” —Scott Anthony Business Instruments well-dosed combination “ofAdeep technical knowledge and real-world experience ” —Alexandre Madurell Piksel, Inc ... MEET SOLR 1 ■ Introduction to Solr ■ Getting to know Solr 26 ■ Key Solr concepts ■ Configuring Solr ■ Indexing 116 ■ Text analysis 48 82 162 CORE SOLR CAPABILITIES 195 ■ Performing.. .Solr in Action TREY GRAINGER TIMOTHY POTTER MANNING SHELTER ISLAND www.it-ebooks.info For online information and ordering of this and other Manning books, please visit www.manning.com... started 27 Installing Solr 27 Starting the Solr example server 28 Understanding Solr home 32 Indexing the example documents 33 ■ ■ 2.2 Searching is what it’s all about 34 Exploring Solr s query form

Ngày đăng: 27/03/2019, 13:47

Xem thêm: