www.it-ebooks.info Apache Solr Cookbook Over 100 recipes to make Apache Solr faster, more reliable, and return better results Rafał Kuć BIRMINGHAM - MUMBAI www.it-ebooks.info Apache Solr Cookbook Copyright © 2013 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: July 2011 Second edition: January 2013 Production Reference: 1150113 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78216-132-5 www.packtpub.com Cover Image by J Blaminsky (milak6@wp.pl) www.it-ebooks.info Credits Author Project Coordinator Rafał Kuć Anurag Banerjee Reviewers Proofreaders Ravindra Bharathi Maria Gould Marcelo Ochoa Aaron Nash Vijayakumar Ramdoss Indexer Tejal Soni Acquisition Editor Andrew Duckworth Production Coordinators Lead Technical Editor Arun Nadar Manu Joseph Nitesh Thakur Technical Editors Cover Work Jalasha D'costa Nitesh Thakur Charmaine Pereira Lubna Shaikh www.it-ebooks.info About the Author Rafał Kuć is a born team leader and software developer Currently working as a Consultant and a Software Engineer at Sematext Inc, where he concentrates on open source technologies such as Apache Lucene and Solr, ElasticSearch, and Hadoop stack He has more than 10 years of experience in various software branches, from banking software to e-commerce products He is mainly focused on Java, but open to every tool and programming language that will make the achievement of his goal easier and faster Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people with their problems with Solr and Lucene He is also a speaker for various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, and ApacheCon Rafał began his journey with Lucene in 2002 and it wasn't love at first sight When he came back to Lucene later in 2003, he revised his thoughts about the framework and saw the potential in search technologies Then Solr came and that was it From then on, Rafał has concentrated on search technologies and data analysis Right now Lucene, Solr, and ElasticSearch are his main points of interest www.it-ebooks.info Acknowledgement This book is an update to the first cookbook for Solr that was released almost two year ago now What was at the beginning an update turned out to be a rewrite of almost all the recipes in the book, because we wanted to not only bring you an update to the already existing recipes, but also give you whole new recipes that will help you with common situations when using Apache Solr 4.0 I hope that the book you are holding in your hands (or reading on a computer or reader screen) will be useful to you Although I would go the same way if I could get back in time, the time of writing this book was not easy for my family Among the ones who suffered the most were my wife Agnes and our two great kids, our son Philip and daughter Susanna Without their patience and understanding, the writing of this book wouldn't have been possible I would also like to thank my parents and Agnes' parents for their support and help I would like to thank all the people involved in creating, developing, and maintaining Lucene and Solr projects for their work and passion Without them this book wouldn't have been written Once again, thank you www.it-ebooks.info About the Reviewers Ravindra Bharathi has worked in the software industry for over a decade in various domains such as education, digital media marketing/advertising, enterprise search, and energy management systems He has a keen interest in search-based applications that involve data visualization, mashups, and dashboards He blogs at http://ravindrabharathi.blogspot.com Marcelo Ochoa works at the System Laboratory of Facultad de Ciencias Exactas of the Universidad Nacional del Centro de la Provincia de Buenos Aires, and is the CTO at Scotas com, a company specialized in near real time search solutions using Apache Solr and Oracle He divides his time between University jobs and external projects related to Oracle, and big data technologies He has worked in several Oracle related projects such as translation of Oracle manuals and multimedia CBTs His background is in database, network, web, and Java technologies In the XML world, he is known as the developer of the DB Generator for the Apache Cocoon project, the open source projects DBPrism and DBPrism CMS, the Lucene-Oracle integration by using Oracle JVM Directory implementation, and the Restlet.org project – the Oracle XDB Restlet Adapter, an alternative to writing native REST web services inside the database resident JVM Since 2006, he has been a part of the Oracle ACE program Oracle ACEs are known for their strong credentials as Oracle community enthusiasts and advocates, with candidates nominated by ACEs in the Oracle Technology and Applications communities He is the author of Chapter 17 of the book Oracle Database Programming using Java and Web Services, Kuassi Mensah, Digital Press and Chapter 21 of the book Professional XML Databases, Kevin Williams, Wrox Press www.it-ebooks.info www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library Here, you can access, read and search across Packt's entire library of books. Why Subscribe? ff Fully searchable across every book published by Packt ff Copy and paste, print and bookmark content ff On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books Simply use your login credentials for immediate access www.it-ebooks.info www.it-ebooks.info Table of Contents Preface Chapter 1: Apache Solr Configuration Introduction Running Solr on Jetty Running Solr on Apache Tomcat Installing a standalone ZooKeeper Clustering your data Choosing the right directory implementation Configuring spellchecker to not use its own index Solr cache configuration How to fetch and index web pages How to set up the extracting request handler Changing the default similarity implementation Chapter 2: Indexing Your Data Introduction Indexing PDF files Generating unique fields automatically Extracting metadata from binary files How to properly configure Data Import Handler with JDBC Indexing data from a database using Data Import Handler How to import data using Data Import Handler and delta query How to use Data Import Handler with the URL data source How to modify data while importing with Data Import Handler Updating a single field of your document Handling multiple currencies Detecting the document's language Optimizing your primary key field indexing www.it-ebooks.info 5 10 14 15 17 19 22 27 30 32 35 35 36 38 40 42 45 48 50 53 56 59 62 67 Appendix How to it This recipe will show how we can boost documents based on their publishing date Let's begin with the following index structure (add it to the field section in your schema.xml file): Now, let's index the following sample data: 2012-10-01T12:00:00Z Now, let's run a simple query: curl 'http://localhost:8983/solr/select?q=solr+cookbook&qf=name&de fType=edismax' For the preceding query, Solr will return the following results: 0 1 name solr cookbook edismax 301 www.it-ebooks.info Real-life Situations 1 Solr 3.1 CookBook 2011-02-02T12:00:00Z 2 Solr 4.0 CookBook 2012-10-01T12:00:00Z As you can see, the newest document is the second one, which we want to avoid So, we need to change our query to the following one: curl 'http://localhost:8983/solr/select?q=solr+cookbook&qf=name&bf =recip(ms(NOW/HOUR,published),3.16e-11,1,1)defType=edismax' Now, the response will be as follows: 0 2 name recip(ms(NOW/HOUR,published),3.16e-11,1,1) solr cookbook edismax 2 Solr 4.0 CookBook 2012-10-01T12:00:00Z 1 Solr 3.1 CookBook 302 www.it-ebooks.info Appendix 2011-02-02T12:00:00Z So, we have achieved what we wanted Now, let's see how it works How it works Our index structure consists of three fields; one responsible for holding the identifier of the document, one for the name of the document, and the last one; the one which we will be most interested in, in which we hold the publishing date The published field has one nice feature – if we don't define it in the document and send it for indexation, then it will get the value of the date and time when it is processed (the default="NOW" attribute) As you can see, the first query that we sent to Solr returned results not in a way we would like them to be sorted The most recent document is the second one Of course, we could have sorted them by date, but we don't want to that, because we would like to have the most recent and the most relevant documents at the top, not only the newest ones In order to achieve this, we use the bf (boost function) parameter We specify the boosting function At first, it can look very complicated, but it's not In order to boost our documents, we use the recip(ms(NOW/HOUR,published),3.16e-11,1,1) function query 3.16e10 specifies the number of milliseconds that are in a single year, so we use 3.16e-11 to invert that, and we use the reciprocal function (recip) to calculate the scaling value, which will return values near for recent documents, 1/2 for documents from about a year, 1/3 for documents that are about two years old, 1/4 for documents that are about three years old, and so on We've also used NOW/HOUR to reduce the precision of the published field, in order for our function query to consume less memory and because we don't need that granularity; our results will be just fine As you can see, our query with the bf parameter and the time-based function query work as intended There's more If you want to read more about function queries, please refer to the http://wiki.apache org/solr/FunctionQuery Solr wiki page 303 www.it-ebooks.info www.it-ebooks.info Index Symbols B -DnumShards parameter 213 -DzkHost parameter 213 -DzkRun parameter 213 tag 55 binary files metadata, extracting from 40-42 bqQuery parameter 300 buffer overflow 10 A C add command 58 administration GUI, SolrCloud cluster 220-223 adminPath property 17 adminPath variable alphabetical order faceting results, sorting in 168-170 analyzer 70 Apache Nutch URL, for downloading 27 URL, for info 30 Apache Solr URL, for tutorial Apache Tika 36 Apache Tika library used, for detecting language 66 Apache Tomcat Solr, running on 10-13 URL 11 apt-get command 6, automatic document distribution stopping, among shards 230-234 autosuggest feature implementing, faceting used 171-173 autowarmCount parameter 190, 193 cache 22 caches, Solr document 22, 26 filter 22, 25 query result 22, 26 CamelCase used, for splitting text 80, 82 Catalina context file 12 category’s autocomplete functionality implementing 287-289 working 289 CDATA tags 75 character filters 70 clientPort property 15 cluster collections, setting up 214-216 replica count, increasing 227-230 collections setting up, in cluster 214-216 commit command 295 commit operation about 200 Solr performance, improving after 194-196 conf directory 13 config-file 120 configuration, document cache 189, 190 www.it-ebooks.info configuration, filter cache 192, 193 configuration, query result cache 191, 192 configuration, Solr cache about 23, 24 document cache 26 filter cache 25 filter cache, using with faceting 25 no cache hits 25 query result cache 26 query result window 26 configuration, spellchecker 19, 21 content copying, of dynamic fields 77 copying, of fields 75-77 context directory contrib modules 62 crawl command 29 crawl-urlfilter.txt file 29 CSV 30 curl command 37 currencyConfig attribute 61 currencyExchange.xml file 61 currency provider setting up 62 D data clustering 15, 17 importing, Data Import Handler used 48-50 indexing, Data Import Handler used 45-48 modifying, in Data Import Handler 53-55 searching, in near real-time manner 294-296 stemming 91-93 data analysis 70 data behavior 70 data-config.xml file 52 dataDir property 15 Data Import Handler about 42 configuring, with JDBC 42-44 data, modifying 53-55 used, for importing data 48-50 used, for indexing data from database 45-48 using, with URL data source 50, 51 data indexing 70 db-data-config.xml file 43 debug attribute 12 decision tree faceting using 180-183 defaultCoreName attribute 9, 13 defaultCurrency attribute 61 default HTML tags modifying 241 default similarity implementation modifying 32-34 defined words ignoring 248-250 defType parameter 116 delete operation 216 different query parsers using, in single query 290, 291 directoryFactory tag 18 directory implementation selecting 17-19 DirectSolrSpellChecker 256 DisMax query parser about 116, 122 used, for querying particular value 109 distance defining, between words in phrase 114 distributed indexing 223-226 docBase attribute 12 document language, detecting 62-66 single field, updating 56-58 document cache about 22, 26, 189 configuring 189, 190 document count getting, by query match 161-164 getting, by subquery match 161-164 getting, without value in field 174-176 getting, with same field value 156-158 getting, with same value range 158-161 document language detecting 62-66 detecting, Apache Tika library used 66 documents boosting, based on publishing date 301-303 default HTML tags, modifying 241 excluding, with QueryElevationComponent 121 faceting, calculating for 183-186 306 www.it-ebooks.info getting right, after indexation 292, 293 getting, with all query words at top results set 296-300 modifying 136-138 positioning, over others on query 117-121 positioning, with closer words 122-125 retrieving, with partial match 128-130 DoubleMetaphoneFilterFactory 247 duplicate documents detecting 145-148 omitting 145-148 dynamic fields content, copying of 77 E elevate.xml file 139 embedded ZooKeeper server starting 213 enablePositionIncrements parameter 250 entities 44 Extended DisMax query parser parameters 299 using 290, 299 extracting request handler setting up 30, 31 F faceting about 155 calculating, for relevant documents in groups 183-186 filter cache, using with 25 used, for implementing autosuggest feature 171-173 faceting method per field specifying 200 faceting performance improving, for low cardinality fields 198, 199 faceting results filters, removing from 164-167 lexicographical sorting 158 sorting, in alphabetical order 168-170 facet limits for different fields, in same query 177-180 FastVectorHighlighting feature 243 field updating, of document 56-58 field aliases using 148-150 fields content, copying of 75-77 specifying, for highlighting 241 field value used, for grouping results 257-259 used, for sorting results 109-111 file data source 50 filter cache about 22, 25, 192 configuring 192, 193 using, with faceting 25 filter caching avoiding 206 filter queries order of execution, controlling for 207, 208 filters removing, from faceting results 164-167 flexible indexing 68 function queries used, for grouping results 262, 263 functions scoring, affecting with 130-34 function value used, for sorting results 243-245 G Gangila URL 188 generateNumberParts parameter 98 generateWordParts parameter 98 geodist function 245 geographical points storing, in index 88-91 global similarity configuring 34 H hash value 227 highlighting fields, specifying for 241 HTML tags eliminating, from text 73-75 307 www.it-ebooks.info low cardinality fields faceting performance, improving for 198, 199 Lucene directory implementation 17 LuceneQParser query parser 240 Lucene’s internal cache 23 HttpDataSource 52 Hunspell about 99 using, as stemmer 99, 100 I ignoreCase attribute 79 ignored.txt file 248 index geographical points, storing in 88-91 making, smaller 272, 273 indexing 35 index size estimating 274 information storing, payloads used 70-73 initialSize parameter 190 initLimit property 15 installation, ZooKeeper 14, 15 instanceDir attribute issues, Apache Tomcat Apache Tomcat, running on different port 13 issues, Jetty servlet container buffer overflow 10 Jetty, running on different port J Java 55 java command 8, JDBC Data Import Handler, configuring with 42-44 Jetty Solr, running on 6-9 Jetty servlet container URL, for downloading jetty.xml file 7, 10 JSON 30 L language attribute 55 lexicographical sorting, faceting results 158 light stemming 86 logging.properties file M matched words highlighing 238-240 maxChars attribute 77 mergeFactor parameter 267 metadata extracting, from binary files 40-42 mmQuery parameter about 299 multiple currencies configuring 59-61 handling 59 using 59-61 multiple values querying for 109 N n-grams about 95 used, for handling user typos 142-145 non-English languages sorting, properly 268-271 non-whitespace characters used, for splitting text 96-98 numbers used, for splitting text 96-98 numerical range queries performance, improving 208, 209 O opened files dealing with 265-267 order of execution controlling, of filter queries 207, 208 OR operator 122 out-of-memory issues dealing with 267, 268 308 www.it-ebooks.info P Q parameter dereferencing 136 parameters, Extended DisMax query parser bq 300 mm 299 pf 300 qf 299 v 300 parent-child relationships about 139 using 140, 141 partial match documents, retrieving with 128-130 particular field value asking for 108 particular value querying, DisMax query parser used 109 path attribute 12 payload about 70 used, for storing information 70-73 PDFCreator 36 PDF files indexing 36-38 performance about 187 improving, of numerical range queries 208, 209 pfQuery parameter 300 phrase searching for 111-113 phrases boosting, over words 114-116 boosting, with standard query parser 117 phrase slop 114 pivot faceting 180 plural words singular, making 84-86 PostgreSQL 50 primary key 67 primary key field indexing optimizing 67, 68 product’s autocomplete functionality implementing 284, 285 working 286, 287 qfQuery parameter 299 queries nesting 134-136 used, for grouping results 260-262 queryAnalyzerFieldType property 21 QueryElevationComponent document, excluding with 121 queryFieldType attribute 120 query parser 291 query performance analyzing 202-205 query result cache about 22, 26, 190 configuring 191, 192 queryResultMaxDocsCached property 189 query results paging 188, 189 query result window 26 queryResultWindowSize property 188 R real-time get 292 reload operation 216 replicas increasing, on live cluster 227-230 replication 227 result pages caching 197, 198 results grouping, field values used 257-259 grouping, function queries used 262-263 grouping, queries used 260-262 sorting, by distance from point 125-128 sorting, by field value 109-111 sorting, by function value 243-245 value of function, returning in 151-153 S Scalable Performance Monitoring 25, 188 schema.xml file 7, 29, 38, 52, 84, 133 scoring affecting, with functions 130-134 searching 223-226 309 www.it-ebooks.info search results used, for computing statistics 250-253 Sematext about 25 URL 188 server.xml file 11 similar documents returning 236-238 softCommit command 17 Solr about 36, 99 indexing, issues 200-202 performance, improving after commit operation 194-196 performance, improving after startup operation 194-196 result pages, caching 197, 198 running, on Apache Tomcat 10-12 running, on Jetty 6-9 Solr 4.0 211 Solr cache configuring 23, 24 SolrCloud about 211 automatic document distribution, stopping among shards 230-234 collections, setting up in cluster 214-216 distributed indexing 223-226 replicas, increasing on live cluster 227-230 searching 223-226 SolrCloud cluster about 211 administration GUI 220-223 creating 212 managing 216, 217-219 working 213 solrconfig.xml file 7, 16, 19, 52, 188 solr.DFRSimilarityFactory 34 solr.DirectSolrSpellchecker 19 solr.DirectSolrSpellChecker 21 Solr issues diagnosing 274-279 solr.MMapDirectoryFactory 18 solr.NIOFSDirectoryFactory 18 solr.NRTCachingDirectoryFactory 19 solr.QueryElevationComponent 117 solr.RAMDirectoryFactory 19 solr.RealTimeGetHandler class 294 solr.SchemaSimilarityFactory 34 solr.SimpleFSDirectoryFactory 18 solr.StandardDirectoryFactory 18 solr.UUIDField 39 solr.war file 6, Solr wiki page 303 solr.xml file 6-13 sounds used, for searching words 246, 247 spellchecker about 19 configuring 19, 21 spellchecker component about 253 using 254-256 spelling mistakes checking, of user 253-256 splitOnNumerics parameter 98 standard query parser phrases, boosting with 117 startup operation Solr performance, improving after 194-196 statistics computing, for search results 250-253 StatsComponent 252 stemmer Hunspell, using as 99, 100 stemming about 91 words, protecting from 103-106 stemming algorithms 84 stemming dictionary using 101-103 StopFilterFactory 250 string lowercasing 87, 88 swapping avoiding 280-282 syncLimit property 15 synonyms attribute 79 synonyms.txt file 78 T temp directory termVectors attribute 238 310 www.it-ebooks.info text HTML tags, eliminating from 73-75 preparing, for wildcard search 93-95 splitting, by CamelCase 80-82 splitting, by non-whitespace characters 96-98 splitting, by numbers 96-98 splitting, by whitespace 82-84 XML tags, eliminating from 73-75 text fields highlighting 241-243 tickTime property 15 Tika 31 tokenizer 70 tokens 70 transformer 52 types 70 typos handling, ngrams used 142-145 ignoring, in performance wise way 142-145 U unique fields generating, automatically 38, 39 URL data source Data Import Handler, using with 50-53 UTF-8 file encoding 12 V value of function returning, in results 151-153 vQuery parameter 300 W webapps directory webdefault.xml file web pages fetching 27-29 indexing 27-29 whitespace used, for splitting text 82-84 wildcard search text, preparing for 93-95 words modifying 77-79 phrases, boosting over 114-116 protecting, from stemming 103-106 searching, by sound 246, 247 X XML 30 XML tags eliminating, from text 73-75 XPath expression 52 Z ZooKeeper about 14 installing 14, 15 URL, for downloading 14 ZooKeeper cluster 212 311 www.it-ebooks.info www.it-ebooks.info Thank you for buying Apache Solr Cookbook About Packt Publishing Packt, pronounced 'packed', published its first book "Mastering phpMyAdmin for Effective MySQL Management" in April 2004 and subsequently continued to specialize in publishing highly focused books on specific technologies and solutions Our books and publications share the experiences of your fellow IT professionals in adapting and customizing today's systems, applications, and frameworks Our solution based books give you the knowledge and power to customize the software and technologies you're using to get the job done Packt books are more specific and less general than the IT books you have seen in the past Our unique business model allows us to bring you more focused information, giving you more of what you need to know, and less of what you don't Packt is a modern, yet unique publishing company, which focuses on producing quality, cuttingedge books for communities of developers, administrators, and newbies alike For more information, please visit our website: www.packtpub.com About Packt Open Source In 2010, Packt launched two new brands, Packt Open Source and Packt Enterprise, in order to continue its focus on specialization This book is part of the Packt Open Source brand, home to books published on software built around Open Source licences, and offering information to anybody from advanced developers to budding web designers The Open Source brand also runs Packt's Open Source Royalty Scheme, by which Packt gives a royalty to each Open Source project about whose software a book is sold Writing for Packt We welcome all inquiries from people who are interested in authoring Book proposals should be sent to author@packtpub.com If your book idea is still at an early stage and you would like to discuss it first before writing a formal book proposal, contact us; one of our commissioning editors will get in touch with you We're not just looking for published authors; if you have strong technical skills but no writing experience, our experienced editors can help you develop a writing career, or simply get some additional reward for your expertise www.it-ebooks.info Apache Solr Enterprise Search Server ISBN: 978-1-84951-606-8 Paperback: 418 pages Enhance your search with faceted navigation, result highlighting relevancy ranked sorting, and more Comprehensive information on Apache Solr with examples and tips so you can focus on the important parts Integration examples with databases, web-crawlers, XSLT, Java & embedded-Solr, PHP & Drupal, JavaScript, Ruby frameworks Advice on data modeling, deployment considerations to include security, logging, and monitoring, and advice on scaling Solr and measuring performance HBase Administration Cookbook ISBN: 978-1-84951-714-0 Paperback: 332 pages Master HBase configuration and administration for optimum database performance Move large amounts of data into HBase and learn how to manage it efficiently Set up HBase on the cloud, get it ready for production, and run it smoothly with high performance Maximize the ability of HBase with the Hadoop eco-system including HDFS, MapReduce, Zookeeper, and Hive Please check www.PacktPub.com for information on our titles www.it-ebooks.info Hadoop Real World Solutions Cookbook ISBN: 978-1-84951-912-0 Paperback: 325 pages Realistic, simple code examples to solve problems at scale with Hadoop and related technologies Solutions to common problems when working in the Hadoop environment Recipes for (un)loading data, analytics, and troubleshooting In depth code examples demonstrating various analytic models, analytic solutions, and common best practices Cassandra High Performance Cookbook ISBN: 978-1-84951-512-2 Paperback: 310 pages Over 150 recipes to design and optimize large-scale Apache Cassandra deployments Get the best out of Cassandra using this efficient recipe bank Configure and tune Cassandra components to enhance performance Deploy Cassandra in various environments and monitor its performance Well illustrated, step-by-step recipes to make all tasks look easy! Please check www.PacktPub.com for information on our titles www.it-ebooks.info .. .Apache Solr Cookbook Over 100 recipes to make Apache Solr faster, more reliable, and return better results Rafał Kuć BIRMINGHAM - MUMBAI www.it-ebooks.info Apache Solr Cookbook Copyright... publishing date Index 265 265 267 268 272 2 74 280 283 2 84 287 290 292 2 94 296 300 305 iv www.it-ebooks.info Preface Welcome to the Solr Cookbook for Apache Solr 4. 0 You will be taken on a tour through... experience with Apache Solr, please refer to the Apache Solr tutorial which can be found at: http://lucene .apache. org /solr/ tutorial.html before reading this book www.it-ebooks.info Apache Solr Configuration