Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 50 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
50
Dung lượng
868,61 KB
Nội dung
Chapter 3 [ 85 ] When you search for all documents, you should see indexed metadata for Angel Eyes , prexed with metadata_ : <str name="metadata_Content-Type">audio/midi</str> <str name="metadata_divisionType">PPQ</str> <str name="metadata_patches">0</str> <str name="metadata_stream_content_type"> application/octet-stream</str> <str name="metadata_stream_name">angeleyes.kar</str> <str name="metadata_stream_size">55677</str> <str name="metadata_stream_source_info">file</str> <str name="metadata_tracks">16</str> Obviously, in most use cases, every time you index the same le you don't want to get a new document. If your schema has a uniqueKey eld dened such as id , then you can provide a specic ID by passing a literal value using literal.id=34 . Each time you index the le using the same ID, it will delete and insert that document. However, that implies that you have the ability to manage IDs through some third party system like a database. If you want to use the metadata, such as the stream_name provided by Tika to provide the key, then you just need to map that eld using map.stream_ name=id . To make the example work, update ./examples/cores/karaoke/schema. xml to specify <uniqueKey>id</uniqueKey> . >> curl 'http://localhost:8983/solr/karaoke/update/extract?map. content=text&map.stream_name=id' -F "file=@angeleyes.kar" This of course assumes that you've dened <uniqueKey>id</uniqueKey> to be of type string, not a number. Indexing richer documents Indexing karaoke lyrics from MIDI les is also a fairly trivial example. We basically just strip out all of the contents, and store them in the Solr text eld. However, indexing other types of documents, such as PDFs, can be a bit more complicated. Let's look at Take a Chance on Me, a complex PDF le that explains what a Monte Carlo simulation is, while making lots of puns about the lyrics and titles of songs from ABBA. View ./examples/appendix/karaoke/mccm.pdf , and you will see a complex PDF document with multiple fonts, background images, complex mathematical equations, Greek symbols, and charts. However, indexing that content is as simple as the prior example: >> curl 'http://localhost:8983/solr/karaoke/update/extract?map. content=text&map.stream_name=id&commit=true' -F "file=@mccm.pdf" This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Indexing Data [ 86 ] If you do a search for the document using the lename as the id via http://localhost:8983/solr/karaoke/select/?q=id:mccm.pdf , then you'll also see that the last_modified eld that we mapped in solrconfig.xml is being populated. Tika provides a Last-Modified eld for PDFs, but not for MIDI les: <doc> <arr name="id"> <str>mccm.pdf</str> </arr> <arr name="last_modified"> <str>Sun Mar 03 15:55:09 EST 2002</str> </arr> <arr name="text"> <str> Take A Chance On Me So with these richer documents, how can we get a handle on the metadata and content that is available? Passing extractOnly=true on the URL will output what Solr Cell has extracted, including metadata elds, without actually indexing them: <response> . <str name="mccm.pdf"><?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Take A Chance On Me</title> </head> <body> <div> <p> Take A Chance On Me Monte Carlo Condensed Matter A very brief guide to Monte Carlo simulation. . </str> <lst name="mccm.pdf_metadata"> <arr name="stream_source_info"><str>file</str></arr> <arr name="subject"><str>Monte Carlo Condensed Matter</str></arr> <arr name="Last-Modified"><str>Sun Mar 03 15:55:09 EST 2002</str></arr> . <arr name="creator"><str>PostScript PDriver module 4.49</str></arr> <arr name="title"><str>Take A Chance On Me</str></arr> <arr name="stream_content_type"><str>application/ octet-stream</str></arr> <arr name="created"><str>Sun Mar 03 15:53:14 EST 2002</str></arr> <arr name="stream_size"><str>378454</str></arr> <arr name="stream_name"><str>mccm.pdf</str></arr> </lst> </response> This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Chapter 3 [ 87 ] At the top in an XML node called <str name="mccm.pdf"/> is the content extracted from the PDF as an XHTML document. As it is XHTML wrapped in another separate XML document, the various <and> tags have been escaped: <div> ;. If you cut and paste the contents of <str/> node into a text editor and convert the < ; to < and > ; to > , then you can see the structure of the XHTML document that is indexed. Below the contents of the PDF, you can also see a wide variety of PDF document-specic metadata elds, including subject, title, and creator, as well as metadata elds added by Solr Cell for all imported formats, including stream_source_info , stream_content_type , stream_size , and the already-seen stream_name . So why would we want to see the XHTML structure of the content? The answer is in order to narrow down our results. We can use XPath queries through the ext.xpath parameter to select a subset of the data to be indexed. To make up an arbitrary example, let's say that after looking at mccm.html we know we only want the second paragraph of content to be indexed: >> curl 'http://localhost:8983/solr/karaoke/update/extract?map. content=text&map.div=divs_s&capture=div&captureAttr=true&xpath=\/\/xhtml: p[1]' -F "file=@mccm.pdf" We now have only the second paragraph, which is the summary of what the document Take a Chance on Me is about. Binary le size Take a Chance on Me is a 372 KB le stored at ./examples/appendix/ karaoke/mccm.pdf, and it highlights one of the challenges of using Solr Cell. If you are indexing a thousand PDF documents that each average 372 KB, then you are shipping 372 megabytes over the wire, assuming the data is not already on Solr's le system. However, if you extract the contents of the PDF on the client side and only send that over the web, then what is sent to the Solr text eld is just 5.1 KB. Look at ./examples/appendix/karaoke/mccm.txt to see the actual text extracted from mccm.pdf. Generously assuming that the metadata adds an extra 1 KB of information, then you have a total content sent over the wire of 6.1 megabytes ((5.1 KB + 1.0 KB) * 1000). Solr Cell offers a quick way to start indexing that vast amount of information stored in previously inaccessible binary formats without resorting to custom code per binary format. However, depending on the les, you may be needlessly transmitting a lot of data, only to extract a small portion of text. Moreover, you may nd that the logic provided by Solr Cell for parsing and selecting just the data you want may not be rich enough. For these cases you may be better off building a dedicated client-side tool that does all of the parsing and munging you require. This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Indexing Data [ 88 ] Summary At this point, you should have a schema that you believe will suit your needs, and you should know how to get your data into it. From Solr's native XML to CSV to databases to rich documents, Solr offers a variety of possibilities to ingest data into the index. Chapter 8 will discuss some additional choices for importing data. In the end, usually one or two mechanisms will be used. In addition, you can usually expect the need to write some code, perhaps just a simple bash or ant script to implement the automation of getting data from your source system into Solr. Now that we've got data in Solr, we can nally get to querying it. The next chapter will describe Solr/Lucene's query syntax in detail, which includes phrase queries, range queries, wildcards, boosting, as well as the description of Solr's DateMath syntax. Finally, you'll learn the basics of scoring and how to debug them. The chapters after that will get to more interesting querying topics that of course depend on having data to search with. This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Basic Searching At this point, you have Solr running and some data indexed, and you're nally ready to put Solr to the test. Searching with Solr is arguably the most fun aspect of working with it, because it's quick and easy to do. While searching your data, you will learn more about its nature than before. It is also a source of interesting puzzles to solve when you troubleshoot why a search didn't nd a document or conversely why it did, or similarly why a document wasn't scored sufciently high. In this chapter, you are going to learn about: The Full Interface for querying Solr Solr's query response XML Using query parameters to congure the search Solr/Lucene's query syntax The factors inuencing scoring Your first search, a walk-through We've got a lot of data indexed, and now it's time to actually use Solr for what it is intended—searching (aka querying). When you hook up Solr to your application, you will use HTTP to interact with Solr, either by using an HTTP software library or indirectly through one of Solr's client APIs. However, as we demonstrate Solr's capabilities in this chapter, we'll use Solr's web-based admin interface. Surely you've noticed the search box on the rst screen of Solr's admin interface. It's a bit too basic, so instead click on the [FULL INTERFACE] link to take you to a query form with more options. • • • • • This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Basic Searching [ 90 ] The following screenshot is seen after clicking on the [FULL INTERFACE] link: Contrary to what the label FULL INTERFACE might suggest, this form only has a fraction of the options you might possibly specify to run a search. Let's jump ahead for a second, and do a quick search. In the Solr/Lucene Statement box, type *:* (an asterisk, colon, and then another asterisk). That is admittedly cryptic if you've never seen it before, but it basically means match anything in any eld, which is to say, it matches all documents. Much more about the query syntax will be discussed soon enough. At this point, it is tempting to quickly hit return or enter, but that inserts a newline instead of submitting the form (this will hopefully be xed in the future). Click on the Search button, and you'll get output like this: <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">392</int> This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Chapter 4 [ 91 ] <lst name="params"> <str name="explainOther"/> <str name="fl">*,score</str> <str name="indent">on</str> <str name="start">0</str> <str name="q">*:*</str> <str name="hl.fl"/> <str name="qt">standard</str> <str name="wt">standard</str> <str name="version">2.2</str> <str name="rows">10</str> </lst> </lst> <result name="response" numFound="1002272" start="0" maxScore="1.0"> <doc> <float name="score">1.0</float> <str name="id">Release:449119</str> <str name="r_a_id">56063</str> <str name="r_a_name">The Spotnicks</str> <arr name="r_attributes"><int>0</int><int>1</int><int>100</int> </arr> <arr name="r_event_country"><str>JP</str></arr> <arr name="r_event_date"><date>1965-11-30T05:00:00Z</date></arr> <str name="r_lang">English</str> <str name="r_name">The Spotnicks in Tokyo</str> <int name="r_tracks">16</int> <str name="type">Release</str> </doc> <doc> <float name="score">1.0</float> <str name="id">Release:186779</str> <str name="r_a_id">56011</str> <str name="r_a_name">Metro Area</str> <arr name="r_attributes"><int>0</int><int>1</int><int>100</int> </arr> <arr name="r_event_country"><str>US</str></arr> <arr name="r_event_date"><date>2001-11-30T05:00:00Z</date></arr> <str name="r_name">Metro Area</str> <int name="r_tracks">11</int> <str name="type">Release</str> </doc> <!-- ** 7 other docs omitted for brevity ** --> </result> </response> This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Basic Searching [ 92 ] Browser note Use Firefox for best results when searching Solr. Solr's search results return XML, and Firefox renders XML color coded and pretty-printed. For other browsers (notably Safari), you may nd yourself having to use the View Source feature to interpret the results. Even in Firefox, however, there are cases where you will use View Source in order to look at the XML with the original indentation, which is relevant when diagnosing the scoring debug output. Solr's generic XML structured data representation Solr has its own generic XML representation of typed and named data structures. This XML is used for most of the responseXML and it is also used in parts of solconfig.xml too. The XML elements involved in this partial schema are: lst : A named list. Each of its child nodes should have a name attribute. This generic XML is often stored within an element not part of this schema, like doc , but is in effect equivalent to lst . arr : An array of values. Each of its child nodes are a member of this array. The following elements represent simple values with the text of the element storing the value. The numeric ranges match that of the Java language. They will have a name attribute if they are underneath lst (or an equivalent element like doc ), but not otherwise. str : A string of text int : An integer in the range -2^31 to 2^31-1 long : An integer in the range -2^63 to 2^63-1 float : A oating point number in the range 1.4e-45 to about 3.4e38 double : A oating point number in the range 4.9e-324 to about 1.8e308 bool : A boolean value represented as true or false date : A date in the ISO-8601 format like so: 1965-11-30T05:00:00Z , which is always in the GMT time zone represented by Z • • • • • • • • • This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Chapter 4 [ 93 ] Solr's XML response formatresponse format format The <response/> element wraps the entire response. The rst child element is <lst name="responseHeader"> , which is intuitively the response header that captures some basic metadata about the response. status : Always zero unless something went very wrong. QTime : The number of milliseconds Solr takes to process the entire request on the server. Due to internal caching, you should see this number drop to a couple of milliseconds or so for subsequent requests of the same query. If subsequent identical searches are much faster, yet you see the same QTime , then your web browser (or intermediate HTTP Proxy) cached the response. Solr's HTTP caching conguration is discussed in Chapter 9. Other data may be present depending on query parameters. The main body of the response is the search result listing enclosed by this: <result name="response" numFound="1002272" start="0" maxScore="1.0"> , and it contains a <doc> child node for each returned document. Some of the elds are explained below: numFound : The total number of documents matched by the query. This is not impacted by the rows parameter and as such may be larger (but not smaller) than the number of child <doc> elements. start : The same as the start parameter, which is the offset of the returned results into the query's result set. maxScore : Of all documents matched by the query ( numFound ), this is the highest score. If you didn't explicitly ask for the score in the eld list using the fl parameter, then this won't be here. Scoring is described later in this chapter. The contents of the resultant element are a list of doc elements. Each of these elements represents a document in the index. The child elements of a doc element represent elds in the index and are named correspondingly. The types of these elements are in the generic data structure partial schema, which was described earlier. They are simple values if they are not multi-valued in the schema. For multi-valued values, the eld would be represented by an ordered array of simple values. There was no data following the results element in our demonstration query. However, there can be, depending on the query parameters using features such as faceting and highlighting. When those features are described, the corresponding XML will be explained. • • • • • • This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Basic Searching [ 94 ] Parsing the URL The search form is a very simple thing, no more complicated than a basic one you might see in a tutorial if you are learning HTML for the rst time. All that it does is submit the form using HTTP GET, essentially resulting in the browser loading a new URL with the form elements becoming part of the URL's query string. Take a good look at the URL in the browser page showing the XML response. Understanding the URL's structure is very important for grasping how search works: http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&start =0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl.fl= The /solr/ is the web application context where Solr is installed on the Java servlet engine. If you have a dedicated server for Solr, then you might opt to install it at the root. This would make it just / . How to do this is out of scope of this book, but letting it remain at /solr/ is ne. After the web application context is a reference to the Solr core (we don't have one for this conguration). We'll congure Solr Multicore in Chapter 7, at which point the URL to searchSolr would look something like /solr/corename/select? . The /select in combination with the qt=standard parameter is a reference to the Solr request handler . More on this is covered later under the Request Handler section. As the standard request handler is the default handler, the qt parameter can be omitted in this example. Following the ? , is a set of unordered URL parameters (aka query parameters in the context of searching). The format of this part of the URL is an & separated set of unordered name=value pairs. As the form doesn't have an option for all query parameters, you will manually modify the URL in your browser to add query parameters as needed. Remember that the data in the URL must be URL-Encoded so that the URL complies with its specication. Therefore, the %3A in our example is interpreted by Solr as :, and %2C is interpreted as ,. Although not in our example, the most common escaped character in URLs is a space, which is escaped as either + or %20. For more information on URL encoding see http://en.wikipedia.org/wiki/Percent-encoding. • • • • This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009 4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. [...]... on www.verypdf.com 30327 Basic Searching Request handlers Querying Solr, and most other interactions with Solr including indexing for that matter, is processed by what Solr calls a request handler Request handlers are configured in the solrconfig.xml file and are clearly labeled as such Most of them exist for special purposes like handling a CSV import, for example Our searches in this chapter have... problem, consider the search for: SMASH~ There is an artist named S.M.A.S.H., and our analysis configuration emits smash as a term So SMASH would be a perfect match, but adding the tilde results in a search term in which every character is different due to the upper/lower case difference and so this search returns nothing As with wildcard searches, if you intend on using fuzzy searches then you might... 30327 Chapter 4 Summary At this point, you've learned the basics of searching in Solr, from query parameters to interpreting the search results to nearly the full gamut of Solr' s query syntax to the essential factors of scoring We have spent a lot of time on the query syntax because you'll see the syntax pop-up in several places across Solr, not just the user's query Such places include filtering the... Split-Merge E Conway Dr NW, , Atlanta, ,to remove this watermark 4310 on www.verypdf.com 30327 Enhanced Searching So we've got the searching basics down, and we even know a thing or two about more advanced topics like scoring In this chapter, we'll extend the searching topic into more advanced features of Solr' s searching capabilities, such as: • function queries • the Dismax query handler • faceting Function... you won't need a custom output format since you'll be writing the client and can use a Solr integration library like SolrJ or just talk to Solr directly with an existing response format If you do need to support a special format, then you have three choices The most flexible is to write the mediation code to talk to Solr that exposes the special format/protocol The simplest if it will suffice is to use... if, all of the search terms or just • one of the search terms respectively need to match If this isn't present, then the default is specified near the bottom of the schema file (an admittedly strange place to put the default) df: The default field that will be searched by the user query If this isn't specified, then the default is specified in the schema near the bottom in the defaultSearchField element... purchase PDF Split-Merge E Conway Dr NW, , Atlanta, ,to remove this watermark 4310 on www.verypdf.com 30327 Basic Searching Observe that the date format is the full ISO-8601 date-time in GMT, which Solr mandates (the same format used by Solr to index dates and that which is emitted in search results) The fractional seconds part (milliseconds) is actually optional The [ and ] brackets signify an inclusive... though there is only one term: id:"Artist:11650" Filtering Filtering in Solr is really quite simple Let's say you are dispatching a user's query to Solr, but you want to limit the scope of that query further than what the query might be doing As an example, let's say we wanted to make a search form for MusicBrainz that lets the user search for bands, not individual artists Let's also say that the user's... defaultSearchField element If that isn't specified, then an unqualified query clause will be an error Searching more than one field In order to have Solrsearch more than one field, it is a common technique to combine multiple fields into one field (indexed, multi-valued, not stored) through the schema's copyField directive, and search that by default instead Alternatively, you can use the dismax query type through... unchangeable • Registering Solrsearch components such as faceting and highlighting Create a request handler configuration for your application Instead of using the standard request handler for use by the application you are building, it is a good idea to create a request handler just for your application, perhaps even several to satisfy multiple search forms In doing so, you can change various search aspects . the range -2^ 31 to 2^ 31- 1 long : An integer in the range -2^63 to 2^63 -1 float : A oating point number in the range 1. 4e -45 to about 3.4e38 double : A. the range 4. 9e-3 24 to about 1. 8e308 bool : A boolean value represented as true or false date : A date in the ISO-86 01 format like so: 19 65 -11 -30T05:00:00Z