MarkLogic Cookbook Documents, Triples, and Values: Powering Search Dave Cassel Beijing Boston Farnham Sebastopol Tokyo MarkLogic Cookbook by Dave Cassel Copyright © 2017 O’Reilly Media All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Shannon Cutt Production Editor: Kristen Brown Copyeditor: Sonia Saruba August 2017: Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition Revision History for the First Edition 2017-06-09: Part 2017-08-16: Part The O’Reilly logo is a registered trademark of O’Reilly Media, Inc MarkLogic Cook‐ book, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-99458-0 [LSI] Table of Contents Introduction v Document Searches Search by Root Element Find Documents That Are Missing an Element Scoring Search Results Sort Results to Promote Recent Documents Weigh Matches Based on Document Parts Understanding Your Data and How It Gets Used 13 Logging Search Requests Count Documents in Directories 13 16 Searching with the Optic API 19 Paging Over Results Group By Extract Content from Retrieved Documents Select Documents Based on Criteria in Joined Documents 19 22 24 26 iii Introduction MarkLogic is a database capable of storing many types of data, but it also includes a search engine built into the core, complete with an integrated suite of indexes working across multiple data models This combination allows for a simpler architecture (one software system to deploy, configure, and maintain rather than two), simpler application-level code (application code goes to one resource for query and search, rather than two), and better security (because the search engine has the same security configuration as the database and is updated transactionally whenever data changes) The recipes in this book, the second of a three-part series, provide guidance on how to solve common search-related problems Some of the recipes work with older versions of MarkLogic, while others take advantage of newer feaures in MarkLogic MarkLogic supports both XQuery and JavaScript as internal lan‐ guages Most of the recipes in this book are written in JavaScript, but have corresponding XQuery versions at http://developer.mark logic.com/recipes JavaScript is very well suited for JSON content, while XQuery is great for XML; both are natively managed inside of MarkLogic Recipes are a useful way to distill simple solutions to common prob‐ lems—copy and paste these into MarkLogic’s Query Console or your source code, and you’ve solved the problem In choosing rec‐ ipes for this book, I looked for a couple of factors First, I wanted problems that occur with some frequency Some problems in this book are more common than others, but all occur often enough in real-world situations that one of my colleagues wrote down a solu‐ tion Second, I selected four recipes that illustrate how to use the v new Optic API, to help developers get used to that feature Finally, some recipes require explanations that provide insight into how to approach programming with MarkLogic Developers will get the most value from these recipes and the accompanying discussions after they’ve worked with MarkLogic for at least a few months and built an application or two If you’re just getting started, I suggest spending some time on MarkLogic Univer‐ sity classes first, then come back to this material If you would like to suggest or request a recipe, please write to recipes@marklogic.com Acknowledgments Many people contributed to this book Erik Hennum provided code for Optic recipes and helped me understand what I needed to know in order to write up the discussions Tom Ternquist provided the original version of the “Search By Root Element” recipe Jason Hunter suggested the “Weight Matches” recipe and provided ideas for “Logging Search Requests.” Puneet Rawal proposed “Count Documents in Directories.” Bob Starbird, Gabo Manuel, and Mae Isabelle Turiana reviewed the content Diane Burley gave feedback on the content and made sure I actually got this done Thank you to all! vi | Introduction CHAPTER Document Searches Finding documents is a core feature for searching in MarkLogic Searches often begin with looking for simple words or phrases Fac‐ ets in the user interface, in the form of lists, graphs, or maps, allow users to drill into results But MarkLogic’s Universal Index also cap‐ tures the structure of documents The recipes in this chapter take advantage of the Universal Index to find documents with a specific root element and to look for docu‐ ments that are missing some type of structure Search by Root Element Problem You want to look for documents that have a particular root XML element or JSON property and combine that with other search criteria Solution Applies to MarkLogic versions and higher (: Return a query that finds documents with : the specified root element :) declare function local:query-root($qname as xs:QName) { let $ns := fn:namespace-uri-from-QName($qname) let $prefix := if ($ns eq "") then "" else "pre:" return xdmp:with-namespaces( map:new( map:entry("qry", "http://marklogic.com/cts/query")), cts:term-query( xdmp:value( "xdmp:plan(/" || $prefix || fn:local-name-from-QName($qname) || ")", map:entry("pre", $ns) )/qry:final-plan//qry:term-query/qry:key ) ) }; You can then call it like this: declare namespace ml = "http://marklogic.com"; cts:search( fn:doc(), cts:and-query(( local:query-root(xs:QName("ml:base")), cts:collection-query("published") )) ) Discussion It’s easy to find all the documents that have a particular root element or property: use XPath (/ml:base) However, that limits the other search criteria you can use For instance, you can’t combine a cts:collection-query with XPath What we need is a way to express /ml:base as a cts:query The local:query-root function in the solution returns a cts:termquery that finds the target element as a root We’re using a bit of trickery to get there (including the fact that cts:term-query is an undocumented function) Let’s dig in a bit deeper to see what’s happening We can use xdmp:plan to ask MarkLogic how it will evaluate an XPath expression like this: declare namespace ml = "http://marklogic.com"; xdmp:plan(/ml:base) The result looks like this (note that if you run this, the identifiers will be different): | Chapter 1: Document Searches capability, as opposed to working with the REST API, then you can combine logging the search parameters with executing the search request import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy"; let $query := xdmp:get-request-field("query", "") let $parsed := search:parse($query) let $user-info := let $results := search:resolve($parsed) let $log := xdmp:invoke( "/service/log-search.xqy", map:new(( map:entry("query", $parsed), map:entry("total", $results/@total), map:entry("user", $user-info) )), different-transaction ) return $results Discussion The code above gets the query string with xdmp:get-requestfield The Search API parses the query, which will be used both for logging and for running the actual search The $query is passed to search:parse, which interprets the query and generates a serialized (XML) version of it The serialized version can then be passed into the logging process, as well as sent to search:resolve for execution The details of what gets logged and where it is stored (the imple‐ mentation of /service/log-search.xqy) are beyond the scope of this recipe—your requirements will determine what user informa‐ tion you need to capture and how you want to store it You probably have user profile documents that you could add to Alternatively, you might record the information as managed triples, eliminating the worry of individual profile documents getting too large 14 | Chapter 3: Understanding Your Data and How It Gets Used The key point to notice is that logging is done in a separate transac‐ tion with xdmp:invoke This is important to minimize the locking required A search can run as a query, while logging the search has to run as an update This transaction type has an impact on what type of locks are required MarkLogic handles transactions using Multi-Version Concurrency Control (MVCC) Each version of a document has a creation time‐ stamp and a deletion timestamp To change a document, a new ver‐ sion is created; when it’s time to commit, the deletion timestamp of the earlier version and the creation timestamp of the new version are both set to the current system timestamp An update statement is assigned a timestamp when it commits Conversely, a query state‐ ment runs at the latest committed timestamp available at the time the statement begins work With that in mind, we can take a closer look at the locks required to a search and log it The search is a query statement, running at a particular timestamp Because of MVCC, we know that nothing will change in the database at that timestamp, therefore no locks are needed This is especially useful for a search, which could potentially touch a large number of documents The logging portion needs to make a transactional update to the database, which requires write locks for any documents that will be updated, as well as read locks By running the update in a separate transaction and passing in the required data, the documents touched by the search don’t need to be locked, only the documents that will record the information about the search Having the logging done in the same request (though a different transaction) from the search means that there is more work for the database to than if the logging is done separately from the search request In practice, this has very little impact (thanks to eliminating the need for most locks); however, it’s worth considering why this is better than other strategies The simplest approach would be to skip the separate transaction and all the work in one Hopefully the locking discussion above shows why that’s not ideal You might decide to not wait for the logging to finish before moving on to the next request, spawning the log process instead of invoking it This puts the logging request on the Task Server, which seems like Logging Search Requests | 15 a win—the logging will be done asynchronously, without making the search results wait However, there are a couple of risks On a busy system, the Task Server queue could potentially fill up, losing requests Also, the queue is not persisted, so if the server goes down, the logging information will be lost See Also • Documentation: “Search API: Understanding and Using” (Search Developer’s Guide) • Documentation: “Understanding Transactions in MarkLogic Server” (Application Developer’s Guide) Count Documents in Directories Problem Get a count of how many documents are in each directory so that you or your users can understand your data set better Solution Applies to MarkLogic versions and higher declare function local:map-uris($uris as xs:string*) { let $map := map:map() let $_ := for $uri in $uris let $toks := fn:tokenize($uri, "/") for $t at $i in fn:subsequence($toks, 1, fn:count($toks) - 1) let $key := fn:string-join($toks[1 to $i], "/") || "/" let $count := (map:get($map, $key), 0)[1] return map:put($map, $key, ($count + 1) ) return $map }; local:map-uris(cts:uris()) Required Index • URI lexicon 16 | Chapter 3: Understanding Your Data and How It Gets Used Discussion MarkLogic allows you to segment content by collections and by directories If you’re using directories, it can be helpful to know how many documents are in a directory This information might be used just by you, as a content manager, or presented to your end users The local:map-uris function is given a sequence of URIs and returns a map The keys of the map are the directories, starting with the root (“/”) The results are deep counts, showing the number of documents somewhere under a directory A URI of “/a/b/c/1.xml” will contribute one count to “/”, “/a/”, “/a/b/”, and “a/b/c/” The count for “/” will match the total number of URIs passed in, assuming that all URIs are in the root directory Thus, for any particular directory, the count will be the same number as found by cts:directoryquery($dir, "infinity") Rather than running on all URIs, you can pass a query to cts:uris, which will run unfiltered This could be useful in building a multitier facet to give users information about available content If you run this on a large database, there’s a good chance that it will time out In that case, you might want to make separate calls for each of your top-level directories using cts:uri-match() See Also • Documentation: “Collections Versus Directories” (Search Developer’s Guide) • Documentation: “Understanding Unfiltered Searches” (Query Performance and Tuning Guide) Count Documents in Directories | 17 CHAPTER Searching with the Optic API The Optic API, introduced in MarkLogic 9, implements common relational operations over data extracted from documents This chapter illustrates how to accomplish some common tasks, which should help in transitioning to MarkLogic from a relational back‐ ground Paging Over Results Problem An Optic API query returns a large result set Get a stable set of results a page at a time Solution Applies to MarkLogic versions 9+ and higher In its simplest form: const op = require('/MarkLogic/optic'); const pageSize = ; const pageNumber = ; op.fromTriples([ )]) offsetLimit(op.param('offset'), pageSize) result(null, {offset: pageSize * (pageNumber - 1)}) Expanded out to a main module, which allows the caller to specify the page and a timestamp, the recipe looks like this: 19 const op = require('/MarkLogic/optic'); const pageNumber = parseInt(xdmp.getRequestField('page', '1'), 10); const pageSize = 1000; let timestamp = xdmp.getRequestField('timestamp'); if (timestamp === null) { timestamp = xdmp.requestTimestamp(); } const response = { timestamp: timestamp, page: pageNumber, results: xdmp.invokeFunction( function() { return op.fromTriples([ ]) offsetLimit(op.param('offset'), pageSize) result(null, {offset: pageSize * (pageNumber - 1)}); }, { timestamp: timestamp } ) } response Required Privilege • xdmp:timestamp Discussion Sometimes your result set will be bigger than you want to return in a single request Paging solves this problem by having the caller request successive pages until all results have been returned This means that no individual response is too big, but all results are returned One of the challenges with paging is the risk that the underlying data set may change, with the result that a row might be skipped or repeated In this recipe, we’re working through a large set of triples by calling op.fromTriples, but the same principles apply if calling op.fromLexicons, op.fromLiterals, or op.fromView This recipe prevents the changing data set problem using point-intime queries If you aren’t familiar with how timestamps are man‐ aged in MarkLogic, read “Understanding Point-In-Time Queries” in the Application Developer’s Guide 20 | Chapter 4: Searching with the Optic API By using point-in-time queries, we can ask for a batch of results in one request, process them, then ask for the next batch, knowing that the list will not change in-between Using the main module version of the recipe, the caller is able to specify the page and the timestamp The timestamp would not be sent with the first request, but the response will indicate at what timestamp the query was run Subse‐ quent calls can include this to ensure stable results Point-in-time queries work the same for Optic API queries as they for others; the differences are in how the data set is gathered and paged Note that the REST API provides its own ways to manage time‐ stamps For example, take a look at the POST /v1/rows endpoint, paying attention to the timestamp parameter and the MLEffective-Timestamp header As with any point-in-time query, one caveat is that the caller should finish before MarkLogic’s merge timestamp catches up to the request timestamp In practice, this is unlikely to be a problem; if it becomes one, you may need to take control of the merge timestamp to ensure the results remain available The offsetLimit call has a reference to op.param(offset) This could have been written with the offset value in place; however, writing it with a variable allows MarkLogic to cache and reuse the query MarkLogic analyzes the query and builds up a plan By parameterizing it, this plan can be reused, enabling faster execution The caller will need to determine when all results have been pro‐ vided by watching for an empty result set While some MarkLogic searches provide an estimate of the total number of results, estimat‐ ing rows is harder than estimating search because the pipeline of operations can produce more or fewer output than input rows Even with an estimate, that would not be an exact count, so iterating until empty would be necessary regardless See Also • Tutorial: “Optic API” • Documentation: “Optic API for Relational Operations” (Appli‐ cation Developer’s Guide) • Documentation: “Invoking an Optic API Query Plan” (REST Application Developer’s Guide) Paging Over Results | 21 Group By Your data has properties for names and counts; you want the sums of the counts grouped by name Solution Applies to MarkLogic versions and higher Assuming you want sums of a property called amount grouped by a property called name: const op = require('/MarkLogic/optic'); op.fromLexicons({ name: cts.jsonPropertyReference('name'), amount: cts.jsonPropertyReference('amount', ["type=int"]) }) groupBy('name', [ op.sum('totalamount', 'amount') ]) result(); To see this run, you can create some sample docs to play with: declareUpdate(); for (let index=0; index < 1000; index++) { xdmp.documentInsert( '/values' + index + '.json', { "name": "val" + (xdmp.random(2) + 1), "amount": xdmp.random(2) + } ) } The response I got is below Yours might be different due to the xdmp.random calls { "name": "val1", "totalamount": 712 }, { "name": "val2", "totalamount": 659 }, { "name": "val3", 22 | Chapter 4: Searching with the Optic API "totalamount": 649 } Required Indexes • range index on name • range index on amount Discussion MarkLogic provides group-by with the Optic API, which performs relational operations over data extracted from documents Optic can work with range indexes, using op.fromLexicons as shown above, but it can also calculate group-bys using op.fromTri ples or op.fromView, to work with triples or rows, respectively Optic’s groupBy takes a field that you want to group by, along with instructions on how to aggregate the related values For the sum aggregator, we provide a new name for the aggregated values, along with the source field The groupBy function documentation lists the aggregation functions you can use If you need an aggregation function other than the ones provided, op.uda allows you to call a user-defined function See Also • XQuery version: Group by sum • Blog Post: “SQL’s ‘Group By’ the MarkLogic way” (for Mark‐ Logic 8) • Documentation: “Optic API for Relational Operations” (Appli‐ cation Developer’s Guide) • Documentation: “Invoking an Optic API Query Plan” (REST Application Developer’s Guide) Group By | 23 Extract Content from Retrieved Documents Problem You have used Template Driven Extraction (TDE) to extract data from documents into the row index, but there are additional pieces of information you want to return with a particular Optic query Solution Applies to MarkLogic versions and higher Consider a schema called content with a view called books The view has columns title, author, and price The documents also have an element called summary that you decided not to put in the view, as it won’t be used very often For this query, however, you want to include it in the results const op = require('/MarkLogic/optic'); const docId = op.fragmentIdCol('docId'); op.fromView('content', 'books', null, docId) where(op.gt(op.col('price'), 20)) joinDoc('doc', docId) select([ 'title', 'author', 'price', op.as('summary', op.xpath(op.col('doc'), '/doc/summary/fn:string()')) ]) result(); Required Index • TDE-extracted rows Discussion The Optic API is useful for executing relational operations on val‐ ues; however, sometimes we need to pull in additional data from the source documents It makes sense to pull scalar values into the row index, where we can calculations and aggregations on them, but content with substructure, such as XML with markup, doesn’t really benefit The example presented here is relatively simple, but supple‐ menting the results with content from documents can be done with 24 | Chapter 4: Searching with the Optic API much more complex queries, involving joins, aggregates, or other operations The op.fromView call identifies the schema and view that we’ll draw data from We also use op.fragmentIdCol to let Optic know that we want to work with the source documents In this case, I’ve used the name “docId,” but the name isn’t meaningful as we won’t be using it once we get our results Before we actually join the documents to our rows, we should filter down to just those rows we need, in order to avoid reading more documents from disk than is necessary In the recipe above, only books with prices greater than 20 are included Having told Optic what data to work with, joinDoc tells Optic to use the docId to include the document content in a column called doc Finally, select specifies what columns to return in the result set The as clause tells Optic to make a summary column based on an XPath statement, run against the document associated with the cur‐ rent row When using this technique, we’ll see columns from the row index, as well as the specified additional data from the documents, in this case the summary It’s important to remember that to get the summary, MarkLogic had to load the entire document, then use XPath to select just a part of it This is similar to any other search where we return just part of search result To ensure this performs well, use a where clause to reduce the number of documents that need to be retrieved Consider using paging if you still get a lot of results If the docu‐ ments are large, considering adding the additional data to the row index instead of retrieving it separately See Also • XQuery version: “Extract Content from Retrieved Documents” • Recipe: “Page over Optic API results” • Documentation: “Optic API for Relational Operations” (Appli‐ cation Developer’s Guide) • Documentation: “Invoking an Optic API Query Plan” (REST Application Developer’s Guide) Extract Content from Retrieved Documents | 25 Select Documents Based on Criteria in Joined Documents Problem Select documents based on criteria that are located in different documents You have Template Driven Extraction (TDE) templates that extract data from both document types into views Solution Applies to MarkLogic versions and higher This recipe assumes that you have two types of documents: one describing factories, and one describing events (problems) that hap‐ pen at those factories A TDE template has extracted a factory ID and the state where the factory is located into the factories view of the manufacturing schema Another template has extracted col‐ umns from the events documents: factoryFKey (factory foreign key), severity (of the event), and status (Active or Resolved) These columns are in the events view const op = require('/MarkLogic/optic'); const docId = op.fragmentIdCol('docId'); op.fromView('manufacturing', 'events') where( op.and( op.eq(op.col('severity'), 1), op.eq(op.col('status'), 'Active') )) groupBy('factoryFKey') joinInner( op.fromView('manufacturing', 'factories', null, docId), op.on('factoryFKey', 'factoryID') ) joinDoc('doc', docId) select('doc') result(); Required Indexes • row index with data from TDE extracted into two related views (in this case, factories and events) 26 | Chapter 4: Searching with the Optic API Discussion The desired result from this query is the set of factory documents that have a severity 1, active event going on The severity and status are in the events view, so we need to join the views to get to the information we want The query starts with the events view, as that is where the search criteria are found The where clause limits the rows that we need to work with prior to doing the join, which is based on the factory ID columns This early scoping is important for performance Once the selection is made and the join between the views is done, we join with the documents Notice that as part of the joinInner, the op.fromView that connects to the factories view refers to docId This is how the joinDoc call knows which document to retrieve The final step is to select the doc column and return just that Optic will return a Sequence, which your code can then iterate over See Also • XQuery version: “Select Documents Based on Criteria in Joined Documents” • Tutorial: “Template Driven Extraction” • Documentation: “Optic API for Relational Operations” (Appli‐ cation Developer’s Guide) • Documentation: “Invoking an Optic API Query Plan” (REST Application Developer’s Guide) Select Documents Based on Criteria in Joined Documents | 27 About the Author David Cassel is the Technical Community Manager for MarkLogic, where he educates developers, architects, and DBAs about how to use MarkLogic to implement data integration solutions David has more than 20 years’ experience as a developer, building applications ranging from quick proof-of-concepts to production systems for customers across such verticals as public sector, financial services, medical, and telecommunications He has also built a number of developer productivity tools, some of which can be found on GitHub Besides building applications, Dave creates and delivers educational material in a variety of formats, including blog posts, YouTube vid‐ eos, and Meetup presentations ... which book you want to see first? Probably the one with the title match, since the title will likely have key terms in it The summary is a bit bigger, but describes the general purpose of the... written with the offset value in place; however, writing it with a variable allows MarkLogic to cache and reuse the query MarkLogic analyzes the query and builds up a plan By parameterizing it, this... xdmp:value? We can run xdmp:plan with an explicit XPath expression, but if we want to work with a dynamic path (provided at runtime), then we can’t build a string and pass it to xdmp:plan However, we