Search-Driven Business Analytics Designing a New Search Engine for Data Andy Oram Make Data Work strataconf.com Presented by O’Reilly and Cloudera, Strata + Hadoop World is where cutting-edge data science and new business fundamentals intersect— and merge n n n Learn business applications of data technologies Develop new skills through trainings and in-depth tutorials Connect with an international community of thousands who work with data Job # 15420 Search-Driven Business Analytics Designing a New Search Engine for Data Andy Oram Search-Driven Business Analytics by Andy Oram Copyright © 2015 O’Reilly Media, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Shannon Cutt Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest First Edition August 2015: Revision History for the First Edition 2015-09-02: 2015-10-20: First Release Second Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Search-Driven Business Analytics, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-93813-3 [LSI] Table of Contents Search-Driven Business Analytics A New Generation of Vendors Offering Interactive Visualizations Data Access Methods Are Being Transformed by Search Getting Insights from Diverse Data Interpreting User Input Translating Queries into Answers Validating Answers Creating the Simplicity of a Search-Like Query Creating Instant Visualizations Sharing Answers and Visualizations Bringing Search-Driven Analytics to the Masses 13 15 17 20 21 22 iii Search-Driven Business Analytics We are all accustomed to instant results with the use of major web search engines However, when we pull up a business intelligence (BI) product at work, the situation is quite different In comparison to Internet services that we use every day, these products seem stiff and unresponsive Business leaders are served with pre-built reports and dashboards put together by their BI teams, and they wait days or weeks to get reports on new inquiries about customers, products, or markets Thus, when a business manager moves from Facebook, Amazon.com, or Google to her BI tool, it feels like time travel back to a different century This report examines what it takes to make business intelligence as simple and responsive as today’s consumer search engines, where the user gets answers and visualizations as quickly as questions come to mind We’ll look at: • • • • • • The convergence of BI and search What a search-driven user experience looks like The intelligence required for analytical search Data sources and their associated data modeling requirements Turning on-the-fly calculations into visualizations Applying enterprise scale and security to search The techniques described here are general and draw on wellestablished practices in the field The main reference platform for this report is the ThoughtSpot Analytical Search Appliance The author will also incorporate information gleaned from discussions with technical staff from Microsoft’s Power BI service and from Adatao, a firm that offers collaborative and predictive analytics A New Generation of Vendors Offering Interactive Visualizations ThoughtSpot’s Analytical Search engine allows the user to ask adhoc questions of their data through a search interface The engine computes results on-the-fly based on the search query, and offers visualizations of interest to the user It features an interactive inter‐ face that allows you to search through billions of rows and compute results on-the-fly from any data source Figure Data display in ThoughtSpot Microsoft’s PowerBI service lets you quickly create dashboards, share reports, and directly connect to (and incorporate) all the data available within the organization, through partners, or publicly pos‐ ted to the Internet Power BI Desktop enables you to transform data and create reports and visualizations Figure shows a typical dash‐ board created in the Desktop | Search-Driven Business Analytics Figure Dashboard produced by Microsoft Power BI Adatao takes a problem-solving approach to all data, big and small, where the user starts with a hypothesis and pulls answers out of data sources to validate or invalidate the hypothesis Figure shows typi‐ cal output from Adatao, known as a narrative, which enables data discovery and presentation in the form of attractive visualizations Figure Narrative produced by Adatao A New Generation of Vendors Offering Interactive Visualizations | Data Access Methods Are Being Transformed by Search So how have these new-generation technologies transformed data interaction for the business user? An enlightening analogy can be drawn between the way managers use BI today and how information access on the Internet has evolved Typically, a manager at a data-rich company has access to certain canned business reports The managers have generated a list of busi‐ ness questions such as “a chart showing the product revenue from each store, to compare same-store sales year-by-year” and a pro‐ grammer has dutifully coded up an analytics application to provide those answers If the business managers want a different report con‐ taining metrics and relationships not provided ahead of time, a recoding effort is involved This severely limits the data analysis sys‐ tems, leaving them unresponsive to intuitive questioning by the business managers The systems and humans are operating at very different paces in this world of old-generation BI software Drawing an analogy to the evolution of the Internet, this is similar to the sites that curated content for users more than a decade ago Users would subscribe to forums to find out what was new Hot products like Encarta (introduced by Microsoft in the early 1990s when the Web was quite young) provided predetermined sets of information in an encyclopedia format Getting access to these resources was much easier than pacing through the card catalog of one’s local library, but they opened access only to a limited set of information chosen by the site Existing BI reports are similar to these offerings in their inelasticity and lack of real-time interactivity to serve the needs of the business user The advent of the AltaVista search engine, and subsequently Google, transformed information access The search engines didn’t add a jot to the information already available But they radically broadened the sites to which we had access, and put us only a few seconds and a few clicks away from the wealth of information and opinions on the Web Immediate options are now taken for granted as we search an online bookseller for books, a travel site for hotels and airline tickets, etc Within minutes we sample a mind-boggling range of opinions from around the world, whether the subject is the best data store for fast-moving input or the latest sports news | Search-Driven Business Analytics relationship diagram created by Power BI to represent an incoming schema Figure Schema in Power BI In Power BI, The user can also attach to a stream of incoming data and see a dashboard updated in real time as new data comes in The user can then provide this dashboard to colleagues—by sending an email with a URL, or through SharePoint—and they too can see real-time changes Power BI takes integration further by supporting single sign-on For instance, a user would log into Power BI and enter her Salesforce credentials After this, the user just needs to log into Power BI for future sessions and would be able to search Salesforce without reau‐ thorizing the connection Interpreting User Input Let’s see how the solutions in this report handle use questions Power BI and Adatao estimate what a user’s intent is using natural language processing (NLP) techniques They accept a range of rela‐ tively free text and resolve ambiguities by examining the context of the words used ThoughtSpot, on the other hand, chose not to use NLP in order to remove any chance of ambiguity ThoughtSpot’s search engine guides the user as they type with intelligent search suggestions, mak‐ ing sure that the user’s intent and the search engine are always in Interpreting User Input | sync As such, ThoughtSpot is always able to provide a single, accu‐ rate result, rather than a list of probabilistic answers All of these tools are fault-tolerant at the user-input level, allowing users to get to answers even with misspellings, changed word orders, or incorrect grammar The tools can execute the kind of type-ahead autocomplete that Google has made familiar (see Figure 5) Figure Autocomplete in ThoughtSpot Figure shows the output of an NLP query in Power BI Figure Natural-language query in Power BI 10 | Search-Driven Business Analytics Figure shows a typical set of relevant business questions suggested by Adatao Figure Adatao search suggestions To recognize natural-language phrases such as “What is the average cost per trip by region of travel?”, Power BI incorporated advanced technology from other Microsoft tools, notably Bing Corrections to spelling and alternative columns can be presented to the user As we have seen, the ThoughtSpot Analytical Search Appliance can handle a wide range of user requests and help the user structure her queries Let’s focus on a simple request such as “Revenue California 2015 county.” If the user types “Cal,” the engine fills in “California” as a suggested completion The algorithms that calculate and rank the completions take into account many factors, including how often a word shows up in the data (its cardinality) and how often people have searched for it As the product gets used, the sugges‐ tions get more relevant and personalized to each user, as with search engines like Google To facilitate this type of personalization, index matching has to sup‐ port exact matches as well as prefix, suffix, and substring matches; it also looks for synonyms If there are no matches—for example, if the user makes a typographical error—the engine offers suggestions based on spellcheck-based algorithms and phonetic matching algo‐ rithms, such as metaphone While performing these over potentially billions of rows of data, the engine also needs to apply sophisticated row-level, column-level, and object-level security rules so that only the entities the user is allowed to see are visible even in the search suggestions Within the the ThoughtSpot Analytical Search Appliance, when a user types “2015,” the engine knows that the text refers to a year— not a product part number or some other arbitrary number The Interpreting User Input | 11 engine can predict this with high accuracy because 2015 appears fre‐ quently in a Year column in a database it indexed A crucial prerequisite for joining data to respond to user queries is to recognize relationships The “California” and “2015” in the user’s query lead the engine to filter the data so it uses only the rows that are related to California and are from the year 2015 In our “Califor‐ nia 2015” example, ThoughtSpot can determine that a relationship exists if a foreign key connects two tables The interface offers suggestions in a dropdown box as the user types, and the user can immediately choose the one she intends For instance, as the user types “Revenue California,” the interface sug‐ gests several completions such as “Revenue California by county” and “Revenue California by customer,” drawing on its knowledge of the columns in the database The suggestions include those gener‐ ated through the analytical search algorithm already described, as well as those generated by typical document search algorithms, like Apache Lucene This instant responsiveness keeps the user and engine in lock step It allows the user to focus on her thought pro‐ cess instead of her interactions with the engine It also allows her to create a new answer or consume a saved answer based on what she’s looking for, without having to limit herself to saved charts and dash‐ boards The ThoughtSpot engine only produces suggestions that adhere to any security restrictions If a user has not been granted access to a column, it is not used to generate search suggestions, let alone pro‐ duce results In our search for “Revenue California 2015 county,” the engine com‐ putes that the data in the joined State and Year columns should be grouped by county in the display The ThoughtSpot user interface also recognizes common aggregate functions, such as “sum” or “standard deviation”, and computations such as “growth of ” that are complex to express in SQL Figure shows the results of a complex calculation in ThoughtSpot 12 | Search-Driven Business Analytics Figure Monthly Sales Growth chart example Translating Queries into Answers Programmers reading this account will quickly see that the services in this report manipulate SQL behind the scenes, generating rela‐ tional database queries of considerable complexity and sophistica‐ tion But the user doesn’t have to think in terms of relational data or SQL at all The inputs are mildly structured but close to everyday language—the original premise of SQL in the 1970s In the 1980s and 1990s, a number of products promised a “naturallanguage-on-SQL” approach, but failed to meet the market’s need The query suggestion/completion interfaces implemented by the services in this report are along the lines of popular search engines, and add a crucial missing piece to those older approaches It turns out that the query suggestion/completion interface is a significant factor in helping users effectively go from thought to question to answer Figure shows how ThoughtSpot extends the user’s query with suggestions Translating Queries into Answers | 13 Figure Search suggestions are refined as you type your query We have assumed so far that a column named “county” is in some input table However, if the column has some other name (say, “region”), the services in this report allow a user or administrator to define synonyms, so that they can map an oddly named column (such as “cust_reg”) to words in everyday language (such as “region”) Power BI, for instance, lets users this through PowerPi‐ vot in Excel ThoughtSpot allows administrators to relate column names to synonyms In this hypothetical case, the administrator could indicate that when a user requests a “county,” the engine should map that to the “region” column from the input Thought‐ Spot also uses synonym sets and other matching algorithms to offer meaningful suggestions based on what the user meant, and lets the user pick the correct choice to move forward with the query compu‐ tation All three tools in this report, working with the original data sources, lets the user “slice and dice” data through filtering and drill-down operations (by region, product segment, etc.) With ThoughtSpot, users can also slice-and-dice directly in the search bar by adding or removing search terms In short, a responsive user interface should—in real time—compare user inputs to both column names and values in the input databases It should be able to make a savvy guess as to what column the user wants and offer that as a higher-ranked suggestion, based both on exact matches and on considerations such as which columns contain the most rows containing “California” or “2015” as a value The user 14 | Search-Driven Business Analytics can pick the suggestion that matches what she is looking for and dis‐ ambiguate the request Validating Answers The final piece of input interpretation is helping users verify the intermediate steps that the product used to arrive at a result This helps adoption because users can now trust results that they see by verifying the data sources used to compute the answer When a search result is shown, alongside the result the user is also given an option to hover over each search term and understand the lineage of the data (which source table and column it came from) For example the user could see if she has chosen revenue data from an official data source such as the data warehouse, or a spreadsheet shared with her by a coworker Each source and object in the system can also be “tagged” to show its associations (e.g., marketing, sales), and these could serve as useful inputs to help the user understand what data sources she picked to arrive at the answer in front of her In Figure 10, a ThoughtSpot user has selected the “store region” part of the query for deeper investigation Figure 10 User selects parts of a ThoughtSpot query to delve into A button next to the search box lets the user translate the search string to an almost plain-English form that explains how the differ‐ ent tables were joined, what filters were applied, and what final result was computed Figure 11 shows the internal information that ThoughtSpot displays about the “store region” part of the query in Figure 10 Validating Answers | 15 Figure 11 Delving into a ThoughtSpot query This helps business users gain confidence that the product is indeed performing the computations they way they expect it to Users can share their query answers with BI analysts to reconcile any differ‐ ences For example, the business user might have wanted to see order date, and the BI analyst could have made her report using the ship date—the key is that the two are different By looking at the output provided by ThoughtSpot, she is able to see that the date used was the order date and could change it by picking the date from the ship date column to get her desired answer 16 | Search-Driven Business Analytics Creating the Simplicity of a Search-Like Query To show how a search interface can form and execute a query while totally hiding the complexity of the schema and SQL, we’ll track the ThoughtSpot Analytical Search engine through its underlying pro‐ cesses when handling a user query Say we have two fact tables called Contacts and Sales Details, along with a dimension table called the Phone table that connects the other two Assume that for each category and product, we want to find the following: • • • • • Number of unique phone numbers contacted Number of contacts made How many clicks were counted How many sales were made Total revenue With ThoughtSpot, the user just needs to type these terms into the interface and all the complex joins happen in the backend The query would be: “count phone count contact count sale clicks reve‐ nue category product” If you were to write the full SQL for something like this, it would look like: Creating the Simplicity of a Search-Like Query | 17 18 | Search-Driven Business Analytics This search brings together data from the following tables: With that data, ThoughtSpot does all the complex joins and pro‐ duces the result in Figure 12 Figure 12 ThoughtSpot produced this result after hiding all the com‐ plex logic under the hood The user can ask ThoughtSpot to explain how it put together and interpreted the data, and receive the display in Figure 13 Creating the Simplicity of a Search-Like Query | 19 Figure 13 Explanation of how ThoughtSpot performed query Creating Instant Visualizations We all like aptly-chosen charts and figures that show us trends, and business intelligence solutions thrive at creating these A searchdriven business intelligence engine should therefore choose appro‐ priate visualizations instantly, making a best guess at what relation‐ ships the user wants to see and how they should be compared With all the tools discussed in this report, the types of data that a user has entered into the search bar automatically determine the chart type that gets plotted by default For example, if a user looks for revenue over a particular time period, a line chart might be picked automatically, with time shown along the X axis and revenue along the Y axis If the user were to look at revenue by store loca‐ tion, a bar chart might be chosen instead—with stacked bars if the user wanted to further subdivide results by product category When there are two measures, such as when charting GDP of countries and life expectancies, a scatterplot is the first choice Three continu‐ ously changing variables can be visualized through a bubble chart Any geographical information, such as ZIP codes or latitudes and longitudes, will automatically get displayed on a map 20 | Search-Driven Business Analytics Figure 14 Sample visualization in ThoughtSpot As with the query, the visualization can be changed by the user in real time The user can simply click on a chart and pull up a menu of possible charts and display options In addition to the types of data included in a search, the algorithm also looks at factors such as car‐ dinality of the different attributes to determine the best visualization to represent the data Having found an appropriate query and visualization, the user can save it for future use by “pinning” it to a dashboard—a feature simi‐ lar to “pinning” photos on sites such as Pinterest In ThoughtSpot, charts, tables, and summary statistics can all be pinned to the “pin‐ board,” and organized in the best sequence to support whatever story the user has in mind Each time a user views a chart, Thought‐ Spot refreshes it with current data Sharing Answers and Visualizations For a search experience to feel complete, it needs to provide you with the ability to express your thoughts, create answers, and share those with others Today’s business intelligence world has a fragmented approach to the full workflow that starts when a business user thinks of a ques‐ tion and continues through to when she shares her findings with others The user expresses her question in some documented form for a BI analyst, and then waits for a dashboard to be built Then she looks at the dashboard, slices-and-dices based on the limited Sharing Answers and Visualizations | 21 options she has, and if she finds what she was looking for, she saves the charts to PowerPoint for presentation Once a user has created a visualization in ThoughtSpot she can share the answer or dashboard with anyone in the enterprise that she’s allowed to share it with, using share options in the product She can also provide the URL for the shared answer to anyone that has group-level privileges to see that answer ThoughtSpot also offers a full-screen, presentation mode for charts to be presented directly from the product in meetings so the user can cut down time between search and discussions in a meeting This provides the added advantage of being able to edit search quer‐ ies live in a meeting in case questions come up regarding any partic‐ ular charts or tables Bringing Search-Driven Analytics to the Masses Intelligent search, instant visualization, and the ability to validate and verify answers provide the foundations for a search-driven data analytics product The goal is an easy entry point to doing research with the organization’s data without requiring training Today, search has also become synonymous with speed at scale A business user should be able to access all of her enterprise data, even if it is billions of rows, with the same speed as looking up the weather on Google Therefore, to get mass adoption, search-driven analytics products need to be architected from the ground up for scale (e.g.,terabytes of data being accessed by all the users in a com‐ pany) Given the recent industry trends, a significant shift in direction for business intelligence is in the works This shift is being led by search Within the next decade, search will transform the world of business data as it did the world of public data over the last two dec‐ ades 22 | Search-Driven Business Analytics About the Author Andy Oram is an editor at O’Reilly Media An employee of the company since 1992, Andy currently specializes in open source technologies and software engineering His work for O’Reilly includes the first books ever released by a US publisher on Linux, the 2001 title Peer-to-Peer, and the 2007 bestseller Beautiful Code ... international community of thousands who work with data Job # 15420 Search- Driven Business Analytics Designing a New Search Engine for Data Andy Oram Search- Driven Business Analytics by Andy Oram... Visualizations Bringing Search- Driven Analytics to the Masses 13 15 17 20 21 22 iii Search- Driven Business Analytics We are all accustomed to instant results with the use of major web search engines However,... If you were to write the full SQL for something like this, it would look like: Creating the Simplicity of a Search- Like Query | 17 18 | Search- Driven Business Analytics This search brings together