Search-Driven Business Analytics Designing a New Search Engine for Data Andy Oram Search-Driven Business Analytics by Andy Oram Copyright © 2015 O’Reilly Media, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Shannon Cutt Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest August 2015: First Edition Revision History for the First Edition 2015-09-02: First Release 2015-10-20: Second Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc SearchDriven Business Analytics, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-93813-3 [LSI] Chapter Search-Driven Business Analytics We are all accustomed to instant results with the use of major web search engines However, when we pull up a business intelligence (BI) product at work, the situation is quite different In comparison to Internet services that we use every day, these products seem stiff and unresponsive Business leaders are served with pre-built reports and dashboards put together by their BI teams, and they wait days or weeks to get reports on new inquiries about customers, products, or markets Thus, when a business manager moves from Facebook, Amazon.com, or Google to her BI tool, it feels like time travel back to a different century This report examines what it takes to make business intelligence as simple and responsive as today’s consumer search engines, where the user gets answers and visualizations as quickly as questions come to mind We’ll look at: The convergence of BI and search What a search-driven user experience looks like The intelligence required for analytical search Data sources and their associated data modeling requirements Turning on-the-fly calculations into visualizations Applying enterprise scale and security to search The techniques described here are general and draw on well-established practices in the field The main reference platform for this report is the ThoughtSpot Analytical Search Appliance The author will also incorporate information gleaned from discussions with technical staff from Microsoft’s Power BI service and from Adatao, a firm that offers collaborative and predictive analytics A New Generation of Vendors Offering Interactive Visualizations ThoughtSpot’s Analytical Search engine allows the user to ask ad-hoc questions of their data through a search interface The engine computes results on-the-fly based on the search query, and offers visualizations of interest to the user It features an interactive interface that allows you to search through billions of rows and compute results on-the-fly from any data source Figure Data display in ThoughtSpot Microsoft’s PowerBI service lets you quickly create dashboards, share reports, and directly connect to (and incorporate) all the data available within the organization, through partners, or publicly posted to the Internet Power BI Desktop enables you to transform data and create reports and visualizations Figure shows a typical dashboard created in the Desktop Figure Dashboard produced by Microsoft Power BI Adatao takes a problem-solving approach to all data, big and small, where the user starts with a hypothesis and pulls answers out of data sources to validate or invalidate the hypothesis Figure shows typical output from Adatao, known as a narrative, which enables data discovery and presentation in the form of attractive visualizations Figure Narrative produced by Adatao Translating Queries into Answers Programmers reading this account will quickly see that the services in this report manipulate SQL behind the scenes, generating relational database queries of considerable complexity and sophistication But the user doesn’t have to think in terms of relational data or SQL at all The inputs are mildly structured but close to everyday language — the original premise of SQL in the 1970s In the 1980s and 1990s, a number of products promised a “natural-languageon-SQL” approach, but failed to meet the market’s need The query suggestion/completion interfaces implemented by the services in this report are along the lines of popular search engines, and add a crucial missing piece to those older approaches It turns out that the query suggestion/completion interface is a significant factor in helping users effectively go from thought to question to answer Figure shows how ThoughtSpot extends the user’s query with suggestions Figure Search suggestions are refined as you type your query We have assumed so far that a column named “county” is in some input table However, if the column has some other name (say, “region”), the services in this report allow a user or administrator to define synonyms, so that they can map an oddly named column (such as “cust_reg”) to words in everyday language (such as “region”) Power BI, for instance, lets users this through PowerPivot in Excel ThoughtSpot allows administrators to relate column names to synonyms In this hypothetical case, the administrator could indicate that when a user requests a “county,” the engine should map that to the “region” column from the input ThoughtSpot also uses synonym sets and other matching algorithms to offer meaningful suggestions based on what the user meant, and lets the user pick the correct choice to move forward with the query computation All three tools in this report, working with the original data sources, lets the user “slice and dice” data through filtering and drill-down operations (by region, product segment, etc.) With ThoughtSpot, users can also slice-anddice directly in the search bar by adding or removing search terms In short, a responsive user interface should — in real time — compare user inputs to both column names and values in the input databases It should be able to make a savvy guess as to what column the user wants and offer that as a higher-ranked suggestion, based both on exact matches and on considerations such as which columns contain the most rows containing “California” or “2015” as a value The user can pick the suggestion that matches what she is looking for and disambiguate the request Validating Answers The final piece of input interpretation is helping users verify the intermediate steps that the product used to arrive at a result This helps adoption because users can now trust results that they see by verifying the data sources used to compute the answer When a search result is shown, alongside the result the user is also given an option to hover over each search term and understand the lineage of the data (which source table and column it came from) For example the user could see if she has chosen revenue data from an official data source such as the data warehouse, or a spreadsheet shared with her by a coworker Each source and object in the system can also be “tagged” to show its associations (e.g., marketing, sales), and these could serve as useful inputs to help the user understand what data sources she picked to arrive at the answer in front of her In Figure 10, a ThoughtSpot user has selected the “store region” part of the query for deeper investigation Figure 10 User selects parts of a ThoughtSpot query to delve into A button next to the search box lets the user translate the search string to an almost plain-English form that explains how the different tables were joined, what filters were applied, and what final result was computed Figure 11 shows the internal information that ThoughtSpot displays about the “store region” part of the query in Figure 10 Figure 11 Delving into a ThoughtSpot query This helps business users gain confidence that the product is indeed performing the computations they way they expect it to Users can share their query answers with BI analysts to reconcile any differences For example, the business user might have wanted to see order date, and the BI analyst could have made her report using the ship date — the key is that the two are different By looking at the output provided by ThoughtSpot, she is able to see that the date used was the order date and could change it by picking the date from the ship date column to get her desired answer Creating the Simplicity of a Search-Like Query To show how a search interface can form and execute a query while totally hiding the complexity of the schema and SQL, we’ll track the ThoughtSpot Analytical Search engine through its underlying processes when handling a user query Say we have two fact tables called Contacts and Sales Details, along with a dimension table called the Phone table that connects the other two Assume that for each category and product, we want to find the following: Number of unique phone numbers contacted Number of contacts made How many clicks were counted How many sales were made Total revenue With ThoughtSpot, the user just needs to type these terms into the interface and all the complex joins happen in the backend The query would be: “count phone count contact count sale clicks revenue category product” If you were to write the full SQL for something like this, it would look like: This search brings together data from the following tables: With that data, ThoughtSpot does all the complex joins and produces the result in Figure 12 Figure 12 ThoughtSpot produced this result after hiding all the complex logic under the hood The user can ask ThoughtSpot to explain how it put together and interpreted the data, and receive the display in Figure 13 Figure 13 Explanation of how ThoughtSpot performed query Creating Instant Visualizations We all like aptly-chosen charts and figures that show us trends, and business intelligence solutions thrive at creating these A search-driven business intelligence engine should therefore choose appropriate visualizations instantly, making a best guess at what relationships the user wants to see and how they should be compared With all the tools discussed in this report, the types of data that a user has entered into the search bar automatically determine the chart type that gets plotted by default For example, if a user looks for revenue over a particular time period, a line chart might be picked automatically, with time shown along the X axis and revenue along the Y axis If the user were to look at revenue by store location, a bar chart might be chosen instead — with stacked bars if the user wanted to further subdivide results by product category When there are two measures, such as when charting GDP of countries and life expectancies, a scatterplot is the first choice Three continuously changing variables can be visualized through a bubble chart Any geographical information, such as ZIP codes or latitudes and longitudes, will automatically get displayed on a map Figure 14 Sample visualization in ThoughtSpot As with the query, the visualization can be changed by the user in real time The user can simply click on a chart and pull up a menu of possible charts and display options In addition to the types of data included in a search, the algorithm also looks at factors such as cardinality of the different attributes to determine the best visualization to represent the data Having found an appropriate query and visualization, the user can save it for future use by “pinning” it to a dashboard — a feature similar to “pinning” photos on sites such as Pinterest In ThoughtSpot, charts, tables, and summary statistics can all be pinned to the “pinboard,” and organized in the best sequence to support whatever story the user has in mind Each time a user views a chart, ThoughtSpot refreshes it with current data Sharing Answers and Visualizations For a search experience to feel complete, it needs to provide you with the ability to express your thoughts, create answers, and share those with others Today’s business intelligence world has a fragmented approach to the full workflow that starts when a business user thinks of a question and continues through to when she shares her findings with others The user expresses her question in some documented form for a BI analyst, and then waits for a dashboard to be built Then she looks at the dashboard, slices-and-dices based on the limited options she has, and if she finds what she was looking for, she saves the charts to PowerPoint for presentation Once a user has created a visualization in ThoughtSpot she can share the answer or dashboard with anyone in the enterprise that she’s allowed to share it with, using share options in the product She can also provide the URL for the shared answer to anyone that has group-level privileges to see that answer ThoughtSpot also offers a full-screen, presentation mode for charts to be presented directly from the product in meetings so the user can cut down time between search and discussions in a meeting This provides the added advantage of being able to edit search queries live in a meeting in case questions come up regarding any particular charts or tables Bringing Search-Driven Analytics to the Masses Intelligent search, instant visualization, and the ability to validate and verify answers provide the foundations for a search-driven data analytics product The goal is an easy entry point to doing research with the organization’s data without requiring training Today, search has also become synonymous with speed at scale A business user should be able to access all of her enterprise data, even if it is billions of rows, with the same speed as looking up the weather on Google Therefore, to get mass adoption, search-driven analytics products need to be architected from the ground up for scale (e.g.,terabytes of data being accessed by all the users in a company) Given the recent industry trends, a significant shift in direction for business intelligence is in the works This shift is being led by search Within the next decade, search will transform the world of business data as it did the world of public data over the last two decades About the Author Andy Oram is an editor at O’Reilly Media An employee of the company since 1992, Andy currently specializes in open source technologies and software engineering His work for O’Reilly includes the first books ever released by a US publisher on Linux, the 2001 title Peer-to-Peer, and the 2007 bestseller Beautiful Code Search-Driven Business Analytics A New Generation of Vendors Offering Interactive Visualizations Data Access Methods Are Being Transformed by Search Getting Insights from Diverse Data Interpreting User Input Translating Queries into Answers Validating Answers Creating the Simplicity of a Search-Like Query Creating Instant Visualizations Sharing Answers and Visualizations Bringing Search-Driven Analytics to the Masses ... Search- Driven Business Analytics Designing a New Search Engine for Data Andy Oram Search- Driven Business Analytics by Andy Oram Copyright © 20 15 O’Reilly Media, Inc All... August 20 15: First Edition Revision History for the First Edition 20 15-09- 02: First Release 20 15-10 -20 : Second Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc SearchDriven... 978-1-491-93813-3 [LSI] Chapter Search- Driven Business Analytics We are all accustomed to instant results with the use of major web search engines However, when we pull up a business intelligence (BI)