a guide to improving data integrity and adoption

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	31
Dung lượng	3,79 MB

Nội dung

Strata+Hadoop World A Guide to Improving Data Integrity and Adoption A Case Study in Verifying Usage Data Jessica Roper A Guide to Improving Data Integrity and Adoption by Jessica Roper Copyright © 2017 O’Reilly Media Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Nicole Tache Production Editor: Colleen Lobner Copyeditor: Octal Publishing Services Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest December 2016: First Edition Revision History for the First Edition 2016-12-12: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc A Guide to Improving Data Integrity and Adoption, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-97052-2 [LSI] A Guide to Improving Data Integrity and Adoption In most companies, quality data is crucial to measuring success and planning for business goals Unlike sample datasets in classes and examples, real data is messy and requires processing and effort to be utilized, maintained, and trusted How we know if the data is accurate or whether we can trust final conclusions? What steps can we take to not only ensure that all of the data is transformed correctly, but also to verify that the source data itself can be trusted as accurate? How can we motivate others to treat data and its accuracy as priority? What can we to expand adoption of data? Validating Data Integrity as an Integral Part of Business Data can be messy for many reasons Unstructured data such as log files can be complicated to understand and parse information A lot of data, even when structured, is still not standardized For example, parsing text from online forums can be complicated and might need to include logic to accommodate slang such as “bad ass,” which is a positive phrase but made with negative words The system creating the data can also make it messy because different languages have different expectations for design, such as Ruby on Rails, which requires a separate table to represent many-tomany relationships Implementation or design can also lead to messy data For example, the process or code that creates data, and the database storing that data might use incompatible formats Or, the code might store a set of values as one column instead of many columns Some languages parse and store values in a format that is not compatible with the databases used to store and process it, such as YAML (YAML Ain’t Markup Language), which is not a valid data type in some databases and is stored instead as a string Because this format is intended to work much like a hash with key-and-value pairs, searching with the database language can be difficult Also, code design can inadvertently produce a table that holds data for many different, unrelated models (such as categories, address, name, and other profile information) that is also self-referential For example, the dataset in Table 1-1 is self-referential, wherein each row has a parent ID representing the type or category of the row The value of the parent ID refers to the ID column of the same table In Table 1-1, all information around a “User Profile” is stored in the same table, including labels for profile values, resulting in some values representing labels, whereas others represent final values for those labels The data in Table 1-1 shows that “Mexico” is a “Country,” part of the “User Profile” because the parent ID of “Mexico” is 11, the ID for “Country,” and so on I’ve seen this kind of example in the real world, and this format can be difficult to query I believe this relationship was mostly the result of poor design My guess is that, at the time, the idea was to keep all “profile-like” things in one table and, as a result, relationships between different parts of the profile also needed to be stored in the same place Table 1-1 Selfreferential data example (source: Jessica Roper and Brian Johnson) ID Parent ID Value 16 11 Mexico 11 Country User Profile NULL Data quality is important for a lot of reasons, chiefly that it’s difficult to draw valid conclusions from impartial or inaccurate data With a dataset that is too small, skewed, inaccurate, or incomplete, it’s easy to draw invalid conclusions Organizations that make data quality a priority are said to be data driven; to be a data-driven company means priorities, features, products used, staffing, and areas of focus are all determined by data rather than intuition or personal experience The company’s success is also measured by data Other things that might be measured include ad impression inventory, user engagement with different products and features, user-base size and predictions, revenue predictions, and most successful marketing campaigns To affect data priority and quality will likely require some work to make the data more usable and reportable and will almost certainly require working with others within the organization Using the Case Study as a Guide In this report, I will follow a case study from a large and critical data project at Spiceworks, where I’ve worked for the past seven years as part of the data team, validating, processing and creating reports Spiceworks is a software company that aims to be “everything IT for everyone IT,” bringing together vendors and IT pros in one place Spiceworks offers many products including an online community for IT pros to research and collaborate with colleagues and vendors, a help desk with a user portal, network monitoring tools, network inventory tools, user management, and much more Throughout much of the case study project, I worked with other teams at Spiceworks to understand and improve our datasets We have many teams and applications that either produce or consume data, from the network-monitoring tool and online community that create data, to the business analysts and managers who consume data to create internal reports and prove return on investment to customers My team helps to analyze and process the data to provide value and enable further utilization by other teams and products via standardizing, filtering, and classifying the data (Later in this report, I will talk about how this collaboration with other teams is a critical component to achieving confidence in the accuracy and usage of data.) This case study demonstrates Spiceworks’ process for checking each part of the system for internal and external consistency Throughout the discussion of the usage data case study, I’ll provide some quick tips to keep in mind when testing data, and then I’ll walk through strategies and test cases to verify raw data sources (such as parsing logs) and work with transformations (such as appending and summarizing data) I will also use the case study to talk about vetting data for trustworthiness and explain how to use data monitors to identify anomalies and system issues for the future Finally, I will discuss automation and how you can automate different tests at different levels and in different ways This report should serve as a guide for how to think about data verification and analysis and some of the tools that you can use to determine whether data is reliable and accurate, and to increase the usage of data An Overview of the Usage Data Project The case study, which I’ll refer to as the usage data project, or UDP, began with a high-level goal: to determine usage across all of Spiceworks’ products and to identify page views and trends by our users The need for this new processing and data collection came after a long road of hodge-podge reporting wherein individual teams and products were all measured in different ways Each team and department collected and assessed data in its own way—how data was measured in each team could be unique Metrics became increasingly important for us to measure success and determine which features and products brought the most value to the company and, therefore, should have more resources devoted to them The impetus for this project was partially due to company growth—Spiceworks had reached a size at which not everyone knew exactly what was being worked on and how the data from each place correlated to their own Another determining factor was inventory—to improve and increase our inventory, we needed to accurately determine feature priority and value We also needed to utilize and understand our users and audience more effectively to know what to show, to whom, and when (such as display ads or send emails) When access to this data occurred at an executive level, it was even more necessary to be able to easily compare products and understand the data as a whole to answer questions like: “How many total active users we have across all of our products?” and “How many users are in each product?” It wasn’t necessary to understand how each product’s data worked We also needed to be able to analysis on cross-product adoption and usage The product-focused reporting and methods of measuring performance that were already in place made comparison and analysis of products impossible The different data pieces did not share the same mappings, and some were missing critical statistics such as which specific user was active on a feature We thus needed to find a new source for data (discussed in a moment) When our new metrics proved to be stable, individual teams began to focus more on the quality of their data After all, the product bugs and features that should be focused on are all determined by data they collect to record usage and performance After our experience with the UDP and wider shared data access, teams have learned to ensure that their data is being collected correctly during beta testing of the product launch instead of long after This guarantees them easy access to data reports dynamically created on the data collected After we made the switch to this new way of collecting and managing data from the start—which was automatic and easy—more people in the organization were motivated to focus on data quality, consistency, and completeness These efforts moved us to being a more truly data-driven company and, ultimately, a stronger company because of it Getting Started with Data Where to begin? After we determined the goals of the project, we were ready to get started As I previously remarked, the first task was to find new data After some research, we identified much of the data needed was available in logs from Spiceworks’ advertising service (see Figure 1-1), which is used to identify a target audience that users qualify to be in and therefore what set of ads should be displayed to them On each page of our applications, the advertising service is loaded, usually even when no ads are displayed Each new page and even context changes, such as switching to a new tab, create a log entry We parsed these logs into tables to analyze usage across all products; then, we identified places where tracking was missing or broken to show what parts of the advertising-service data source could be trusted As Figure 1-1 demonstrates, each log ee results so that everything we wanted to filter and report on was included Data transformations like appended information, aggregation, and transformation require more validation Anything converted or categorized needs to be individually checked product Monitors are used to test new data being appended to and created by the system for the future; one time or single historical reports will not require monitoring One thing we had to account for when creating monitors was traffic changes throughout the week, such as significant drops on the weekends A couple of trend complications we had to deal with were weeks that have holidays and general annual trends such as drops in traffic in December and during the summer It is not enough to verify that the month looks similar to the month before it or that a week has similar data to the week before; we also had to determine a list of known holidays to add indicators to those dates when the monitors are triggered and compare averages over a reasonable amount of time It is important to note that we did not allow holidays to mute errors; instead, we added indicators and high-level data trend summaries in the monitor errors that allowed us to easily determine if the alert could be ignored Some specific monitors we added included looking at total page views over time and ensuring that the total was close to the average total over the previous three months We also added the same monitors for the total page views of each product and category, which tracked that all categories collect data consistently This also ensured that issues in the system creating the data were monitored and changes such as accidental removal of tracking code would not go unnoticed Other tests included looking at these same trends for totals and by category for registered users and visitors to ensure that tracking around users remained consistent We added many tests around users because knowing active users and their demographics was critical to our reporting The main functionality for monitors is to ensure that critical data continues to have the integrity required A large change in a trend is an indicator that something might not be working as expected in all parts of the system A good rule of thumb for what defines a “large” change is when the data in question is outside one to two standard deviations from the average For example, we found one application that collected the expected data for three months while in beta, but when the final product was deployed, the tracking was removed Our monitors discovered this issue by detecting a drop in total page views for that product category, allowing us to dig in and correct the issue before it had a large impact There are other monitors we also added that not focus heavily on trends over time Rather, they ensured that we would see the expected number of total categories and that the directory containing all the files being processed had the minimum number of expected files, each with the minimum expected size This was determined to be critical because we found one issue in which some log files were not properly copied for parsing and therefore significant portions of data were missing for a day Missing even only a few hours of data can have large effects on different product results, depending on what part of the day is missing from our data These monitors helped us to ensure data copy processes and sources were updated correctly and provided high-level trackers to make sure the system is maintained As with other testing, the monitors can change over time In fact, we did not start out with a monitor to ensure that all the files being processed were present and the correct sizes The monitor was added when we discovered data missing after running a very long process When new data or data processes are created it is important to use it skeptically until no new issues or questions are found for a reasonable amount of time This is usually related to how the processed data is consumed and used Much of the data I work with at Spiceworks is produced and analyzed monthly, so we closely and heavily manually monitor the system until the process has run fully successfully for several months This included working closely with our analysts as they worked with the data to find any potential issues or remaining edge cases in the data Anytime we found a new issue or unexpected change, a new monitor was added Monitors were also updated over time to be more tolerant of acceptable changes Many of these monitors were less around the system (there are different kinds of tests for that), and more about the data integrity and ensuring reliability Finally, another way to monitor the system is to “provide end users with a dead-easy way to raise an issue the moment an inaccuracy is discovered,” and, even better, let them fix it If you can provide a tool that both allows a user to report on data as well as make corrections, the data will be able to mature and be maintained more effectively One tool we created at Spiceworks helped maintain how different products are categorized We provided a user interface with a database backend that allowed interested parties to update classifications of URLs This created a way to dynamically update and maintain the data without requiring code changes and manual updates Yet another way we did this was to incorporate regular communications and meetings with all of the users of our data This included our financial planning teams, business analysts, and product managers We spent time understanding the way the data would be used and what the end goals were for those using it In every application, we included a way to give feedback on each page, usually through a form that includes all the page’s details Anytime the reporting tool did not have enough data results for the user, we gave an easy way to connect with us directly to help obtain the necessary data Implementing Automation At each layer of testing, automation can help ensure long-term reliability of the data and quickly identify problems during development and process updates This can include unit tests, trend alerts, or anything in between These are valuable for products that are being changed frequently or require heavy monitoring In the UDP, we automated almost all of the tests around transformations and aggregations, which allowed for shorter test cycles while iterating through process and provided long-term stability monitoring of the parsing process in case anything changes in the future or a new system needs to be tested Not all tests need to be automated or created as monitors To determine which tests should be automated, I try to focus on three areas: Overall totals that indicate system health and accuracy Edge cases that have a large effect on the data How much effect code changes can have on the data There are four general levels of testing, and each of these levels generally describes how the tests are implemented: Unit This tests focus on single complete components in isolation Integration Integration tests focus on two components working together to build a new or combined data set System This level tests verify the infrastructure and overall process itself as a whole Acceptance Acceptance tests validate data as reasonable before publishing or appending data sets In the UDP, because having complete sets of logs was critical, a separate system-level test was created to run before the rest of the process to ensure that data for each day and hour could be identified in the log files This approach further ensures that critical and difficult-to-find errors would not go unnoticed Other tests we focused on were between transformations of the data such as comparing initial parsed logs as well as aggregate counts of users and total page views Some tests, such as categorization verification, were only done manually because most changes to the process should not affect this data and any change in categorization would require more manual testing either way Different tests require different kinds of automation; for example, we created an automated test to validate the final reporting tables, which included a column for total impressions as well as the breakdown for type of impression based on that impression being caused by a new page view versus ad refresh, and so on This test was implemented as a unit test to ensure that at a low level the total was equal to the sum of the page view types Another unit test included creating samples for the log parsing logic including edge cases as well as both common and invalid examples These were fed through the parsing logic after each change to it as we discovered new elements of the data One integration test included in the automation suite was the test to ensure country data from the third-party geographical dataset was valid and present The automating tests for data integrity and reliability using monitors and trends were done at the acceptance level after processing to ensure valid data that followed the patterns expected before publishing it Usually when automated tests are needed, there will be some at every level It is helpful to document test suites and coverage, even if they are not automated immediately or at all This makes it easy to review tests and coverage as well as allow for new or inexperienced testers, developers, and so on, to assist in automation and manual testing Usually, I just record tests as they are manually created and executed This helps to document edge cases and other expectations and attributes of the data As needed, when critical tests were identified, we worked to automate those tests to allow for faster iterations working with the data Because almost all code changes required some regression testing, covering critical and high-level tests automatically provided easy smoke testing for the system and gave some confidence in the continued integrity of the data when changes were made Conclusion Having confidence in data accuracy and integrity can be a daunting task, but it can be accomplished without having a Ph.D or background in data analysis Although you cannot use some of these strategies in every scenario or project, they should provide a guide for how you think about data verification, analysis, and automation, as well as give you the tools and ways to think about data to be able to provide confidence that the data you’re using is trustworthy It is important that you become familiar with the data at each layer and create tests between each transformation to ensure consistency in the data Becoming familiar with the data will allow you to understand what edge cases to look for as well as trends and outliers to expect It will usually be necessary to work with other teams and groups to improve and validate data accuracy (a quick drink never hurts to build rapport) Some ways to make this collaboration easier are to understand what the focus is for those being collaborated with and to show how the data can be valuable to those teams to use themselves Finally, you can ensure and monitor reliability through automation of process tests and acceptance tests that verify trends and boundaries and also allow the data collection processes to be converted and iterated on easily Further Reading Peters, M (2013) “How Do You Know If Your Data is Accurate?” Retrieved December 12, 2016, from http://bit.ly/2gJz84p Polovets, L (2011) “Data Testing Challenge.” Retrieved December 12, 2016 from http://bit.ly/2hfakCF Chen, W (2010) “How to Measure Data Accuracy?” Retrieved December 12, 2016 from http://bit.ly/2gj2wxp Chen, W (2010) “What’s the Root Cause of Bad Data?” Retrieved December 12, 2016 from http://bit.ly/2hnkm7x Jain, K (2013) “Being paranoid about data accuracy!” Retrieved December 12, 2016 from http://bit.ly/2hbS0Kh About the Author Since graduating from University of Texas at Austin with a BS in computer science, Jessica Roper has worked as a software developer working with data to maintain, process, scrub, warehouse, test, report on, and create products for it She is an avid mentor and teacher, taking any opportunity available to share knowledge Jessica is currently senior developer in the data analytics division of Spiceworks, Inc., a network used by IT professionals to stay connected and monitor their systems Outside of her technical work, she enjoys biking, swimming, cooking, and traveling ...Strata+Hadoop World A Guide to Improving Data Integrity and Adoption A Case Study in Verifying Usage Data Jessica Roper A Guide to Improving Data Integrity and Adoption by Jessica Roper... accuracy as priority? What can we to expand adoption of data? Validating Data Integrity as an Integral Part of Business Data can be messy for many reasons Unstructured data such as log files can... [LSI] A Guide to Improving Data Integrity and Adoption In most companies, quality data is crucial to measuring success and planning for business goals Unlike sample datasets in classes and examples,

Ngày đăng: 04/03/2019, 14:27