IT training mesosphere ebook oreilly streaming data khotailieu

Streaming Data Concepts that Drive Innovative Analytics Andy Oram Beijing Boston Farnham Sebastopol Tokyo Streaming Data by Andy Oram Copyright © 2019 O’Reilly Media All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com) For more infor‐ mation, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Acquisitions Editor: Rachel Roumeliotis Developmental Editor: Jeff Bleiel Production Editor: Christopher Faucher Copyeditor: Octal Publishing, LLC Proofreader: Nan Barber Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition March 2019: Revision History for the First Edition 2019-03-15: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492038092 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Streaming Data, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc The views expressed in this work are those of the author, and not represent the publisher’s views While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, includ‐ ing without limitation responsibility for damages resulting from the use of or reli‐ ance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of oth‐ ers, it is your responsibility to ensure that your use thereof complies with such licen‐ ses and/or rights This work is part of a collaboration between O’Reilly and Mesosphere See our state‐ ment of editorial independence 978-1-492-03806-1 [LSI] Table of Contents Streaming Data: Concepts That Drive Innovative Analytics From Data to Insight Example of a Complete Project Tracing the Movements of Big Data Processing Machine Learning New Architectures Lead to New Development Patterns Conclusion 17 25 27 iii CHAPTER Streaming Data: Concepts That Drive Innovative Analytics Managers and staff who are responsible for planning, hiring, and resource allocation throughout an organization need to consider the fast-growing impact of data and analytics If this topic doesn’t fill you with excitement, you can at least take on its study out of a welljustified concern that your industry is undergoing profound change Organizations everywhere are disrupting business, government, and society through the use of these analytics You need to understand the field to help your organization survive—and hopefully to grow and contribute to the common good This report deals in particular with streaming data, a set of tools and practices for quick decision making in response to fast-changing events Here are a few examples of how such analytics are changing businesses: • Online services for movies and other content, such as Netflix and Amazon.com, analyze customer behavior to make better predictions, as made famous by the Netflix Prize • Walmart, a long-term champion of efficiency, is using these technologies to improve the customer experience in multiple ways • Banks are detecting fraud in the use of credit cards and mobile payments • John Deere classifies plants in the fields in order to reduce pesti‐ cide use, while robots record their collection of fruit to predict yields It is difficult to quantify the growth of streaming analytics, for many reasons Observers don’t track it as a distinct discipline, and it involves not a simple purchase, but a continuous process of chang‐ ing educational and organizational roles Furthermore, many busi‐ nesses would like to move much faster into streaming analytics but are hampered by the shortage of qualified staff As an indication of the attention analytics in general are getting, it’s worth noting an assessment from mid-2018 by the highly respected International Data Corporation (IDC), estimating an annual growth of nearly 12% in “big data and business analytics (BDA) solutions.” Trade publications highlight many benefits—and sometimes risks— of streaming data but offer the general reader little insight into what these things actually and what companies need to to make them work On the other hand, people who try to delve into the technology quickly find themselves in a maze of overlapping pro‐ grams and tools (“You should use Kubernetes to manage Docker.”) This report tries to bring streaming data into focus for the average professional reader It uses everyday situations and terms to describe what these technologies are meant to achieve and how they achieve those ends Along the way, the report reveals the challenges raised by trying to organize and process data You will learn the following: • The essential steps in turning raw data into insights • Major flaws in the data available to most organizations and the processing required to make the data usable • A history of the goals adopted by data-driven organizations and how tools evolved along with these goals • Major categories of streaming data tools, with examples in each category and how they contribute to the essential goals As a high-level introduction for people without deep knowledge of programming or digital technology, this report doesn’t delve into mathematics or programming details I don’t recommend particular technologies, although I mention some of the popular or historically significant ones Most of all, I don’t suggest how you can use stream‐ ing data in your organization, because that is highly dependent on | Chapter 1: Streaming Data: Concepts That Drive Innovative Analytics your industry, the expertise of your employees, your needs, and the direction in which you want to take the organization From Data to Insight We will now begin our journey toward extracting insights from data More and more of it is being collected each year, presenting institu‐ tions with the question of which data they will find useful Here are a few common sources of data: Logs from computer services Because servers usually write information about the activities they perform to a log file, these files become a valuable source of information for assessing popularity and improving content as well as for performance and security Postings to social media and other popular websites For instance, you can easily ask Twitter to send you all postings that contain certain words Data generated by devices This includes data generated in settings such as medical and industrial Streaming data also often operates on batches of data sent as files Sales information and other transactions often come through this way Techniques for streaming data have been developed to address the vast acceleration that has taken place over the past couple of deca‐ des, both in the accumulation of data and in the speed with which business decisions are made Information from online purchases can be incorporated into marketing campaigns within milliseconds, and walk-in retail stores are adding data from their point-of-sale systems to speed up their decisions, too Just-in-time (JIT) manufac‐ turing was invented in the 1970s, but has recently become far more sophisticated thanks to new data-processing techniques Therefore, although some decisions are made in traditional ways by saving up data and processing it as a batch, this process must be supplemented by methods that incorporate recent data that has just streamed in The same data can be handled in a streaming and a nonstreaming manner For instance, online sales can be used in a streaming man‐ ner to alter what appears on the home page of the retailer (“Hottest From Data to Insight | items in our catalog!”) and then be saved for use in more long-term projects such as planning the next product to make So when I use the term “streaming data,” it applies to data that is used in a certain way—and also to the technical and organizational methods that uncover new value in that data Sometimes, regulations in your industry as well as general laws such as the European General Data Protection Regulation (GDPR) might place limits on how you can use data These will have an effect on what data you exclude from consideration or the data that you deidentify or aggregate over large groups of people Having chosen data sources, two challenges you face at the very start are fixing quality problems with the data and extracting the useful items from the raw data We examine those challenges in the follow‐ ing subsections Issues of Data Quality Data scientists routinely report that 80% or even 90% of their time is spent getting data into a state where it can be used to inform busi‐ ness decisions We can’t understand the reality of data analytics without grasping what a messy world we live in and how many qual‐ ity problems are presented by the data that comes in for analysis This is true whether we’re dealing with data entered by humans into databases, data collected from devices in the Internet of Things (IoT), or data left as a trail when people visit websites and make pur‐ chases Consider a few sources of error in order to see why errors are so pervasive: • People entering data will routinely make mistakes For instance, someone unfamiliar with the Spanish name “Aguilar” can easily enter it as “Agular” or “Agiular.” People can also put data in the wrong fields, such as typing a name into a field meant for the social security number Because addresses vary so much, they might choose different fields for some parts of an address, such as an apartment number or rural route This inevitable inaccur‐ acy is a source of suffering for people whose governments or businesses insist that their ID exactly matches some official record Computer processing is no panacea, because the pro‐ grammer rarely anticipates all variations in input Several sys‐ | Chapter 1: Streaming Data: Concepts That Drive Innovative Analytics streaming data Look back at the examples of streaming applications that began this report: bank fraud detection, IoT data, and so on The tools we have looked at up to this point strained to handle the data quickly enough One of the most pressing cases for streaming data is ad auctions that occur when someone visits a web page that has advertisements The article “How browsers get to know you in milliseconds” describes in detail the background data processing required to make modern advertising work on the web When you have a tight time limit for getting results from your analytics, such as when you need to respond to a visitor’s click and show an ad, the industry talks of realtime requirements Some device data is also used in real-time situa‐ tions, such as the brakes on an automobile Real-time processing is one important use for streaming data, but large streams of data can also be processed for other reasons that lack such stringent require‐ ments (On the flip side, some real-time applications just wait to be triggered by an event; these don’t need the tools used for streaming data.) Streaming Analytics: Flink, Kafka, and More Many modern applications contain data processing at many levels Data is preprocessed to determine insights, such as to find common characteristics of fraud This processing can use a mixture of tradi‐ tional relational database queries and the big data tools described so far And those analytics are fed into a new generation of streaming data tools to complete the application We turn next to these tools The defining characteristic of streaming data is the manner in which it arrives: fast, continuous, and often endless Some applications need to process it quickly (such as the real-time applications men‐ tioned in the previous section), whereas others just need to handle continuous streams of data for purposes such as updating an adver‐ tising model or delivery schedule Unlike the old Hadoop jobs, both of these types of applications for streaming data don’t just write out results and terminate: they often run continuously as long as the business needs them Each item of data—for example, a visitor landing on a web page or a person running a credit card through a machine—is a single inci‐ dent that might require a response But most streaming data appli‐ cations also some processing on small batches of data Thus, one 14 | Chapter 1: Streaming Data: Concepts That Drive Innovative Analytics of the main design tasks is to determine how to divide the input streams into chunks for processing This division of data is usually called a window It might simply be all inputs received during a period of time, such as five minutes It can also be something logi‐ cally significant, such as a visitor’s session on a website consisting of a series of clicks that might result in a purchase In that case, you can define some event as a trigger marking the end of a window Timing can become complicated if the source of data has interrup‐ tions or if you are taking data from multiple sources The time that you receive notice of an event is different from the time the event took place Therefore, you need to account for data arriving out of order and for wide differences in the gap between the event and the notification There are ways to deal with these uncertainties if the source of the data adds a field to mark the time when the event was generated Spark can then add an event to an earlier window if the notification of that event comes in late, but there is usually some maximum time after which old events cannot be added If you didn’t impose a maximum, you would theoretically need to keep all data in memory forever You also need to be careful to view the incoming data in a consistent state, even as new data is continuously being added For example, if you ask for the number of records in a particular set and then retrieve those records, you don’t want to be told that it contains 10 records and then get 11 records because one was added between your queries Just as Spark arose as a kind of next generation of Hadoop, some new tools arose as a streaming improvement to Spark Three tools in particular, Storm, Flink, and Akka, came under the umbrella of the Apache Foundation, which also hosts Spark And Spark itself has evolved to support streaming Individual streams can become complicated enough on their own, but another level of complexity developed in large data-processing organizations: the feeding of many streams to many different pipe‐ lines You might have one pipeline that takes in sales information and weather information in order to make predictions about future sales The same sales information might go to a completely different pipeline, along with traffic information, to plan how to restock retail outlets Along with this new complexity came a new set of tools called message queues The one in greatest use currently for stream‐ Tracing the Movements of Big Data Processing | 15 ing applications is Kafka, but there are many other systems with less ominous names, such as RabbitMQ, ActiveMQ, ZeroMQ, and Flume A message queue usually follows a publish/subscribe (or pub/sub) model It ingests data and “publishes” it by emitting a data stream, often after some transformations to make the data more useful Transformations might include putting messy data into a standard format and assigning tags (called “topics” in Kafka lingo) to different parts (such as sales data from different stores) Consumers of the data—the streaming data applications—then “subscribe” by accept‐ ing input from the message queue Thanks to the tags or topics, sub‐ scribers can filter out data and get only the small subset they want A message queue can also buffer data so that if a large burst is too much for the streaming application to handle, the data is saved and the application can read it when things settle down A queue can also wait for an acknowledgment that the recipient has received the data, resend it after a time-out, and finally delete the data when an acknowledgment is received We have now covered the main tools currently used for streaming data But other tasks require yet other solutions After all, streaming data applications are made up of many pieces and can be spread across a number of computers Each computer has memory, CPU, and input/output requirements Ever resourceful, the users of big data have created special tools to deal with such computer resources Orchestrating Computing Requirements: Docker, Kubernetes, and DC/OS To be robust, any distributed processing must deal with certain real‐ ities of computing The proper amount of CPU, memory, and stor‐ age must be allocated for each process The pipeline also must recover from systems that fail, network connections that disappear, and applications that go out of control or hang An administrative node, usually called a master, is responsible for such scheduling tasks Some of this is built into the tools mentioned in the previous sec‐ tion They can restart jobs that fail and even provide checkpoints so that the new job can pick up from a known state and make use of the work done by the failed job But there are a huge number of tasks involved in providing resources and keeping a distributed sys‐ 16 | Chapter 1: Streaming Data: Concepts That Drive Innovative Analytics tem chugging along Free and open source software teems with tools to these tasks, called orchestration Orchestration needs units of work to manipulate Docker is a very popular tool for creating individual units of work, called containers, that can share physical systems efficiently Containers are great for tasks that are self-contained but that communicate with other parts of the system This kind of processing has come to be called micro‐ services Containers are also easy and cheap to start up, and easy to replace if they fail Thus, Docker, along with some of the tools that follow, have become central to cloud computing You might use one container running a microservice to log in a visitor, another to han‐ dle a sale, a third to handle the payment, and so on If any one of those services becomes a bottleneck (because it has a high load or takes more time than others), the orchestration can add containers for that service, without having to take up more memory or pro‐ cessor time from the other services Kubernetes, Mesos, and YARN carry out certain types of orchestra‐ tion: for instance, they launch tasks, apportion resources such as CPU time, and restart tasks that fail To varying degrees, one or more of these are integrated with the streaming data pipeline tools Because you need to tell these tools how you want resources and processes apportioned, they require administration in themselves, and higher-level tools such as Distributed Cloud Operating System, or DC/OS, are available for some of those tasks Zookeeper is also often found as a fellow traveler of streaming data tools because it can help with deeply buried administrative tasks such as promoting a new system to be master after an existing master fails Machine Learning Streaming data can work well with traditional analytics But the suc‐ cess of AI, usually using a set of data processing tools called machine learning, recommends its use with streaming data As we’ve seen, streaming data possesses properties that machine intelligence can handle impressively well: the datasets are huge, they contain many parameters, and hidden relationships need to be uncovered The sudden success of AI in the past decade (after some 50 years of underperforming) has prompted everyone to wonder what its effect will be, both on our businesses and on our daily lives The general public senses a justified fear that powerful forces will use artificial Machine Learning | 17 intelligence to control us and cut off choices—or even that machines will wrest power away from humans entirely (I wonder what pur‐ pose might drive a machine to read this report sometime in the future.) Businesses have similar fears of being pushed into the dust‐ bin of history by AI, but are also intrigued by its potentially positive impacts The use of data and machine learning in your organization will depend closely on your goals and capabilities But finding the right application that offers you results to act on can utterly transform the organization The next few sections of this report therefore turn to the ideas that make up machine learning; in particular, a form called deep learning because of its architecture, which involves many layers of processors feeding data from one to the next Deep learning can be divided into many subsets, the most signifi‐ cant division being supervised versus unsupervised learning Both types of learning involve large sets of input data We have seen in previous sections what that data might be and how it is prepared for processing The difference between these types of learning depends on whether the data comes pretagged Image recognition is a common example of supervised learning, because the data is pretagged It will contain a number of pictures of dogs with the tag “dog” (or subsets of “dog” such as “Labrador” or “poodle”), pictures of cars tagged “car,” and so forth The goal of deep learning is to learn from this as a baby might learn You take your baby out in the stroller and proclaim “Look at the dog!” as you pass each dog; supervised learning works the same way Unsupervised learning is a bit less intuitive It usually works on data that contains many fields, such as records of customers, their demo‐ graphic data, and their purchases The goal of unsupervised learning is to create categories (a process called clustering) so that you can find relationships you might not have considered For instance, although television producers usually think just of children as the audiences for children’s shows, it turns out that many adults enjoy some of those shows, too Unsupervised learning can turn up such unanticipated results, which can then inform marketing efforts Supervised learning offers many intuitive concepts, so we will con‐ centrate on supervised learning in the following sections 18 | Chapter 1: Streaming Data: Concepts That Drive Innovative Analytics Ideas Behind Supervised Learning Supervised learning is based on two principles: inductive reasoning and cyclical feedback Inductive reasoning, in the estimation of behavioral psychologists, defines how we learn much of what we know about the world Inductive reasoning is the basis behind behavior modification, in which we give a dog a treat each time she obeys us and thus instill in her the desire to obey us in the future Organic brains work differ‐ ently from computer chips, but tremendous successes in natural lan‐ guage processing (NLP) and image recognition show that algorithms can also learn this way However, there are cracks in the glaze of this miracle because image recognition can be fooled in disastrous ways just by changing a few pixels in an image This shows how different supervised machine learning is from human and animal learning Cyclical feedback lies behind the architecture and behavior of deep learning networks The original input passes through a number of layers, undergoing some transformation The collection of layers is called a model At the end, the result is compared to the correct result, and the differences—reflecting the errors in the algorithm— lead to changes in the importance assigned to various processing steps The importance of a processing step is called its weight, just as in everyday life one person might assign more weight to the cost of a product, whereas another person assigns more weight to its quality Through a process known as backpropagation, new weights are assigned to the various processing units, and the entire process is repeated Basically, you find out on each pass what works and what doesn’t and then focus more on what works If everything lines up properly—truly representative training data, correct functions for the layers, correct interpretations for assigning weights—the network becomes better and better with repetition and finally achieves enough accuracy to be put into production And then you need to regularly start over because data changes over time For instance, decisions that the algorithm might have based on people’s income will become outdated when the economy takes an upturn or downturn If an algorithm makes assumptions based on people watching cable television, you’ll need to reevaluate them as people “cut the cord” and stream their TV shows Machine Learning | 19 So, what happens in each layer of the deep learning network? Some‐ times the operations are pretty simple Let’s look at a couple com‐ mon examples: • If you are identifying images, you might want to emphasize edges and deemphasize the minor variations in texture found in clothing, walls, and so on So, you might run a function called a convolution that looks at every pixel and compares it to the pix‐ els around it By exaggerating large differences and diminishing small differences, you help the next layer in the series identify an edge In another layer, another convolution could compare an edge to some stored image to identify a face or an automo‐ bile • Other operations reduce the many pieces of input data to a sin‐ gle, useful value We saw such functions earlier in the Map‐ Reduce algorithm In deep learning, such operations are called pooling Because tagged data lies behind supervised learning, it is normally divided into two sets: one to train the algorithm, the other to test the results of the algorithm The training set is much larger (often an order of magnitude greater than the test set) but the test set is ran‐ domly chosen from the original data in the hope of ensuring that it accurately represents the entire data The training set is input for each iteration of the layers, and the final layer submits its results to the test set The idea behind the test set is this: if you train an NLP system with a training set of English sentences and it successfully recognizes a test set of other English sentences, you have a good algorithm Each pass through the layers should produce an increas‐ ingly accurate result If it doesn’t, you have chosen bad parameters to start with and must redesign the deep learning model Some‐ times, another test set (also drawn from the original data) is applied after all the iterations of the model You might by now see a potential problem with these sets Because they are randomly drawn from a chosen set of input data, the pro‐ cess should robustly demonstrate that the training worked well for the input But what if you get incorrect or skewed input, which doesn’t accurately reflect real life? We will look at that problem shortly 20 | Chapter 1: Streaming Data: Concepts That Drive Innovative Analytics The magic of deep learning is that the same simple operations work well on many different types of decision making: what’s shown in an image, what someone is saying into a phone, what movie is likely to appeal to different people, and even how likely someone is to well in their job Deep learning, essentially, chooses the features that best predict what you’re looking for: a propensity to buy a car, whether someone is worthy of getting credit, and so on The decision can be based on various characteristics of the individuals or items that you are evalu‐ ating: age, gender, income, location, and more In deep learning, these characteristics are called features or dimensions and can run into the millions Thus, a major task of deep learning is to select the most valuable features—the ones that most correctly predict out‐ comes One type of dimensionality reduction ranks dimensions, testing how much better or worse a prediction is when a single dimension is removed If predictions improve when the dimension is removed, it clearly should go We can also look for dimensions that have similar effects on predic‐ tions For instance, income might have similar effects as educational level, perhaps because they’re correlated in actual life If you uncover such a relationship, you can keep the dimension that is the better predictor and eliminate the other Prerequisites for Successful Deep Learning Although there is no end of things to say about deep learning, the discussion in this report hopefully demystifies it and helps you understand how your organization can deploy it In the next section, I will give a high-level overview of programming libraries and tools for deep learning To strip away more of the magic, let’s first look at what is needed for it to be successful: Accurate data Deep learning reproduces the view of reality that we give it, and that view is often flawed We can see many examples of bias in everyday life People who make decisions about hiring, college acceptance, loans, and even whom to suspend from school tend to favor people who look like them—and those people are usually white, well educa‐ Machine Learning | 21 ted, and cis-male Health studies are usually based on white sub‐ jects, and often on people who are healthier and younger on average than those who will ultimately receive the treatment Similar biases slip all too easily into machine learning, as in the embarrassing scandal when Google labeled pictures of black people as gorillas Luckily, machine learning differs from everyday life in that one can consciously anticipate bias and compensate for it by making sure the training and test data is inclusive and well balanced I’m not using “bias” here in a political sense, but as mathematicians and scientists use the term, because it goes far beyond such con‐ siderations as race and gender If your dataset involves houses but you skip a certain neighborhood, or miss triple-deckers, or insert some other blind spot, that is also a bias and you will encounter problems applying your final model to real life Accurate tuning A host of different operations can be used as the layers of deep learning model, and each operation offers numerous knobs you can tweak There are also different options for other parts of deep learning, such as calculating the error between the desired and actual results All these things are called hyperparameters to distinguish them from the features or dimensions that the model is trying to discover, which could be called parameters One slightly incorrect hyperparameter could drive the model into a sand trap Choosing hyperparameters is partly a matter of experimenta‐ tion and partly a matter of experience (which means gathering the results of previous experiments) An experienced data scien‐ tist knows from reading or experimentation that certain opera‐ tions work well for certain types of analysis Some modern deep learning services promise that people without much back‐ ground can get good results by choosing generic operations and hyperparameters, and then just launching the training over and over until it converges on a good result But this is by no means guaranteed—you could go in circles Each step in deep learning is well supported by rigorous research, but the entire process has an experimental aspect Uncertainty about what is really controlling the final decision leads to problems with “explainability.” We can’t point to one simple reason such as a credit 22 | Chapter 1: Streaming Data: Concepts That Drive Innovative Analytics score or a purchase history to explain why you are granting or denying a loan Thus, some human review must always be part of the process; someone possessing consciousness should take responsibility Finally, even if you come out in triumph with an accurate model, it will have an error rate just like every classification or decisionmaking process If a human is right 75% of the time when granting a bank loan, and an algorithm is right 80% of the time, the algorithm is worth using But there should still be due process and someone to review the decision Otherwise, you won’t be fair to the other 20% who are granted loans that they can’t repay, or are denied loans that they deserve Special Algorithms for Streaming Data Large quantities of data make it difficult to answer common ques‐ tions such as “How many distinct countries our prospective cus‐ tomers live in?” or “What is the average amount of money spent by a customer?” When the stream is potentially infinite, you can’t hope to answer such a question through simple counting The program‐ ming community has developed a plethora of clever algorithms for finding approximate answers A brief description of one well-known algorithm with an impressive range of uses will illustrate the diversity of these research efforts Suppose that you want to offer a service to every customer who owns a home You have a database of millions of customers, and a large proportion own homes, but you don’t want to take the time to pull up a specific customer record and status as a homeowner every time that customer visits your website You can store information about homeowners in a very compressed data structure (typically fewer than 10 bits per homeowner) and run an algorithm called a Bloom filter on this data structure each time a customer visits your website If the Bloom filter tells you the customer is not in the set of home‐ owners, you have just saved a lot of time and processing power The Bloom filter is never wrong when it returns a negative result The weakness of the Bloom filter is that sometimes it returns a falsepositive: in our example, it might indicate that the customer is a homeowner when she actually isn’t A typical false-positive rate is 1%, which you can reduce by devoting more bits to each customer Machine Learning | 23 The false-positives shouldn’t produce errors because you probably want to call up the customer record every time you get a positive match At that point, you can perform a second check and then find out whether the Bloom filter was right or wrong You need to spend a little more time on positive results and a little space on the extra data structure, but they’re worth the expenditure for huge datasets Programming Aids for Machine Learning Advances in programming historically follow a traditional sequence First, academics and researchers code up algorithms that are pub‐ lished only as mathematics in research papers Then, as researchers learn the best way to represent the algorithms in code, the algo‐ rithms are released as programming libraries Most programming languages offer convenient ways to distribute such libraries through packages or modules Libraries and modules are important ways to deliver functions because they are arranged hierarchically For instance, complex string manipulation calls on simpler functions for handling charac‐ ters Matrix math uses functions that manipulate vectors Packaging all the functions together ensures that each one finds its dependen‐ cies Each package also has dependencies on other packages The distribution of these packages under free and open source licenses makes it easy to combine them Python, currently the most popular language for machine learning, has been enhanced with hundreds of packages Python likely became dominant thanks to two robust and stable mathematical packages that have been around a long time, NumPy and SciPy Many other packages have dependencies on these NumPy offers more fundamental math functions such as manipulation of arrays, vectors, matrices, and other data structures SciPy offers sophistica‐ ted tools for science, such as signal processing, and many of SciPy’s functions (for instance, convolution) fit right into machine learning The R language is also commonly used for machine learning because it was designed for statistics, a discipline that underlies data science R also has been enhanced by numerous machine learning libraries Scala and Java also seem to be used fairly often for machine learning However, most of the computationally intensive functions in machine learning (including NumPy and large parts of SciPy) are coded in the C language for speed and portability 24 | Chapter 1: Streaming Data: Concepts That Drive Innovative Analytics In the past few years, packaging has been taken to a new level Ten‐ sorflow, MXNet, and a few other packages let you set up machine learning in a couple of commands, specifying only a few essential hyperparameters such as the number of layers and the function to run on each The libraries also include options for tweaking hyper‐ parameters so that you have more control Finally, the packages make it easy to view the deep learning networks as graphs so that you can debug them The similarity of the functions and algorithms offered by various libraries indicates that the field of machine learning is maturing The major cloud providers (AWS, Azure, GCP) offer these libraries, or proprietary equivalents, as services But there is little discussion about how much expertise in statistics or computation you need to use these convenient libraries Some bad AI might take place until people recognize their own educational limitations We are going to turn briefly from the applications that use streaming data to talk about another important change in computing: how programmers work on these applications New Architectures Lead to New Development Patterns The field of software engineering has always concentrated on the process of programming Its goals include the following: Speedy and reliable delivery of services Companies are less and less tolerant of six-month deadline delays And when an engineer or marketing person thinks of a cool feature, they want it right away Leaving a known bug in the product for several months is also unacceptable Accuracy Companies want to catch programming errors as quickly as possible during development Costs of fixing a bug go up rap‐ idly the later in the product cycle it appears, and design flaws are even more crippling Ease of maintenance Programming languages, tools, communication patterns, and organizational structure are all examined to make updates to the software go smoothly New Architectures Lead to New Development Patterns | 25 To achieve these benefits more successfully than ever before, pro‐ grammers are making major changes in the ways they work Trends in their work tend to parallel the structure of the applications Tech‐ nical decisions around programming languages and tools change, as organizations, job descriptions, and schedules It has always been common to divide different parts of a large appli‐ cation among different developers or teams, but traditionally they still had to combine their efforts, test them, and deploy them as a monolithic application This naturally slowed everything down, and could lead to puzzling bugs if one part of the system had an unanti‐ cipated effect on other parts The age of microservices and contain‐ ers has brought test and deployment into line with the efficient practices of modular development Continuous integration and DevOps are the key elements of the new deployment systems Continuous integration automates the process of merging different contributions from different programmers and running the tests created by the programmers We can configure a system such as Jen‐ kins or Travis CI to run these steps every time new code is uploaded At the center of continuous integration process is regression testing, so called because it alerts the programmers whether something that used work is now failing, or “regressing.” Regression testing preda‐ ted continuous integration but made it possible In this model, pro‐ grammers write numerous tests for different things that can go wrong, and continuous integration simply runs them all Of course, programmers have blind spots and can’t always anticipate what might go wrong, particularly with respect to security because secu‐ rity involves malicious actors thinking of ways to use the system in perversely creative ways So organizations need to testing under realistic conditions, too Nevertheless, enormous problems are often caught very earlier through continuous integration and regression testing DevOps blurs or even eliminates traditional organizational bound‐ aries Most organizations have long distinguished between develop‐ ers who write code and operators who run the computer systems These people had somewhat different skill sets, particularly for com‐ panies running programs on their own hardware The developers would give the finished code to the operators, who would install it and make sure the systems stayed up 26 | Chapter 1: Streaming Data: Concepts That Drive Innovative Analytics Nowadays, automation has exposed operations through configura‐ tion files and application programming interfaces (APIs), especially in the cloud, making operations look more and more like program‐ ming The term DevOps conveys the resulting combination of the two job skills Programmers now control all aspects of deployment and can get new features up and running faster using automation Continuous integration and DevOps fit very comfortably with con‐ tainers, microservices, and cloud computing, leading to new demands on programmers to fix reported bugs quickly and add fea‐ tures incrementally Some organizations change the running soft‐ ware several times a day, cranking up innovation to high Mach speeds The occasional bugs that still turn up because of the rate of change are probably the motivation for the famous slogan “move fast and break things.” Conclusion Streaming data, whether generated by people or machines, contains hidden gems of information that can transform a business Process‐ ing these streams is a little like a detective novel, in which scattered facts that seem unrelated turn out to explain the mystery But effective use of streaming data requires many prerequisites: finding the right data, cleaning and converting it into a useful form, storing it, and choosing the right analytics, which could take itera‐ tive experimentation The purpose of this report has been to lay out the principles and activities behind the various parts of streaming data As you peruse the trade press or pick up whitepapers and books and the various topics in the big data space, the background in this report we hope will help you make better sense of the possi‐ bilities they discuss Conclusion | 27 About the Author As an editor at O’Reilly Media, Andy Oram brought to publication O’Reilly’s Linux series, the ground-breaking book Peer-to-Peer, and the best-seller Beautiful Code Andy has also authored many reports on technical topics such as data lakes, web performance, and open source software His articles have appeared in The Economist, Com‐ munications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business Conferences where he has presented talks include O’Reil‐ ly’s Open Source Convention, FISL (Brazil), FOSDEM, DebConf, and LibrePlanet Andy participates in the Association for Comput‐ ing Machinery’s policy organization, USTPC He also writes for var‐ ious websites about health IT and about issues in computing and policy Acknowledgments The author thanks Nicole Schwartz, Michael Hausenblas, Jari Koister, and Simeon Schwarz for their review comments ... well with traditional analytics But the suc‐ cess of AI, usually using a set of data processing tools called machine learning, recommends its use with streaming data As we’ve seen, streaming data. .. pub/sub) model It ingests data and “publishes” it by emitting a data stream, often after some transformations to make the data more useful Transformations might include putting messy data into a... that streaming data projects go through From Data to Insight | Example of a Complete Project We can illustrate the general steps involved in data processing, start‐ ing with data acquisition

Định dạng
Số trang	34
Dung lượng	3,87 MB