Strata Effective Multi-Tenant Distributed Systems Challenges and Solutions when Running Complex Environments Chad Carson and Sean Suchter Effective Multi-Tenant Distributed Systems by Chad Carson and Sean Suchter Copyright © 2017 Pepperdata, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Nicole Taché and Debbie Hardin Production Editor: Nicholas Adams Copyeditor: Octal Publishing Inc Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest October 2016: First Edition Revision History for the First Edition 2016-10-10: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Effective Multi-Tenant Distributed Systems, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-96183-4 [LSI] Chapter Introduction to MultiTenant Distributed Systems The Benefits of Distributed Systems The past few decades have seen an explosion of computing power Search engines, social networks, cloud-based storage and computing, and similar services now make seemingly infinite amounts of information and computation available to users across the globe The tremendous scale of these services would not be possible without distributed systems Distributed systems make it possible for many hundreds or thousands of relatively inexpensive computers to communicate with one another and work together, creating the outward appearance of a single, highpowered computer The primary benefit of a distributed system is clear: the ability to massively scale computing power relatively inexpensively, enabling organizations to scale up their businesses to a global level in a way that was not possible even a decade ago Performance Problems in Distributed Systems As more and more nodes are added to the distributed system and interact with one another, and as more and more developers write and run applications on the system, complications arise Operators of distributed systems must address an array of challenges that affect the performance of the system as a whole as well as individual applications’ performance These performance challenges are different from those faced when operating a data center of computers that are running more or less independently, such as a web server farm In a true distributed system, applications are split into smaller units of work, which are spread across many nodes and communicate with one another either directly or via shared input/output data Additional performance challenges arise with multi-tenant distributed systems, in which different users, groups, and possibly business units run different applications on the same cluster (This is in contrast to a single, large distributed application, such as a search engine, which is quite complex and has intertask dependencies but is still just one overall application.) These challenges that come with multitenancy result from the diversity of applications running together on any node as well as the fact that the applications are written by many different developers instead of one engineering team focused on ensuring that everything in a single distributed application works well together Scheduling One of the primary challenges in a distributed system is in scheduling jobs and their component processes Computing power might be quite large, but it is always finite, and the distributed system must decide which jobs should be scheduled to run where and when, and the relative priority of those jobs Even sophisticated distributed-system schedulers have limitations that can lead to underutilization of cluster hardware, unpredictable job run times, or both Examples include assuming the worst-case resource usage to avoid overcommitting, failing to plan for different resource types across different applications, and overlooking one or more dependencies, thus causing deadlock or starvation The scheduling challenges become more severe on multi-tenant clusters, which add fairness of resource access among users as a scheduling goal, in addition to (and often in conflict with) the goals of high overall hardware utilization and predictable run times for high-priority applications Aside from the challenge of balancing utilization and fairness, in some extreme cases the scheduler might go too far in trying to ensure fairness, scheduling just a few tasks from many jobs for many users at once This can result in latency for every job on the cluster and cause the cluster to use resources inefficiently because the system is trying to too many disparate things at the same time Challenges Specific to Multi-Tenant Distributed Systems Just as monitoring multi-tenant distributed systems requires scaling to handle the huge number of unique time series from a constant stream of new processes, there are challenges due to the behavior of such time series, especially near the beginning and end of a series In a traditional system, time series are being added or removed only occasionally, and the beginning or end of a time series is a relatively unimportant corner case In contrast, for multi-tenant distributed systems, time series start and end points are the normal case, and the math used for aggregation, time downsampling, and presentation must correctly handle periods when a time series existed and periods when it did not Similarly, some metrics that operators commonly look at for a single machine can be counterintuitive across a cluster For example, does a 100-node cluster have 10,000 percent of CPU available? Or instead, if the metrics system normalizes percentages across the cluster, and an idle node drops out, should the reported CPU usage percentage increase? Even more complicated is the case of a heterogeneous cluster, where some nodes are more powerful than others, in which case 100 percent of CPU or RAM on one node is not the same as 100 percent of CPU or RAM on another node A single correct approach to aggregating data across nodes unfortunately does not exist Whatever the approach, the operator must easily understand the meaning of each aggregate metric Some kinds of metrics make sense only on distributed systems and can give valuable insights into performance For example, a distributed file system might suffer from hotspots; that is, one node or one file is being accessed by many jobs at one time Gathering metrics about CPU, disk, and network access for each node (or file or rack) and then comparing across nodes can help identify hotspots; in the case of a hotspot, an operator could change the replication for specific files or take other steps to rebalance the distributed file system usage Capturing and presenting metrics at various levels of detail — for the overall distributed system, specific nodes, and specific applications — can be critical in detecting and diagnosing problems specific to multi-tenant systems, such as the following examples: Low-priority work in one queue causing the entire system to back up We recently encountered a situation in which a developer launched a Spark cluster on top of a YARN cluster and ran an interactive shell in Spark He then went to lunch and left the Spark cluster running, holding cluster resources without doing any useful work Multiple users submitting jobs through a common framework (or using copy-and-paste code) that was accidentally misconfigured to run all jobs on a single node, causing a hotspot Measuring the Effect of Attempted Improvements When adjusting any “knob” on a system (whether directly changing configurations or using software like Pepperdata to adjust hardware usage dynamically), the effect of the adjustment must be measured This can be a challenge even for a single machine running a few applications, but it is immensely more difficult for a multi-tenant distributed system that might be running hundreds of applications at once, with many thousands of processes across thousands of nodes Several factors combine to create this complexity: Even a single application’s performance can be difficult to measure, because modern distributed applications are composed of workflows (for example, a job consists of a series of map-shuffle-reduce stages, and many such jobs might be chained together in a directed acyclic graph [DAG], with complex dependencies among the stages) Reducing the run time of one particular process might not improve overall performance The environment itself can be unfathomably complex because of the wide variety of applications, hardware usage profiles, and types of work involved The environment changes constantly Experimentation with an application might not yield useful results, because the background noise from the other applications running at the same time can drown out any signal from the experiment itself When operators are making changes to improve performance, it’s important to have a monitoring system that presents both system-level and applicationlevel metrics at the same time and allows for comparison Allocating Cluster Costs Across Tenants Related to monitoring is reporting on the cost of a cluster and allocating that cost across the various tenants, such as business units This type of reporting is often called chargeback (or showback, in the common case of reporting on usage but not actually charging departments for it) Traditional chargeback reports for distributed systems are generally based on disk storage used or queue capacity planned/reserved, both of which have limitations Reports that incorporate all aspects of actual system usage are much more valuable Using disk storage as a metric for chargeback reports captures only one aspect of hardware usage, which is generally a small fraction of the total cost of the cluster Compute-intensive jobs with low data usage would get a free ride, even though they can significantly interfere with other jobs and thus be expensive to perform Likewise, departments with jobs requiring heavy I/O might pay less than departments that have stored a large amount of data but not access it often, and that I/O (and the related network usage) can slow down other tenants on the cluster Charging by storage also provides a subtle disincentive to making the cluster more useful for everyone A modern distributed system becomes increasingly useful as more data is added to the cluster and shared across users Charging the data owner for storage penalizes groups that share their data and rewards those that not add much data to the cluster but instead use data others have brought Systems that attempt to capture other hardware resources beyond storage but are based on planned usage (for example, queue capacity) also suffer from both inaccuracy and perverse incentives Actual usage can differ from planned usage, so the cost allocation might not reflect reality Charging by theoretical utilization encourages groups to use up all of their allocated capacity because it costs them nothing extra to so (In fact, the group might be better off using up their allocation, thereby keeping their queue “full” and ensuring that other groups cannot use more than their share.) Developers have no incentive to reduce overall cluster load, for example by removing obsolete jobs or optimizing their code Summary Multi-tenant distributed systems come with unique challenges in monitoring and troubleshooting, due both to their large scale and to the diversity of simultaneously running applications Traditional monitoring systems fall short in both dimensions; open source tools such as Ganglia not scale effectively to thousands of nodes, and most tools not provide the processlevel hardware usage information needed for operators to diagnose and fix problems Gathering, processing, and presenting such granular hardware usage data for large clusters is a significant challenge, in terms of both systems engineering (to process and store the data efficiently and in a scalable manner) and presentation-level logic and math (to present it usefully and accurately) One problem specific to multi-tenant distributed systems is the combinatorial explosion of unique time series arising from many users running many jobs on many nodes, often resulting in millions of new time series each day on a reasonably-sized cluster Efficiently and correctly handling such a huge number of short-lived time series is a major challenge In addition to identifying and solving performance problems on a cluster, gathering detailed process-level hardware usage metrics allows operators to allocate the cost of the cluster across the different users, groups, or business units using the cluster Basing cost allocation on actual usage of all resources is much more accurate than considering just a single resource (like disk storage space) or planned usage (like quota size) See https://en.wikipedia.org/wiki/Load_(computing) and http://www.howtogeek.com/194642/understanding-the-load-average-on-linux-and-other-unix-likesystems/ See, for example, http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-you-cantaverage.html for more discussion For example, a 1,000-node cluster with each node running 20 tasks at a time and each task lasting an average of 15 minutes would have nearly two million tasks per day; even gathering just ten metrics per task would result in 20 million unique time series per day, or more than seven billion unique time series per year 4 See http://opentsdb.net/docs/build/html/user_guide/guis/index.html For some details of the optimizations in the Pepperdata Dashboard, see http://www.slideshare.net/Beeks06/a-billion-points-of-datapressure and http://pepperdata.com/2014/06/a-billion-points-of-data-pressure/ One standard heuristic is to compute the distribution of values for a metric across nodes and identify cases when metrics for one or a few nodes are more than a fixed multiple of the interquartile range For details on that approach and others, see https://en.wikipedia.org/wiki/Interquartile_range and https://en.wikipedia.org/wiki/Outlier Chapter Conclusion: Performance Challenges and Solutions for Effective MultiTenant Distributed Systems Organizations now have access to more data than ever before from an increasing number of sources, and big data has fundamentally changed the way all of that information is managed The promise of big data is the ability to make sense of the many data sources by using real-time and ad hoc analysis to derive time-critical business insights, enabling organizations to become smarter about their customers, operations, and overall business As volumes of business data increase, organizations are rapidly adopting distributed systems to store, manage, process, and serve big data for use in analytics, business intelligence, and decision support Beyond the world of big data, the use of distributed systems for other kinds of applications has also grown dramatically over the past several decades Just a few examples include physics simulations for automobile and aircraft design, computer graphics rendering, and climate simulation Unfortunately, fundamental performance limitations of distributed systems can prevent organizations from achieving the predictability and reliability needed to realize the promise of large-scale distributed applications in production, especially in multi-tenant and mixed-workload environments The need to have predictable performance is more critical than ever before, because for most businesses, information is the competitive edge needed to survive in today’s data-driven economy Computing resources even in the largest organizations are finite, and, as computational demands increase, bottlenecks can result in many areas of the distributed system environment The timely completion of a job requires a sufficient allocation of CPU, memory, disk, and network to every component of that job As the number and complexity of jobs grow, so does the contention for these limited computing resources Furthermore, the availability of individual computing resources can vary wildly over time, radically increasing the complexity of scheduling jobs Business-visible symptoms of performance problems resulting from this complexity include underutilized or overutilized hardware (sometimes both at the same time), jobs completing late or unpredictably, and even cluster crashes In this book, we have explored the most critical areas of computing resource contention and some of the solutions that organizations are currently using to overcome these issues Although manual performance tuning will always have a role, many challenges encountered in today’s distributed system environments are far too complex to be solved by people power alone: a more sophisticated, software-based solution is necessary to achieve predictability and reliability The availability of real-time, intelligent software that dynamically reacts to and reshapes access to hardware resources allows organizations to regain control of their distributed systems One example is the Pepperdata software solution, which provides organizations with the ability to increase utilization for distributed systems such as Hadoop by closely monitoring actual hardware usage and dynamically allowing more or fewer processes to be scheduled on a given node, based on the current and projected future hardware usage on that node This type of solution also ensures that businesscritical applications run reliably and predictably so that they meet their SLAs Such intelligent software enables faster, better decision making and ultimately faster, better business results About the Authors At Microsoft, Yahoo, and Inktomi, Chad Carson led teams using huge amounts of data building web-scale products, including social search at Bing and sponsored search ranking and optimization at Yahoo Before getting into web search, Chad worked on computer vision and image retrieval, earning a Ph.D in EECS from UC Berkeley Chad also holds Bachelor’s degrees in History and Electrical Engineering from Rice University Sean Suchter was the founding General Manager of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search Prior to Microsoft, Sean managed the Yahoo Search Technology team, the first production user of Hadoop Sean joined Yahoo through the acquisition of Inktomi, and holds a B.S in Engineering and Applied Science from Caltech Introduction to Multi-Tenant Distributed Systems The Benefits of Distributed Systems Performance Problems in Distributed Systems Scheduling Hardware Bottlenecks Lack of Visibility Within Multi-Tenant Distributed Systems The Impact on Business from Performance Problems Scope of This Book Hadoop: An Example Distributed System Terminology Scheduling in Distributed Systems Introduction Dominant Resource Fairness Scheduling Aggressive Scheduling for Busy Queues Special Scheduling Treatment for Small Jobs Workload-Specific Scheduling Considerations Inefficiencies in Scheduling The Need to be Conservative with Memory Inability to Effectively Schedule the Use of Other Resources Deadlock and Starvation Waste Due to Speculative Execution Summary CPU Performance Considerations Introduction Algorithm Efficiency Kernel Scheduling Intentional or Accidental Bad Actors Applying the Control Mechanisms in Multi-Tenant Distributed Systems I/O Waiting and CPU Cache Impacts Summary Memory Usage in Distributed Systems Introduction Physical Versus Virtual Memory Node Thrashing Detecting and Avoiding Thrashing Kernel Out-Of-Memory Killer Implications of Memory-Intensive Workloads for Multi-Tenant Distributed Systems Solutions Summary Disk Performance: Identifying and Eliminating Bottlenecks Introduction Overview of Disk Performance Limits Disk Behavior When Using Multiple Disks Disk Performance in Multi-Tenant Distributed Systems Controlling Disk I/O Usage to Improve Performance for HighPriority Applications Basic Disk I/O Prioritization Tools and Their Limitations Effective Control of Disk I/O Usage Solid-State Drives and Distributed Systems Measuring Performance and Diagnosing Problems Summary Network Performance Limits: Causes and Solutions Introduction Bandwidth Problems in Distributed Systems Hadoop’s Solution to Network Bottlenecks: Move Computation to the Data Why Network Quality of Service Does Not Solve the Problem of Network Bottlenecks Controlling Network Usage on a Per-Application Basis Other Network-Related Bottlenecks and Problems Measuring Network Performance and Debugging Problems ping and mtr Retransmissions Summary Other Bottlenecks in Distributed Systems Introduction NameNode Contention ResourceManager Contention ZooKeeper Locks External Databases and Related Systems DNS Servers Summary Monitoring Performance: Challenges and Solutions Introduction Why Monitor? What to Monitor Systems and Performance Aspects of Monitoring Handling Huge Amounts of Metrics Data Reliability of the Monitoring System Some Commonly Used Monitoring Systems Algorithmic and Logical Aspects of Monitoring Challenges Specific to Multi-Tenant Distributed Systems Measuring the Effect of Attempted Improvements Allocating Cluster Costs Across Tenants Summary Conclusion: Performance Challenges and Solutions for Effective Multi-Tenant Distributed Systems ...Strata Effective Multi- Tenant Distributed Systems Challenges and Solutions when Running Complex Environments Chad Carson and Sean Suchter Effective Multi- Tenant Distributed Systems by Chad... affecting every other job and user Lack of Visibility Within Multi- Tenant Distributed Systems Because multi- tenant distributed systems simultaneously run many applications, each with different... licenses and/or rights 978-1-491-96183-4 [LSI] Chapter Introduction to MultiTenant Distributed Systems The Benefits of Distributed Systems The past few decades have seen an explosion of computing power