Effective Multi-Tenant Distributed Systems Challenges and Solutions when Running Complex Environments Chad Carson and Sean Suchter Beijing Boston Farnham Sebastopol Tokyo Effective Multi-Tenant Distributed Systems by Chad Carson and Sean Suchter Copyright © 2017 Pepperdata, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Nicole Taché and Debbie Hardin Production Editor: Nicholas Adams Copyeditor: Octal Publishing Inc October 2016: Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest First Edition Revision History for the First Edition 2016-10-10: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Effective MultiTenant Distributed Systems, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-96183-4 [LSI] Table of Contents Introduction to Multi-Tenant Distributed Systems The Benefits of Distributed Systems Performance Problems in Distributed Systems Lack of Visibility Within Multi-Tenant Distributed Systems The Impact on Business from Performance Problems Scope of This Book 1 4 Scheduling in Distributed Systems Introduction Dominant Resource Fairness Scheduling Aggressive Scheduling for Busy Queues Special Scheduling Treatment for Small Jobs Workload-Specific Scheduling Considerations Inefficiencies in Scheduling Summary 10 12 13 14 16 22 CPU Performance Considerations 25 Introduction Algorithm Efficiency Kernel Scheduling I/O Waiting and CPU Cache Impacts Summary 25 26 27 30 32 Memory Usage in Distributed Systems 33 Introduction Physical Versus Virtual Memory Node Thrashing 33 33 34 iii Kernel Out-Of-Memory Killer Implications of Memory-Intensive Workloads for MultiTenant Distributed Systems Summary 37 38 41 Disk Performance: Identifying and Eliminating Bottlenecks 43 Introduction Overview of Disk Performance Limits Disk Behavior When Using Multiple Disks Disk Performance in Multi-Tenant Distributed Systems Controlling Disk I/O Usage to Improve Performance for High-Priority Applications Solid-State Drives and Distributed Systems Measuring Performance and Diagnosing Problems Summary 43 45 46 47 48 50 52 54 Network Performance Limits: Causes and Solutions 55 Introduction Bandwidth Problems in Distributed Systems Other Network-Related Bottlenecks and Problems Measuring Network Performance and Debugging Problems Summary 55 55 61 62 64 Other Bottlenecks in Distributed Systems 67 Introduction NameNode Contention ResourceManager Contention ZooKeeper Locks External Databases and Related Systems DNS Servers Summary 67 67 69 70 71 72 72 73 Monitoring Performance: Challenges and Solutions 75 Introduction Why Monitor? What to Monitor Systems and Performance Aspects of Monitoring Algorithmic and Logical Aspects of Monitoring Measuring the Effect of Attempted Improvements Allocating Cluster Costs Across Tenants iv | Table of Contents 75 76 78 80 85 88 89 Summary 90 Conclusion: Performance Challenges and Solutions for Effective Multi-Tenant Distributed Systems 91 Table of Contents | v CHAPTER Introduction to Multi-Tenant Distributed Systems The Benefits of Distributed Systems The past few decades have seen an explosion of computing power Search engines, social networks, cloud-based storage and comput‐ ing, and similar services now make seemingly infinite amounts of information and computation available to users across the globe The tremendous scale of these services would not be possible without distributed systems Distributed systems make it possible for many hundreds or thousands of relatively inexpensive computers to communicate with one another and work together, creating the out‐ ward appearance of a single, high-powered computer The primary benefit of a distributed system is clear: the ability to massively scale computing power relatively inexpensively, enabling organizations to scale up their businesses to a global level in a way that was not possi‐ ble even a decade ago Performance Problems in Distributed Systems As more and more nodes are added to the distributed system and interact with one another, and as more and more developers write and run applications on the system, complications arise Operators of distributed systems must address an array of challenges that affect the performance of the system as a whole as well as individual appli‐ cations’ performance Figure 8-1 Sample screenshot from the Pepperdata Dashboard dis‐ playing both node-level and process-level metrics Note that the vari‐ ous jobs’ contribution to hardware usage is different for different hardware resources Source: Pepperdata Process-level hardware metrics and semantic metrics for distributed system framework components Daemons running on every node affect the performance of applications running on that node and others, both because they require system resources to run and because they provide common services to user applications In the case of Hadoop, for example, the datanode daemon on each worker node can use significant disk and network resources; it can also act as a bottleneck for applications because they access HDFS data via the datanode process Performance and semantic metrics for centralized services running on other nodes As described in Chapter 7, centralized services can act as bottle‐ necks for the distributed system as a whole For example, Hadoop performance can be significantly degraded by poor performance of the ResourceManager or NameNode These centralized services also generate metrics that provide visibility What to Monitor | 79 into requests made by applications running on individual worker nodes System and application log files are occasionally used for monitor‐ ing, but they are much more commonly used for troubleshooting application problems, such as coding bugs, broken system or data dependencies, and related issues (Log files can be used for alerting for cases in which the user who needs to set up alerts cannot change the application itself to add metrics to output but can access the log files.) Log-based analysis can be expensive, partly due to the vol‐ umes of data involved and partly because such analysis requires detecting patterns in large, free-form log files Developers and operators must understand the nuances of the spe‐ cific metrics that are being recorded This point might seem obvi‐ ous, but there can be surprising differences between similarly named metrics on different operating systems For example, “load average” on Linux systems includes processes waiting for resources such as disk reads/writes, whereas on some other Unix systems, it includes only processes waiting for the CPU.1 Systems and Performance Aspects of Monitoring The large number of metrics to capture, the large number of nodes and processes to track, and the frequency of gathering data all mean that for a distributed system of interesting size, metrics storage itself is a distributed systems problem When designing a system to gather, store, and present monitoring data, it’s important to consider what kinds of queries must be supported (both for visualization and alerting) and what performance levels are required for those queries For example, a high-level dashboard needs to respond in just a sec‐ ond or two; drilling down to diagnose issues could take a few sec‐ onds; generating long-term trend reports could be done offline and take many seconds The data structures and system architecture must reflect those requirements In addition, building the monitor‐ ing system will involve tradeoffs among the level of metrics granu‐ See https://en.wikipedia.org/wiki/Load_(computing) and http://www.howtogeek.com/ 194642/understanding-the-load-average-on-linux-and-other-unix-like-systems/ 80 | Chapter 8: Monitoring Performance: Challenges and Solutions larity supported, the amount of history to store, the performance of interactive queries, and the size and hardware cost of the system Most metrics used for monitoring are inherently time-series data, so they must be gathered and stored as time series They will often be presented to an operator as time-series data, but they can also be aggregated over time; for example, reporting the total disk IOPS by a process for the entire lifetime of the process They can also be aggre‐ gated across processes or nodes; for example, reporting the total memory used by all of an application’s tasks at a particular time (In some cases, both kinds of aggregation will be done; for example, reporting the total network bytes written by all tasks of an applica‐ tion for the entire runtime of that application.) Handling Huge Amounts of Metrics Data In an ideal world, these metrics would be captured at extremely high frequency, for example every metric for every process every few hundred milliseconds The real world, of course, has tradeoffs The act of collecting the data itself can affect the performance of the node itself (For example, running top and iotop on every node continuously and storing the result every second would capture use‐ ful data but would also add significant CPU load on the nodes.) Storing, aggregating, and presenting data also has a cost, which increases with the number of data points collected In practice, met‐ rics systems can use different time granularities for different types of metrics—for example, operating system versions change very rarely; disk space usage changes significantly only over the course of many minutes, hours, or days; CPU usage changes many times a second To reduce the cost of storage, monitoring systems often aggregate older data and discard the original raw data after some time period The reduction can be done in the time domain (by downsampling), in the level of detail (by aggregating across nodes or across pro‐ cesses), or both Offline aggregation across processes and nodes also reduces the computational cost and latency at presentation time when users query the monitoring system; in fact, it might be required to enable interactive use Systems and Performance Aspects of Monitoring | 81 Aggregation only helps at query time; it does not miti‐ gate any performance impact on the distributed system nodes when data is collected or the initial cost to the monitoring system of processing new data When downsampling in time or aggregating across processes and nodes, it is important to consider some subtle side effects of data reduction—in particular, spikes in time or unusually high usage on one host are lost when data is smoothed by downsampling or aggre‐ gation, and it is often exactly these spikes that indicate both the symptoms and causes of performance bottlenecks In many cases, keeping both averages/sums (which accurately reflect total usage) and min/max/percentiles (e.g., 95th percentile usage, which reflect spikes) is a way to reduce the metrics blindness that can result from naive downsampling and aggregation After a percentile is calculated, further downsampling is dangerous; downsampling a series of percentiles gives a wildly incorrect result To compute percentiles requires information about the full distribution of val‐ ues Summary statistics like mean/min/max/total/ count/percentiles not provide sufficient informa‐ tion to compute a new percentile value when multiple time periods are combined.2 One challenge specific to metrics data for multi-tenant distributed systems is the large number of unique time series generated Tradi‐ tional data centers, even very large ones, can have thousands of nodes and just one or a few long-lived applications on each node, with tens of metrics per node, resulting in tens of thousands of time series The resulting time series generally last for days or weeks, because nodes are added or removed, and long-running applications are moved across nodes, only rarely In contrast, a distributed sys‐ tem like Hadoop has many orders of magnitude more unique time See, for example, http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday- you-cant-average.html for more discussion 82 | Chapter 8: Monitoring Performance: Challenges and Solutions series—easily over a billion per year.3 (The number of time series can grow even larger if developers are allowed to generate their own application-specific metrics, which increases the data volume.) Most systems designed to store large numbers of time series cannot scale anywhere near this number of unique series Reliability of the Monitoring System Operators must also ensure that the monitoring system itself is robust and reliable Of course, operators should not inject metrics back into the system they are tracking—but more than that, they should ensure that the monitoring system is as resilient as possible to a problem with the distributed system being monitored For example, if a network outage occurs in the distributed system, what happens to the monitoring system? While real-time data cannot be gathered and ingested into the monitoring system during a network outage, the operator should still be able to access the monitoring system, at least to see that a problem occurred with the primary sys‐ tem In addition, metrics should continue to be collected on the dis‐ tributed system nodes so that they can later be gathered and ingested, rather than lost completely On a related note, the monitoring system itself should have some basic monitoring—at least enough to detect if monitoring is down or data is no longer being collected Some Commonly Used Monitoring Systems One system used by many companies for cluster monitoring is Gan‐ glia, which collects metrics data from individual nodes Ganglia uses a tree structure among the nodes in the distributed system to gather and centralize data from all of the nodes It then stores that data in a storage system called RRDtool RRDtool, which is also used by sev‐ eral other monitoring systems, has been optimized for efficient data writes It uses a predefined amount of storage for all data going back in time, but it stays within this limit by keeping an increasingly downsampled amount of history for older data; as a result, it rapidly For example, a 1,000-node cluster with each node running 20 tasks at a time and each task lasting an average of 15 minutes would have nearly two million tasks per day; even gathering just ten metrics per task would result in 20 million unique time series per day, or more than seven billion unique time series per year Systems and Performance Aspects of Monitoring | 83 loses granularity going back in time Both Ganglia and RRDtool suf‐ fer from scalability challenges on reasonably sized clusters and when storing large numbers of metrics For example, because RRDtool performs data aggregation primarily at query time, viewing data across nodes or across processes can be prohibitively slow Another time-series data storage system is OpenTSDB, which is designed to store large numbers of time series OpenTSDB is gener‐ ally used as just one component of a broader monitoring system, in part because its user interface is limited,4 and it does not natively support important features like offline aggregation of time series; it performs all aggregations at query time, which limits performance and scalability (An example of a system that uses OpenTSDB as a component is the Pepperdata Dashboard, which uses several com‐ ponents both upstream and downstream of OpenTSDB to perform a range of optimizations at data ingestion and query time, along with providing a user interface designed for exploration and trouble‐ shooting of very large multi-tenant clusters.5 See Figure 8-2.) See http://opentsdb.net/docs/build/html/user_guide/guis/index.html For some details of the optimizations in the Pepperdata Dashboard, see http:// www.slideshare.net/Beeks06/a-billion-points-of-datapressure and http://pepperdata.com/ 2014/06/a-billion-points-of-data-pressure/ 84 | Chapter 8: Monitoring Performance: Challenges and Solutions Figure 8-2 Architecture of the Pepperdata Dashboard, which utilizes OpenTSDB along with other components (for example, servers per‐ forming ingestion-time and offline aggregation) to improve perfor‐ mance Algorithmic and Logical Aspects of Monitoring Operators want to identify and be notified about anomalies—cases when the distributed system is behaving outside the expected per‐ formance range Some common examples include spikes in disk or network usage, thrashing, and high application latency (which could indicate a problem with that particular application or the system as a whole) These anomalies generally fall into one of two categories: Algorithmic and Logical Aspects of Monitoring | 85 • Outliers at one point in time, when one node or application is behaving very differently from others at that time, which often indicates a hotspot or hardware problem It can be straightfor‐ ward to automatically detect such outliers for a single metric across similar nodes, other than requiring some tuning.6 In interactive use, operators might examine tables showing many similar items (for example, all nodes or all jobs) and quickly sort the items with the highest or lowest values for a particular met‐ ric • Changes over time, such that the entire system is behaving dif‐ ferently from the way it normally does For this kind of anom‐ aly, it’s important to understand the normal variation over time, including periodic effects For example, if jobs are taking longer to complete than usual, is that because cluster usage always spikes on Mondays, or it’s now the end of the quarter and groups are generating extra reports, or a similar “organic” cause —or is it a true anomaly? When using a monitoring system for alerting (active notification to operators about a problem, for example by sending an email or text message), users should consider the following two common types of alerts: • If a metric’s current value exceeds a threshold or returns an error right now, or is automatically detected as an anomaly, send an alert • If a metric has been in one of the above states for some time (not just momentarily), send an alert This type of alert is useful if a metric tends to spike transiently during the normal course of operation, or if operators only want to be notified of prob‐ lems that are ongoing and likely require manual intervention (Note that this kind of alert is much harder to support from a systems perspective because the alerting system needs to store history about the recent states of the metric rather than just running one query about a single point in time.) One standard heuristic is to compute the distribution of values for a metric across nodes and identify cases when metrics for one or a few nodes are more than a fixed multiple of the interquartile range For details on that approach and others, see https:// en.wikipedia.org/wiki/Interquartile_range and https://en.wikipedia.org/wiki/Outlier 86 | Chapter 8: Monitoring Performance: Challenges and Solutions Challenges Specific to Multi-Tenant Distributed Systems Just as monitoring multi-tenant distributed systems requires scaling to handle the huge number of unique time series from a constant stream of new processes, there are challenges due to the behavior of such time series, especially near the beginning and end of a series In a traditional system, time series are being added or removed only occasionally, and the beginning or end of a time series is a relatively unimportant corner case In contrast, for multi-tenant distributed systems, time series start and end points are the normal case, and the math used for aggregation, time downsampling, and presenta‐ tion must correctly handle periods when a time series existed and periods when it did not Similarly, some metrics that operators commonly look at for a single machine can be counterintuitive across a cluster For example, does a 100-node cluster have 10,000 percent of CPU available? Or instead, if the metrics system normalizes percentages across the cluster, and an idle node drops out, should the reported CPU usage percentage increase? Even more complicated is the case of a hetero‐ geneous cluster, where some nodes are more powerful than others, in which case 100 percent of CPU or RAM on one node is not the same as 100 percent of CPU or RAM on another node A single cor‐ rect approach to aggregating data across nodes unfortunately does not exist Whatever the approach, the operator must easily under‐ stand the meaning of each aggregate metric Some kinds of metrics make sense only on distributed systems and can give valuable insights into performance For example, a dis‐ tributed file system might suffer from hotspots; that is, one node or one file is being accessed by many jobs at one time Gathering met‐ rics about CPU, disk, and network access for each node (or file or rack) and then comparing across nodes can help identify hotspots; in the case of a hotspot, an operator could change the replication for specific files or take other steps to rebalance the distributed file sys‐ tem usage Capturing and presenting metrics at various levels of detail—for the overall distributed system, specific nodes, and specific applications —can be critical in detecting and diagnosing problems specific to multi-tenant systems, such as the following examples: Algorithmic and Logical Aspects of Monitoring | 87 • Low-priority work in one queue causing the entire system to back up We recently encountered a situation in which a devel‐ oper launched a Spark cluster on top of a YARN cluster and ran an interactive shell in Spark He then went to lunch and left the Spark cluster running, holding cluster resources without doing any useful work • Multiple users submitting jobs through a common framework (or using copy-and-paste code) that was accidentally misconfig‐ ured to run all jobs on a single node, causing a hotspot Measuring the Effect of Attempted Improvements When adjusting any “knob” on a system (whether directly changing configurations or using software like Pepperdata to adjust hardware usage dynamically), the effect of the adjustment must be measured This can be a challenge even for a single machine running a few applications, but it is immensely more difficult for a multi-tenant distributed system that might be running hundreds of applications at once, with many thousands of processes across thousands of nodes Several factors combine to create this complexity: • Even a single application’s performance can be difficult to meas‐ ure, because modern distributed applications are composed of workflows (for example, a job consists of a series of mapshuffle-reduce stages, and many such jobs might be chained together in a directed acyclic graph [DAG], with complex dependencies among the stages) Reducing the run time of one particular process might not improve overall performance • The environment itself can be unfathomably complex because of the wide variety of applications, hardware usage profiles, and types of work involved • The environment changes constantly Experimentation with an application might not yield useful results, because the back‐ ground noise from the other applications running at the same time can drown out any signal from the experiment itself When operators are making changes to improve performance, it’s important to have a monitoring system that presents both system88 | Chapter 8: Monitoring Performance: Challenges and Solutions level and application-level metrics at the same time and allows for comparison Allocating Cluster Costs Across Tenants Related to monitoring is reporting on the cost of a cluster and allo‐ cating that cost across the various tenants, such as business units This type of reporting is often called chargeback (or showback, in the common case of reporting on usage but not actually charging departments for it) Traditional chargeback reports for distributed systems are generally based on disk storage used or queue capacity planned/reserved, both of which have limitations Reports that incorporate all aspects of actual system usage are much more valua‐ ble Using disk storage as a metric for chargeback reports captures only one aspect of hardware usage, which is generally a small fraction of the total cost of the cluster Compute-intensive jobs with low data usage would get a free ride, even though they can significantly inter‐ fere with other jobs and thus be expensive to perform Likewise, departments with jobs requiring heavy I/O might pay less than departments that have stored a large amount of data but not access it often, and that I/O (and the related network usage) can slow down other tenants on the cluster Charging by storage also provides a subtle disincentive to making the cluster more useful for everyone A modern distributed system becomes increasingly useful as more data is added to the cluster and shared across users Charg‐ ing the data owner for storage penalizes groups that share their data and rewards those that not add much data to the cluster but instead use data others have brought Systems that attempt to capture other hardware resources beyond storage but are based on planned usage (for example, queue capacity) also suffer from both inaccuracy and perverse incentives Actual usage can differ from planned usage, so the cost allocation might not reflect reality Charging by theoretical utilization encour‐ ages groups to use up all of their allocated capacity because it costs them nothing extra to so (In fact, the group might be better off using up their allocation, thereby keeping their queue “full” and ensuring that other groups cannot use more than their share.) Developers have no incentive to reduce overall cluster load, for example by removing obsolete jobs or optimizing their code Allocating Cluster Costs Across Tenants | 89 Summary Multi-tenant distributed systems come with unique challenges in monitoring and troubleshooting, due both to their large scale and to the diversity of simultaneously running applications Traditional monitoring systems fall short in both dimensions; open source tools such as Ganglia not scale effectively to thousands of nodes, and most tools not provide the process-level hardware usage infor‐ mation needed for operators to diagnose and fix problems Gathering, processing, and presenting such granular hardware usage data for large clusters is a significant challenge, in terms of both sys‐ tems engineering (to process and store the data efficiently and in a scalable manner) and presentation-level logic and math (to present it usefully and accurately) One problem specific to multi-tenant dis‐ tributed systems is the combinatorial explosion of unique time ser‐ ies arising from many users running many jobs on many nodes, often resulting in millions of new time series each day on a reasonably-sized cluster Efficiently and correctly handling such a huge number of short-lived time series is a major challenge In addition to identifying and solving performance problems on a cluster, gathering detailed process-level hardware usage metrics allows operators to allocate the cost of the cluster across the differ‐ ent users, groups, or business units using the cluster Basing cost allocation on actual usage of all resources is much more accurate than considering just a single resource (like disk storage space) or planned usage (like quota size) 90 | Chapter 8: Monitoring Performance: Challenges and Solutions CHAPTER Conclusion: Performance Challenges and Solutions for Effective Multi-Tenant Distributed Systems Organizations now have access to more data than ever before from an increasing number of sources, and big data has fundamentally changed the way all of that information is managed The promise of big data is the ability to make sense of the many data sources by using real-time and ad hoc analysis to derive time-critical business insights, enabling organizations to become smarter about their cus‐ tomers, operations, and overall business As volumes of business data increase, organizations are rapidly adopting distributed systems to store, manage, process, and serve big data for use in analytics, business intelligence, and decision support Beyond the world of big data, the use of distributed systems for other kinds of applications has also grown dramatically over the past several decades Just a few examples include physics simulations for automobile and aircraft design, computer graphics rendering, and climate simulation Unfortunately, fundamental performance limitations of distributed systems can prevent organizations from achieving the predictability and reliability needed to realize the promise of large-scale dis‐ tributed applications in production, especially in multi-tenant and mixed-workload environments The need to have predictable per‐ 91 formance is more critical than ever before, because for most busi‐ nesses, information is the competitive edge needed to survive in today’s data-driven economy Computing resources even in the largest organizations are finite, and, as computational demands increase, bottlenecks can result in many areas of the distributed system environment The timely com‐ pletion of a job requires a sufficient allocation of CPU, memory, disk, and network to every component of that job As the number and complexity of jobs grow, so does the contention for these limi‐ ted computing resources Furthermore, the availability of individual computing resources can vary wildly over time, radically increasing the complexity of scheduling jobs Business-visible symptoms of performance problems resulting from this complexity include underutilized or overutilized hardware (sometimes both at the same time), jobs completing late or unpredictably, and even cluster crashes In this book, we have explored the most critical areas of computing resource contention and some of the solutions that organizations are currently using to overcome these issues Although manual perfor‐ mance tuning will always have a role, many challenges encountered in today’s distributed system environments are far too complex to be solved by people power alone: a more sophisticated, software-based solution is necessary to achieve predictability and reliability The availability of real-time, intelligent software that dynamically reacts to and reshapes access to hardware resources allows organiza‐ tions to regain control of their distributed systems One example is the Pepperdata software solution, which provides organizations with the ability to increase utilization for distributed systems such as Hadoop by closely monitoring actual hardware usage and dynami‐ cally allowing more or fewer processes to be scheduled on a given node, based on the current and projected future hardware usage on that node This type of solution also ensures that business-critical applications run reliably and predictably so that they meet their SLAs Such intelligent software enables faster, better decision mak‐ ing and ultimately faster, better business results 92 | Chapter 9: Conclusion: Performance Challenges and Solutions for Effective Multi-Tenant Distributed Systems About the Authors At Microsoft, Yahoo, and Inktomi, Chad Carson led teams using huge amounts of data building web-scale products, including social search at Bing and sponsored search ranking and optimization at Yahoo Before getting into web search, Chad worked on computer vision and image retrieval, earning a Ph.D in EECS from UC Berke‐ ley Chad also holds Bachelor’s degrees in History and Electrical Engineering from Rice University Sean Suchter was the founding General Manager of Microsoft’s Sili‐ Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search Prior to Micro‐ soft, Sean managed the Yahoo Search Technology team, the first production user of Hadoop Sean joined Yahoo through the acquisi‐ tion of Inktomi, and holds a B.S in Engineering and Applied Sci‐ ence from Caltech ... Problems in Distributed Systems | Lack of Visibility Within Multi- Tenant Distributed Systems Because multi- tenant distributed systems simultaneously run many applications, each with different... Distributed Systems The Benefits of Distributed Systems Performance Problems in Distributed Systems Lack of Visibility Within Multi- Tenant Distributed Systems The Impact on Business from... Effective Multi- Tenant Distributed Systems Challenges and Solutions when Running Complex Environments Chad Carson and Sean Suchter Beijing Boston Farnham Sebastopol Tokyo Effective Multi- Tenant