effective multi tenant distributed systems

76 67 0
effective multi tenant distributed systems

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Strata Effective Multi-Tenant Distributed Systems Challenges and Solutions when Running Complex Environments Chad Carson and Sean Suchter Effective Multi-Tenant Distributed Systems by Chad Carson and Sean Suchter Copyright © 2017 Pepperdata, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Nicole Taché and Debbie Hardin Production Editor: Nicholas Adams Copyeditor: Octal Publishing Inc Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest October 2016: First Edition Revision History for the First Edition 2016-10-10: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Effective Multi-Tenant Distributed Systems, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-96183-4 [LSI] Chapter Introduction to Multi-Tenant Distributed Systems The Benefits of Distributed Systems The past few decades have seen an explosion of computing power Search engines, social networks, cloud-based storage and computing, and similar services now make seemingly infinite amounts of information and computation available to users across the globe The tremendous scale of these services would not be possible without distributed systems Distributed systems make it possible for many hundreds or thousands of relatively inexpensive computers to communicate with one another and work together, creating the outward appearance of a single, high-powered computer The primary benefit of a distributed system is clear: the ability to massively scale computing power relatively inexpensively, enabling organizations to scale up their businesses to a global level in a way that was not possible even a decade ago Performance Problems in Distributed Systems As more and more nodes are added to the distributed system and interact with one another, and as more and more developers write and run applications on the system, complications arise Operators of distributed systems must address an array of challenges that affect the performance of the system as a whole as well as individual applications’ performance These performance challenges are different from those faced when operating a data center of computers that are running more or less independently, such as a web server farm In a true distributed system, applications are split into smaller units of work, which are spread across many nodes and communicate with one another either directly or via shared input/output data Additional performance challenges arise with multi-tenant distributed systems, in which different users, groups, and possibly business units run different applications on the same cluster (This is in contrast to a single, large distributed application, such as a search engine, which is quite complex and has intertask dependencies but is still just one overall application.) These challenges that come with multitenancy result from the diversity of applications running together on any node as well as the fact that the applications are written by many different developers instead of one engineering team focused on ensuring that everything in a single distributed application works well together Scheduling One of the primary challenges in a distributed system is in scheduling jobs and their component processes Computing power might be quite large, but it is always finite, and the distributed system must decide which jobs should be scheduled to run where and when, and the relative priority of those jobs Even sophisticated distributed-system schedulers have limitations that can lead to underutilization of cluster hardware, unpredictable job run times, or both Examples include assuming the worst-case resource usage to avoid overcommitting, failing to plan for different resource types across different applications, and overlooking one or more dependencies, thus causing deadlock or starvation The scheduling challenges become more severe on multi-tenant clusters, which add fairness of resource access among users as a scheduling goal, in addition to (and often in conflict with) the goals of high overall hardware utilization and predictable run times for high-priority applications Aside from the challenge of balancing utilization and fairness, in some extreme cases the scheduler might go too far in trying to ensure fairness, scheduling just a few tasks from many jobs for many users at once This can result in latency for every job on the cluster and cause the cluster to use resources inefficiently because the system is trying to too many disparate things at the same time Hardware Bottlenecks Beyond scheduling challenges, there are many ways a distributed system can suffer from hardware bottlenecks and other inefficiencies For example, a single job can saturate the network or disk I/O, slowing down every other job These potential problems are only exacerbated in a multi-tenant environment—usage of a given hardware resource such as CPU or disk is often less efficient when a node has many different processes running on it In addition, operators cannot tune the cluster for a particular access pattern, because the access patterns are both diverse and constantly changing (Again, contrast this situation with a farm of servers, each of which is independently running a single application, or a large cluster running a single coherently designed and tuned application like a search engine.) Distributed systems are also subject to performance problems due to bottlenecks from centralized services used by every node in the system One common example is the master node performing job admission and scheduling; others include the master node for a distributed file system storing data for the cluster as well as common services like domain name system (DNS) servers These potential performance challenges are exacerbated by the fact that a primary design goal for many modern distributed systems is to enable large numbers of developers, data scientists, and analysts to use the system simultaneously This is in stark contrast to earlier distributed systems such as high-performance computing (HPC) systems in which the only people who could write programs to run on the cluster had a systems programming background Today, distributed systems are opening up enormous computing power to people without a systems background, so they often don’t understand or even think about system performance Such a user might easily write a job that accidentally brings a cluster to its knees, affecting every other job and user Lack of Visibility Within Multi-Tenant Distributed Systems Because multi-tenant distributed systems simultaneously run many applications, each with different performance characteristics and written by different developers, it can be difficult to determine what’s going on with the system, whether (and why) there’s a problem, which users and applications are the cause of any problem, and what to about such problems Traditional cluster monitoring systems are generally limited to tracking metrics at the node level; they lack visibility into detailed hardware usage by each process Major blind spots can result—when there’s a performance problem, operators are unable to pinpoint exactly which application caused it, or what to about it Similarly, application-level monitoring systems tend to focus on overall application semantics (overall run times, data volumes, etc.) and not drill down to performancelevel metrics for actual hardware resources on each node that is running a part of the application Truly useful monitoring for multi-tenant distributed systems must track hardware usage metrics at a sufficient level of granularity for each interesting process on each node Gathering, processing, and presenting this data for large clusters is a significant challenge, in terms of both systems engineering (to process and store the data efficiently and in a scalable fashion) and the presentation-level logic and math (to present it usefully and accurately) Even for limited, node-level metrics, traditional monitoring systems not scale well on large clusters of hundreds to thousands of nodes The Impact on Business from Performance Problems The performance challenges described in this book can easily lead to business impacts such as the following: Inconsistent, unpredictable application run times Batch jobs might run late, interactive applications might respond slowly, and the ingestion and processing of new incoming data for use by other applications might be delayed Underutilized hardware Job queues can appear full even when the cluster hardware is not running at full capacity This inefficiency can result in higher capital and operating expenses; it can also result in significant delays for new projects due to insufficient hardware, or even the need to build out new datacenter space to add new machines for additional processing power Cluster instability In extreme cases, nodes can become unresponsive or a distributed file system (DFS) might become overloaded, so applications cannot run or are significantly delayed in accessing data Aside from these obvious effects, performance problems also cause businesses to suffer in subtler but ultimately more significant ways Organizations might informally “learn” that a multi-tenant cluster is unpredictable and build implicit or explicit processes to work around the unpredictability, such as the following: Limit cluster access to a subset of developers or analysts, out of a concern that poorly written jobs will slow down or even crash the cluster for everyone Build separate clusters for different groups or different workloads so that the most important applications are insulated from others Doing so increases overall cost due to inefficiency in resource usage, adds operational overhead and cost, and reduces the ability to share data across groups Set up “development” and “production” clusters, with a committee or other cumbersome process to approve jobs before they can be run on a production cluster Adding these hurdles can dramatically hinder innovation, because they significantly slow the feedback loop of learning from production data, building and testing a new model or new feature, deploying it to production, and learning again.1 These responses to unpredictable performance can limit a business’s ability to fully benefit from the potential of distributed systems Eliminating performance problems on the cluster can improve performance of the business overall Scope of This Book In this book, we consider the performance challenges that arise from scheduling inefficiencies, hardware bottlenecks, and lack of visibility We examine each problem in detail and present solutions that organizations use today to overcome these challenges and benefit from the tremendous scale and efficiency of distributed systems Hadoop: An Example Distributed System This book uses Hadoop as an example of a multi-tenant distributed system Hadoop serves as an ideal example of such a system because of its broad adoption across a variety of industries, from healthcare to finance to transportation Due to its open source availability and a robust ecosystem of supporting applications, Hadoop’s adoption is increasing among small and large organizations alike Hadoop is also an ideal example because it is used in highly multi-tenant production deployments (running jobs from many hundreds of developers) and is often used to simultaneously run large batch jobs, real-time stream processing, interactive analysis, and customer-facing databases As a result, it suffers from all of the performance challenges described herein Of course, Hadoop is not the only important distributed system; a few other examples include the following:2 Classic HPC clusters using MPI, TORQUE, and Moab Distributed databases such as Oracle RAC, Teradata, Cassandra, and MongoDB Render farms used for animation Simulation systems used for physics and manufacturing Terminology Throughout the book, we use the following sets of terms interchangeably: Application or job A program submitted by a particular user to be run on a distributed system (In some systems, this might be termed a query.) Container or task An atomic unit of work that is part of a job This work is done on a single node, generally running as a single (sometimes multithreaded) process on the node Host, machine, or node A single computing node, which can be an actual physical computer or a virtual machine We saw an example of the benefits of having an extremely short feedback loop at Yahoo in 2006– 2007, when the sponsored search R&D team was an early user of the very first production Hadoop cluster anywhere By moving to Hadoop and being able to deploy new click prediction models directly into production, we increased the number of simultaneous experiments by five times or more and reduced the feedback loop time by a similar factor As a result, our models could improve an order of magnitude faster, and the revenue gains from those improvements similarly compounded that much faster Various distributed systems are designed to make different tradeoffs among Consistency, Availability, and Partition tolerance For more information, see Gilbert, Seth, and Nancy Ann Lynch “Perspectives on the CAP Theorem.” Institute of Electrical and Electronics Engineers, 2012 (http://hdl.handle.net/1721.1/79112) and https://www.infoq.com/articles/cap-twelve-years-laterhow-the-rules-have-changed Chapter Monitoring Performance: Challenges and Solutions Introduction System design and tuning aren’t the only aspects of multi-tenant distributed systems that require different treatment than traditional single-node systems or data centers of machines working independently Monitoring (detection and diagnosis of problems) is also fundamentally different for distributed systems, especially multi-tenant systems for which the nature of the workload can change dramatically over time Traditional system administration makes use of a variety of tools for understanding the performance of and debugging problems on a single node, such as the following in Linux: top Displays a regularly updated page of current information about hardware use, both for the node as a whole and per-process, focusing on CPU and memory usage iotop Similar to top but reports on disk I/O iostat Generates a report on CPU statistics and input/output statistics for devices, partitions, and network file systems ss and ip Reports on network information for a node, such as sockets, connections, routing, and devices sar Regularly collects and reports on a wide variety of system metrics for the node overall The /proc file system A virtual file system that provides a convenient and structured way to access process data stored in the kernel’s internal data structures These tools, along with log files from a machine, are generally used after an operator has identified a particular machine as having slow performance or instability and wants to dig deeper into the processes currently running on the machine The operator will often be notified about the problematic machine by a monitoring system that either sends test queries to each machine in a data center or gathers node-level system metrics True distributed systems are different, in part because their jobs and applications are composed of many processes running on many nodes and working together or sharing data In addition, the individual nodes can appear to be homogeneous but actually have different performance characteristics; this is common in virtualized environments Finally, large multi-tenant distributed systems have many millions of tasks starting and finishing each day; this massive increase in the number of unique time series to track makes the monitoring problem significantly more challenging from both a systems point of view and a logical point of view Why Monitor? Distributed system monitoring is intended to identify problems and performance issues at various levels of the system, such as the following: Computing hardware Is a specific node performing as expected? Are there broken disks, network cards, memory chips, and so on? Network Are any switches or other network components having problems? Are firewalls and network address translation ([NAT], which is generally implemented by routers) performing as expected? Distributed system fabric Is the software that manages the system and runs workloads (for example Hadoop, HDFS, or a distributed database) behaving as expected? Are applications receiving adequate resources? If not, are there problems with the underlying software (for example, in the scheduler configuration), or is the system working at capacity but overloaded? What is the resource usage breakdown across applications, users, business units, and so on? Applications Is the application software running as expected? Are there crashes due to coding bugs, data unavailability, configuration problems, or other sources? Are there performance problems in the way the application is written or configured? Similarly, a metrics system should support monitoring and diagnosis at various points in time Examples of such monitoring are as follows: Real-time At the moment the cluster is having a problem, help the operator quickly detect, diagnose, and fix it Retroactively Diagnose why a problem happened in the past, and identify performance bottlenecks or suboptimal use of resources that should be improved Proactively Identify and fix problems (automatically or with human intervention) before they affect the user Long-term Track historical trends in terms of usage to help with workload scheduling and capacity planning What to Monitor For a monitoring system to provide sufficient information about past, current, and potential future performance problems, it must capture multiple types of metrics for a distributed system, and then store and track them over time Examples of the types of metrics to capture include the following: Node-level hardware metrics for every node in the system As described in the preceding chapters, each of the basic hardware resources (CPU, RAM, disk storage and access, and network) has its own set of metrics that indicate performance; for example, disk access metrics should be captured for each device, including metrics such as read and write bytes per second, number of I/O operations per second (IOPS), and disk service time, among others Process-level hardware metrics for every user process on every node Whereas capturing these metrics for every system process on the node would be both unnecessary and infeasible, capturing them for every user task/application (i.e., every workload the distributed system has launched on that node on behalf of users) is important to understand the performance of each application and the interactions with other applications and bottlenecks that might be affecting it (Figure 8-1 shows an example of process-level hardware metrics alongside nodelevel metrics.) Application-specific metrics for every user process on every node Some user processes generate their own metrics related to the semantics of the application, and those should be captured, as well They can be interesting and useful to developers and operators in their own right, for example in tracking progress for a workflow or measuring changes over subsequent runs in statistics about the data processed or generated In addition, correlating them in time with hardware metrics can help inform developers and operators about the performance impact of the application on the distributed system as a whole, or performance impacts on the application due to other applications Figure 8-1 Sample screenshot from the Pepperdata Dashboard displaying both node-level and process-level metrics Note that the various jobs’ contribution to hardware usage is different for different hardware resources Source: Pepperdata Process-level hardware metrics and semantic metrics for distributed system framework components Daemons running on every node affect the performance of applications running on that node and others, both because they require system resources to run and because they provide common services to user applications In the case of Hadoop, for example, the datanode daemon on each worker node can use significant disk and network resources; it can also act as a bottleneck for applications because they access HDFS data via the datanode process Performance and semantic metrics for centralized services running on other nodes As described in Chapter 7, centralized services can act as bottlenecks for the distributed system as a whole For example, Hadoop performance can be significantly degraded by poor performance of the ResourceManager or NameNode These centralized services also generate metrics that provide visibility into requests made by applications running on individual worker nodes System and application log files are occasionally used for monitoring, but they are much more commonly used for troubleshooting application problems, such as coding bugs, broken system or data dependencies, and related issues (Log files can be used for alerting for cases in which the user who needs to set up alerts cannot change the application itself to add metrics to output but can access the log files.) Log-based analysis can be expensive, partly due to the volumes of data involved and partly because such analysis requires detecting patterns in large, free-form log files Developers and operators must understand the nuances of the specific metrics that are being recorded This point might seem obvious, but there can be surprising differences between similarly named metrics on different operating systems For example, “load average” on Linux systems includes processes waiting for resources such as disk reads/writes, whereas on some other Unix systems, it includes only processes waiting for the CPU.1 Systems and Performance Aspects of Monitoring The large number of metrics to capture, the large number of nodes and processes to track, and the frequency of gathering data all mean that for a distributed system of interesting size, metrics storage itself is a distributed systems problem When designing a system to gather, store, and present monitoring data, it’s important to consider what kinds of queries must be supported (both for visualization and alerting) and what performance levels are required for those queries For example, a high-level dashboard needs to respond in just a second or two; drilling down to diagnose issues could take a few seconds; generating long-term trend reports could be done offline and take many seconds The data structures and system architecture must reflect those requirements In addition, building the monitoring system will involve tradeoffs among the level of metrics granularity supported, the amount of history to store, the performance of interactive queries, and the size and hardware cost of the system Most metrics used for monitoring are inherently time-series data, so they must be gathered and stored as time series They will often be presented to an operator as time-series data, but they can also be aggregated over time; for example, reporting the total disk IOPS by a process for the entire lifetime of the process They can also be aggregated across processes or nodes; for example, reporting the total memory used by all of an application’s tasks at a particular time (In some cases, both kinds of aggregation will be done; for example, reporting the total network bytes written by all tasks of an application for the entire runtime of that application.) Handling Huge Amounts of Metrics Data In an ideal world, these metrics would be captured at extremely high frequency, for example every metric for every process every few hundred milliseconds The real world, of course, has tradeoffs The act of collecting the data itself can affect the performance of the node itself (For example, running top and iotop on every node continuously and storing the result every second would capture useful data but would also add significant CPU load on the nodes.) Storing, aggregating, and presenting data also has a cost, which increases with the number of data points collected In practice, metrics systems can use different time granularities for different types of metrics—for example, operating system versions change very rarely; disk space usage changes significantly only over the course of many minutes, hours, or days; CPU usage changes many times a second To reduce the cost of storage, monitoring systems often aggregate older data and discard the original raw data after some time period The reduction can be done in the time domain (by downsampling), in the level of detail (by aggregating across nodes or across processes), or both Offline aggregation across processes and nodes also reduces the computational cost and latency at presentation time when users query the monitoring system; in fact, it might be required to enable interactive use NOTE Aggregation only helps at query time; it does not mitigate any performance impact on the distributed system nodes when data is collected or the initial cost to the monitoring system of processing new data When downsampling in time or aggregating across processes and nodes, it is important to consider some subtle side effects of data reduction—in particular, spikes in time or unusually high usage on one host are lost when data is smoothed by downsampling or aggregation, and it is often exactly these spikes that indicate both the symptoms and causes of performance bottlenecks In many cases, keeping both averages/sums (which accurately reflect total usage) and min/max/percentiles (e.g., 95th percentile usage, which reflect spikes) is a way to reduce the metrics blindness that can result from naive downsampling and aggregation WARNING After a percentile is calculated, further downsampling is dangerous; downsampling a series of percentiles gives a wildly incorrect result To compute percentiles requires information about the full distribution of values Summary statistics like mean/min/max/total/count/percentiles not provide sufficient information to compute a new percentile value when multiple time periods are combined.2 One challenge specific to metrics data for multi-tenant distributed systems is the large number of unique time series generated Traditional data centers, even very large ones, can have thousands of nodes and just one or a few long-lived applications on each node, with tens of metrics per node, resulting in tens of thousands of time series The resulting time series generally last for days or weeks, because nodes are added or removed, and long-running applications are moved across nodes, only rarely In contrast, a distributed system like Hadoop has many orders of magnitude more unique time series—easily over a billion per year.3 (The number of time series can grow even larger if developers are allowed to generate their own application-specific metrics, which increases the data volume.) Most systems designed to store large numbers of time series cannot scale anywhere near this number of unique series Reliability of the Monitoring System Operators must also ensure that the monitoring system itself is robust and reliable Of course, operators should not inject metrics back into the system they are tracking—but more than that, they should ensure that the monitoring system is as resilient as possible to a problem with the distributed system being monitored For example, if a network outage occurs in the distributed system, what happens to the monitoring system? While real-time data cannot be gathered and ingested into the monitoring system during a network outage, the operator should still be able to access the monitoring system, at least to see that a problem occurred with the primary system In addition, metrics should continue to be collected on the distributed system nodes so that they can later be gathered and ingested, rather than lost completely On a related note, the monitoring system itself should have some basic monitoring—at least enough to detect if monitoring is down or data is no longer being collected Some Commonly Used Monitoring Systems One system used by many companies for cluster monitoring is Ganglia, which collects metrics data from individual nodes Ganglia uses a tree structure among the nodes in the distributed system to gather and centralize data from all of the nodes It then stores that data in a storage system called RRDtool RRDtool, which is also used by several other monitoring systems, has been optimized for efficient data writes It uses a predefined amount of storage for all data going back in time, but it stays within this limit by keeping an increasingly downsampled amount of history for older data; as a result, it rapidly loses granularity going back in time Both Ganglia and RRDtool suffer from scalability challenges on reasonably sized clusters and when storing large numbers of metrics For example, because RRDtool performs data aggregation primarily at query time, viewing data across nodes or across processes can be prohibitively slow Another time-series data storage system is OpenTSDB, which is designed to store large numbers of time series OpenTSDB is generally used as just one component of a broader monitoring system, in part because its user interface is limited,4 and it does not natively support important features like offline aggregation of time series; it performs all aggregations at query time, which limits performance and scalability (An example of a system that uses OpenTSDB as a component is the Pepperdata Dashboard, which uses several components both upstream and downstream of OpenTSDB to perform a range of optimizations at data ingestion and query time, along with providing a user interface designed for exploration and troubleshooting of very large multi-tenant clusters.5 See Figure 8-2.) Figure 8-2 Architecture of the Pepperdata Dashboard, which utilizes OpenTSDB along with other components (for example, servers performing ingestion-time and offline aggregation) to improve performance Algorithmic and Logical Aspects of Monitoring Operators want to identify and be notified about anomalies—cases when the distributed system is behaving outside the expected performance range Some common examples include spikes in disk or network usage, thrashing, and high application latency (which could indicate a problem with that particular application or the system as a whole) These anomalies generally fall into one of two categories: Outliers at one point in time, when one node or application is behaving very differently from others at that time, which often indicates a hotspot or hardware problem It can be straightforward to automatically detect such outliers for a single metric across similar nodes, other than requiring some tuning.6 In interactive use, operators might examine tables showing many similar items (for example, all nodes or all jobs) and quickly sort the items with the highest or lowest values for a particular metric Changes over time, such that the entire system is behaving differently from the way it normally does For this kind of anomaly, it’s important to understand the normal variation over time, including periodic effects For example, if jobs are taking longer to complete than usual, is that because cluster usage always spikes on Mondays, or it’s now the end of the quarter and groups are generating extra reports, or a similar “organic” cause—or is it a true anomaly? When using a monitoring system for alerting (active notification to operators about a problem, for example by sending an email or text message), users should consider the following two common types of alerts: If a metric’s current value exceeds a threshold or returns an error right now, or is automatically detected as an anomaly, send an alert If a metric has been in one of the above states for some time (not just momentarily), send an alert This type of alert is useful if a metric tends to spike transiently during the normal course of operation, or if operators only want to be notified of problems that are ongoing and likely require manual intervention (Note that this kind of alert is much harder to support from a systems perspective because the alerting system needs to store history about the recent states of the metric rather than just running one query about a single point in time.) Challenges Specific to Multi-Tenant Distributed Systems Just as monitoring multi-tenant distributed systems requires scaling to handle the huge number of unique time series from a constant stream of new processes, there are challenges due to the behavior of such time series, especially near the beginning and end of a series In a traditional system, time series are being added or removed only occasionally, and the beginning or end of a time series is a relatively unimportant corner case In contrast, for multi-tenant distributed systems, time series start and end points are the normal case, and the math used for aggregation, time downsampling, and presentation must correctly handle periods when a time series existed and periods when it did not Similarly, some metrics that operators commonly look at for a single machine can be counterintuitive across a cluster For example, does a 100-node cluster have 10,000 percent of CPU available? Or instead, if the metrics system normalizes percentages across the cluster, and an idle node drops out, should the reported CPU usage percentage increase? Even more complicated is the case of a heterogeneous cluster, where some nodes are more powerful than others, in which case 100 percent of CPU or RAM on one node is not the same as 100 percent of CPU or RAM on another node A single correct approach to aggregating data across nodes unfortunately does not exist Whatever the approach, the operator must easily understand the meaning of each aggregate metric Some kinds of metrics make sense only on distributed systems and can give valuable insights into performance For example, a distributed file system might suffer from hotspots; that is, one node or one file is being accessed by many jobs at one time Gathering metrics about CPU, disk, and network access for each node (or file or rack) and then comparing across nodes can help identify hotspots; in the case of a hotspot, an operator could change the replication for specific files or take other steps to rebalance the distributed file system usage Capturing and presenting metrics at various levels of detail—for the overall distributed system, specific nodes, and specific applications—can be critical in detecting and diagnosing problems specific to multi-tenant systems, such as the following examples: Low-priority work in one queue causing the entire system to back up We recently encountered a situation in which a developer launched a Spark cluster on top of a YARN cluster and ran an interactive shell in Spark He then went to lunch and left the Spark cluster running, holding cluster resources without doing any useful work Multiple users submitting jobs through a common framework (or using copy-and-paste code) that was accidentally misconfigured to run all jobs on a single node, causing a hotspot Measuring the Effect of Attempted Improvements When adjusting any “knob” on a system (whether directly changing configurations or using software like Pepperdata to adjust hardware usage dynamically), the effect of the adjustment must be measured This can be a challenge even for a single machine running a few applications, but it is immensely more difficult for a multi-tenant distributed system that might be running hundreds of applications at once, with many thousands of processes across thousands of nodes Several factors combine to create this complexity: Even a single application’s performance can be difficult to measure, because modern distributed applications are composed of workflows (for example, a job consists of a series of map-shufflereduce stages, and many such jobs might be chained together in a directed acyclic graph [DAG], with complex dependencies among the stages) Reducing the run time of one particular process might not improve overall performance The environment itself can be unfathomably complex because of the wide variety of applications, hardware usage profiles, and types of work involved The environment changes constantly Experimentation with an application might not yield useful results, because the background noise from the other applications running at the same time can drown out any signal from the experiment itself When operators are making changes to improve performance, it’s important to have a monitoring system that presents both system-level and application-level metrics at the same time and allows for comparison Allocating Cluster Costs Across Tenants Related to monitoring is reporting on the cost of a cluster and allocating that cost across the various tenants, such as business units This type of reporting is often called chargeback (or showback, in the common case of reporting on usage but not actually charging departments for it) Traditional chargeback reports for distributed systems are generally based on disk storage used or queue capacity planned/reserved, both of which have limitations Reports that incorporate all aspects of actual system usage are much more valuable Using disk storage as a metric for chargeback reports captures only one aspect of hardware usage, which is generally a small fraction of the total cost of the cluster Compute-intensive jobs with low data usage would get a free ride, even though they can significantly interfere with other jobs and thus be expensive to perform Likewise, departments with jobs requiring heavy I/O might pay less than departments that have stored a large amount of data but not access it often, and that I/O (and the related network usage) can slow down other tenants on the cluster Charging by storage also provides a subtle disincentive to making the cluster more useful for everyone A modern distributed system becomes increasingly useful as more data is added to the cluster and shared across users Charging the data owner for storage penalizes groups that share their data and rewards those that not add much data to the cluster but instead use data others have brought Systems that attempt to capture other hardware resources beyond storage but are based on planned usage (for example, queue capacity) also suffer from both inaccuracy and perverse incentives Actual usage can differ from planned usage, so the cost allocation might not reflect reality Charging by theoretical utilization encourages groups to use up all of their allocated capacity because it costs them nothing extra to so (In fact, the group might be better off using up their allocation, thereby keeping their queue “full” and ensuring that other groups cannot use more than their share.) Developers have no incentive to reduce overall cluster load, for example by removing obsolete jobs or optimizing their code Summary Multi-tenant distributed systems come with unique challenges in monitoring and troubleshooting, due both to their large scale and to the diversity of simultaneously running applications Traditional monitoring systems fall short in both dimensions; open source tools such as Ganglia not scale effectively to thousands of nodes, and most tools not provide the process-level hardware usage information needed for operators to diagnose and fix problems Gathering, processing, and presenting such granular hardware usage data for large clusters is a significant challenge, in terms of both systems engineering (to process and store the data efficiently and in a scalable manner) and presentation-level logic and math (to present it usefully and accurately) One problem specific to multi-tenant distributed systems is the combinatorial explosion of unique time series arising from many users running many jobs on many nodes, often resulting in millions of new time series each day on a reasonably-sized cluster Efficiently and correctly handling such a huge number of short-lived time series is a major challenge In addition to identifying and solving performance problems on a cluster, gathering detailed processlevel hardware usage metrics allows operators to allocate the cost of the cluster across the different users, groups, or business units using the cluster Basing cost allocation on actual usage of all resources is much more accurate than considering just a single resource (like disk storage space) or planned usage (like quota size) See https://en.wikipedia.org/wiki/Load_(computing) and http://www.howtogeek.com/194642/understanding-the-load-average-on-linux-and-other-unix-likesystems/ See, for example, http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-you-cantaverage.html for more discussion For example, a 1,000-node cluster with each node running 20 tasks at a time and each task lasting an average of 15 minutes would have nearly two million tasks per day; even gathering just ten metrics per task would result in 20 million unique time series per day, or more than seven billion unique time series per year See http://opentsdb.net/docs/build/html/user_guide/guis/index.html For some details of the optimizations in the Pepperdata Dashboard, see http://www.slideshare.net/Beeks06/a-billion-points-of-datapressure and http://pepperdata.com/2014/06/a-billion-points-of-data-pressure/ One standard heuristic is to compute the distribution of values for a metric across nodes and identify cases when metrics for one or a few nodes are more than a fixed multiple of the interquartile range For details on that approach and others, see https://en.wikipedia.org/wiki/Interquartile_range and https://en.wikipedia.org/wiki/Outlier Chapter Conclusion: Performance Challenges and Solutions for Effective Multi-Tenant Distributed Systems Organizations now have access to more data than ever before from an increasing number of sources, and big data has fundamentally changed the way all of that information is managed The promise of big data is the ability to make sense of the many data sources by using real-time and ad hoc analysis to derive time-critical business insights, enabling organizations to become smarter about their customers, operations, and overall business As volumes of business data increase, organizations are rapidly adopting distributed systems to store, manage, process, and serve big data for use in analytics, business intelligence, and decision support Beyond the world of big data, the use of distributed systems for other kinds of applications has also grown dramatically over the past several decades Just a few examples include physics simulations for automobile and aircraft design, computer graphics rendering, and climate simulation Unfortunately, fundamental performance limitations of distributed systems can prevent organizations from achieving the predictability and reliability needed to realize the promise of large-scale distributed applications in production, especially in multi-tenant and mixed-workload environments The need to have predictable performance is more critical than ever before, because for most businesses, information is the competitive edge needed to survive in today’s data-driven economy Computing resources even in the largest organizations are finite, and, as computational demands increase, bottlenecks can result in many areas of the distributed system environment The timely completion of a job requires a sufficient allocation of CPU, memory, disk, and network to every component of that job As the number and complexity of jobs grow, so does the contention for these limited computing resources Furthermore, the availability of individual computing resources can vary wildly over time, radically increasing the complexity of scheduling jobs Business-visible symptoms of performance problems resulting from this complexity include underutilized or overutilized hardware (sometimes both at the same time), jobs completing late or unpredictably, and even cluster crashes In this book, we have explored the most critical areas of computing resource contention and some of the solutions that organizations are currently using to overcome these issues Although manual performance tuning will always have a role, many challenges encountered in today’s distributed system environments are far too complex to be solved by people power alone: a more sophisticated, software-based solution is necessary to achieve predictability and reliability The availability of real-time, intelligent software that dynamically reacts to and reshapes access to hardware resources allows organizations to regain control of their distributed systems One example is the Pepperdata software solution, which provides organizations with the ability to increase utilization for distributed systems such as Hadoop by closely monitoring actual hardware usage and dynamically allowing more or fewer processes to be scheduled on a given node, based on the current and projected future hardware usage on that node This type of solution also ensures that businesscritical applications run reliably and predictably so that they meet their SLAs Such intelligent software enables faster, better decision making and ultimately faster, better business results About the Authors At Microsoft, Yahoo, and Inktomi, Chad Carson led teams using huge amounts of data building webscale products, including social search at Bing and sponsored search ranking and optimization at Yahoo Before getting into web search, Chad worked on computer vision and image retrieval, earning a Ph.D in EECS from UC Berkeley Chad also holds Bachelor’s degrees in History and Electrical Engineering from Rice University Sean Suchter was the founding General Manager of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search Prior to Microsoft, Sean managed the Yahoo Search Technology team, the first production user of Hadoop Sean joined Yahoo through the acquisition of Inktomi, and holds a B.S in Engineering and Applied Science from Caltech ...Strata Effective Multi- Tenant Distributed Systems Challenges and Solutions when Running Complex Environments Chad Carson and Sean Suchter Effective Multi- Tenant Distributed Systems by Chad... knees, affecting every other job and user Lack of Visibility Within Multi- Tenant Distributed Systems Because multi- tenant distributed systems simultaneously run many applications, each with different... the CPU on a node or even the distributed system as a whole These problems are specific to multi- tenant distributed systems, not single-node systems or distributed systems running a single application

Ngày đăng: 04/03/2019, 13:44

Từ khóa liên quan

Mục lục

  • 1. Introduction to Multi-Tenant Distributed Systems

    • The Benefits of Distributed Systems

    • Performance Problems in Distributed Systems

      • Scheduling

      • Hardware Bottlenecks

      • Lack of Visibility Within Multi-Tenant Distributed Systems

      • The Impact on Business from Performance Problems

      • Scope of This Book

        • Hadoop: An Example Distributed System

        • Terminology

        • 2. Scheduling in Distributed Systems

          • Introduction

          • Dominant Resource Fairness Scheduling

          • Aggressive Scheduling for Busy Queues

          • Special Scheduling Treatment for Small Jobs

          • Workload-Specific Scheduling Considerations

          • Inefficiencies in Scheduling

            • The Need to be Conservative with Memory

            • Inability to Effectively Schedule the Use of Other Resources

            • Deadlock and Starvation

            • Waste Due to Speculative Execution

            • Summary

            • 3. CPU Performance Considerations

              • Introduction

              • Algorithm Efficiency

              • Kernel Scheduling

                • Intentional or Accidental Bad Actors

Tài liệu cùng người dùng

Tài liệu liên quan