Systems and networking

NATIONAL U NIVERSITY OF S INGAPORE D EPARTMENT OF C OMPUTER S CIENCE A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE Performance Analysis of MapReduce Computing Framework Supervisor: Professor TAY Yong Chiang Author: Hou Song HT090459N September 2011 Performance Analysis of MapReduce Computing Framework Hou Song songhou@comp.nus.edu.sg Abstract MapReduce is a more and more popular distributed computing framework, especially in large scale data analysis. Although it has been adopted in many places, a theoretical analysis of its behavior is lacking. This thesis introduces an analytical model for MapReduce with three parts, the average task performance, the random behavior and the waiting time. This model is then validated using measured data from three categories of workloads. The model’s usefulness is demonstrated by three optimization processes, which give reasonable conclusions yet different from current understandings. Acknowledgments I would like to express my gratitude to my supervisor Professor Tay Yong Chiang for his countless guidance and advice. Without his help I cannot complete this thesis in time. I would also like to thank my University for supporting me financially, without which I cannot survive. My parents and brother always trust me and give me continuous support. I owe them so much. At last I would like to thank Shi Lei, Vo Hoang Tam and many other lab mates for their generous reviews and suggestions. iii Table of Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction ix 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation for an Analytical Model . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work 2.1 2.2 4 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 The Google File System . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Bigtable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Dryad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 System Description 17 3.1 Architecture of Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Representative Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Experimental Setup and Measured Results . . . . . . . . . . . . . . . . . . . . . . 19 4 Model Description 22 4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Related Theory Overview 4.3 Notation Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 Disassembled Sub-models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 The Average Task Performance . . . . . . . . . . . . . . . . . . . . . . . . 26 iv 4.5 4.4.2 The Random Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4.3 The Waiting Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 The Global Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5 Model Validation 34 5.1 Database Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6 Model Application 43 6.1 Procedures of Optimization using Gradient Descent . . . . . . . . . . . . . . . . . 43 6.2 Optimal Number of Reducers Per Job . . . . . . . . . . . . . . . . . . . . . . . . 46 6.3 Optimal Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.4 Optimal Cluster Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7 Conclusion and Future Work 59 Bibliography 61 v List of Figures 1 Basic infrastructure of GFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 An example table that stores Web pages . . . . . . . . . . . . . . . . . . . . . . . 7 3 Bigtable tablet representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Basic structure of MapReduce framework . . . . . . . . . . . . . . . . . . . . . . 10 5 A Dryad job DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6 Example figure of throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7 Example figure of times of each phase . . . . . . . . . . . . . . . . . . . . . . . . 21 8 Example figure of numbers of tasks in each phase . . . . . . . . . . . . . . . . . . 21 9 Queueing model of a slave node . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 10 Histograms of tasks’ response time . . . . . . . . . . . . . . . . . . . . . . . . . . 29 11 Example figure of randomness T∆ 12 Regular pattern for T∆ 13 Transformation procedure to get equation for T∆ . . . . . . . . . . . . . . . . . . . 30 14 Measured time VS calculated time for database query . . . . . . . . . . . . . . . . 36 15 Measured ∆ VS calculated ∆ for database query . . . . . . . . . . . . . . . . . . . 36 16 Measured response time VS calculated response time for database query . . . . . . 37 17 Measured time VS calculated time for random number generation . . . . . . . . . 38 18 Measured ∆ VS calculated ∆ for random number generation 19 Measured response time VS calculated response time for random number genera- . . . . . . . . . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 . . . . . . . . . . . . 39 tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 20 Measured time VS calculated time for sorting . . . . . . . . . . . . . . . . . . . . 40 21 Measured ∆ VS calculated ∆ for sorting . . . . . . . . . . . . . . . . . . . . . . . 41 22 Measured response time VS calculated response time for sorting . . . . . . . . . . 41 23 An example of gradient descent usage . . . . . . . . . . . . . . . . . . . . . . . . 44 24 The process of gradient descent for the optimization of Hr for sorting 25 The gradients for the optimization of Hr for sorting . . . . . . . . . . . . . . . . . 48 26 The trend of gradients for the optimization of Hr for sorting . . . . . . . 47 . . . . . . . . . . . . 48 vi 27 The process of gradient descent for the optimization of Hr for database query . . . 49 28 The gradients for the optimization of Hr for database query . . . . . . . . . . . . . 50 29 The process of gradient descent for the optimization block size for sorting . . . . . 51 30 The gradients for the optimization of block size for sorting . . . . . . . . . . . . . 52 31 The trend of gradients for the optimization of block size for sorting 32 The process of gradient descent for the optimization block size for database query . 53 33 The gradients for the optimization of block size for database query . . . . . . . . . 54 34 The process of gradient descent for cluster size for database query . . . . . . . . . 55 35 The gradients for the optimization of cluster size for database query . . . . . . . . 55 36 The process of gradient descent for cluster size for sorting 37 The gradients for the optimization of cluster size for sorting . . . . . . . . 52 . . . . . . . . . . . . . 56 . . . . . . . . . . . . 57 vii List of Tables 1 Symbols and notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2 System default values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 System optimization conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 viii Summary The work people are trying to solve is getting much larger than a single computer’s capability, and distributed computing is an inevitable direction. For example, Internet companies are using tens of thousands of machines to process an enormous amount of concurrent user requests. MapReduce is a useful and popular distributed computing framework that is widely adopted in both the industry and academic world, because it is simple to use yet able to provide good scalability and high performance. However, its performance has not been fully studied yet. There are some papers which are trying to improve MapReduce’s design and implementation through experiments or simulations, but no one has yet proposed an analytical model that can overcome the weaknesses of the first two methods. This thesis investigates the details of MapReduce, and proposes an analytical model based on the categorization of typical workloads. This model consists of three parts: the average task performance which is a modified multiclass processor sharing queue, the random behavior which is a fitted curve using common observations, and the waiting time which utilizes a deterministic waiting equation. The model is then validated using measured data from all three categories. At last this model is applied in three optimizations, thus demonstrating its usefulness in configuring MapReduce computations. Their conclusions provide new understandings about MapReduce’s performance behavior. ix 1 1.1 Introduction Background The size of problems people are trying to solve has been increasing for a long time, and now it is way beyond a single powerful computer can handle. For example, the Internet now has trillions of web pages and countless multimedia files. How to make good use of it is a nontrivial problem. Distributed algorithms are the unavoidable direction, but due to its complexity in nature, it is hard to develop useful applications efficiently [3]. Issues in distributed computing range from communication, synchronization, concurrency control to failure management and recovery, and each of these is by itself an area that needs thorough study. Therefore, working in a distributed fashion is tedious and error-prone, but still necessary work. There are various proposed frameworks and tools intended to help developers. Message Passing Interface (MPI) [12] is a successful communication protocol with a wide range of adoption. It provides convenient ways to send point-to-point messages or multicast messages. MPI is scalable and portable, and remains in the dominant position for high performance computing systems that focus on raw computation, such as traditional climate simulation. However, this does not eliminate the difficulties that developers have to resort to low level primitives to accomplish complex logic such as synchronization, failure detection and recovery, comparable to writing a sequential program using assembly language. They make a distributed system hard to design and implement, and tricky to ensure correctness. However many of these aspects share some common operations, which could be provided by the underlying system and thus relief the burden on programmers. From a broader point of view, there are two types of high performance computer systems, one is for raw computation power and the other is for data processing. The first type has a longer history, with its concentration on the number of calculation operations per second. Those systems in the TOP500 list [28] are good examples. As people are collecting and generating more and more data, such as Internet web pages, photos and videos in social network sites, health records, telescope imageries, transaction logs and so on, automatic processing of these data using large computer systems is of high demand. For example, successful data mining of transaction records from a supermarket is able to give the manager a better understanding of the business and its customers, 1 therefore improving this business to a new level. Fast and accurate processing of telescope images is possible to provide breakthrough scientific discoveries. Database systems are designed to manage large data, but up to 70% to 80% of online data are unstructured and may be used for only a few times, and the processing is not efficient, or even difficult, if people use existing commercial databases [7]. New systems are being designed [4, 11], and companies such as Google, Microsoft and Amazon made these designs into commercial systems that are operating tens of thousands of computers. Their accumulated power of low end computers makes it possible to analyse the whole Internet in a timely fashion, support large transaction systems and many more. This new trend is also attracting attention from smaller companies and researchers, who do not have access to large computing infrastructure like Google and Microsoft. However, in the era of cloud computing, people can rent machines in the remote clouds and start their own cluster at a very low price. Then the immediate question is: Given the workload and service level objectives, how many machines are needed? Other challenges include cluster parameters optimization, system upgrading, scheduler design decisions, and cluster sharing. To answer these questions engineers and researcher need to understand the relationship between system performance, system parameter settings and the characteristics of workload, which is the major research direction of this thesis. 1.2 Motivation for an Analytical Model Although there are many such successful systems, their performance is not very deeply studied, especially the newly established massive scale data processing systems. MapReduce [9], the distributed computing framework originally from Google, is an example. It is very popular thanks to its power of expressiveness and simplicity of use. There are some work [24, 36, 29] trying to study and improve the performance of MapReduce, but to the best of our knowledge, no one has proposed an analytic model for MapReduce. This thesis mainly focuses on MapReduce, and tries to design an analytical model that characterizes its performance. There are usually three ways to study a system, i.e. experiment, simulation and analytical modelling. Experiments are accurate, but sometimes too slow to be feasible. Simulation extracts only the necessary details and runs faster, but is still not practical when the parameter space is too large; for example, a distributed system could have hundreds of parameters. An analytical model 2 describes a system abstractly with mathematical equations with system parameters as input and system performance metrics as output. A well developed analytical model is able to hide those unnecessary details, explores the whole parameter space conveniently and possibly unveils some obscure truth that is impossible to produce otherwise. In this thesis we first study the details of Hadoop, an open source implementation of MapReduce, and categorize ordinary workloads into three groups according to the size of their input and output. We divide MapReduce execution into several pieces, and develop models for each of them. Then we validate the accuracy of this model using measured data. At last we show its applications using three examples, which provide useful conclusions. For example, the common practice of using larger block size to get better sorting benchmark may lead to even longer response time. MapReduce’s scalability can be influenced by the design of MapReduce’s jobs, and improvement could be made according to this finding. 1.3 Overview This thesis is organized as following. In Section 2, we present the related work in distributed data processing, with the first part on data storage and the second part on data manipulation. Section 3 describes technical details of MapReduce, and the experimental setup. Section 4 starts with assumptions and fundamental theory overview, and proceeds to build the model. After that the model is validated in Section 5, and model application is shown in Section 6, which shows the conclusions that are different from current understandings. At last this thesis concludes in Section 7 and sets the plan for future work. 3 2 Related Work Modern data processing systems are growing in both size and complexity, but their theory and guidelines do not change very frequently. The predication from [10] stays valid till now that large scale database systems should be built based on shared-nothing architecture made of conventional hardware. During the last few decades the shared-nothing architecture has been developing fast, attracting popularity from both the academia and the industry. A number of systems are implemented in this area, which will be introduced later in this section. Intuitively in a data processing system there are two major parts: how data are stored and how they are manipulated. Therefore, this section of related work is organized into two corresponding subsections. 2.1 Data Storage From the lower level data storage is a sequence of bytes that is kept in disks or other stable storage devices. However bytes can be interpreted in different forms, ranging from simple bit streams to a group of records to nested objects. In order to control and share data more conveniently, database technologies are invented and widely deployed. In recent years the requirements for data storage are becoming more demanding. These requirements include but are not limited to the volume that scales up to Petabytes, currency control among billions of objects, fault tolerance and failover, and high throughput of both read and write. The traditional storage and database system may become awkward for handling all these, thus new approaches are proposed. 2.1.1 The Google File System The Google File System (GFS) [13] was originally designed and implemented by Google to meet their needs. A lot of decisions and optimizations were made to fit the environment they were in. This environment is not unique for Google, and its open source version, Hadoop Distributed File System [16], is now widely used in many companies and institutes [15]. GFS is based on several assumptions. First, failures are the norm rather than an exception. Be4 cause of the large number of machines gathered together, expensive, highly reliable hardware does not help much. Instead, Google’s system consists of thousands of machines built from commodity hardware. Therefore, fault tolerance is necessary in such systems. Second, the system should manage very large files gracefully. The system is able to store small files, but the priority is for large files, the size of which is counted in gigabytes. Third, its workload is primarily one type of write and two types of reads: large sequential writes that append at the end of files, and large sequential reads or small random reads. Finally, the objective is sustainable performance for a large number of current clients, and places more emphasis on high overall throughput instead of response time of a single job [13]. Using these assumptions, GFS uses a single master multiple slaves architecture as the basic form. Files are divided into chunks in the size of 64MB by default, which is the unit of file distribution. The master maintains all the meta-data, such as the file system structure, chunk identifiers of files, and locations of chunks. It has all the information and controls all the actions. The slave nodes called chunkservers store the actual data in chunks that are assigned by the master, and serve them directly to clients. The master and chunkservers exchange all kinds of information, such as primary copy lease, meta-data update and server registration, periodically through heartbeat. The work for master node is minimized to avoid overloading. When a client tries to read a file, it retrieves the chunk handles and locations from the master using the file name and offset within the file. The client caches this piece of meta-data locally to limit the communication with master. Then it chooses the closest location among all possibilities (local or within the same rack), and then initiates a connection with that chunkserver to stream the actual data. Writing is a bit more complex. Because GFS uses replication to improve reliability and read performance, consistency has to be taken into consideration. Each chunk has a primary copy selected by the master, and every file modification has to go through the primary copy so that the operations on this chunk are ordered properly. In order to provide high performance, ordinary writes that update a certain region of an existing file are supported, but the system is more optimized for appends, which are used when an application wants to add a record at the end of a file but it does not care about the record’s exact location. The primary copy of the last chunk of 5 that file decides the location where the record is written to. Therefore record appends are atomic operations and the system can serve a large number of concurrent operations, because the order is decided at writing site and no further synchronization is required. Before the operations are successfully returned to applications, the primary copy pushes all the operations on the chunk to all the secondary copies to ensure all copies are the same. Fault tolerance is one of the key features in GFS. No hardware is trusted, so software needs to deal with all kinds of failures. File chunks are replicated across multiple nodes and racks. Checksums are used a lot to rule out data corruption. Master replicates its state and logs so that in case of failure, master can restore itself locally or on other nodes. Shadow masters are used to provide read-only access during master’s failure. Servers are designed for fast recovery, and downtime can be reduced down to a few seconds. File read/write request GFS client Legend: GFS master Chunk locations control message data stream Heartbeat Read/write request Chunk server Chunk server Chunk server ... Read/write request Data read from / write to Figure 1. Basic infrastructure of GFS In addition, there are other useful functionalities, such as snapshots, garbage collection, integrity test, re-replication and re-balancing. The structure of GFS is shown in Figure 1. 6 2.1.2 Bigtable GFS is a reliable, high performance distributed file system that serves raw files, but there are many applications that need structured data which resembles a table in a relational database. Bigtable [6] which is also from Google fulfills this need. HBase [14] from Hadoop is an open source version of Bigtable. "contents:" "anchor:cnnsi.com" "anchor:my.look.ca" "CNN" "CNN.com" "..." "com.cnn.www" "..." "..." Figure 2. An example table that stores Web pages In Bigtable, tables are not organized in strict relational data models; in contrast each table has one row key and unfixed, unlimited number of columns, and each field has multiple versions indexed by time stamp. This data model supports dynamic data control and gives applications more choices to express their data. Internally Bigtable is a distributed, efficient map from row key, column name and time stamp to the actual value in that cell. Columns are grouped into some column families, which are basic units of access control. Bigtable is sorted in lexicographic order by row keys, and dynamically partitioned into tablets which have several adjacent rows. This design exploits good data localities and improves overall performance. An example is shown in Figure 2, which is a small part of Web page table. This row has a row key “com.cnn.www”, two column families “contents” and “anchor”, and the “contents” value has three versions. Bigtable is built on top of many other Google infrastructures. Bigtable uses GFS as persistent storage for data files and log files. It depends on cluster management system to schedule resources. It relies on Chubby [5] as lock service provider and meta-data manager. It runs in clusters of machine pools shared with other applications. In implementation, there are one master and many tablet servers. The master takes charge of 7 the whole system, and tablet servers manage the tablets assigned to them by the master. Tablets’ information is stored in meta-data indexed by a specific Chubby file. Data in a tablet are stored in two parts, the old values that are immutable and stored in a special SStable file format which is sorted by row key, and a commit log file of mutations on this tablet. Both of these files are stored in GFS. When a tablet server starts, it reads SStable and the commit log for the tablet, and constructs a sorted buffer named “memtable” filled with the most recent view of values. When a read operation is received, the tablet server searches in the merged view of the latest tablet from both SSTable files and commit logs, and returns the value. When a write operation is received, the operation is written to the log file, and the memtable is modified accordingly. The whole process is shown in Figure 3. Memtable Read Op Memory GFS Tablet log SSTable file SSTable file SSTable file Write Op Figure 3. Bigtable tablet representation A lot of refinements are utilized to achieve high performance and reliability. Compression is applied to save storage space and speed up transportation. Caching is used a lot at both server side and client side to relief the load on network and disks. Because most tables are sparse tables, bloom filters are used to speed up searches for non-existent fields. Logs on a tablet server are co-mingled into one. And tablet recovery is also designed to be rapid to minimize down time. 8 2.2 Data Manipulation How to make good use of data is more than just data storage. People could write dedicated distributed programs to conduct a certain processing, but this kind of programs is hard to write and maintain, and each program has to deal with data distribution, scheduling, failure detection and recovery, and machine communication. A central framework can be implemented to provide these common features, so that users can rely on it and concentrate only on unique logic of their jobs. Here two systems are analyzed in detail. 2.2.1 MapReduce MapReduce [9] is a powerful programming model for distributed massive scale data processing. Originally designed and implemented in Google, MapReduce is now a hot topic and applied to fields that it was not intended to fit. Hadoop also has its open source MapReduce version. It is built on top of GFS and Bigtable, and use them as input source and output destination. MapReduce’s programming model is from functional languages, consisting of two functions: map and reduce. The Map function takes input in the form of < key, value > pairs, does some computation on a single pair and produces a set of intermediate < key , value > pairs. Then all the intermediate pairs are grouped and sorted together. The Reduce function takes an intermediate key and the value list for that key as input, does some computation and writes out the final < key , value > pair as the result. A lot of practical jobs can be expressed in this model, including grep, inverted list of web pages and page rank computation. MapReduce inside Google is implemented in a master-slave architecture. When a new job starts, it generates a master, a number of workers controlled by the master, and some number of Mapper and Reducer tasks. The master assigns Mapper and Reducer tasks to free workers. Mapper tasks write their intermediate results into their local disks, and notify the master of their completion, which informs Reducers tasks to fetch the map outputs. When some tasks finish, if there are still un-assigned tasks, the master continues its scheduling. When all Mapper tasks are finished, the Reducer tasks are started. After all reducer have finished, the job is finished, and returned to client. During this job execution, if any worker dies, all the tasks on that worker is marked as failed, and re-scheduled later until finished successfully. Figure 4 is an example, which 9 has an input file of 5 splits, 4 mappers and 2 reducers. Note that Hadoop’s version MapReduce is a little different from Google’s version MapReduce, which we will show later. Client Legend: control message data stream Master Mapper split 0 split 1 Mapper Reducer Output 0 Reducer Output 1 split 2 split 3 Mapper split 4 Mapper Figure 4. Basic structure of MapReduce framework There are some enhancements to MapReduce from other researchers. Traditional Map Reduce focuses on long running data intensive batch jobs that aim at high throughput rather than short response time. Outputs of both map phase and reduce phase are written to disks, either local file system or distributed file system, to simplify fault tolerance. MapReduce Online [8] was proposed to allow data pipelined between different phases and between immediate jobs. Intermediate < key, value > pairs are sent to the next operator soon after they are generated, from mappers to reducers in the same job, or from reducers in this job to mappers in the next immediate job. Because reducers executes with only a portion of all intermediate results, the final results are not always accurate. In MapReduce Online the system takes snapshots as mappers proceed, runs reducers on these snapshots and approximates the real answer. A lot of refinements are used to improve performance and fault tolerance, and support online aggregation and continuous queries. 10 Although MapReduce can be used to express many algorithms in the areas such as information retrieval and machine learning, it is hard to use in some database fields, especially table joins. MapReduce-Merge [33] extends the original model by adding a third phase, merge phase, at the end of the reduce phase. A merger gets reducers’ resulting < key, value > pairs from different reducers in two MapReduce jobs, runs default or user-defined functions, and generates the final output files. With the help of merge phase and some operators, Map-Reduce-Merge is able to express many relational operators such as projection, aggregation, selection and set operations. What is more important, it is able to run most join algorithms such as sort-merge join, hash join and nested-loop join. Job scheduling is an important factor in MapReduce’s performance. In Hadoop, all nodes are assumed to be the same, which results in bad performance in heterogeneous environment, such as virtual machines in Amazon’s EC2 [2] in which the performance of machines could differ significantly. Speculative tasks are one of the reasons. [36] proposes a new scheduling algorithm, Longest Approximate Time to End (LATE), that fits well in heterogeneous environment and more accurately executes speculative tasks. The key idea in this algorithm is try to estimate which task will finish farthest into the future. LATE uses simple heuristics that assume task’s progress rate is constant. The progress rate is calculated using the completed part of the work and elapsed time, then the full completion time is computed by dividing the remaining part of the work by the progress rate. The task that finishes the latest decides the job’s response time, and therefore is the first one to re-execute, if speculative tasks are needed. There are some tuning parameters used to further enhance this strategy. In order to use MapReduce more conveniently, some other applications are built on top of MapReduce so as to provide simple interfaces. Hive [26] is a data warehousing solution built on Hadoop. It organizes data in relational table with partitions, and uses SQL-like declarative language HiveQL as its query language. Many database operations are supported in Hive, such as equi-join, selection, group by, sort by and some aggregation functions. Users are able to define and plug in their own functions, therefore furthermore increase Hive’s functionalities. Its ability to translate SQL to directed acyclic graphs of MapReduce jobs empowers ordinary database users to analyze extremely large datasets, thus it is becoming more and more popular. 11 Pig Latin [20] is a data processing language for large dataset, and can be compiled into a sequence of MapReduce jobs. Not like SQL, Pig Latin is a hybrid procedural query language especially suitable for programmers, and users can easily control its data flow. It employs a flexible nested data model, and internally supports many operators such as co-group and equi-join. Like Hive it is easily extensible using user-defined functions. Furthermore, Pig Latin is easy to learn and easy to debug. Sawzall [22] is designed and implemented at Google and runs on top of Google’s infrastructure such as protocol buffers, GFS and MapReduce. Like MapReduce, Sawzall has two phases, a filtering phase and an aggregation phase. In filtering phase records are processed one by one, and emitted to the aggregation phase, which performs aggregation operations such as sum, maximum and histogram. In practice Sawzall acts as a wrapper of MapReduce, and presents a virtual vision of pure data operations. Additionally, MapReduce is introduced into areas other than large scale data analysis. As multi-core systems are getting popular, how to easily utilize the power of multi-core is a hot topic. MapReduce is also useful in this situation. [23] describes Phoenix system, a MapReduce runtime for multi-core and multi-processor systems using shared-memories. Like MapReduce, Phoenix also consists of many map and reduce workers, each of which runs in a thread on a core. Shared memories are used as storage for intermediate results, so that data don’t need to be copied and that saves a lot of time. At the end of reducer phase, all outputs from different reducers are merged into one output file in a bushy tree fashion. It provides a small set of API that is flexible and easy to use. Phoenix system shows a new way to apply MapReduce in a shared memory system, but it may perform badly in a distributed environment. Phoenix Rebirth [34] revises the original Phoenix into a new version for NUMA systems. It utilizes locality information when making scheduling decisions to minimize remote memory traffic. Mappers are scheduled to machines that have the data or are near to the data. Combiners are used to reduce the size of mappers’ outputs, and therefore reduce remote memory demand. In the merge phase, merge sort is performed first within a locality group, and then among different groups. Another area is general purpose computation on graphics processors. GPUs are being used 12 in high-performance computing because of their high internal parallelism. However, programs for GPUs are hard to write and not portable. Mars [17] implements a MapReduce framework on GPUs that is easy to use, flexible and portable. During its execution, inputs are prepared in the main memory, copied to GPU’s device memory and then mappers are started on those hundreds of cores on GPU. After mappers are completed, reducers are scheduled. At last the final outputs are merged into one and copied to main memory. Mars exposes a small set of API and yet produces a high performance that is sometimes 16 times faster than its CPU based counterpart. 2.2.2 Dryad Although MapReduce is now widely adopted in many areas, it is awkward to express some algorithms such as a large graph computation which is needed in many cases. Dryad [18] can be seen as an extended MapReduce that enables users to control the topology of its computation. Dryad is a general purpose, data parallel, low level distributed execution engine. It organizes its jobs as directed acyclic graphs, in which nodes are simple computation programs that are usually sequential, and edges are data transmission between nodes. Creating a Dryad job is easy. Dryad uses a simple graph description language to define a graph. This description language has 8 basic topology building blocks, which are sufficient to represent all DAGs when combined together. Users can define computation nodes by simply inheriting a C++ base node class and integrating them in the graph. Edges in the graph have three forms: files, TCP connections and shared memories. An example of job DAG is shown in Figure 5. Other tedious work such as job scheduling, resource management, synchronization, failure detection and recovery, and data transportation is internally performed by Dryad framework itself. The system architecture is also a single master multiple slave style to ensure efficiency. The process of a Dryad job is coordinated by the job manager, which also monitors the the running states of all slaves. When a new job arrives, the job manager starts computation nodes that have direct input from files according to its job graph description. When a node finishes, its output is fed to its child nodes. The job manager keeps looking for nodes that have all its input ready, and starts them in real time. If a nodes fails, it is restarted on another machine. If all computation nodes finish, then the whole job finishes, and the output is returned to the user. 13 output file Op n Op Op input file 2 input file 1 n Op input file 1 input file 2 Op input file 1 input file 2 Figure 5. A Dryad job DAG 14 It needs a lot of optimizations to make Dryad useful in practice. First of all, the user provided execution plan may not be optimal, which needs refinements. For example, if a large number of vertices in the execution graph aggregate to a single vertex, that vertex may become a bottleneck. At run time, these source vertices may be grouped into subsets, and corresponding intermediate vertices are added in the middle. Second, the number of vertices may be a lot larger than the number of running machines, and these vertices are not always independent. Therefore how to map these logic vertices onto physical resources is of great importance. There are some vertices that are so closely related that it is better to schedule them on the same machine or even in the same process. Vertices can run one after another, or can run at the same time with data pipelined in between. What’s more, there are three kinds of data transmission channels: shared memory, TCP network connection and temporary file, and each of them has different characteristics. Using inappropriate channel could cause large overhead. Last but not least, how the failure recovery is accomplished can affect the whole system. Note that these optimization decisions are correlated. For example, if some vertices are executed in the same process, the channels between them should have shared memory to minimize overhead, but failure recovery is more complex because once a vertex fails, other vertices in the same process need to be re-executed at the same time, because the intermediate data are stored in the memory and volatile to failures. Although Dryad is powerful for expressing many algorithm, it is too low level to fit in daily work. Dryad users still need to consider details about computation topology and data manipulation. Therefore some higher level systems are designed and implemented on top of Dryad. The Nebula language [18] is one example. Nebula language exposes Dryad as a generalization of simple pipelining mechanism, providing developers a clean interface and hides away Dryad’s unnecessary details. Nebula also has a front-end that integrates Nebula scripts with simple SQL queries. DryadLINQ [35] is an extension on top of Dryad, aiming to give programmers the illusion of a single powerful virtual computer so that they can focus on the primary logic of their applications. It automatically translates the high level sequential LINQ programs, which have an SQL-style syntax and many extra enhancements, into Dryad plans, and applies both static and dynamic optimizations to speedup their execution. A strongly typed data model is inherited from the original LINQ language, and old LINQ programs what aimed to be executed on traditional relational databases 15 can now deal with extremely large volume of data on Dryad clusters without any change. Debug environments are also provided to help developers. 16 3 System Description In this section we first investigate the fundamental behavior of Hadoop MapReduce, followed by the categorization of usual workloads. We then describe the experimental setup, and a set of experiment results is plotted as a preliminary impression of MapReduce’s performance. 3.1 Architecture of Hadoop MapReduce Hadoop [16] is a suite of distributed processing software which closely mimic their counterparts in Google’s system. In this study we choose Hadoop MapReduce along with Hadoop Distributed File System (HDFS), and their architectures are analyzed here in order to get enough insight to set up the environment for the model. HDFS is organized in a single master, multiple slaves style. The master is called Namenode, which maintains the file system structure, and controls all read/write operations. The slaves are called Datanodes, which maintain actual data, and carry out the read/write operations. As stated in Section 2.1.1, file data are stored in blocks of fixed size. This improves the scalability and fault tolerance. However, its available functionalities are confined to keep the system simple and efficient. When a read operation comes, the Namenode first checks its validity, and redirects this operation to a list of corresponding Datanodes according to the file name and the offset inside the file. The operation sender then contact one of those Datanodes for the data it requires. The Datanode that is closer to the sender is chosen first, and in special cases the sender itself, so as to save network bandwidth. If the current block is finished, the next block is chosen by the Namenode, and the operation restarts from the new offset at the beginning of the block. When a write operation comes, Namenode again checks its validity, and chooses a list of Datanodes to store different replicas of the written data. The operation sender streams the data to the first Datanode, the first Datanode streams the same data to the next Datanode at the same time, and so on until no data is left. If the current block is full, a new list of Datanodes is chosen to store the remaining data, and the operation restarts. Hadoop MapReduce is built on top of HDFS, and similarly it also has a master-slave archi17 tecture. Its master is called Jobtracker, which controls the process of a job, including submission, scheduling, cleaning up and so on. Slaves are called Tasktrackers, which run the tasks assigned to them by the Jobtracker. To keep the description concise, only major actions will be shown. When a new job arrives, the Jobtracker sets up necessary data structures to keep track of the progress of this job, and then initializes the right number of mappers and reducers which are put in the pool of available tasks. The scheduler monitors that pool, and allocates new tasks to free Tasktrackers. There are many strategies that can help the scheduling, such as utilizing the locality information of input files, and rearranging tasks in a better order. If no Tasktracker is available, these new tasks are queued up until some Tasktracker finishes an old tasks and is ready for a new task. When a Tasktracker receives a task from Jobtracker, it spawns a new process to run the actual code for that particular task, collects the running information and sends it back to the Jobtracker. Depending on the specific configuration of the task, it may read data from local disk or remote nodes, compute the output and write them to local disk or HDFS. Therefore three important parts are involved, namely CPU, local disk and network interface card. According to MapReduce’s topology, ideally a job has two phases: map and reduce. But after careful study, we find two synchronization points after mapper and reducer, meaning that all reducers start only after all mappers finish, and the job result is returned only after all reducers finish. In this work we focus on average performance, and this randomness of response times of individual map or reduce tasks makes the average task time insufficient to calculate the total job time. We measure this randomness using the difference ∆ between the response time of a job and the average time of map and reduce. Furthermore, a job would have to wait in the master node if there are currently no free slots for new jobs. We call this waiting the fourth part. In total we have four phase, i.e. map, reduce, ∆ and waiting. 3.2 Representative Workload MapReduce is powerful in its expression ability, especially in large scale problems. Typical workloads includes sorting, log processing, index building and database operations such as selections, projections and joins. In order to make our analysis applicable to a variety of scenarios, we divide 18 these workloads into three types according to the relation between input and output, because most MapReduce applications are I/O bounded programs, which means that they require more disk or network I/O time than CPU time. The first type is large-in-small-out, in which a large amount of data are read and processed, but only a small amount of output is generated. Examples include log item aggregation and table selections in database. The second type is large-in-large-out, in which a large amount of data are read, and a comparable amount of output is generated. Examples include sorting, index building and table projection in database. The last type is small-in-large-out, in which only a small amount of input is needed, but can generate a large amount of output. Examples include random data generation, and some table joins in database. To reach the accurate model step by step, we will first build the model for the first type workload, and then verify this model to incorporate the other two types. The workload we choose to set up the basic model is a random SQL query of an equi-join of two tables with projection and selection, shown in Listing 1. The two chosen input tables are large enough, and size of its output is negligible. In this query u name is randomly chosen in order to make queries different from each other, the way the real workload should look like. Listing 1. Workload query select p id from u s e r , p h o t o where u i d = p u i d and u name = ”< d i f f e r e n t n a m e s >” Sorting and random number generation are used for the second and third workload type respectively. These two very basic data operations are used a lot in many systems, therefore suitable workload in our study. 3.3 Experimental Setup and Measured Results All experiments are run in an in-house cluster with at most 72 working nodes, although we do not use all of them all the time. Each node has a 4 core CPU, 8GB main memory, a 400GB disk and a 19 Gb Ethernet network card. In order to get a complete overview of MapReduce’s performance, we used the following systematic way to arrange the experiments. First the number of concurrent tasks is changed to a large enough number to support more concurrent jobs. Then in each experiment, the number of concurrent running jobs is at first fixed, and as the jobs running, measure the number of maps, reduces and shuffles, and also the amount of time in each phase. Possibly run each experiment multiple times and calculate the averages to improve accuracy. Then vary the job concurrency to get the whole curve, which shows how the performance changes along with the workload. After getting the curve for one setting, we change the number of nodes used, or the system parameters, to get the effects of different settings. The usual patterns for throughput, times of each phase and numbers of tasks in each phase are pictured in Figure 6, 7 and 8 respectively. For different settings the specific numbers in these curves may be different, but the shapes are similar. In the throughput Figure 6, the throughput first increases almost linearly, then gradually decelerate to reach a maximum, after which the throughput decreases a little bit and then remains steady afterwards. The last changing point is concurrency 40 in this example. If we dissect the running time into 4 parts mentioned earlier, we get Figure 7. The first two parts (mapper and reducer) have a similar patten, a linear increase followed by steady constant. The ∆ part is different, which stabilizes after an exponential-like increase. The waiting part remains 0 and increases linearly after a turning point. One thing to notice here is that the turning points of these 4 parts are the same, concurrency 40 in this example figure, which is the same in the throughput. At last, in Figure 8 where the numbers of concurrent tasks are shown, these two phases have a similar pattern, a linear increase followed by constants, and the turning points are the same as in Figure 7 and Figure 6. However, not all workloads produce the same curves; the length of performance drop depends on system parameters and the characteristics of the workload, and may disappear in some scenarios. In the next chapter we will explain why the shape is like this. 20 0.35 Throughput (jobs/sec) 0.3 0.25 0.2 0.15 0.1 0 10 20 30 40 Concurrency 50 60 Figure 6. Example figure of throughput 70 Mapper Reducer 6 Waiting 60 Time (s) 50 40 30 20 10 0 0 10 20 30 40 Concurrency 50 60 Figure 7. Example figure of times of each phase 90 Mapper Reducer 80 Number of tasks 70 60 50 40 30 20 10 0 0 10 20 30 40 Concurrency 50 60 Figure 8. Example figure of numbers of tasks in each phase 21 4 Model Description This section starts with the assumptions, and reviews related theory in analytical performance modelling. Then the model is discussed in the form of open systems, in three major parts. Then the whole model is assembled, and a corresponding formula is derived for closed systems. 4.1 Assumptions We first define several key assumptions to make this problem solvable. Other minor assumptions will be introduced where they are needed. Assumption 1. Master node is not overloaded. In most cases the work in the master node is relatively small, meaning that master is not likely to be the bottleneck. Assumption 2. The whole system stays in a steady state. Our primary interest is the steady state performance. However, as we can see later, the system is steady but maybe not balanced. Assumption 3. No failure is present. Although MapReduce is capable of handling many kinds of failures, their influence on performance is inevitable. What is more, failures are usually transient, but our current focus is steady state performance. Therefore, we postpone the study of failures into future study. Assumption 4. Individual jobs are not too large, and their tasks can run at the same time. MapReduce jobs could be very large and, for example, take hours to finish, but it is not realistic to treat these long jobs as failure free. Because our model does not consider failure due to the lack of time and budget, our primary focus is on the throughput of short running jobs which take only several minutes to finish. 4.2 Related Theory Overview Before we delve deep into the model, we first review some related theories that will be used later. One of the most important theorems in performance analysis is Little’s Law, which has the following form in Theorem 1. 22 Theorem 1. In steady state of a system, n¯ = λ T, (1) in which n¯ is the average number of jobs in the system, λ is the average arrival rate, and T is the average time in system. In queueing theory, if a simple queue has only one server, its arrival rate and service rate are both exponentially distributed, then it is usually denoted as an M/M/1 queue. The classic queueing theory has the following conclusion: Theorem 2. For an M/M/1 queue, n¯ = ρ , 1−ρ (2) ρ 1 1−ρ µ (3) λ µ (4) and W= in which ρ= is the server utilization, W is the average waiting time, λ is the arrival rate and µ is the service rate. Queueing models have two basic forms, open and closed. In open systems, the customer arrival rate does not depend on state of system, but in closed systems, the total number of customers is fixed. Many systems can be studied as both open systems and closed systems, and the decision depends on the purpose of the system. If a system is designed to minimize the processing time of incoming requests which are not fixed, an open model may be better. And if a system is designed to support a fixed large number of concurrent users, a closed model may be more suitable. The classic M/M/1 queue uses First Come First Served (FCFS) discipline, which is simple and easy to study. Processor sharing is another useful discipline, in the sense that a lot of system behaves in a processor sharing pattern, such as modern time sharing operating systems. In such systems, the server evenly divides its computing power among all the visiting customers. Further23 more, if there are multiple classes of customers in a queueing system, it is much more complicated to analyze. In [1] the authors summarized several models for processor sharing queues, and here we use the equations about the relation between response time and the arrival rate unconditioned on the number of jobs in the queue, as described in Theorem 3. Theorem 3. In a queueing system with in total K classes of customers and processor sharing discipline, let Tk be the expected response time of a class-k customer, λk be its arrival rate and µk be its service rate. If the service requirement is exponentially distributed, then Tk satisfies the following linear equations: K Tk = Bk0 + ∑ T j λ j Bk j , (5) j=1 where in all equations k = 1, . . . , K, Bk j is given as     1 i f j = 0, µk (1 − σk ) Bk j = 1    otherwise, (µk + µ j )(1 − σk ) (6) where j = 0, 1, . . . , K and σk is given as K σi = λ ∑ µi +k µk . (7) k=1 Large and complex systems, such as the Internet which has countless routers, switches and end computers, are usually hard to model using aforementioned techniques because of the large number of sub-systems they include. Bottleneck Analysis [25] is helpful in this scenario. Among all sub-systems, such as all links in the Internet, the one that has the highest utilization is called the bottleneck, and the bottleneck defines a performance bound for the whole system. For example, an end user A in the Internet is sending data to another user B, then the data transferring speed is limited by the speed of the bottleneck link between A and B. The model of the bottleneck sub-system is a good approximation of the whole system, accurate enough in many different cases. 24 4.3 Notation Table Table 1 shows the symbols and their descriptions that will be used throughout the thesis. Several less important notation symbols will be introduced where they are needed. Symbol Description C K X T N Hm Hr b Di Do λm µm Sm λr µr Sr Tm Tr T∆ Tw p Number of concurrent jobs (either running or waiting) Maximum number of concurrent running jobs Job arrival rate (in open systems) or throughput (in closed systems) Job response time Number of slave nodes in the system Number of mappers per job Number of reducers per job Block size of the distributed file system Size of input data per job Size of output data per job The average arrival rate of mappers per node The average service rate mappers per node The average service time of mappers per node The average arrival rate of reducer per node The average service rate reducer per node The average service time of reducer per node Average response time of a mapper Average response time of a reducer Maximum task response time - average task response time The average waiting time in the master node Percentage of total work on the slowest node Table 1. Symbols and notations 4.4 Disassembled Sub-models In this subsection we first model the average performance of individual tasks, and then use their random behavior to calculate the response time of an ordinary job. At last we consider the waiting time when the existing jobs occupy all free slots and new coming jobs have to wait. 25 4.4.1 The Average Task Performance Back to MapReduce framework, a job is decomposed into many tasks which run on slave nodes, and they read from or write to HDFS files that spread across all nodes. As a result, all jobs are potentially related to all nodes, which means that the busiest slave node is potentially the slowest point for all jobs. Intuitively from Section 4.2 we know the tasks on the busiest node are the slowest tasks, meaning that the busiest node is the bottleneck for the whole MapReduce framework, and if we model the busiest node accurately, we have the model for the slave nodes. Therefore, we first focus on the model for a single node, then use the parameters on the bottleneck node to directly calculate the performance of that particular node, and indirectly calculate the performance of the slave nodes as a whole. We introduce a parameter p to represent this imbalance, which is defined in the following Equation 8, p= T he amount o f work on the slowest node T he total amount o f work (8) where the amount of work is number of running Operating System processes, including MapReduce mapper and reducers tasks, their management processes, and the distributed file system processes. The more running processes one machine has, the slower each of these processes get. This cluster imbalance factor p is affected by the type of work and the cluster size, which will be discussed later. When a slave node receives a new task, it sets up the environment and initializes the task, and then launches a new process to run it. Although all parts of a computer system, such as CPU, memory, disk and network interface, are involved in the execution of tasks, we consider it as a black box to simplify the problem. We will later validate this simplification using measured data. Usually slave nodes are managed by modern time sharing operating system, Linux in our case. There are two types of tasks running in slaves, as mentioned before, and therefore, a multiclass processor sharing model would be a reasonable fit. Theorem 3 gives us the precise equations which will be used later. Although Equations 5 do not necessarily imply the performance curve is linear as in Figure 7, later we will show that in our system a little modification would validate this model. 26 Nm maps Ns shuffles Nr reduces . . . . Task execution as a 3-class PS queue Task initialization as a delay center Figure 9. Queueing model of a slave node The basic model for a slave node is shown in Figure 9. After a new task enters a slave node, it is first initialized in the delay center at the left of the figure, and then moved to the major part of the model, which is a two class processor sharing queue at the right of the figure. Theorem 3 considers only the ideal open multiclass queue, but we can modify Equation 5 using the following observation: Bk0 + ∑Kj=1 T j λ j Bk j is the time spent at the processor sharing queue, and if we add another constant Bck to represent the initialization time spent at the delay center in Figure 9, we get the real response time in our case. To sum up, the equations for this model is: Tm = Bm0 + Tm λm Bmm + Tr λr Bmr + Bcm (9) Tr = Br0 + Tm λm Brm + Tr λr Brr + Bcr where Bi j is defined as in Theorem 3, Tm and Tr are the average execution time for mapper and reducer respectively, and Bcm and Bcr are the initialization constant for mapper and reducer respectively. Because the cluster is not balanced, we are primarily interested in the performance of the bottleneck node. We measure the load on each node and calculate p, the percentage of work on the bottleneck node. Using Little’s Law, we have the equation for arrival rates at the bottleneck node: λm = pXHm (10) λr = pXHr 27 According to the definition of service rate and service time, we have the following equation: 1 Sm 1 µr = Sr µm = (11) If we substitute these two into the definitions of σi and Bk j in Theorem 3, we have nicer formulas: K 2 Sm Sm Sr SmHm Sm Sr Hr λk pXHk = pX( Hm + Hr ) = pX( + ) =∑ 1 µm + µk k=1 1 Sm + Sm Sm + Sr 2 Sm + Sr + Sm Sk K λk Sr2 Sr Sm Hm SrHr pXHk Sr Sm σr = ∑K H + Hr ) = pX( + ) = = pX( m ∑ k=1 1 µr + µk k=1 1 Sr + Sm Sr + Sr Sr + Sm 2 + Sr Sk (12) Sm Bm0 = 1 − σm Sm 1 Bmm = 2 1 − σm 1 Sm Sr Bmr = S m + S r 1 − σm (13) Sr Br0 = 1 − σr Sr 1 Brm = 2 1 − σr Sr Sm 1 Brr = Sr + Sm 1 − σr σm = ∑K k=1 And if we substitute these equations into Equation 9, we have improved equations: Tm = Bm0 + pX(Bmm Hm Tm + Bmr Hr Tr ) + Bcm (14) Tr = Br0 + pX(Brm Hm Tm + Brr Hr Tr ) + Bcr Parameters such as throughput X, number of different tasks per job Hm , and Hr , system imbalance factor p and response time for different tasks Tm and Tr are available through system monitoring leaving 6 unknown variables in Equation 14, which are p, X, Sm , Sr , Bcm and Bcr . 4.4.2 The Random Behavior Average performance of individual tasks does not necessarily imply the average performance of their jobs, because of the synchronization steps we mentioned earlier. An example of histograms 28 of response times is plotted in Figure 10. 180 900 160 800 140 Number of reducers Number of mappers 700 600 500 400 300 120 100 80 60 200 40 100 20 0 0 20 40 60 Time(s) 80 100 0 120 0 50 (a) Mappers 100 150 200 Time(s) 250 300 350 (b) Reducers Figure 10. Histograms of tasks’ response time The response time of a task is a random variable, then the average response time of a job is the expectation of the maximum of two sets of random variables, one set for mappers and the other for reducers. In statistics there is no closed form formula to calculate this expectation. As a result, we measure the difference between the total time of a job and the average time of its tasks, which we call T∆ . T∆ can be seen as an indicator of the system’s randomness, and an example is plotted in Figure 11. 120 100 Time (s) 80 60 40 20 0 5 10 15 20 Concurrency 25 30 Figure 11. Example figure of randomness T∆ We do not need the accurate equation for the randomness factor to be helpful; an approximate curve is enough for many analysis. Through simple data analysis we found its regular pattern, which is plotted in Figure 12. The T∆ curve starts from a horizontal line y = c (c is a constant 1 x − Tmax , where Xmax between 10 to 15 in our system), and approaches an asymptote line y = Xmax 29 120 100 Time (s) 80 y=g*x-R 60 40 20 0 y=c 5 10 15 20 Concurrency 25 30 Figure 12. Regular pattern for T∆ and Tmax are throughput and job response time when the system is saturated (i.e. C = K). We a achieve this curve using a shear mapping [32] of an inversely proportional function y = using x transformation matrix    1 −Xmax    0 1 followed by a move using vector (Xmax Tmax , c). This procedure is illustrated in Figure 13. The final 30 60 25 50 20 40 15 30 10 20 5 10 100 90 80 70 60 50 40 30 20 0 ï10 ï8 ï6 ï4 ï2 0 0 ï5 ï4 ï3 ï2 ï1 0 1 2 3 4 5 10 0 2 4 6 8 10 Figure 13. Transformation procedure to get equation for T∆ equation for randomness is Equation 15, T∆ = 1 1 (C − Xmax Tmax ) + 2Xmax 2 (C − Xmax Tmax )2 4a + +c 2 Xmax Xmax (15) where C is the number of concurrent jobs in the system, Xmax and Tmax can be calculated using the model introduced in the previous subsection, and the c is a constant value depending on the 30 specific system configuration, typically 10 to 15. According to Little’s Law, K = Xmax Tmax , where K is the maximum number of concurrent running jobs in the notation Table 1, and C = XT , where X is the throughput, and T is the response time. Therefore Equation 15 can be transformed to Equation 16, which is the final equation for T∆ . T∆ = 4.4.3 1 1 (XT − K) + 2Xmax 2 4a (XT − K)2 + +c 2 Xmax Xmax (16) The Waiting Time The number of maximum concurrent jobs K is a configurable system parameter, and when the workload has not hit this maximum, the system behaves like aforementioned models for the average task performance and the random behavior. Waiting time is now 0 because new incoming jobs can immediate run. When there are more than K jobs in the system, new incoming jobs have Tm + Tr + T∆ to wait for free slots. According to Little’s Law, thoughput is now , which is also the K time that a job slot becomes free. If before this new job there are in total C jobs in the system, then Tm + Tr + T∆ C − K jobs are waiting, and the waiting time for that new job is therefore (C − K) . In K total, the waiting time Tw is defined in the following equation: Tw = 4.5    0 if C < K   (C − K) Tm + Tr + T∆ i f C K (17) K The Global Model Now we have models for each part, and then it is easy to calculate the response time for each job, the sum of times for map and reduce phases, their random behavior and the waiting time. Mathematically the following equation shows this relation: T = Tm + Tr + T∆ + Tw . (18) This equation, together with Equation 14, Equation 16 and Equation 17, is the performance model for closed system MapReduce. Finding the explicit close formula of T using parameters such as 31 Sm, Sr, etc. is possible, but this formula is too long and too complex to be useful. Later on we will present methods on how to use this implicit formula for optimization. C If Little’s Law X = is substituted into Equation 18, we get Equation 19 for throughput X of T a closed system, which has C concurrent jobs. X= C Tm + Tr + T∆ + Tw (19) If the system is not overloaded, which means Tw = 0, the equation can be transformed into: X= C , Tm + Tr + T∆ (20) and it is now possible to substitute this equation into Equation 14 and Equation 16 to get Equation 21 and Equation 22, where Tmax is the response time at the maximum concurrency K, which can also be computed with Equation 21, which is shown in Equation 23. C (Bmm Hm Tm + Bmr Hr Tr ) + Bcm Tm + Tr + T∆ C Tr = Br0 + p (Brm Hm Tm + Brr Hr Tr ) + Bcr Tm + Tr + T∆ Tm = Bm0 + p T∆ = 1 Tmax (C − K) + 2C 2 2 (C − K)2 Tmax 4aTmax +c + 2 C C Tmax = Tm + Tr + T∆ , when C = K (21) (22) (23) These equations, together with Equation 17 and Equation 19, is the performance model for closed system MapReduce. The number of maximum supported concurrent jobs is a configurable parameter, which is shown as the changing point in Figure 7 and Figure 8. Intuitively, if the total number of jobs in the system increases, the average number of concurrent tasks in each node also increases, and because of its processor sharing property, the time for each task also increases. After the number of total jobs hits the maximum, the number of concurrent tasks in each node remains constant, and so does the execution time for each task. This is how mappers and reducers behave in Figure 7 and Figure 8. 32 Our model needs parameters such as Sm , Sr , Bcm and Bcr to be complete. The exact values and equations of these parameters depend on the characteristics of specific workloads and software and hardware specification, meaning it is unnecessary, and maybe impossible, to find a universal formula that works for all cases. Users of this performance model can plug in their own submodel for their own workload and get the final detailed model. For this thesis, we demonstrate the correctness of this model framework by finding the right numerical values for these unknown variables using Genetic Algorithm [19], a heuristic search technique. Different kinds of data, including times for different mapper and reducers tasks, response time for jobs, cluster imbalance factor p, etc., are measured, and genetic algorithm tries many combinations of unknown parameter values using built-in heuristics, for example changing a parameter randomly by a small increment and then examines the how close the result is to the measured data, until the distance between measured data and calculated results is small enough. Because the ranges of possible parameter values are usually limited, for example, Sm is usually larger than 5 seconds and smaller than 5 minutes, this heuristic search is quite fast. Specific results are available in the next section. 33 5 Model Validation In order to validate the proposed model for MapReduce, we have finished a large number of experiments, including representative workloads, i.e. database query, random number generation and sorting. We organize this section accordingly in three subsections, each of which presents the measured data and the calculated data using our model. In all these experiments, the maximum number of tasks per node is increased from default value 4 to 20 for the queueing nature to be more obvious. All other parameters are kept as default value, and the parameters closely related to system performance are shown in Table 2. Although there are 72 nodes in our cluster, because it is shared among many users, and my experiments can be easily affected by other users’ experiments, we have to be careful about which nodes to choose. Finally we were able to use 16 physical nodes to complete all these experiments. More nodes will be used in parts of Section 6, along with the impact of several key parameters. The sub-model for waiting time in Equation 17 is easy to understand, and is confirmed by the measured data. Therefore this section focuses on the average task performance and its randomness, and the experiments’ concurrency C is kept smaller than the maximum job parameter K. The average values of multiple runs are used for better accuracy. Although only three data sets and their corresponding model calculations are presented, other data sets show similar results. These results suggest that the blackbox simplification in the previous section is reasonable. Parameter name Default value Block size HDFS replication number Heartbeat interval The number of internal concurrent sortings Buffer size for sorting The maximum number of attempts for failed tasks The limit on the input size of reduce Task execution heap limit The size of inmemory buffer size for shuffle The number of tasks per Java Virtual Machine The number of worker threads for map’s output fetching Compression used for map’s output 64MB 3 3 seconds 10 1MB for each merge sorting 4 No limit set 200MB 140MB 1 40 Not used Table 2. System default values 34 5.1 Database Query The first set of experiments is database queries using the SQL query in Listing 1 in Section 3.2. Each SQL query generates a MapReduce job, in which mappers scan the input tables to find database rows that satisfy this query, and reducers combine these rows into the final results. In detail, there are two kinds of mappers, one for each table. Mappers for table user search for the rows that have the specified user name, and generate intermediate pairs that uses the user id as key. Mappers for table photo transform all rows into intermediate pairs that has user id as key and photo id as value. After shuffle phase, reducers scan these pairs and if one pair has items from both tables, it is one of the results, and written into result file. Therefore, the MapReduce job from this database query has large inputs (two tables) but small output (potentially hundreds of lines). When running database queries, data such as throughput X, total response time T , mapper time Tm , reducer time Tr and cluster imbalance factor p are all measured. These data can be used to compute the parameters for the model in Equation 19, Equation 21 and Equation 22, which can then be used to compute the times for each part of the model. If the computed results are closed to measured times, then the model is able to express the performance of MapReduce for this database query workload. The average times for mappers and reducers are shown in Figure 14, including measured data from experiments and calculated results from the model. The measured randomness factor ∆ and calculated ∆ are shown in Figure 15. The measured response time T and calculated response time are plotted in Figure 16. Small circles in these figures are the average values from multiple experiments, and error bars show the maximum and minimum measured data from these experiments. These figures confirm that our model is able to accurately describe the performance of database query. The horizontal axis of these figures ranges from 10 to 60, because this is the range we are interested in. When concurrency is smaller than 10, meaning there are fewer than 10 jobs running concurrently, the queuing in the system is not obvious, and the performance increases almost linearly. However, when concurrency is larger than 60, the system is full and the extra jobs have to wait outside of the system. The most difficult part of our model is this range, so data only in this range are plotted. 35 (a) Mappers (b) Reducers Figure 14. Measured time VS calculated time for database query Figure 15. Measured ∆ VS calculated ∆ for database query 36 Figure 16. Measured response time VS calculated response time for database query In Figure 14 we can see that the running times for mappers are comparable to the times for reducers, which means that although work for individual reducers is larger than individual mappers, they are not very far away, because theoretically most operations of database query job are located inside mappers. In Figure 15 we can see that the randomness increases as the concurrency increases, which is reasonable that more concurrent jobs give more concurrent running tasks which fight each other for resources, and therefore increases the randomness. 5.2 Random Number Generation The second type of experiments is random number generation. In each MapReduce job of this experiment, mappers are in charge of generating numbers and reducers are in charge of writing them into files. In detail, each map operation generates two random numbers. One of them is a very small integer acting as the key of intermediate pairs, and the other is a 100 byte large integer acting as the value of intermediate pairs. After shuffle phase, reducers abandon the keys of intermediate pairs, and write their values into the output file. As a result, random number generation is a kind of workload that has small input (at most several integers random seeds) and large output (all random numbers). The procedure of validation using random number generation workload is similar to the valida37 tion using database query workload. Different times and values are measured during the execution of the experiments, and then used to compute the parameters for the model in Equation 19 and Equation Equation 21 and Equation 22. At last calculated results using these measured parameters are compared with the measured times, through which we can get the accuracy of our model for the workload of random number generation. The results for random number generation look like the results for database query in the previous subsection. The comparison of measured time and calculated time is shown in Figure 17, and the comparison of measured ∆ and calculated ∆ is shown in Figure 18. Figure 19 shows the measured and calculated response time. Again, the measured data and calculated results are very close, indicating our model is able to capture the performance of random number generation workload. The concurrency range in these figures is from 5 to 30, which is different from the previous subsection on database query. The reason for choosing this range is the same as database query, and the specific values are different because the work for one random number job is larger than database query, but the system’s capacity remains the same, and the supported concurrency becomes smaller. (a) Mappers (b) Reducers Figure 17. Measured time VS calculated time for random number generation Different from the previous subsection on database query, Figure 17 shows that running times of reducers are more than ten times larger than mappers’ running times. In random number generation the work for reducers is much larger than mappers’ work, because reducers have to read from intermediate results from remote mappers and write the output file to remote machines in the underlying distributed file system. The randomness in Figure 18 increases faster than linear 38 5 10 15 20 15 30 Figure 18. Measured ∆ VS calculated ∆ for random number generation Figure 19. Measured response time VS calculated response time for random number generation 39 as concurrency increases, meaning more concurrent jobs may bring extra wastes in the random behavior. 5.3 Sorting The last kind of experiment is sorting of random numbers generated in the second workload. In its MapReduce job, mappers read the input random number file, and forward the random number into intermediate pairs. These pairs are sorted by the MapReduce framework, and reducers write write what they get from the framework directly into the output file. It is clear that the input and output of sorting is same. Similar validation procedure is carried out for sorting experiments, and produces similar results. The figures for mappers and reducers are plotted in Figure 20, the figure for ∆ is plotted in Figure 21, and measured and calculated response time is plotted in Figure 22. Again the range of concurrency in these figures is different from the previous two, because the work for each sorting work is even larger than random number generation. Reducers’ running times in Figure 20 are still larger than mappers’ times, but not as different as Figure 17, because here mappers have to read large input files, therefore decrease the gap between mappers and reducers. (a) Mappers (b) Reducers Figure 20. Measured time VS calculated time for sorting The results from these three experiments prove that our model works for these three workloads. Because most MapReduce programs are I/O bounded and these workloads are categorized accord- 40 Figure 21. Measured ∆ VS calculated ∆ for sorting Figure 22. Measured response time VS calculated response time for sorting 41 ing to their I/O demands, our model should be able to describe the performance characteristics for a variety of different MapReduce applications. 42 6 Model Application After analyzing MapReduce through experiments and the proposed model, several points can be discovered to improve parameter settings and system design. We will show three key areas where optimizations are possible. We use the proposed model in Gradient Descent [31] technique as a tactic to optimize some important parameters, such as number of reducers per job, block size of underlying distributed file system, and the size of the cluster. Note that the methods of applying our model are more important than the quantitative results. The conclusions of this section depend on the type of workload, and may change if the workload changes. Users need to apply this methodology to their own systems and workloads. 6.1 Procedures of Optimization using Gradient Descent If a function F(x) is differentiable near a point x = a, then the value of F(x) decreases the fastest when x moves in the opposite direction of the gradient of F(x) at a. This first-order optimization algorithm is called gradient descent. It works in the space of large number of dimensions, and does not require an explicit nice formula for this F function. Here is an example of gradient descent. The function under concern is F(x) = x4 − 3x3 + 2, (24) and the minimum F value can be located by first calculating its derivative shown as follow: F (x) = 4x3 − 9x2 . (25) If the optimization is started from x = 5, we first calculate its derivative using Equation 25, which is positive. Then next step we decrease x to a smaller value x = 3, and the derivative is still positive. The x value is then further decreased to 2, and now the derivative becomes negative, so x is increased in the next step. This iteration continues until the derivative’s change is smaller than a certain limit, and then the satisfactory minimum value is located. If the increment is large, it is able to quickly search a broader space, but may lose precision; if the increment is small, it is able to 43 improve the precision, but may need more steps if it is far from the optimal point. This procedure is shown in Figure 23. y 140 120 100 80 60 40 20 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x Figure 23. An example of gradient descent usage Back to our model, our optimization object is to minimize response time T . Maximizing the throughput X has a similar process if Little’s Law from Theorem 1 is used. There is a solution for the global model in Equation 18, but it is too complex to analyze symbolically, but mathematical software such as Matlab is able to quickly calculate numerical solutions if unknown parameters are fixed. Therefore gradient descent can be used to search for the optimal values for some parameters, given that other parameters are fixed, for example, using measured data from experiments of that particular workload. In order to use gradient descent, we first derive the equation for T from Equation 14, 16 and 18 using parameters such as Sm , Hm , Sr , Hr , Bcm , Bcr and p, but it is too complex and therefore omitted from this thesis. For example, if the optimization if for Hr , we set other parameters to measured data, calculate T ’s derivative at Hr , and increase or decrease the value of Hr according to the sign of that derivative, as shown in the example in Figure 23. We are only interested in the system that does not have deterministic waiting, meaning that the system is not overloaded and Tw is 0. We first consider the equation for T∆ in Equation 16. Tmax is the response time when the 44 concurrency C equals to K, and taking these conditions into Equation 14 and Equation 16, we have the following Equation 26, Tm max = Bm0 + p K Tm max + Tr max + T∆ max Tr max = Br0 + p K Tm max + Tr max + T∆ max T∆ max = 1 2 (Bmm Hm Tm max + Bmr Hr Tr max ) + Bcm (Brm Hm Tm max + Brr Hr Tr max ) + Bcr (26) 4a(Tm max + Tr max + T∆ max ) +c K where Tm max , Tr max and T∆ max are the time for mapper, reducer and ∆ repectively when the concurrency is at the maximum K. This equation can be solved for Tm max , Tr max and T∆ max , and after substituting these into Equation 16, we have the equation for T∆ which consists of parameters such as Sm , Hm , Sr , Hr , Bcm , Bcr and p, and can be differentiated. Now we can substitute the equation for T∆ from the last paragraph into Equation 14 and solve that equation array to get the formulas for Tm and Tr . Therefore we have the complete formula for T in Equation 18. To sum up, T now can be expressed using Sm , Hm , Sr , Hr , Bcm , Bcr and p. Then we can proceed with the optimizations using gradient descent against the T ’s expression. There are some extra relationships between these parameters. According to the nature of MapReduce, all jobs are split into records, for example, strings of English words. If the input and output are large enough, we can assume that the amount of computing work for each input and output block is approximately the same. Therefore the total work for all mappers and reducers are approximately proportional to the size of the work, which means Sm Hm = gm Di (27) Sr Hr = gr Do where gm and gr are constants, and Sm Hm and Sr Hr are usually called service demands. The input size of each mapper is, by default, the size of a block, therefore we have the following 45 relation: Di b (28) Sm = gm b (29) Hm = Taking this equation into Equation 27 will get where b is block size. Equation 29 is intuitive: the larger the block size b is, the larger the input for each mapper is, and the longer time it takes for a mapper to finish. However, these relations are not always true. For example, the number of mappers can be explicitly set to an arbitrary number, regardless of the size of a block. If the number of mappers and reducers are very large, the chance of failure may become not negligible and our model is not accurate to capture failure. Still these relationships are used here as demonstration of most common cases. 6.2 Optimal Number of Reducers Per Job The first optimization procedure we would like to demonstrate is the number of reducers per job. Some guideline articles such as [30] suggest to set the number of reducers to a large possible value, but some others argue that the number should equal to the number of machines used. Here we study this issue from a theoretical perspective. First of all, 32 nodes are used in sorting experiments, and each sorting job has 5 reducers at the beginning. We use this as the baseline to find the optimal number of reducers per job so that the response time T is minimized. In each step we calculate the parameter values using the method d in the validation Section 5, and then compute the gradient T . If this gradient is negative, the dHr number of reducer is increased by a small value which is, in our case, 5 if the gradient is large to search more carefully, and 10 if the gradient is small to search more effectively. A larger gradient means response time T is more sensitive to the change of the number of reducers Hr . If this gradient is positive, it is decreased by a small value. The experiment is rerun using this new reducer setting, until the response time T stabilizes, meaning that T approaches a limit and does not changes much if optimization steps are taken. 46 According to Equation 27, Sr can be displayed as Sr = gr Do Hr (30) and after substituting this into Equation 12, 13 and 14, we get the equations that can be directly used in the optimization procedure. The process of the optimization is shown below. The decrease of response time is plotted in Figure 24, and the gradient of each corresponding step is plotted in Figure 25. As we can see, the response time approaches a minimum value as the number of reducers increases, while the gradients approaches 0, meaning that the response time stabilizes. It is too compute-intensive to run experiments for arbitrary Hr , but if we assume that parameters other than Hr remain constant, we can numerically calculate the gradients for very large Hr . The trend of gradient is shown in Figure 26. 175 Hr = 5 Response time (s) 170 H r = 10 165 160 H r = 15 155 H r = 20 H r = 30 150 1 2 3 Step 4 5 Figure 24. The process of gradient descent for the optimization of Hr for sorting It should be pointed out that, although the calculation shows that the gradient for Hr is always negative, too many reducers should be avoided, because the subtle overhead that could be omitted is now more and more significant, making our model inaccurate. Furthermore, it is possible to consume more energy and increase the chance of failures. As a result, a threshold ε can be introduced, 47 0 H r = 20 ï5 H r = 15 ï10 Gradient H r = 30 H r = 10 ï15 ï20 ï25 ï30 Hr = 5 1 2 3 Step 4 5 Figure 25. The gradients for the optimization of Hr for sorting 0 ï0.2 Gradient ï0.4 ï0.6 ï0.8 ï1 50 100 150 200 Number of reducers 250 300 Figure 26. The trend of gradients for the optimization of Hr for sorting 48 and increase the number of reducers only when its gradient exceeds this threshold. This threshold depends on the specific system and its workload, and should be decided case by case. Similarly we run this optimization for the number of reducers Hr for the workload of database query, but the results are different from sorting. The process of gradient descent is shown in Figure 27 and Figure 28, from which we can see the optimal Hr value is 1, different from sorting. The difference is cause by the difference of their workload: the service time of reducer Sr in database query is much smaller than sorting, and the overhead of reducer in database query is more significant than sorting. The large ratio of overhead to service time Sr limits the parallelism, and make database query suffer from large number of reducer Hr . Hr = 7 130 Hr = 6 Response time (s) 120 Hr = 5 Hr = 4 110 Hr = 3 100 90 Hr = 2 80 Hr = 1 70 1 2 3 4 Step 5 6 7 Figure 27. The process of gradient descent for the optimization of Hr for database query These results show that there is no universal rule to set the number of reducers Hr . Neither of the previously mentioned guidelines are correct for all MapReduce jobs. If the service demand for reducers Sr Hr is small, larger Hr will incur more overhead and therefore drag down the system performance; if the service demand is large, the system performance is hard to benefit from large parallelism if Hr is too small. Decisions need to be made case by case. 49 16 Hr = 1 14 12 Gradient 10 8 6 4 Hr = 2 Hr = 7 Hr = 4 Hr = 6 Hr = 3 Hr = 5 2 1 2 3 4 Step 5 6 7 Figure 28. The gradients for the optimization of Hr for database query 6.3 Optimal Block Size Block size is an important parameter in HDFS and MapReduce. The experiment with which Yahoo won the Terasort benchmark set the block size to 512MB instead of the default value 64MB [21], but no explanation was provided. This default value originally came from paper [13], where no reason was available either. One of the reasons that we can think of is that larger block size will generate smaller total overhead. But larger block size also decreases parallelism and therefore prolong its execution time. What is more, the execution of Terasort benchmark is usually isolated from other workload to ensure faster speed, and is possibly not an ideal indicator for the real system shared among many users. As a result, we try to find the optimal block size to minimize the average response for several concurrent sortings. The relationship between the service time of mapper Sm and block size b can be used in from Equation 29 that Sm = gm b. (31) Substituting this equation into Equation 18, Equation 21 and Equation 22 of our model, we can use the methods introduced in previous subsections to get the derivative of T against b, and start gradient descent. 50 The procedure of this optimization is similar to the previous subsection. We use 32 nodes to run concurrent sortings which have fixed number of reducers. The first experiment which uses 512MB block size servers as the baseline. Then gradient descent is carried out to find the optimal block size, which is shown in Figure 29 and Figure 30. When at step 6 the block size b is 16MB, the gradient becomes a small negative number, so at step 7 the block size is increased a little to 24MB, and gradient becomes a small positive number. 145 b = 512M B 140 135 Response time (s) 130 125 b = 256M B 120 115 110 b = 64M B b = 16M B b = 128M B 105 b = 32M B 100 b = 24M B 95 90 1 2 3 4 Step 5 6 7 Figure 29. The process of gradient descent for the optimization block size for sorting Distributed systems like MapReduce are nondeterministic and their randomness implies one cannot determine the optimum precisely by iterating indefinitely. Performance difference between block size b = 16MB and block size b = 32MB is already very small (approximately 1%), therefore iterating more steps is not able to give much higher precision. Just like the previous subsection about the optimal number of reducers, we numerically calculate the gradient for block size between 16MB and 32MB, which is shown in Figure 31. The same optimization is also carried out for database query workload, but the results are different. The response time and gradient of different steps of the optimization are plotted in Figure 32 and Figure 33 respectively, which show that the optimal block size for database query is around 192MB. The reason of this difference is that the service requirement of sorting’s mapper is larger than database query. Even though smaller block size generates more mappers and incurs 51 12 b = 512M B 10 Gradient 8 6 b = 256M B 4 b = 128M B 2 b = 64M B b = 32M B b = 24M B 0 b = 16M B ï2 1 2 3 4 Step 5 6 7 Figure 30. The gradients for the optimization of block size for sorting 0.14 0.12 0.1 Gradient 0.08 0.06 0.04 0.02 0 ï0.02 16 18 20 22 24 26 Block size 28 30 32 Figure 31. The trend of gradients for the optimization of block size for sorting 52 larger overhead, the impact of this overhead for sorting is not as obvious as for database query. Therefore, sorting is able to benefit from larger parallelism using smaller block size, but database query would instead suffer from it. b = 16M B Response time (s) 200 150 b = 32M B 100 b = 128M B b = 192M B b = 64M B b = 256M B 50 0 1 1.5 2 2.5 3 3.5 Step 4 4.5 5 5.5 6 Figure 32. The process of gradient descent for the optimization block size for database query The results here suggest that the block size that engineers have been using is not always optimal; a smaller block size is good for concurrent sorting, and a relatively larger block size is good for the database query in Listing 1 in Section 3. However, the conclusion is not meant to be universally applicable. Real world systems are not designed just for concurrent sortings or database query, and other user applications may require different block sizes. Furthermore, smaller block size will generate more blocks for the same file, and therefore increase the chance of failure as well as the overhead of master nodes management. Larger block size increases the response time of individual mapper and reducer tasks. Because block size b is a parameter of the underlying distributed file system shared by all MapReduce jobs and other applications, Trade-offs need to be made to get the overall optimal block size. 6.4 Optimal Cluster Size One of MapReduce’s pronounced advantages is its scalability, which means adding more machines will give better overall performance. It is natural that more machines have more computing power, 53 0.15 b = 256M B 0.1 b = 192M B 0.05 0 Gradient ï0.05 b = 128M B ï0.1 b = 64M B ï0.15 ï0.2 ï0.25 ï0.3 ï0.35 b = 32M B b = 16M B 1 2 3 4 5 6 Step Figure 33. The gradients for the optimization of block size for database query but this does not necessarily linearly increase the overall system speed. Plus, more machines means larger procurement budget and larger operation cost, especially in cloud computing environment. For example, it is not economic to double the cluster just to get a few percents of performance enhancement. In this subsection we try to find the point that MapReduce’s scalability starts to deteriorate seriously, which is an important information for system setting up. The model in Equation 14 has imbalance factor p inside. Ideally in a balanced system, all 1 nodes are the same, and p = . If we assume our system is const p times worse than the ideal case, N const p . After substituting this relation into the overall model in Equation 18, we now then p = N have the equation for T , and can start the optimization like the previous subsections. Database query workload is used for this optimization, because its performance degradation is clearer than the other two. The results are shown in Figure 34 and Figure 35. These two figures show that after a certain range larger cluster size doesn’t provide desirable performance increase. This fast performance degradation is probably caused by databases’ constraints. In databases, data are supposed to be stored only once (in our case, only one file replicated 3 times) to ensure strong consistency, but this also decrease the possible parallelism. When more machines are added, they need to fight for the only input file, therefore limit the performance gain. One solution is to store multiple copies of the database, or at least the popular portion of the database, if the database itself 54 130 N =4 120 Response time (s) 110 100 90 N =8 80 70 60 N = 16 50 40 1 2 3 Step N = 32 4 N = 48 5 Figure 34. The process of gradient descent for cluster size for database query 0 N = 16 N = 32 N = 48 N =8 ï2 Gradient ï4 ï6 ï8 ï10 ï12 ï14 N =4 1 2 3 Step 4 5 Figure 35. The gradients for the optimization of cluster size for database query 55 or client applications can tolerate inconsistencies. However, the scalability of sorting workload is different from the scalability of database queries. Its results are shown in Figure 36 and Figure 37. Although the performance increase is not linear, it is still better than database query workload. From the gradient figure we know that the gradient is still away from 0, which means its scalability remains effective if the cluster size increases even further. The possible reason is that there are multiple input files for sortings, thus avoid the inefficiency mentioned in the last paragraph. 240 N =8 220 Response time (s) 200 180 160 140 N = 16 120 N = 32 100 N = 48 80 N = 64 60 1 1.5 2 2.5 3 Step 3.5 4 4.5 5 Figure 36. The process of gradient descent for cluster size for sorting 6.5 Summary Conclusions from this section are summarized in Table 3, which shows that the default parameter values are not always good choices. Different applications require different settings. For example, requirements of sorting on the number of reducers and block size is totally different from the requirements of database queries. Similar methods can generate optimizations for other parameters. MapReduce job writers will benefit from these conclusions. They are familiar with their own jobs, such as the requirements on CPU, disk or network. Using the methods in this section they are able to speedup their jobs by setting better parameters, such as the number of reducers. 56 0 ï1 N = 64 ï2 N = 48 N = 32 Gradient ï3 ï4 N = 16 ï5 ï6 N =8 ï7 1 2 3 Step 4 5 Figure 37. The gradients for the optimization of cluster size for sorting Optimization area Number of reducers Block size Cluster size Workload Sorting Database query Sorting Database query Sorting Database query Conclusion Optimal value is 30 or more Optimal value is 1 Optimal size is 24MB Optimal size is 192MB Scalability remains good for a large size Scalability starts to deteriorate from 16 nodes Table 3. System optimization conclusions 57 Before setting up a new cluster or changing the existing cluster, these conclusions can be used as follows. In systems such as [27], workloads usually include both regularly scheduled jobs and ad hoc jobs. System designers should firstly decide the percentage of these two and which to focus on. Then the information of their service demand, such as Sm Hm and Sr Hr , should be measured. At last, put these data into our model, and calculate the optimal values for block size value and cluster size. Because the block size and cluster size need to be fixed before the system starts, these values need to be set statistically. 58 7 Conclusion and Future Work This thesis studies the performance of MapReduce. Because of its simplicity, scalability, power of expressiveness and outstanding failure tolerance, MapReduce is becoming a more and more popular distributed computing framework in both the industry and the academic world, yet its performance issues are not fully studied. We tackle this problem by first inspecting its system design and then categorizing its typical workload, followed by experiments to get preliminary impression of its behavior in Section 3. With this basic knowledge we proposed our model including three major parts in Section 4. The first sub-model is for average task performance in Equation 14. We focus on the bottleneck node, and use a modified multi class processor sharing queue to capture the average response time for mappers and reducers. It has two structured equations, which can be easily solved. The second part is the random behavior of map and reduce tasks in Equation 16. Their common style is pointed out, and a customized equation is designed to describe this random time. The waiting time in oveloaded system was investigated as the third part of MapReduce’s performance model in Equation 17, which is an intuitive equation. Then this model is validated in Section 5 against measured data from all three kinds of experiments, along with measured times and calculated times from each sub-model shown in Figures from 14 to 22. At last possible application methods are demonstrated using gradient descent for three optimizations in Section 6, which provide insights on system configuration and MapReduce job design. For example, the guidelines for setting the number of reducers are not effective for all kinds of jobs, and the default block size 64MB is not optimal for sorting and database query workloads. Both system architects and end client programmers will benefit from our model, as provided at the end of Section 6. Following this direction, we propose the following tasks as future work. The model for the impact of failures on the performance is an important and necessary work. Long running jobs are commonly used in real systems, but failures are also commonly present. Therefore, our model will be more accurate and widely applicable after the failure model is added. More kinds of workload jobs should be identified to test the accuracy of our model, or locate weak areas to improve our model. Special scenarios may falsify some of our observations, but 59 because of the weak assumptions used, we believe the major structure of the model is universally applicable. It will make the model even more precise by modeling the random behaviors using analytical statistics equations, instead of curve fitting through measured data. The random behavior Equation 16 is based on common points for the 3 workloads we run, but other workloads may have different characteristics. Mathematically proved expectation equation is an ideal solution, but also hard to reach. As the cluster size increases, the chance that some of parts of the system fail also increases. If a MapReduce job is large and has a large number of mappers and reducers, the chance that failures occur is larger than smaller jobs. In these cases failures’ influence on MapReduce’s performance cannot be ignored. The performance will become more completed after failure is included. More optimization areas can be recognized. As is demonstrated in the model application section, our model is able to explore many optimization opportunities effectively. More parameters can be optimized using similar methods. As our model becomes more and more accurate and generalized, we will have more confidence in using it to guide system design and implementation. 60 Bibliography [1] E. Altman, K. Avrachenkov, and U. Ayesta. A survey on discriminatory processor sharing. Queueing Systems, 53(1):53–63, June 2006. [2] Amazon. Amazon Elastic Compute Cloud, 2010. http://aws.amazon.com/ec2/. ¨ Babao˘glu, L. Alvisi, A. Amoroso, R. Davoli, and L. Giachini. Paralex: An environment [3] O. for parallel programming in distributed systems. In Proceedings of the 6th International Conference on Supercomputing, number October, page 187. ACM, 1992. [4] E. A. Brewer. Delivering high availability for Inktomi search engines. ACM SIGMOD Record, 27(2):538, June 1998. [5] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, pages 335–350. USENIX Association, 2006. [6] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):1–26, 2008. [7] L. Cherkasova. Performance modeling in mapreduce environments: challenges and opportunities. In Proceeding of the Second Joint WOSP/SIPEW International Conference on Performance Engineering, pages 5–6, Karlsruhe, Germany, 2011. ACM. [8] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, pages 21–21. USENIX Association, 2010. [9] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [10] D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6):85–98, June 1992. 61 [11] A. Fox, S. Gribble, Y. Chawathe, E. Brewer, and P. Gauthier. Cluster-based scalable network services. ACM SIGOPS Operating Systems Review, 31(5):78–91, 1997. [12] E. Gabriel, G. Fagg, G. Bosilca, T. Angskun, J. Dongarra, J. Squyres, V. Sahay, P. Kambadur, B. Barrett, and A. e. a. Lumsdaine. Open MPI: Goals, concept, and design of a next generation MPI implementation. Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 353–377, 2004. [13] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. ACM SIGOPS Operating Systems Review, 37(5):29–43, Dec. 2003. [14] Hadoop. HBase Homepage. http://hbase.apache.org/. [15] Hadoop. Companies PoweredBy Hadoop, 2010. http://wiki.apache.org/ hadoop/PoweredBy. [16] Hadoop. Hadoop Distributed File System, 2010. http://hadoop.apache.org/ hdfs/. [17] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 260–269. ACM, ACM, 2008. [18] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59–72. ACM, 2007. [19] Mathworks. 2011. Genetic Algorithm Solver - Global Optimization Toolbox for MATLAB, http://www.mathworks.com/products/global-optimization/ description4.html. [20] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1099–1110, New York, New York, USA, 2008. ACM. [21] O. O’Malley. Terabyte sort on apache hadoop. Yahoo, available online at: 62 http://sortbenchmark. org/Yahoo-Hadoop. pdf, (May):1–3, 2008. [22] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005. [23] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13–24, 2007. [24] T. Sandholm and K. Lai. MapReduce optimization using regulated dynamic prioritization. ACM Press, New York, New York, USA, June 2009. [25] Y. C. Tay. Analytical Performance Modeling for Computer Systems, volume 2. Apr. 2010. [26] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009. [27] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 International Conference on Management of Data, pages 1013–1020. ACM, 2010. [28] Top500. TOP500 Supercomputing Sites, 2010. http://www.top500.org/. [29] G. Wang, A. Butt, P. Pandey, and K. Gupta. A simulation approach to evaluating design decisions in mapreduce setups. In IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009., pages 1–11. IEEE, 2009. [30] T. White. 10 MapReduce Tips, 2009. http://www.cloudera.com/blog/2009/ 05/10-mapreduce-tips/. [31] Wikipedia. Gradient descent — wikipedia, the free encyclopedia, 2011. http: //en.wikipedia.org/w/index.php?title=Gradient_descent&oldid= 432411072. [32] Wikipedia. Shear mapping — wikipedia, the free encyclopedia, 2011. http://en. 63 wikipedia.org/w/index.php?title=Shear_mapping&oldid=425596906. [33] H. Yang, A. Dasdan, R. Hsiao, and D. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029–1040. ACM, 2007. [34] R. Yoo, A. Romano, and C. Kozyrakis. Phoenix Rebirth: Scalable MapReduce on a NUMA System. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 198–207, 2009. [35] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, 2008. [36] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Conference on Operating systems design and implementation, OSDI’08, pages 29–42, Berkeley, 2008. USENIX Association. 64 [...]... utilization, W is the average waiting time, λ is the arrival rate and µ is the service rate Queueing models have two basic forms, open and closed In open systems, the customer arrival rate does not depend on state of system, but in closed systems, the total number of customers is fixed Many systems can be studied as both open systems and closed systems, and the decision depends on the purpose of the system If... existing commercial databases [7] New systems are being designed [4, 11], and companies such as Google, Microsoft and Amazon made these designs into commercial systems that are operating tens of thousands of computers Their accumulated power of low end computers makes it possible to analyse the whole Internet in a timely fashion, support large transaction systems and many more This new trend is also... overview, and proceeds to build the model After that the model is validated in Section 5, and model application is shown in Section 6, which shows the conclusions that are different from current understandings At last this thesis concludes in Section 7 and sets the plan for future work 3 2 Related Work Modern data processing systems are growing in both size and complexity, but their theory and guidelines... technologies are invented and widely deployed In recent years the requirements for data storage are becoming more demanding These requirements include but are not limited to the volume that scales up to Petabytes, currency control among billions of objects, fault tolerance and failover, and high throughput of both read and write The traditional storage and database system may become awkward for handling all these,... programmers, and users can easily control its data flow It employs a flexible nested data model, and internally supports many operators such as co-group and equi-join Like Hive it is easily extensible using user-defined functions Furthermore, Pig Latin is easy to learn and easy to debug Sawzall [22] is designed and implemented at Google and runs on top of Google’s infrastructure such as protocol buffers, GFS and. .. calculation operations per second Those systems in the TOP500 list [28] are good examples As people are collecting and generating more and more data, such as Internet web pages, photos and videos in social network sites, health records, telescope imageries, transaction logs and so on, automatic processing of these data using large computer systems is of high demand For example, successful data mining... processing Originally designed and implemented in Google, MapReduce is now a hot topic and applied to fields that it was not intended to fit Hadoop also has its open source MapReduce version It is built on top of GFS and Bigtable, and use them as input source and output destination MapReduce’s programming model is from functional languages, consisting of two functions: map and reduce The Map function... the manager a better understanding of the business and its customers, 1 therefore improving this business to a new level Fast and accurate processing of telescope images is possible to provide breakthrough scientific discoveries Database systems are designed to manage large data, but up to 70% to 80% of online data are unstructured and may be used for only a few times, and the processing is not efficient,... CPU, local disk and network interface card According to MapReduce’s topology, ideally a job has two phases: map and reduce But after careful study, we find two synchronization points after mapper and reducer, meaning that all reducers start only after all mappers finish, and the job result is returned only after all reducers finish In this work we focus on average performance, and this randomness of response... are large enough, and size of its output is negligible In this query u name is randomly chosen in order to make queries different from each other, the way the real workload should look like Listing 1 Workload query select p id from u s e r , p h o t o where u i d = p u i d and u name = ”< d i f f e r e n t n a m e s >” Sorting and random number generation are used for the second and third workload ... depend on state of system, but in closed systems, the total number of customers is fixed Many systems can be studied as both open systems and closed systems, and the decision depends on the purpose... billions of objects, fault tolerance and failover, and high throughput of both read and write The traditional storage and database system may become awkward for handling all these, thus new approaches... databases [7] New systems are being designed [4, 11], and companies such as Google, Microsoft and Amazon made these designs into commercial systems that are operating tens of thousands of computers

Định dạng
Số trang	75
Dung lượng	2,32 MB