Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 75 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
75
Dung lượng
2,32 MB
Nội dung
NATIONAL U NIVERSITY OF S INGAPORE
D EPARTMENT OF C OMPUTER S CIENCE
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
Performance Analysis of MapReduce
Computing Framework
Supervisor:
Professor TAY Yong Chiang
Author:
Hou Song
HT090459N
September 2011
Performance Analysis of MapReduce
Computing Framework
Hou Song
songhou@comp.nus.edu.sg
Abstract
MapReduce is a more and more popular distributed computing framework, especially in large
scale data analysis. Although it has been adopted in many places, a theoretical analysis of its
behavior is lacking. This thesis introduces an analytical model for MapReduce with three parts,
the average task performance, the random behavior and the waiting time. This model is then
validated using measured data from three categories of workloads. The model’s usefulness is
demonstrated by three optimization processes, which give reasonable conclusions yet different
from current understandings.
Acknowledgments
I would like to express my gratitude to my supervisor Professor Tay Yong Chiang for his countless
guidance and advice. Without his help I cannot complete this thesis in time. I would also like to
thank my University for supporting me financially, without which I cannot survive.
My parents and brother always trust me and give me continuous support. I owe them so much.
At last I would like to thank Shi Lei, Vo Hoang Tam and many other lab mates for their generous
reviews and suggestions.
iii
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Introduction
ix
1
1.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Motivation for an Analytical Model . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Related Work
2.1
2.2
4
Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.1.1
The Google File System . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.1.2
Bigtable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2.1
MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2.2
Dryad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 System Description
17
3.1
Architecture of Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2
Representative Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3
Experimental Setup and Measured Results . . . . . . . . . . . . . . . . . . . . . . 19
4 Model Description
22
4.1
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2
Related Theory Overview
4.3
Notation Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4
Disassembled Sub-models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
The Average Task Performance . . . . . . . . . . . . . . . . . . . . . . . . 26
iv
4.5
4.4.2
The Random Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.3
The Waiting Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
The Global Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Model Validation
34
5.1
Database Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2
Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3
Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Model Application
43
6.1
Procedures of Optimization using Gradient Descent . . . . . . . . . . . . . . . . . 43
6.2
Optimal Number of Reducers Per Job . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3
Optimal Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4
Optimal Cluster Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7 Conclusion and Future Work
59
Bibliography
61
v
List of Figures
1
Basic infrastructure of GFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2
An example table that stores Web pages . . . . . . . . . . . . . . . . . . . . . . .
7
3
Bigtable tablet representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
4
Basic structure of MapReduce framework . . . . . . . . . . . . . . . . . . . . . . 10
5
A Dryad job DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6
Example figure of throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7
Example figure of times of each phase . . . . . . . . . . . . . . . . . . . . . . . . 21
8
Example figure of numbers of tasks in each phase . . . . . . . . . . . . . . . . . . 21
9
Queueing model of a slave node . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
10
Histograms of tasks’ response time . . . . . . . . . . . . . . . . . . . . . . . . . . 29
11
Example figure of randomness T∆
12
Regular pattern for T∆
13
Transformation procedure to get equation for T∆ . . . . . . . . . . . . . . . . . . . 30
14
Measured time VS calculated time for database query . . . . . . . . . . . . . . . . 36
15
Measured ∆ VS calculated ∆ for database query . . . . . . . . . . . . . . . . . . . 36
16
Measured response time VS calculated response time for database query . . . . . . 37
17
Measured time VS calculated time for random number generation . . . . . . . . . 38
18
Measured ∆ VS calculated ∆ for random number generation
19
Measured response time VS calculated response time for random number genera-
. . . . . . . . . . . . . . . . . . . . . . . . . . 29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
. . . . . . . . . . . . 39
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
20
Measured time VS calculated time for sorting . . . . . . . . . . . . . . . . . . . . 40
21
Measured ∆ VS calculated ∆ for sorting . . . . . . . . . . . . . . . . . . . . . . . 41
22
Measured response time VS calculated response time for sorting . . . . . . . . . . 41
23
An example of gradient descent usage . . . . . . . . . . . . . . . . . . . . . . . . 44
24
The process of gradient descent for the optimization of Hr for sorting
25
The gradients for the optimization of Hr for sorting . . . . . . . . . . . . . . . . . 48
26
The trend of gradients for the optimization of Hr for sorting
. . . . . . . 47
. . . . . . . . . . . . 48
vi
27
The process of gradient descent for the optimization of Hr for database query . . . 49
28
The gradients for the optimization of Hr for database query . . . . . . . . . . . . . 50
29
The process of gradient descent for the optimization block size for sorting . . . . . 51
30
The gradients for the optimization of block size for sorting . . . . . . . . . . . . . 52
31
The trend of gradients for the optimization of block size for sorting
32
The process of gradient descent for the optimization block size for database query . 53
33
The gradients for the optimization of block size for database query . . . . . . . . . 54
34
The process of gradient descent for cluster size for database query . . . . . . . . . 55
35
The gradients for the optimization of cluster size for database query . . . . . . . . 55
36
The process of gradient descent for cluster size for sorting
37
The gradients for the optimization of cluster size for sorting
. . . . . . . . 52
. . . . . . . . . . . . . 56
. . . . . . . . . . . . 57
vii
List of Tables
1
Symbols and notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2
System default values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3
System optimization conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
viii
Summary
The work people are trying to solve is getting much larger than a single computer’s capability, and
distributed computing is an inevitable direction. For example, Internet companies are using tens of
thousands of machines to process an enormous amount of concurrent user requests. MapReduce is
a useful and popular distributed computing framework that is widely adopted in both the industry
and academic world, because it is simple to use yet able to provide good scalability and high
performance. However, its performance has not been fully studied yet. There are some papers
which are trying to improve MapReduce’s design and implementation through experiments or
simulations, but no one has yet proposed an analytical model that can overcome the weaknesses of
the first two methods.
This thesis investigates the details of MapReduce, and proposes an analytical model based
on the categorization of typical workloads. This model consists of three parts: the average task
performance which is a modified multiclass processor sharing queue, the random behavior which
is a fitted curve using common observations, and the waiting time which utilizes a deterministic
waiting equation. The model is then validated using measured data from all three categories. At
last this model is applied in three optimizations, thus demonstrating its usefulness in configuring
MapReduce computations. Their conclusions provide new understandings about MapReduce’s
performance behavior.
ix
1
1.1
Introduction
Background
The size of problems people are trying to solve has been increasing for a long time, and now it is
way beyond a single powerful computer can handle. For example, the Internet now has trillions
of web pages and countless multimedia files. How to make good use of it is a nontrivial problem.
Distributed algorithms are the unavoidable direction, but due to its complexity in nature, it is
hard to develop useful applications efficiently [3]. Issues in distributed computing range from
communication, synchronization, concurrency control to failure management and recovery, and
each of these is by itself an area that needs thorough study. Therefore, working in a distributed
fashion is tedious and error-prone, but still necessary work.
There are various proposed frameworks and tools intended to help developers. Message Passing Interface (MPI) [12] is a successful communication protocol with a wide range of adoption. It
provides convenient ways to send point-to-point messages or multicast messages. MPI is scalable
and portable, and remains in the dominant position for high performance computing systems that
focus on raw computation, such as traditional climate simulation. However, this does not eliminate the difficulties that developers have to resort to low level primitives to accomplish complex
logic such as synchronization, failure detection and recovery, comparable to writing a sequential
program using assembly language. They make a distributed system hard to design and implement,
and tricky to ensure correctness. However many of these aspects share some common operations,
which could be provided by the underlying system and thus relief the burden on programmers.
From a broader point of view, there are two types of high performance computer systems, one is
for raw computation power and the other is for data processing. The first type has a longer history,
with its concentration on the number of calculation operations per second. Those systems in the
TOP500 list [28] are good examples. As people are collecting and generating more and more data,
such as Internet web pages, photos and videos in social network sites, health records, telescope
imageries, transaction logs and so on, automatic processing of these data using large computer
systems is of high demand. For example, successful data mining of transaction records from a
supermarket is able to give the manager a better understanding of the business and its customers,
1
therefore improving this business to a new level. Fast and accurate processing of telescope images
is possible to provide breakthrough scientific discoveries. Database systems are designed to manage large data, but up to 70% to 80% of online data are unstructured and may be used for only a
few times, and the processing is not efficient, or even difficult, if people use existing commercial
databases [7]. New systems are being designed [4, 11], and companies such as Google, Microsoft
and Amazon made these designs into commercial systems that are operating tens of thousands of
computers. Their accumulated power of low end computers makes it possible to analyse the whole
Internet in a timely fashion, support large transaction systems and many more.
This new trend is also attracting attention from smaller companies and researchers, who do not
have access to large computing infrastructure like Google and Microsoft. However, in the era of
cloud computing, people can rent machines in the remote clouds and start their own cluster at a
very low price. Then the immediate question is: Given the workload and service level objectives,
how many machines are needed? Other challenges include cluster parameters optimization, system
upgrading, scheduler design decisions, and cluster sharing. To answer these questions engineers
and researcher need to understand the relationship between system performance, system parameter
settings and the characteristics of workload, which is the major research direction of this thesis.
1.2
Motivation for an Analytical Model
Although there are many such successful systems, their performance is not very deeply studied,
especially the newly established massive scale data processing systems. MapReduce [9], the distributed computing framework originally from Google, is an example. It is very popular thanks
to its power of expressiveness and simplicity of use. There are some work [24, 36, 29] trying to
study and improve the performance of MapReduce, but to the best of our knowledge, no one has
proposed an analytic model for MapReduce. This thesis mainly focuses on MapReduce, and tries
to design an analytical model that characterizes its performance.
There are usually three ways to study a system, i.e. experiment, simulation and analytical
modelling. Experiments are accurate, but sometimes too slow to be feasible. Simulation extracts
only the necessary details and runs faster, but is still not practical when the parameter space is too
large; for example, a distributed system could have hundreds of parameters. An analytical model
2
describes a system abstractly with mathematical equations with system parameters as input and
system performance metrics as output. A well developed analytical model is able to hide those
unnecessary details, explores the whole parameter space conveniently and possibly unveils some
obscure truth that is impossible to produce otherwise.
In this thesis we first study the details of Hadoop, an open source implementation of MapReduce, and categorize ordinary workloads into three groups according to the size of their input and
output. We divide MapReduce execution into several pieces, and develop models for each of them.
Then we validate the accuracy of this model using measured data. At last we show its applications
using three examples, which provide useful conclusions. For example, the common practice of
using larger block size to get better sorting benchmark may lead to even longer response time.
MapReduce’s scalability can be influenced by the design of MapReduce’s jobs, and improvement
could be made according to this finding.
1.3
Overview
This thesis is organized as following. In Section 2, we present the related work in distributed data
processing, with the first part on data storage and the second part on data manipulation. Section
3 describes technical details of MapReduce, and the experimental setup. Section 4 starts with
assumptions and fundamental theory overview, and proceeds to build the model. After that the
model is validated in Section 5, and model application is shown in Section 6, which shows the
conclusions that are different from current understandings. At last this thesis concludes in Section
7 and sets the plan for future work.
3
2
Related Work
Modern data processing systems are growing in both size and complexity, but their theory and
guidelines do not change very frequently. The predication from [10] stays valid till now that large
scale database systems should be built based on shared-nothing architecture made of conventional
hardware. During the last few decades the shared-nothing architecture has been developing fast,
attracting popularity from both the academia and the industry. A number of systems are implemented in this area, which will be introduced later in this section.
Intuitively in a data processing system there are two major parts: how data are stored and how
they are manipulated. Therefore, this section of related work is organized into two corresponding
subsections.
2.1
Data Storage
From the lower level data storage is a sequence of bytes that is kept in disks or other stable storage
devices. However bytes can be interpreted in different forms, ranging from simple bit streams to a
group of records to nested objects. In order to control and share data more conveniently, database
technologies are invented and widely deployed.
In recent years the requirements for data storage are becoming more demanding. These requirements include but are not limited to the volume that scales up to Petabytes, currency control
among billions of objects, fault tolerance and failover, and high throughput of both read and write.
The traditional storage and database system may become awkward for handling all these, thus new
approaches are proposed.
2.1.1
The Google File System
The Google File System (GFS) [13] was originally designed and implemented by Google to meet
their needs. A lot of decisions and optimizations were made to fit the environment they were in.
This environment is not unique for Google, and its open source version, Hadoop Distributed File
System [16], is now widely used in many companies and institutes [15].
GFS is based on several assumptions. First, failures are the norm rather than an exception. Be4
cause of the large number of machines gathered together, expensive, highly reliable hardware does
not help much. Instead, Google’s system consists of thousands of machines built from commodity
hardware. Therefore, fault tolerance is necessary in such systems. Second, the system should manage very large files gracefully. The system is able to store small files, but the priority is for large
files, the size of which is counted in gigabytes. Third, its workload is primarily one type of write
and two types of reads: large sequential writes that append at the end of files, and large sequential
reads or small random reads. Finally, the objective is sustainable performance for a large number
of current clients, and places more emphasis on high overall throughput instead of response time
of a single job [13].
Using these assumptions, GFS uses a single master multiple slaves architecture as the basic
form. Files are divided into chunks in the size of 64MB by default, which is the unit of file distribution. The master maintains all the meta-data, such as the file system structure, chunk identifiers
of files, and locations of chunks. It has all the information and controls all the actions. The slave
nodes called chunkservers store the actual data in chunks that are assigned by the master, and serve
them directly to clients. The master and chunkservers exchange all kinds of information, such as
primary copy lease, meta-data update and server registration, periodically through heartbeat. The
work for master node is minimized to avoid overloading.
When a client tries to read a file, it retrieves the chunk handles and locations from the master
using the file name and offset within the file. The client caches this piece of meta-data locally to
limit the communication with master. Then it chooses the closest location among all possibilities
(local or within the same rack), and then initiates a connection with that chunkserver to stream the
actual data.
Writing is a bit more complex. Because GFS uses replication to improve reliability and read
performance, consistency has to be taken into consideration. Each chunk has a primary copy selected by the master, and every file modification has to go through the primary copy so that the
operations on this chunk are ordered properly. In order to provide high performance, ordinary
writes that update a certain region of an existing file are supported, but the system is more optimized for appends, which are used when an application wants to add a record at the end of a
file but it does not care about the record’s exact location. The primary copy of the last chunk of
5
that file decides the location where the record is written to. Therefore record appends are atomic
operations and the system can serve a large number of concurrent operations, because the order
is decided at writing site and no further synchronization is required. Before the operations are
successfully returned to applications, the primary copy pushes all the operations on the chunk to
all the secondary copies to ensure all copies are the same.
Fault tolerance is one of the key features in GFS. No hardware is trusted, so software needs
to deal with all kinds of failures. File chunks are replicated across multiple nodes and racks.
Checksums are used a lot to rule out data corruption. Master replicates its state and logs so that
in case of failure, master can restore itself locally or on other nodes. Shadow masters are used
to provide read-only access during master’s failure. Servers are designed for fast recovery, and
downtime can be reduced down to a few seconds.
File read/write request
GFS
client
Legend:
GFS master
Chunk locations
control message
data stream
Heartbeat
Read/write request
Chunk
server
Chunk
server
Chunk
server
...
Read/write request
Data read from / write to
Figure 1. Basic infrastructure of GFS
In addition, there are other useful functionalities, such as snapshots, garbage collection, integrity test, re-replication and re-balancing. The structure of GFS is shown in Figure 1.
6
2.1.2
Bigtable
GFS is a reliable, high performance distributed file system that serves raw files, but there are many
applications that need structured data which resembles a table in a relational database. Bigtable [6]
which is also from Google fulfills this need. HBase [14] from Hadoop is an open source version
of Bigtable.
"contents:"
"anchor:cnnsi.com"
"anchor:my.look.ca"
"CNN"
"CNN.com"
"..."
"com.cnn.www"
"..."
"..."
Figure 2. An example table that stores Web pages
In Bigtable, tables are not organized in strict relational data models; in contrast each table
has one row key and unfixed, unlimited number of columns, and each field has multiple versions
indexed by time stamp. This data model supports dynamic data control and gives applications
more choices to express their data. Internally Bigtable is a distributed, efficient map from row
key, column name and time stamp to the actual value in that cell. Columns are grouped into some
column families, which are basic units of access control. Bigtable is sorted in lexicographic order
by row keys, and dynamically partitioned into tablets which have several adjacent rows. This
design exploits good data localities and improves overall performance. An example is shown in
Figure 2, which is a small part of Web page table. This row has a row key “com.cnn.www”, two
column families “contents” and “anchor”, and the “contents” value has three versions.
Bigtable is built on top of many other Google infrastructures. Bigtable uses GFS as persistent
storage for data files and log files. It depends on cluster management system to schedule resources.
It relies on Chubby [5] as lock service provider and meta-data manager. It runs in clusters of
machine pools shared with other applications.
In implementation, there are one master and many tablet servers. The master takes charge of
7
the whole system, and tablet servers manage the tablets assigned to them by the master. Tablets’
information is stored in meta-data indexed by a specific Chubby file. Data in a tablet are stored
in two parts, the old values that are immutable and stored in a special SStable file format which
is sorted by row key, and a commit log file of mutations on this tablet. Both of these files are
stored in GFS. When a tablet server starts, it reads SStable and the commit log for the tablet, and
constructs a sorted buffer named “memtable” filled with the most recent view of values. When a
read operation is received, the tablet server searches in the merged view of the latest tablet from
both SSTable files and commit logs, and returns the value. When a write operation is received, the
operation is written to the log file, and the memtable is modified accordingly. The whole process
is shown in Figure 3.
Memtable
Read Op
Memory
GFS
Tablet log
SSTable
file
SSTable
file
SSTable
file
Write Op
Figure 3. Bigtable tablet representation
A lot of refinements are utilized to achieve high performance and reliability. Compression is
applied to save storage space and speed up transportation. Caching is used a lot at both server
side and client side to relief the load on network and disks. Because most tables are sparse tables,
bloom filters are used to speed up searches for non-existent fields. Logs on a tablet server are
co-mingled into one. And tablet recovery is also designed to be rapid to minimize down time.
8
2.2
Data Manipulation
How to make good use of data is more than just data storage. People could write dedicated distributed programs to conduct a certain processing, but this kind of programs is hard to write and
maintain, and each program has to deal with data distribution, scheduling, failure detection and
recovery, and machine communication. A central framework can be implemented to provide these
common features, so that users can rely on it and concentrate only on unique logic of their jobs.
Here two systems are analyzed in detail.
2.2.1
MapReduce
MapReduce [9] is a powerful programming model for distributed massive scale data processing.
Originally designed and implemented in Google, MapReduce is now a hot topic and applied to
fields that it was not intended to fit. Hadoop also has its open source MapReduce version. It is
built on top of GFS and Bigtable, and use them as input source and output destination.
MapReduce’s programming model is from functional languages, consisting of two functions:
map and reduce. The Map function takes input in the form of < key, value > pairs, does some
computation on a single pair and produces a set of intermediate < key , value > pairs. Then all
the intermediate pairs are grouped and sorted together. The Reduce function takes an intermediate
key and the value list for that key as input, does some computation and writes out the final <
key , value > pair as the result. A lot of practical jobs can be expressed in this model, including
grep, inverted list of web pages and page rank computation.
MapReduce inside Google is implemented in a master-slave architecture. When a new job
starts, it generates a master, a number of workers controlled by the master, and some number
of Mapper and Reducer tasks. The master assigns Mapper and Reducer tasks to free workers.
Mapper tasks write their intermediate results into their local disks, and notify the master of their
completion, which informs Reducers tasks to fetch the map outputs. When some tasks finish, if
there are still un-assigned tasks, the master continues its scheduling. When all Mapper tasks are
finished, the Reducer tasks are started. After all reducer have finished, the job is finished, and
returned to client. During this job execution, if any worker dies, all the tasks on that worker is
marked as failed, and re-scheduled later until finished successfully. Figure 4 is an example, which
9
has an input file of 5 splits, 4 mappers and 2 reducers. Note that Hadoop’s version MapReduce is
a little different from Google’s version MapReduce, which we will show later.
Client
Legend:
control message
data stream
Master
Mapper
split 0
split 1
Mapper
Reducer
Output 0
Reducer
Output 1
split 2
split 3
Mapper
split 4
Mapper
Figure 4. Basic structure of MapReduce framework
There are some enhancements to MapReduce from other researchers. Traditional Map Reduce
focuses on long running data intensive batch jobs that aim at high throughput rather than short
response time. Outputs of both map phase and reduce phase are written to disks, either local file
system or distributed file system, to simplify fault tolerance. MapReduce Online [8] was proposed
to allow data pipelined between different phases and between immediate jobs. Intermediate <
key, value > pairs are sent to the next operator soon after they are generated, from mappers to
reducers in the same job, or from reducers in this job to mappers in the next immediate job. Because
reducers executes with only a portion of all intermediate results, the final results are not always
accurate. In MapReduce Online the system takes snapshots as mappers proceed, runs reducers
on these snapshots and approximates the real answer. A lot of refinements are used to improve
performance and fault tolerance, and support online aggregation and continuous queries.
10
Although MapReduce can be used to express many algorithms in the areas such as information
retrieval and machine learning, it is hard to use in some database fields, especially table joins. MapReduce-Merge [33] extends the original model by adding a third phase, merge phase, at the end of
the reduce phase. A merger gets reducers’ resulting < key, value > pairs from different reducers
in two MapReduce jobs, runs default or user-defined functions, and generates the final output files.
With the help of merge phase and some operators, Map-Reduce-Merge is able to express many
relational operators such as projection, aggregation, selection and set operations. What is more
important, it is able to run most join algorithms such as sort-merge join, hash join and nested-loop
join.
Job scheduling is an important factor in MapReduce’s performance. In Hadoop, all nodes
are assumed to be the same, which results in bad performance in heterogeneous environment,
such as virtual machines in Amazon’s EC2 [2] in which the performance of machines could differ
significantly. Speculative tasks are one of the reasons. [36] proposes a new scheduling algorithm,
Longest Approximate Time to End (LATE), that fits well in heterogeneous environment and more
accurately executes speculative tasks. The key idea in this algorithm is try to estimate which task
will finish farthest into the future. LATE uses simple heuristics that assume task’s progress rate
is constant. The progress rate is calculated using the completed part of the work and elapsed
time, then the full completion time is computed by dividing the remaining part of the work by the
progress rate. The task that finishes the latest decides the job’s response time, and therefore is the
first one to re-execute, if speculative tasks are needed. There are some tuning parameters used to
further enhance this strategy.
In order to use MapReduce more conveniently, some other applications are built on top of
MapReduce so as to provide simple interfaces. Hive [26] is a data warehousing solution built
on Hadoop. It organizes data in relational table with partitions, and uses SQL-like declarative
language HiveQL as its query language. Many database operations are supported in Hive, such as
equi-join, selection, group by, sort by and some aggregation functions. Users are able to define
and plug in their own functions, therefore furthermore increase Hive’s functionalities. Its ability to
translate SQL to directed acyclic graphs of MapReduce jobs empowers ordinary database users to
analyze extremely large datasets, thus it is becoming more and more popular.
11
Pig Latin [20] is a data processing language for large dataset, and can be compiled into a
sequence of MapReduce jobs. Not like SQL, Pig Latin is a hybrid procedural query language
especially suitable for programmers, and users can easily control its data flow. It employs a flexible
nested data model, and internally supports many operators such as co-group and equi-join. Like
Hive it is easily extensible using user-defined functions. Furthermore, Pig Latin is easy to learn
and easy to debug.
Sawzall [22] is designed and implemented at Google and runs on top of Google’s infrastructure
such as protocol buffers, GFS and MapReduce. Like MapReduce, Sawzall has two phases, a
filtering phase and an aggregation phase. In filtering phase records are processed one by one, and
emitted to the aggregation phase, which performs aggregation operations such as sum, maximum
and histogram. In practice Sawzall acts as a wrapper of MapReduce, and presents a virtual vision
of pure data operations.
Additionally, MapReduce is introduced into areas other than large scale data analysis. As
multi-core systems are getting popular, how to easily utilize the power of multi-core is a hot topic.
MapReduce is also useful in this situation. [23] describes Phoenix system, a MapReduce runtime
for multi-core and multi-processor systems using shared-memories. Like MapReduce, Phoenix
also consists of many map and reduce workers, each of which runs in a thread on a core. Shared
memories are used as storage for intermediate results, so that data don’t need to be copied and that
saves a lot of time. At the end of reducer phase, all outputs from different reducers are merged into
one output file in a bushy tree fashion. It provides a small set of API that is flexible and easy to
use.
Phoenix system shows a new way to apply MapReduce in a shared memory system, but it may
perform badly in a distributed environment. Phoenix Rebirth [34] revises the original Phoenix
into a new version for NUMA systems. It utilizes locality information when making scheduling
decisions to minimize remote memory traffic. Mappers are scheduled to machines that have the
data or are near to the data. Combiners are used to reduce the size of mappers’ outputs, and
therefore reduce remote memory demand. In the merge phase, merge sort is performed first within
a locality group, and then among different groups.
Another area is general purpose computation on graphics processors. GPUs are being used
12
in high-performance computing because of their high internal parallelism. However, programs
for GPUs are hard to write and not portable. Mars [17] implements a MapReduce framework on
GPUs that is easy to use, flexible and portable. During its execution, inputs are prepared in the
main memory, copied to GPU’s device memory and then mappers are started on those hundreds of
cores on GPU. After mappers are completed, reducers are scheduled. At last the final outputs are
merged into one and copied to main memory. Mars exposes a small set of API and yet produces a
high performance that is sometimes 16 times faster than its CPU based counterpart.
2.2.2
Dryad
Although MapReduce is now widely adopted in many areas, it is awkward to express some algorithms such as a large graph computation which is needed in many cases. Dryad [18] can be seen
as an extended MapReduce that enables users to control the topology of its computation. Dryad is
a general purpose, data parallel, low level distributed execution engine. It organizes its jobs as directed acyclic graphs, in which nodes are simple computation programs that are usually sequential,
and edges are data transmission between nodes.
Creating a Dryad job is easy. Dryad uses a simple graph description language to define a graph.
This description language has 8 basic topology building blocks, which are sufficient to represent
all DAGs when combined together. Users can define computation nodes by simply inheriting a
C++ base node class and integrating them in the graph. Edges in the graph have three forms: files,
TCP connections and shared memories. An example of job DAG is shown in Figure 5. Other
tedious work such as job scheduling, resource management, synchronization, failure detection and
recovery, and data transportation is internally performed by Dryad framework itself.
The system architecture is also a single master multiple slave style to ensure efficiency. The
process of a Dryad job is coordinated by the job manager, which also monitors the the running
states of all slaves. When a new job arrives, the job manager starts computation nodes that have
direct input from files according to its job graph description. When a node finishes, its output is
fed to its child nodes. The job manager keeps looking for nodes that have all its input ready, and
starts them in real time. If a nodes fails, it is restarted on another machine. If all computation nodes
finish, then the whole job finishes, and the output is returned to the user.
13
output file
Op
n
Op
Op
input
file 2
input
file 1
n
Op
input
file 1
input
file 2
Op
input
file 1
input
file 2
Figure 5. A Dryad job DAG
14
It needs a lot of optimizations to make Dryad useful in practice. First of all, the user provided
execution plan may not be optimal, which needs refinements. For example, if a large number of
vertices in the execution graph aggregate to a single vertex, that vertex may become a bottleneck.
At run time, these source vertices may be grouped into subsets, and corresponding intermediate
vertices are added in the middle. Second, the number of vertices may be a lot larger than the
number of running machines, and these vertices are not always independent. Therefore how to
map these logic vertices onto physical resources is of great importance. There are some vertices
that are so closely related that it is better to schedule them on the same machine or even in the
same process. Vertices can run one after another, or can run at the same time with data pipelined
in between. What’s more, there are three kinds of data transmission channels: shared memory,
TCP network connection and temporary file, and each of them has different characteristics. Using
inappropriate channel could cause large overhead. Last but not least, how the failure recovery is
accomplished can affect the whole system. Note that these optimization decisions are correlated.
For example, if some vertices are executed in the same process, the channels between them should
have shared memory to minimize overhead, but failure recovery is more complex because once a
vertex fails, other vertices in the same process need to be re-executed at the same time, because the
intermediate data are stored in the memory and volatile to failures.
Although Dryad is powerful for expressing many algorithm, it is too low level to fit in daily
work. Dryad users still need to consider details about computation topology and data manipulation.
Therefore some higher level systems are designed and implemented on top of Dryad. The Nebula
language [18] is one example. Nebula language exposes Dryad as a generalization of simple
pipelining mechanism, providing developers a clean interface and hides away Dryad’s unnecessary
details. Nebula also has a front-end that integrates Nebula scripts with simple SQL queries.
DryadLINQ [35] is an extension on top of Dryad, aiming to give programmers the illusion of a
single powerful virtual computer so that they can focus on the primary logic of their applications. It
automatically translates the high level sequential LINQ programs, which have an SQL-style syntax
and many extra enhancements, into Dryad plans, and applies both static and dynamic optimizations
to speedup their execution. A strongly typed data model is inherited from the original LINQ
language, and old LINQ programs what aimed to be executed on traditional relational databases
15
can now deal with extremely large volume of data on Dryad clusters without any change. Debug
environments are also provided to help developers.
16
3
System Description
In this section we first investigate the fundamental behavior of Hadoop MapReduce, followed
by the categorization of usual workloads. We then describe the experimental setup, and a set of
experiment results is plotted as a preliminary impression of MapReduce’s performance.
3.1
Architecture of Hadoop MapReduce
Hadoop [16] is a suite of distributed processing software which closely mimic their counterparts
in Google’s system. In this study we choose Hadoop MapReduce along with Hadoop Distributed
File System (HDFS), and their architectures are analyzed here in order to get enough insight to set
up the environment for the model.
HDFS is organized in a single master, multiple slaves style. The master is called Namenode,
which maintains the file system structure, and controls all read/write operations. The slaves are
called Datanodes, which maintain actual data, and carry out the read/write operations. As stated
in Section 2.1.1, file data are stored in blocks of fixed size. This improves the scalability and
fault tolerance. However, its available functionalities are confined to keep the system simple and
efficient.
When a read operation comes, the Namenode first checks its validity, and redirects this operation to a list of corresponding Datanodes according to the file name and the offset inside the file.
The operation sender then contact one of those Datanodes for the data it requires. The Datanode
that is closer to the sender is chosen first, and in special cases the sender itself, so as to save network bandwidth. If the current block is finished, the next block is chosen by the Namenode, and
the operation restarts from the new offset at the beginning of the block.
When a write operation comes, Namenode again checks its validity, and chooses a list of Datanodes to store different replicas of the written data. The operation sender streams the data to the first
Datanode, the first Datanode streams the same data to the next Datanode at the same time, and so
on until no data is left. If the current block is full, a new list of Datanodes is chosen to store the
remaining data, and the operation restarts.
Hadoop MapReduce is built on top of HDFS, and similarly it also has a master-slave archi17
tecture. Its master is called Jobtracker, which controls the process of a job, including submission,
scheduling, cleaning up and so on. Slaves are called Tasktrackers, which run the tasks assigned to
them by the Jobtracker. To keep the description concise, only major actions will be shown.
When a new job arrives, the Jobtracker sets up necessary data structures to keep track of the
progress of this job, and then initializes the right number of mappers and reducers which are put
in the pool of available tasks. The scheduler monitors that pool, and allocates new tasks to free
Tasktrackers. There are many strategies that can help the scheduling, such as utilizing the locality
information of input files, and rearranging tasks in a better order. If no Tasktracker is available,
these new tasks are queued up until some Tasktracker finishes an old tasks and is ready for a new
task.
When a Tasktracker receives a task from Jobtracker, it spawns a new process to run the actual
code for that particular task, collects the running information and sends it back to the Jobtracker.
Depending on the specific configuration of the task, it may read data from local disk or remote
nodes, compute the output and write them to local disk or HDFS. Therefore three important parts
are involved, namely CPU, local disk and network interface card.
According to MapReduce’s topology, ideally a job has two phases: map and reduce. But
after careful study, we find two synchronization points after mapper and reducer, meaning that all
reducers start only after all mappers finish, and the job result is returned only after all reducers
finish. In this work we focus on average performance, and this randomness of response times of
individual map or reduce tasks makes the average task time insufficient to calculate the total job
time. We measure this randomness using the difference ∆ between the response time of a job and
the average time of map and reduce. Furthermore, a job would have to wait in the master node if
there are currently no free slots for new jobs. We call this waiting the fourth part. In total we have
four phase, i.e. map, reduce, ∆ and waiting.
3.2
Representative Workload
MapReduce is powerful in its expression ability, especially in large scale problems. Typical workloads includes sorting, log processing, index building and database operations such as selections,
projections and joins. In order to make our analysis applicable to a variety of scenarios, we divide
18
these workloads into three types according to the relation between input and output, because most
MapReduce applications are I/O bounded programs, which means that they require more disk or
network I/O time than CPU time.
The first type is large-in-small-out, in which a large amount of data are read and processed,
but only a small amount of output is generated. Examples include log item aggregation and table
selections in database. The second type is large-in-large-out, in which a large amount of data are
read, and a comparable amount of output is generated. Examples include sorting, index building
and table projection in database. The last type is small-in-large-out, in which only a small amount
of input is needed, but can generate a large amount of output. Examples include random data
generation, and some table joins in database.
To reach the accurate model step by step, we will first build the model for the first type workload, and then verify this model to incorporate the other two types. The workload we choose to
set up the basic model is a random SQL query of an equi-join of two tables with projection and
selection, shown in Listing 1. The two chosen input tables are large enough, and size of its output
is negligible. In this query u name is randomly chosen in order to make queries different from
each other, the way the real workload should look like.
Listing 1. Workload query
select p id
from u s e r , p h o t o
where u i d = p u i d and
u name = ”< d i f f e r e n t n a m e s >”
Sorting and random number generation are used for the second and third workload type respectively. These two very basic data operations are used a lot in many systems, therefore suitable
workload in our study.
3.3
Experimental Setup and Measured Results
All experiments are run in an in-house cluster with at most 72 working nodes, although we do not
use all of them all the time. Each node has a 4 core CPU, 8GB main memory, a 400GB disk and a
19
Gb Ethernet network card.
In order to get a complete overview of MapReduce’s performance, we used the following
systematic way to arrange the experiments. First the number of concurrent tasks is changed to a
large enough number to support more concurrent jobs. Then in each experiment, the number of
concurrent running jobs is at first fixed, and as the jobs running, measure the number of maps,
reduces and shuffles, and also the amount of time in each phase. Possibly run each experiment
multiple times and calculate the averages to improve accuracy. Then vary the job concurrency to
get the whole curve, which shows how the performance changes along with the workload. After
getting the curve for one setting, we change the number of nodes used, or the system parameters,
to get the effects of different settings.
The usual patterns for throughput, times of each phase and numbers of tasks in each phase
are pictured in Figure 6, 7 and 8 respectively. For different settings the specific numbers in these
curves may be different, but the shapes are similar. In the throughput Figure 6, the throughput
first increases almost linearly, then gradually decelerate to reach a maximum, after which the
throughput decreases a little bit and then remains steady afterwards. The last changing point is
concurrency 40 in this example. If we dissect the running time into 4 parts mentioned earlier,
we get Figure 7. The first two parts (mapper and reducer) have a similar patten, a linear increase
followed by steady constant. The ∆ part is different, which stabilizes after an exponential-like
increase. The waiting part remains 0 and increases linearly after a turning point. One thing to
notice here is that the turning points of these 4 parts are the same, concurrency 40 in this example
figure, which is the same in the throughput. At last, in Figure 8 where the numbers of concurrent
tasks are shown, these two phases have a similar pattern, a linear increase followed by constants,
and the turning points are the same as in Figure 7 and Figure 6. However, not all workloads
produce the same curves; the length of performance drop depends on system parameters and the
characteristics of the workload, and may disappear in some scenarios. In the next chapter we will
explain why the shape is like this.
20
0.35
Throughput (jobs/sec)
0.3
0.25
0.2
0.15
0.1
0
10
20
30
40
Concurrency
50
60
Figure 6. Example figure of throughput
70
Mapper
Reducer
6
Waiting
60
Time (s)
50
40
30
20
10
0
0
10
20
30
40
Concurrency
50
60
Figure 7. Example figure of times of each phase
90
Mapper
Reducer
80
Number of tasks
70
60
50
40
30
20
10
0
0
10
20
30
40
Concurrency
50
60
Figure 8. Example figure of numbers of tasks in each phase
21
4
Model Description
This section starts with the assumptions, and reviews related theory in analytical performance
modelling. Then the model is discussed in the form of open systems, in three major parts. Then
the whole model is assembled, and a corresponding formula is derived for closed systems.
4.1
Assumptions
We first define several key assumptions to make this problem solvable. Other minor assumptions
will be introduced where they are needed.
Assumption 1. Master node is not overloaded. In most cases the work in the master node is
relatively small, meaning that master is not likely to be the bottleneck.
Assumption 2. The whole system stays in a steady state. Our primary interest is the steady state
performance. However, as we can see later, the system is steady but maybe not balanced.
Assumption 3. No failure is present. Although MapReduce is capable of handling many kinds of
failures, their influence on performance is inevitable. What is more, failures are usually transient,
but our current focus is steady state performance. Therefore, we postpone the study of failures into
future study.
Assumption 4. Individual jobs are not too large, and their tasks can run at the same time. MapReduce jobs could be very large and, for example, take hours to finish, but it is not realistic to treat
these long jobs as failure free. Because our model does not consider failure due to the lack of time
and budget, our primary focus is on the throughput of short running jobs which take only several
minutes to finish.
4.2
Related Theory Overview
Before we delve deep into the model, we first review some related theories that will be used later.
One of the most important theorems in performance analysis is Little’s Law, which has the following form in Theorem 1.
22
Theorem 1. In steady state of a system,
n¯ = λ T,
(1)
in which n¯ is the average number of jobs in the system, λ is the average arrival rate, and T is the
average time in system.
In queueing theory, if a simple queue has only one server, its arrival rate and service rate are
both exponentially distributed, then it is usually denoted as an M/M/1 queue. The classic queueing
theory has the following conclusion:
Theorem 2. For an M/M/1 queue,
n¯ =
ρ
,
1−ρ
(2)
ρ 1
1−ρ µ
(3)
λ
µ
(4)
and
W=
in which
ρ=
is the server utilization, W is the average waiting time, λ is the arrival rate and µ is the service
rate.
Queueing models have two basic forms, open and closed. In open systems, the customer arrival
rate does not depend on state of system, but in closed systems, the total number of customers is
fixed. Many systems can be studied as both open systems and closed systems, and the decision
depends on the purpose of the system. If a system is designed to minimize the processing time of
incoming requests which are not fixed, an open model may be better. And if a system is designed
to support a fixed large number of concurrent users, a closed model may be more suitable.
The classic M/M/1 queue uses First Come First Served (FCFS) discipline, which is simple
and easy to study. Processor sharing is another useful discipline, in the sense that a lot of system
behaves in a processor sharing pattern, such as modern time sharing operating systems. In such
systems, the server evenly divides its computing power among all the visiting customers. Further23
more, if there are multiple classes of customers in a queueing system, it is much more complicated
to analyze. In [1] the authors summarized several models for processor sharing queues, and here
we use the equations about the relation between response time and the arrival rate unconditioned
on the number of jobs in the queue, as described in Theorem 3.
Theorem 3. In a queueing system with in total K classes of customers and processor sharing
discipline, let Tk be the expected response time of a class-k customer, λk be its arrival rate and
µk be its service rate. If the service requirement is exponentially distributed, then Tk satisfies the
following linear equations:
K
Tk = Bk0 + ∑ T j λ j Bk j ,
(5)
j=1
where in all equations k = 1, . . . , K, Bk j is given as
1
i f j = 0,
µk (1 − σk )
Bk j =
1
otherwise,
(µk + µ j )(1 − σk )
(6)
where j = 0, 1, . . . , K and σk is given as
K
σi =
λ
∑ µi +k µk .
(7)
k=1
Large and complex systems, such as the Internet which has countless routers, switches and
end computers, are usually hard to model using aforementioned techniques because of the large
number of sub-systems they include. Bottleneck Analysis [25] is helpful in this scenario. Among
all sub-systems, such as all links in the Internet, the one that has the highest utilization is called the
bottleneck, and the bottleneck defines a performance bound for the whole system. For example,
an end user A in the Internet is sending data to another user B, then the data transferring speed
is limited by the speed of the bottleneck link between A and B. The model of the bottleneck
sub-system is a good approximation of the whole system, accurate enough in many different cases.
24
4.3
Notation Table
Table 1 shows the symbols and their descriptions that will be used throughout the thesis. Several
less important notation symbols will be introduced where they are needed.
Symbol
Description
C
K
X
T
N
Hm
Hr
b
Di
Do
λm
µm
Sm
λr
µr
Sr
Tm
Tr
T∆
Tw
p
Number of concurrent jobs (either running or waiting)
Maximum number of concurrent running jobs
Job arrival rate (in open systems) or throughput (in closed systems)
Job response time
Number of slave nodes in the system
Number of mappers per job
Number of reducers per job
Block size of the distributed file system
Size of input data per job
Size of output data per job
The average arrival rate of mappers per node
The average service rate mappers per node
The average service time of mappers per node
The average arrival rate of reducer per node
The average service rate reducer per node
The average service time of reducer per node
Average response time of a mapper
Average response time of a reducer
Maximum task response time - average task response time
The average waiting time in the master node
Percentage of total work on the slowest node
Table 1. Symbols and notations
4.4
Disassembled Sub-models
In this subsection we first model the average performance of individual tasks, and then use their
random behavior to calculate the response time of an ordinary job. At last we consider the waiting
time when the existing jobs occupy all free slots and new coming jobs have to wait.
25
4.4.1
The Average Task Performance
Back to MapReduce framework, a job is decomposed into many tasks which run on slave nodes,
and they read from or write to HDFS files that spread across all nodes. As a result, all jobs are
potentially related to all nodes, which means that the busiest slave node is potentially the slowest
point for all jobs. Intuitively from Section 4.2 we know the tasks on the busiest node are the slowest
tasks, meaning that the busiest node is the bottleneck for the whole MapReduce framework, and if
we model the busiest node accurately, we have the model for the slave nodes. Therefore, we first
focus on the model for a single node, then use the parameters on the bottleneck node to directly
calculate the performance of that particular node, and indirectly calculate the performance of the
slave nodes as a whole. We introduce a parameter p to represent this imbalance, which is defined
in the following Equation 8,
p=
T he amount o f work on the slowest node
T he total amount o f work
(8)
where the amount of work is number of running Operating System processes, including MapReduce mapper and reducers tasks, their management processes, and the distributed file system processes. The more running processes one machine has, the slower each of these processes get.
This cluster imbalance factor p is affected by the type of work and the cluster size, which will be
discussed later.
When a slave node receives a new task, it sets up the environment and initializes the task, and
then launches a new process to run it. Although all parts of a computer system, such as CPU,
memory, disk and network interface, are involved in the execution of tasks, we consider it as a
black box to simplify the problem. We will later validate this simplification using measured data.
Usually slave nodes are managed by modern time sharing operating system, Linux in our case.
There are two types of tasks running in slaves, as mentioned before, and therefore, a multiclass
processor sharing model would be a reasonable fit. Theorem 3 gives us the precise equations
which will be used later. Although Equations 5 do not necessarily imply the performance curve is
linear as in Figure 7, later we will show that in our system a little modification would validate this
model.
26
Nm maps
Ns shuffles
Nr reduces
.
.
.
.
Task execution as a
3-class PS queue
Task initialization
as a delay center
Figure 9. Queueing model of a slave node
The basic model for a slave node is shown in Figure 9. After a new task enters a slave node,
it is first initialized in the delay center at the left of the figure, and then moved to the major part
of the model, which is a two class processor sharing queue at the right of the figure. Theorem 3
considers only the ideal open multiclass queue, but we can modify Equation 5 using the following
observation: Bk0 + ∑Kj=1 T j λ j Bk j is the time spent at the processor sharing queue, and if we add
another constant Bck to represent the initialization time spent at the delay center in Figure 9, we
get the real response time in our case. To sum up, the equations for this model is:
Tm = Bm0 + Tm λm Bmm + Tr λr Bmr + Bcm
(9)
Tr = Br0 + Tm λm Brm + Tr λr Brr + Bcr
where Bi j is defined as in Theorem 3, Tm and Tr are the average execution time for mapper and
reducer respectively, and Bcm and Bcr are the initialization constant for mapper and reducer respectively.
Because the cluster is not balanced, we are primarily interested in the performance of the
bottleneck node. We measure the load on each node and calculate p, the percentage of work on the
bottleneck node. Using Little’s Law, we have the equation for arrival rates at the bottleneck node:
λm = pXHm
(10)
λr = pXHr
27
According to the definition of service rate and service time, we have the following equation:
1
Sm
1
µr =
Sr
µm =
(11)
If we substitute these two into the definitions of σi and Bk j in Theorem 3, we have nicer formulas:
K
2
Sm
Sm Sr
SmHm Sm Sr Hr
λk
pXHk
= pX(
Hm +
Hr ) = pX(
+
)
=∑
1
µm + µk k=1 1
Sm + Sm
Sm + Sr
2
Sm + Sr
+
Sm Sk
K
λk
Sr2
Sr Sm Hm SrHr
pXHk
Sr Sm
σr = ∑K
H
+
Hr ) = pX(
+
)
=
=
pX(
m
∑
k=1
1
µr + µk k=1 1
Sr + Sm
Sr + Sr
Sr + Sm
2
+
Sr Sk
(12)
Sm
Bm0 =
1 − σm
Sm 1
Bmm =
2 1 − σm
1
Sm Sr
Bmr =
S m + S r 1 − σm
(13)
Sr
Br0 =
1 − σr
Sr 1
Brm =
2 1 − σr
Sr Sm
1
Brr =
Sr + Sm 1 − σr
σm = ∑K
k=1
And if we substitute these equations into Equation 9, we have improved equations:
Tm = Bm0 + pX(Bmm Hm Tm + Bmr Hr Tr ) + Bcm
(14)
Tr = Br0 + pX(Brm Hm Tm + Brr Hr Tr ) + Bcr
Parameters such as throughput X, number of different tasks per job Hm , and Hr , system imbalance factor p and response time for different tasks Tm and Tr are available through system
monitoring leaving 6 unknown variables in Equation 14, which are p, X, Sm , Sr , Bcm and Bcr .
4.4.2
The Random Behavior
Average performance of individual tasks does not necessarily imply the average performance of
their jobs, because of the synchronization steps we mentioned earlier. An example of histograms
28
of response times is plotted in Figure 10.
180
900
160
800
140
Number of reducers
Number of mappers
700
600
500
400
300
120
100
80
60
200
40
100
20
0
0
20
40
60
Time(s)
80
100
0
120
0
50
(a) Mappers
100
150
200
Time(s)
250
300
350
(b) Reducers
Figure 10. Histograms of tasks’ response time
The response time of a task is a random variable, then the average response time of a job is the
expectation of the maximum of two sets of random variables, one set for mappers and the other for
reducers. In statistics there is no closed form formula to calculate this expectation. As a result, we
measure the difference between the total time of a job and the average time of its tasks, which we
call T∆ . T∆ can be seen as an indicator of the system’s randomness, and an example is plotted in
Figure 11.
120
100
Time (s)
80
60
40
20
0
5
10
15
20
Concurrency
25
30
Figure 11. Example figure of randomness T∆
We do not need the accurate equation for the randomness factor to be helpful; an approximate
curve is enough for many analysis. Through simple data analysis we found its regular pattern,
which is plotted in Figure 12. The T∆ curve starts from a horizontal line y = c (c is a constant
1
x − Tmax , where Xmax
between 10 to 15 in our system), and approaches an asymptote line y =
Xmax
29
120
100
Time (s)
80
y=g*x-R
60
40
20
0
y=c
5
10
15
20
Concurrency
25
30
Figure 12. Regular pattern for T∆
and Tmax are throughput and job response time when the system is saturated (i.e. C = K). We
a
achieve this curve using a shear mapping [32] of an inversely proportional function y = using
x
transformation matrix
1 −Xmax
0
1
followed by a move using vector (Xmax Tmax , c). This procedure is illustrated in Figure 13. The final
30
60
25
50
20
40
15
30
10
20
5
10
100
90
80
70
60
50
40
30
20
0
ï10
ï8
ï6
ï4
ï2
0
0
ï5
ï4
ï3
ï2
ï1
0
1
2
3
4
5
10
0
2
4
6
8
10
Figure 13. Transformation procedure to get equation for T∆
equation for randomness is Equation 15,
T∆ =
1
1
(C − Xmax Tmax ) +
2Xmax
2
(C − Xmax Tmax )2
4a
+
+c
2
Xmax
Xmax
(15)
where C is the number of concurrent jobs in the system, Xmax and Tmax can be calculated using
the model introduced in the previous subsection, and the c is a constant value depending on the
30
specific system configuration, typically 10 to 15.
According to Little’s Law, K = Xmax Tmax , where K is the maximum number of concurrent
running jobs in the notation Table 1, and C = XT , where X is the throughput, and T is the response
time. Therefore Equation 15 can be transformed to Equation 16, which is the final equation for T∆ .
T∆ =
4.4.3
1
1
(XT − K) +
2Xmax
2
4a
(XT − K)2
+
+c
2
Xmax
Xmax
(16)
The Waiting Time
The number of maximum concurrent jobs K is a configurable system parameter, and when the
workload has not hit this maximum, the system behaves like aforementioned models for the average task performance and the random behavior. Waiting time is now 0 because new incoming
jobs can immediate run. When there are more than K jobs in the system, new incoming jobs have
Tm + Tr + T∆
to wait for free slots. According to Little’s Law, thoughput is now
, which is also the
K
time that a job slot becomes free. If before this new job there are in total C jobs in the system, then
Tm + Tr + T∆
C − K jobs are waiting, and the waiting time for that new job is therefore (C − K)
. In
K
total, the waiting time Tw is defined in the following equation:
Tw =
4.5
0
if C < K
(C − K) Tm + Tr + T∆ i f C
K
(17)
K
The Global Model
Now we have models for each part, and then it is easy to calculate the response time for each
job, the sum of times for map and reduce phases, their random behavior and the waiting time.
Mathematically the following equation shows this relation:
T = Tm + Tr + T∆ + Tw .
(18)
This equation, together with Equation 14, Equation 16 and Equation 17, is the performance model
for closed system MapReduce. Finding the explicit close formula of T using parameters such as
31
Sm, Sr, etc. is possible, but this formula is too long and too complex to be useful. Later on we will
present methods on how to use this implicit formula for optimization.
C
If Little’s Law X = is substituted into Equation 18, we get Equation 19 for throughput X of
T
a closed system, which has C concurrent jobs.
X=
C
Tm + Tr + T∆ + Tw
(19)
If the system is not overloaded, which means Tw = 0, the equation can be transformed into:
X=
C
,
Tm + Tr + T∆
(20)
and it is now possible to substitute this equation into Equation 14 and Equation 16 to get Equation
21 and Equation 22, where Tmax is the response time at the maximum concurrency K, which can
also be computed with Equation 21, which is shown in Equation 23.
C
(Bmm Hm Tm + Bmr Hr Tr ) + Bcm
Tm + Tr + T∆
C
Tr = Br0 + p
(Brm Hm Tm + Brr Hr Tr ) + Bcr
Tm + Tr + T∆
Tm = Bm0 + p
T∆ =
1
Tmax
(C − K) +
2C
2
2
(C − K)2 Tmax
4aTmax
+c
+
2
C
C
Tmax = Tm + Tr + T∆ , when C = K
(21)
(22)
(23)
These equations, together with Equation 17 and Equation 19, is the performance model for closed
system MapReduce.
The number of maximum supported concurrent jobs is a configurable parameter, which is
shown as the changing point in Figure 7 and Figure 8. Intuitively, if the total number of jobs in the
system increases, the average number of concurrent tasks in each node also increases, and because
of its processor sharing property, the time for each task also increases. After the number of total
jobs hits the maximum, the number of concurrent tasks in each node remains constant, and so does
the execution time for each task. This is how mappers and reducers behave in Figure 7 and Figure
8.
32
Our model needs parameters such as Sm , Sr , Bcm and Bcr to be complete. The exact values and
equations of these parameters depend on the characteristics of specific workloads and software
and hardware specification, meaning it is unnecessary, and maybe impossible, to find a universal
formula that works for all cases. Users of this performance model can plug in their own submodel for their own workload and get the final detailed model. For this thesis, we demonstrate
the correctness of this model framework by finding the right numerical values for these unknown
variables using Genetic Algorithm [19], a heuristic search technique. Different kinds of data,
including times for different mapper and reducers tasks, response time for jobs, cluster imbalance
factor p, etc., are measured, and genetic algorithm tries many combinations of unknown parameter
values using built-in heuristics, for example changing a parameter randomly by a small increment
and then examines the how close the result is to the measured data, until the distance between
measured data and calculated results is small enough. Because the ranges of possible parameter
values are usually limited, for example, Sm is usually larger than 5 seconds and smaller than 5
minutes, this heuristic search is quite fast. Specific results are available in the next section.
33
5
Model Validation
In order to validate the proposed model for MapReduce, we have finished a large number of experiments, including representative workloads, i.e. database query, random number generation and
sorting. We organize this section accordingly in three subsections, each of which presents the
measured data and the calculated data using our model.
In all these experiments, the maximum number of tasks per node is increased from default value
4 to 20 for the queueing nature to be more obvious. All other parameters are kept as default value,
and the parameters closely related to system performance are shown in Table 2. Although there are
72 nodes in our cluster, because it is shared among many users, and my experiments can be easily
affected by other users’ experiments, we have to be careful about which nodes to choose. Finally
we were able to use 16 physical nodes to complete all these experiments. More nodes will be used
in parts of Section 6, along with the impact of several key parameters. The sub-model for waiting
time in Equation 17 is easy to understand, and is confirmed by the measured data. Therefore
this section focuses on the average task performance and its randomness, and the experiments’
concurrency C is kept smaller than the maximum job parameter K. The average values of multiple
runs are used for better accuracy. Although only three data sets and their corresponding model
calculations are presented, other data sets show similar results. These results suggest that the
blackbox simplification in the previous section is reasonable.
Parameter name
Default value
Block size
HDFS replication number
Heartbeat interval
The number of internal concurrent sortings
Buffer size for sorting
The maximum number of attempts for failed tasks
The limit on the input size of reduce
Task execution heap limit
The size of inmemory buffer size for shuffle
The number of tasks per Java Virtual Machine
The number of worker threads for map’s output fetching
Compression used for map’s output
64MB
3
3 seconds
10
1MB for each merge sorting
4
No limit set
200MB
140MB
1
40
Not used
Table 2. System default values
34
5.1
Database Query
The first set of experiments is database queries using the SQL query in Listing 1 in Section 3.2.
Each SQL query generates a MapReduce job, in which mappers scan the input tables to find
database rows that satisfy this query, and reducers combine these rows into the final results. In
detail, there are two kinds of mappers, one for each table. Mappers for table user search for the
rows that have the specified user name, and generate intermediate pairs that uses the user id as
key. Mappers for table photo transform all rows into intermediate pairs that has user id as key and
photo id as value. After shuffle phase, reducers scan these pairs and if one pair has items from both
tables, it is one of the results, and written into result file. Therefore, the MapReduce job from this
database query has large inputs (two tables) but small output (potentially hundreds of lines).
When running database queries, data such as throughput X, total response time T , mapper time
Tm , reducer time Tr and cluster imbalance factor p are all measured. These data can be used to
compute the parameters for the model in Equation 19, Equation 21 and Equation 22, which can
then be used to compute the times for each part of the model. If the computed results are closed to
measured times, then the model is able to express the performance of MapReduce for this database
query workload.
The average times for mappers and reducers are shown in Figure 14, including measured data
from experiments and calculated results from the model. The measured randomness factor ∆ and
calculated ∆ are shown in Figure 15. The measured response time T and calculated response time
are plotted in Figure 16. Small circles in these figures are the average values from multiple experiments, and error bars show the maximum and minimum measured data from these experiments.
These figures confirm that our model is able to accurately describe the performance of database
query.
The horizontal axis of these figures ranges from 10 to 60, because this is the range we are
interested in. When concurrency is smaller than 10, meaning there are fewer than 10 jobs running
concurrently, the queuing in the system is not obvious, and the performance increases almost linearly. However, when concurrency is larger than 60, the system is full and the extra jobs have to
wait outside of the system. The most difficult part of our model is this range, so data only in this
range are plotted.
35
(a) Mappers
(b) Reducers
Figure 14. Measured time VS calculated time for database query
Figure 15. Measured ∆ VS calculated ∆ for database query
36
Figure 16. Measured response time VS calculated response time for database query
In Figure 14 we can see that the running times for mappers are comparable to the times for
reducers, which means that although work for individual reducers is larger than individual mappers, they are not very far away, because theoretically most operations of database query job are
located inside mappers. In Figure 15 we can see that the randomness increases as the concurrency
increases, which is reasonable that more concurrent jobs give more concurrent running tasks which
fight each other for resources, and therefore increases the randomness.
5.2
Random Number Generation
The second type of experiments is random number generation. In each MapReduce job of this
experiment, mappers are in charge of generating numbers and reducers are in charge of writing
them into files. In detail, each map operation generates two random numbers. One of them is a very
small integer acting as the key of intermediate pairs, and the other is a 100 byte large integer acting
as the value of intermediate pairs. After shuffle phase, reducers abandon the keys of intermediate
pairs, and write their values into the output file. As a result, random number generation is a kind of
workload that has small input (at most several integers random seeds) and large output (all random
numbers).
The procedure of validation using random number generation workload is similar to the valida37
tion using database query workload. Different times and values are measured during the execution
of the experiments, and then used to compute the parameters for the model in Equation 19 and
Equation Equation 21 and Equation 22. At last calculated results using these measured parameters
are compared with the measured times, through which we can get the accuracy of our model for
the workload of random number generation.
The results for random number generation look like the results for database query in the previous subsection. The comparison of measured time and calculated time is shown in Figure 17, and
the comparison of measured ∆ and calculated ∆ is shown in Figure 18. Figure 19 shows the measured and calculated response time. Again, the measured data and calculated results are very close,
indicating our model is able to capture the performance of random number generation workload.
The concurrency range in these figures is from 5 to 30, which is different from the previous
subsection on database query. The reason for choosing this range is the same as database query, and
the specific values are different because the work for one random number job is larger than database
query, but the system’s capacity remains the same, and the supported concurrency becomes smaller.
(a) Mappers
(b) Reducers
Figure 17. Measured time VS calculated time for random number generation
Different from the previous subsection on database query, Figure 17 shows that running times
of reducers are more than ten times larger than mappers’ running times. In random number generation the work for reducers is much larger than mappers’ work, because reducers have to read
from intermediate results from remote mappers and write the output file to remote machines in
the underlying distributed file system. The randomness in Figure 18 increases faster than linear
38
5
10
15
20
15
30
Figure 18. Measured ∆ VS calculated ∆ for random number generation
Figure 19. Measured response time VS calculated response time for random number generation
39
as concurrency increases, meaning more concurrent jobs may bring extra wastes in the random
behavior.
5.3
Sorting
The last kind of experiment is sorting of random numbers generated in the second workload. In its
MapReduce job, mappers read the input random number file, and forward the random number into
intermediate pairs. These pairs are sorted by the MapReduce framework, and reducers write write
what they get from the framework directly into the output file. It is clear that the input and output
of sorting is same.
Similar validation procedure is carried out for sorting experiments, and produces similar results. The figures for mappers and reducers are plotted in Figure 20, the figure for ∆ is plotted in
Figure 21, and measured and calculated response time is plotted in Figure 22.
Again the range of concurrency in these figures is different from the previous two, because the
work for each sorting work is even larger than random number generation. Reducers’ running times
in Figure 20 are still larger than mappers’ times, but not as different as Figure 17, because here
mappers have to read large input files, therefore decrease the gap between mappers and reducers.
(a) Mappers
(b) Reducers
Figure 20. Measured time VS calculated time for sorting
The results from these three experiments prove that our model works for these three workloads.
Because most MapReduce programs are I/O bounded and these workloads are categorized accord-
40
Figure 21. Measured ∆ VS calculated ∆ for sorting
Figure 22. Measured response time VS calculated response time for sorting
41
ing to their I/O demands, our model should be able to describe the performance characteristics for
a variety of different MapReduce applications.
42
6
Model Application
After analyzing MapReduce through experiments and the proposed model, several points can be
discovered to improve parameter settings and system design. We will show three key areas where
optimizations are possible. We use the proposed model in Gradient Descent [31] technique as a
tactic to optimize some important parameters, such as number of reducers per job, block size of
underlying distributed file system, and the size of the cluster.
Note that the methods of applying our model are more important than the quantitative results.
The conclusions of this section depend on the type of workload, and may change if the workload
changes. Users need to apply this methodology to their own systems and workloads.
6.1
Procedures of Optimization using Gradient Descent
If a function F(x) is differentiable near a point x = a, then the value of F(x) decreases the fastest
when x moves in the opposite direction of the gradient of F(x) at a. This first-order optimization
algorithm is called gradient descent. It works in the space of large number of dimensions, and does
not require an explicit nice formula for this F function.
Here is an example of gradient descent. The function under concern is
F(x) = x4 − 3x3 + 2,
(24)
and the minimum F value can be located by first calculating its derivative shown as follow:
F (x) = 4x3 − 9x2 .
(25)
If the optimization is started from x = 5, we first calculate its derivative using Equation 25, which
is positive. Then next step we decrease x to a smaller value x = 3, and the derivative is still
positive. The x value is then further decreased to 2, and now the derivative becomes negative, so x
is increased in the next step. This iteration continues until the derivative’s change is smaller than a
certain limit, and then the satisfactory minimum value is located. If the increment is large, it is able
to quickly search a broader space, but may lose precision; if the increment is small, it is able to
43
improve the precision, but may need more steps if it is far from the optimal point. This procedure
is shown in Figure 23.
y
140
120
100
80
60
40
20
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
x
Figure 23. An example of gradient descent usage
Back to our model, our optimization object is to minimize response time T . Maximizing the
throughput X has a similar process if Little’s Law from Theorem 1 is used. There is a solution for
the global model in Equation 18, but it is too complex to analyze symbolically, but mathematical
software such as Matlab is able to quickly calculate numerical solutions if unknown parameters are
fixed. Therefore gradient descent can be used to search for the optimal values for some parameters,
given that other parameters are fixed, for example, using measured data from experiments of that
particular workload.
In order to use gradient descent, we first derive the equation for T from Equation 14, 16 and
18 using parameters such as Sm , Hm , Sr , Hr , Bcm , Bcr and p, but it is too complex and therefore
omitted from this thesis. For example, if the optimization if for Hr , we set other parameters to
measured data, calculate T ’s derivative at Hr , and increase or decrease the value of Hr according
to the sign of that derivative, as shown in the example in Figure 23. We are only interested in the
system that does not have deterministic waiting, meaning that the system is not overloaded and Tw
is 0.
We first consider the equation for T∆ in Equation 16. Tmax is the response time when the
44
concurrency C equals to K, and taking these conditions into Equation 14 and Equation 16, we have
the following Equation 26,
Tm max = Bm0 + p
K
Tm max + Tr max + T∆ max
Tr max = Br0 + p
K
Tm max + Tr max + T∆ max
T∆ max =
1
2
(Bmm Hm Tm max + Bmr Hr Tr max ) + Bcm
(Brm Hm Tm max + Brr Hr Tr max ) + Bcr
(26)
4a(Tm max + Tr max + T∆ max )
+c
K
where Tm max , Tr max and T∆ max are the time for mapper, reducer and ∆ repectively when the concurrency is at the maximum K. This equation can be solved for Tm max , Tr max and T∆ max , and after
substituting these into Equation 16, we have the equation for T∆ which consists of parameters such
as Sm , Hm , Sr , Hr , Bcm , Bcr and p, and can be differentiated.
Now we can substitute the equation for T∆ from the last paragraph into Equation 14 and solve
that equation array to get the formulas for Tm and Tr . Therefore we have the complete formula for
T in Equation 18. To sum up, T now can be expressed using Sm , Hm , Sr , Hr , Bcm , Bcr and p. Then
we can proceed with the optimizations using gradient descent against the T ’s expression.
There are some extra relationships between these parameters. According to the nature of
MapReduce, all jobs are split into records, for example, strings of English words. If the input
and output are large enough, we can assume that the amount of computing work for each input and
output block is approximately the same. Therefore the total work for all mappers and reducers are
approximately proportional to the size of the work, which means
Sm Hm = gm Di
(27)
Sr Hr = gr Do
where gm and gr are constants, and Sm Hm and Sr Hr are usually called service demands.
The input size of each mapper is, by default, the size of a block, therefore we have the following
45
relation:
Di
b
(28)
Sm = gm b
(29)
Hm =
Taking this equation into Equation 27 will get
where b is block size. Equation 29 is intuitive: the larger the block size b is, the larger the input
for each mapper is, and the longer time it takes for a mapper to finish.
However, these relations are not always true. For example, the number of mappers can be
explicitly set to an arbitrary number, regardless of the size of a block. If the number of mappers
and reducers are very large, the chance of failure may become not negligible and our model is
not accurate to capture failure. Still these relationships are used here as demonstration of most
common cases.
6.2
Optimal Number of Reducers Per Job
The first optimization procedure we would like to demonstrate is the number of reducers per job.
Some guideline articles such as [30] suggest to set the number of reducers to a large possible value,
but some others argue that the number should equal to the number of machines used. Here we study
this issue from a theoretical perspective.
First of all, 32 nodes are used in sorting experiments, and each sorting job has 5 reducers at the
beginning. We use this as the baseline to find the optimal number of reducers per job so that the
response time T is minimized. In each step we calculate the parameter values using the method
d
in the validation Section 5, and then compute the gradient
T . If this gradient is negative, the
dHr
number of reducer is increased by a small value which is, in our case, 5 if the gradient is large to
search more carefully, and 10 if the gradient is small to search more effectively. A larger gradient
means response time T is more sensitive to the change of the number of reducers Hr . If this gradient
is positive, it is decreased by a small value. The experiment is rerun using this new reducer setting,
until the response time T stabilizes, meaning that T approaches a limit and does not changes much
if optimization steps are taken.
46
According to Equation 27, Sr can be displayed as
Sr =
gr Do
Hr
(30)
and after substituting this into Equation 12, 13 and 14, we get the equations that can be directly
used in the optimization procedure.
The process of the optimization is shown below. The decrease of response time is plotted in
Figure 24, and the gradient of each corresponding step is plotted in Figure 25. As we can see,
the response time approaches a minimum value as the number of reducers increases, while the
gradients approaches 0, meaning that the response time stabilizes.
It is too compute-intensive to run experiments for arbitrary Hr , but if we assume that parameters
other than Hr remain constant, we can numerically calculate the gradients for very large Hr . The
trend of gradient is shown in Figure 26.
175
Hr = 5
Response time (s)
170
H r = 10
165
160
H r = 15
155
H r = 20
H r = 30
150
1
2
3
Step
4
5
Figure 24. The process of gradient descent for the optimization of Hr for sorting
It should be pointed out that, although the calculation shows that the gradient for Hr is always
negative, too many reducers should be avoided, because the subtle overhead that could be omitted
is now more and more significant, making our model inaccurate. Furthermore, it is possible to consume more energy and increase the chance of failures. As a result, a threshold ε can be introduced,
47
0
H r = 20
ï5
H r = 15
ï10
Gradient
H r = 30
H r = 10
ï15
ï20
ï25
ï30
Hr = 5
1
2
3
Step
4
5
Figure 25. The gradients for the optimization of Hr for sorting
0
ï0.2
Gradient
ï0.4
ï0.6
ï0.8
ï1
50
100
150
200
Number of reducers
250
300
Figure 26. The trend of gradients for the optimization of Hr for sorting
48
and increase the number of reducers only when its gradient exceeds this threshold. This threshold
depends on the specific system and its workload, and should be decided case by case.
Similarly we run this optimization for the number of reducers Hr for the workload of database
query, but the results are different from sorting. The process of gradient descent is shown in
Figure 27 and Figure 28, from which we can see the optimal Hr value is 1, different from sorting.
The difference is cause by the difference of their workload: the service time of reducer Sr in
database query is much smaller than sorting, and the overhead of reducer in database query is
more significant than sorting. The large ratio of overhead to service time Sr limits the parallelism,
and make database query suffer from large number of reducer Hr .
Hr = 7
130
Hr = 6
Response time (s)
120
Hr = 5
Hr = 4
110
Hr = 3
100
90
Hr = 2
80
Hr = 1
70
1
2
3
4
Step
5
6
7
Figure 27. The process of gradient descent for the optimization of Hr for database query
These results show that there is no universal rule to set the number of reducers Hr . Neither
of the previously mentioned guidelines are correct for all MapReduce jobs. If the service demand
for reducers Sr Hr is small, larger Hr will incur more overhead and therefore drag down the system
performance; if the service demand is large, the system performance is hard to benefit from large
parallelism if Hr is too small. Decisions need to be made case by case.
49
16
Hr = 1
14
12
Gradient
10
8
6
4
Hr = 2
Hr = 7
Hr = 4
Hr = 6
Hr = 3
Hr = 5
2
1
2
3
4
Step
5
6
7
Figure 28. The gradients for the optimization of Hr for database query
6.3
Optimal Block Size
Block size is an important parameter in HDFS and MapReduce. The experiment with which Yahoo
won the Terasort benchmark set the block size to 512MB instead of the default value 64MB [21],
but no explanation was provided. This default value originally came from paper [13], where no
reason was available either. One of the reasons that we can think of is that larger block size will
generate smaller total overhead. But larger block size also decreases parallelism and therefore
prolong its execution time. What is more, the execution of Terasort benchmark is usually isolated
from other workload to ensure faster speed, and is possibly not an ideal indicator for the real
system shared among many users. As a result, we try to find the optimal block size to minimize
the average response for several concurrent sortings.
The relationship between the service time of mapper Sm and block size b can be used in from
Equation 29 that
Sm = gm b.
(31)
Substituting this equation into Equation 18, Equation 21 and Equation 22 of our model, we can
use the methods introduced in previous subsections to get the derivative of T against b, and start
gradient descent.
50
The procedure of this optimization is similar to the previous subsection. We use 32 nodes to
run concurrent sortings which have fixed number of reducers. The first experiment which uses
512MB block size servers as the baseline. Then gradient descent is carried out to find the optimal
block size, which is shown in Figure 29 and Figure 30. When at step 6 the block size b is 16MB,
the gradient becomes a small negative number, so at step 7 the block size is increased a little to
24MB, and gradient becomes a small positive number.
145
b = 512M B
140
135
Response time (s)
130
125
b = 256M B
120
115
110
b = 64M B
b = 16M B
b = 128M B
105
b = 32M B
100
b = 24M B
95
90
1
2
3
4
Step
5
6
7
Figure 29. The process of gradient descent for the optimization block size for sorting
Distributed systems like MapReduce are nondeterministic and their randomness implies one
cannot determine the optimum precisely by iterating indefinitely. Performance difference between
block size b = 16MB and block size b = 32MB is already very small (approximately 1%), therefore
iterating more steps is not able to give much higher precision. Just like the previous subsection
about the optimal number of reducers, we numerically calculate the gradient for block size between
16MB and 32MB, which is shown in Figure 31.
The same optimization is also carried out for database query workload, but the results are
different. The response time and gradient of different steps of the optimization are plotted in
Figure 32 and Figure 33 respectively, which show that the optimal block size for database query
is around 192MB. The reason of this difference is that the service requirement of sorting’s mapper
is larger than database query. Even though smaller block size generates more mappers and incurs
51
12
b = 512M B
10
Gradient
8
6
b = 256M B
4
b = 128M B
2
b = 64M B
b = 32M B
b = 24M B
0
b = 16M B
ï2
1
2
3
4
Step
5
6
7
Figure 30. The gradients for the optimization of block size for sorting
0.14
0.12
0.1
Gradient
0.08
0.06
0.04
0.02
0
ï0.02
16
18
20
22
24
26
Block size
28
30
32
Figure 31. The trend of gradients for the optimization of block size for sorting
52
larger overhead, the impact of this overhead for sorting is not as obvious as for database query.
Therefore, sorting is able to benefit from larger parallelism using smaller block size, but database
query would instead suffer from it.
b = 16M B
Response time (s)
200
150
b = 32M B
100
b = 128M B
b = 192M B
b = 64M B
b = 256M B
50
0
1
1.5
2
2.5
3
3.5
Step
4
4.5
5
5.5
6
Figure 32. The process of gradient descent for the optimization block size for database query
The results here suggest that the block size that engineers have been using is not always optimal; a smaller block size is good for concurrent sorting, and a relatively larger block size is good
for the database query in Listing 1 in Section 3. However, the conclusion is not meant to be universally applicable. Real world systems are not designed just for concurrent sortings or database
query, and other user applications may require different block sizes. Furthermore, smaller block
size will generate more blocks for the same file, and therefore increase the chance of failure as
well as the overhead of master nodes management. Larger block size increases the response time
of individual mapper and reducer tasks. Because block size b is a parameter of the underlying
distributed file system shared by all MapReduce jobs and other applications, Trade-offs need to be
made to get the overall optimal block size.
6.4
Optimal Cluster Size
One of MapReduce’s pronounced advantages is its scalability, which means adding more machines
will give better overall performance. It is natural that more machines have more computing power,
53
0.15
b = 256M B
0.1
b = 192M B
0.05
0
Gradient
ï0.05
b = 128M B
ï0.1
b = 64M B
ï0.15
ï0.2
ï0.25
ï0.3
ï0.35
b = 32M B
b = 16M B
1
2
3
4
5
6
Step
Figure 33. The gradients for the optimization of block size for database query
but this does not necessarily linearly increase the overall system speed. Plus, more machines means
larger procurement budget and larger operation cost, especially in cloud computing environment.
For example, it is not economic to double the cluster just to get a few percents of performance
enhancement. In this subsection we try to find the point that MapReduce’s scalability starts to
deteriorate seriously, which is an important information for system setting up.
The model in Equation 14 has imbalance factor p inside. Ideally in a balanced system, all
1
nodes are the same, and p = . If we assume our system is const p times worse than the ideal case,
N
const p
. After substituting this relation into the overall model in Equation 18, we now
then p =
N
have the equation for T , and can start the optimization like the previous subsections.
Database query workload is used for this optimization, because its performance degradation is
clearer than the other two. The results are shown in Figure 34 and Figure 35. These two figures
show that after a certain range larger cluster size doesn’t provide desirable performance increase.
This fast performance degradation is probably caused by databases’ constraints. In databases, data
are supposed to be stored only once (in our case, only one file replicated 3 times) to ensure strong
consistency, but this also decrease the possible parallelism. When more machines are added, they
need to fight for the only input file, therefore limit the performance gain. One solution is to store
multiple copies of the database, or at least the popular portion of the database, if the database itself
54
130
N =4
120
Response time (s)
110
100
90
N =8
80
70
60
N = 16
50
40
1
2
3
Step
N = 32
4
N = 48
5
Figure 34. The process of gradient descent for cluster size for database query
0
N = 16
N = 32
N = 48
N =8
ï2
Gradient
ï4
ï6
ï8
ï10
ï12
ï14
N =4
1
2
3
Step
4
5
Figure 35. The gradients for the optimization of cluster size for database query
55
or client applications can tolerate inconsistencies.
However, the scalability of sorting workload is different from the scalability of database queries.
Its results are shown in Figure 36 and Figure 37. Although the performance increase is not linear,
it is still better than database query workload. From the gradient figure we know that the gradient is still away from 0, which means its scalability remains effective if the cluster size increases
even further. The possible reason is that there are multiple input files for sortings, thus avoid the
inefficiency mentioned in the last paragraph.
240
N =8
220
Response time (s)
200
180
160
140
N = 16
120
N = 32
100
N = 48
80
N = 64
60
1
1.5
2
2.5
3
Step
3.5
4
4.5
5
Figure 36. The process of gradient descent for cluster size for sorting
6.5
Summary
Conclusions from this section are summarized in Table 3, which shows that the default parameter
values are not always good choices. Different applications require different settings. For example,
requirements of sorting on the number of reducers and block size is totally different from the requirements of database queries. Similar methods can generate optimizations for other parameters.
MapReduce job writers will benefit from these conclusions. They are familiar with their own
jobs, such as the requirements on CPU, disk or network. Using the methods in this section they are
able to speedup their jobs by setting better parameters, such as the number of reducers.
56
0
ï1
N = 64
ï2
N = 48
N = 32
Gradient
ï3
ï4
N = 16
ï5
ï6
N =8
ï7
1
2
3
Step
4
5
Figure 37. The gradients for the optimization of cluster size for sorting
Optimization area
Number of reducers
Block size
Cluster size
Workload
Sorting
Database query
Sorting
Database query
Sorting
Database query
Conclusion
Optimal value is 30 or more
Optimal value is 1
Optimal size is 24MB
Optimal size is 192MB
Scalability remains good for a large size
Scalability starts to deteriorate from 16 nodes
Table 3. System optimization conclusions
57
Before setting up a new cluster or changing the existing cluster, these conclusions can be used
as follows. In systems such as [27], workloads usually include both regularly scheduled jobs and
ad hoc jobs. System designers should firstly decide the percentage of these two and which to focus
on. Then the information of their service demand, such as Sm Hm and Sr Hr , should be measured.
At last, put these data into our model, and calculate the optimal values for block size value and
cluster size. Because the block size and cluster size need to be fixed before the system starts, these
values need to be set statistically.
58
7
Conclusion and Future Work
This thesis studies the performance of MapReduce. Because of its simplicity, scalability, power
of expressiveness and outstanding failure tolerance, MapReduce is becoming a more and more
popular distributed computing framework in both the industry and the academic world, yet its performance issues are not fully studied. We tackle this problem by first inspecting its system design
and then categorizing its typical workload, followed by experiments to get preliminary impression
of its behavior in Section 3. With this basic knowledge we proposed our model including three
major parts in Section 4. The first sub-model is for average task performance in Equation 14. We
focus on the bottleneck node, and use a modified multi class processor sharing queue to capture
the average response time for mappers and reducers. It has two structured equations, which can
be easily solved. The second part is the random behavior of map and reduce tasks in Equation 16.
Their common style is pointed out, and a customized equation is designed to describe this random
time. The waiting time in oveloaded system was investigated as the third part of MapReduce’s
performance model in Equation 17, which is an intuitive equation.
Then this model is validated in Section 5 against measured data from all three kinds of experiments, along with measured times and calculated times from each sub-model shown in Figures
from 14 to 22. At last possible application methods are demonstrated using gradient descent for
three optimizations in Section 6, which provide insights on system configuration and MapReduce
job design. For example, the guidelines for setting the number of reducers are not effective for
all kinds of jobs, and the default block size 64MB is not optimal for sorting and database query
workloads. Both system architects and end client programmers will benefit from our model, as
provided at the end of Section 6.
Following this direction, we propose the following tasks as future work. The model for the
impact of failures on the performance is an important and necessary work. Long running jobs are
commonly used in real systems, but failures are also commonly present. Therefore, our model will
be more accurate and widely applicable after the failure model is added.
More kinds of workload jobs should be identified to test the accuracy of our model, or locate
weak areas to improve our model. Special scenarios may falsify some of our observations, but
59
because of the weak assumptions used, we believe the major structure of the model is universally
applicable.
It will make the model even more precise by modeling the random behaviors using analytical
statistics equations, instead of curve fitting through measured data. The random behavior Equation
16 is based on common points for the 3 workloads we run, but other workloads may have different
characteristics. Mathematically proved expectation equation is an ideal solution, but also hard to
reach.
As the cluster size increases, the chance that some of parts of the system fail also increases. If
a MapReduce job is large and has a large number of mappers and reducers, the chance that failures
occur is larger than smaller jobs. In these cases failures’ influence on MapReduce’s performance
cannot be ignored. The performance will become more completed after failure is included.
More optimization areas can be recognized. As is demonstrated in the model application section, our model is able to explore many optimization opportunities effectively. More parameters
can be optimized using similar methods. As our model becomes more and more accurate and
generalized, we will have more confidence in using it to guide system design and implementation.
60
Bibliography
[1] E. Altman, K. Avrachenkov, and U. Ayesta. A survey on discriminatory processor sharing.
Queueing Systems, 53(1):53–63, June 2006.
[2] Amazon. Amazon Elastic Compute Cloud, 2010. http://aws.amazon.com/ec2/.
¨ Babao˘glu, L. Alvisi, A. Amoroso, R. Davoli, and L. Giachini. Paralex: An environment
[3] O.
for parallel programming in distributed systems. In Proceedings of the 6th International
Conference on Supercomputing, number October, page 187. ACM, 1992.
[4] E. A. Brewer. Delivering high availability for Inktomi search engines. ACM SIGMOD Record,
27(2):538, June 1998.
[5] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, pages 335–350.
USENIX Association, 2006.
[6] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes,
and R. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions
on Computer Systems (TOCS), 26(2):1–26, 2008.
[7] L. Cherkasova. Performance modeling in mapreduce environments: challenges and opportunities. In Proceeding of the Second Joint WOSP/SIPEW International Conference on Performance Engineering, pages 5–6, Karlsruhe, Germany, 2011. ACM.
[8] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce
online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and
Implementation, pages 21–21. USENIX Association, 2010.
[9] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
[10] D. DeWitt and J. Gray. Parallel database systems: the future of high performance database
systems. Communications of the ACM, 35(6):85–98, June 1992.
61
[11] A. Fox, S. Gribble, Y. Chawathe, E. Brewer, and P. Gauthier. Cluster-based scalable network
services. ACM SIGOPS Operating Systems Review, 31(5):78–91, 1997.
[12] E. Gabriel, G. Fagg, G. Bosilca, T. Angskun, J. Dongarra, J. Squyres, V. Sahay, P. Kambadur,
B. Barrett, and A. e. a. Lumsdaine. Open MPI: Goals, concept, and design of a next generation
MPI implementation. Recent Advances in Parallel Virtual Machine and Message Passing
Interface, pages 353–377, 2004.
[13] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. ACM SIGOPS Operating
Systems Review, 37(5):29–43, Dec. 2003.
[14] Hadoop. HBase Homepage. http://hbase.apache.org/.
[15] Hadoop.
Companies PoweredBy Hadoop, 2010.
http://wiki.apache.org/
hadoop/PoweredBy.
[16] Hadoop.
Hadoop Distributed File System, 2010.
http://hadoop.apache.org/
hdfs/.
[17] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang. Mars: a MapReduce framework
on graphics processors. In Proceedings of the 17th International Conference on Parallel
Architectures and Compilation Techniques, pages 260–269. ACM, ACM, 2008.
[18] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel
programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys
European Conference on Computer Systems 2007, pages 59–72. ACM, 2007.
[19] Mathworks.
2011.
Genetic Algorithm Solver - Global Optimization Toolbox for MATLAB,
http://www.mathworks.com/products/global-optimization/
description4.html.
[20] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign
language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1099–1110, New York, New York, USA, 2008. ACM.
[21] O. O’Malley.
Terabyte sort on apache hadoop.
Yahoo, available online at:
62
http://sortbenchmark. org/Yahoo-Hadoop. pdf, (May):1–3, 2008.
[22] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis
with Sawzall. Scientific Programming, 13(4):277–298, 2005.
[23] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. 2007 IEEE 13th International Symposium
on High Performance Computer Architecture, pages 13–24, 2007.
[24] T. Sandholm and K. Lai. MapReduce optimization using regulated dynamic prioritization.
ACM Press, New York, New York, USA, June 2009.
[25] Y. C. Tay. Analytical Performance Modeling for Computer Systems, volume 2. Apr. 2010.
[26] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and
R. Murthy. Hive: A warehousing solution over a Map-Reduce framework. Proceedings of
the VLDB Endowment, 2(2):1626–1629, 2009.
[27] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu.
Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 International Conference on Management of Data, pages 1013–1020. ACM, 2010.
[28] Top500. TOP500 Supercomputing Sites, 2010. http://www.top500.org/.
[29] G. Wang, A. Butt, P. Pandey, and K. Gupta. A simulation approach to evaluating design
decisions in mapreduce setups. In IEEE International Symposium on Modeling, Analysis &
Simulation of Computer and Telecommunication Systems, 2009., pages 1–11. IEEE, 2009.
[30] T. White. 10 MapReduce Tips, 2009. http://www.cloudera.com/blog/2009/
05/10-mapreduce-tips/.
[31] Wikipedia.
Gradient descent — wikipedia, the free encyclopedia, 2011.
http:
//en.wikipedia.org/w/index.php?title=Gradient_descent&oldid=
432411072.
[32] Wikipedia. Shear mapping — wikipedia, the free encyclopedia, 2011. http://en.
63
wikipedia.org/w/index.php?title=Shear_mapping&oldid=425596906.
[33] H. Yang, A. Dasdan, R. Hsiao, and D. Parker. Map-reduce-merge: simplified relational
data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international
conference on Management of data, pages 1029–1040. ACM, 2007.
[34] R. Yoo, A. Romano, and C. Kozyrakis. Phoenix Rebirth: Scalable MapReduce on a NUMA
System. In Proceedings of the International Symposium on Workload Characterization
(IISWC), pages 198–207, 2009.
[35] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. Gunda, and J. Currey. DryadLINQ: A
system for general-purpose distributed data-parallel computing using a high-level language.
In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, 2008.
[36] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving MapReduce
performance in heterogeneous environments. In Proceedings of the 8th USENIX Conference
on Operating systems design and implementation, OSDI’08, pages 29–42, Berkeley, 2008.
USENIX Association.
64
[...]... utilization, W is the average waiting time, λ is the arrival rate and µ is the service rate Queueing models have two basic forms, open and closed In open systems, the customer arrival rate does not depend on state of system, but in closed systems, the total number of customers is fixed Many systems can be studied as both open systems and closed systems, and the decision depends on the purpose of the system If... existing commercial databases [7] New systems are being designed [4, 11], and companies such as Google, Microsoft and Amazon made these designs into commercial systems that are operating tens of thousands of computers Their accumulated power of low end computers makes it possible to analyse the whole Internet in a timely fashion, support large transaction systems and many more This new trend is also... overview, and proceeds to build the model After that the model is validated in Section 5, and model application is shown in Section 6, which shows the conclusions that are different from current understandings At last this thesis concludes in Section 7 and sets the plan for future work 3 2 Related Work Modern data processing systems are growing in both size and complexity, but their theory and guidelines... technologies are invented and widely deployed In recent years the requirements for data storage are becoming more demanding These requirements include but are not limited to the volume that scales up to Petabytes, currency control among billions of objects, fault tolerance and failover, and high throughput of both read and write The traditional storage and database system may become awkward for handling all these,... programmers, and users can easily control its data flow It employs a flexible nested data model, and internally supports many operators such as co-group and equi-join Like Hive it is easily extensible using user-defined functions Furthermore, Pig Latin is easy to learn and easy to debug Sawzall [22] is designed and implemented at Google and runs on top of Google’s infrastructure such as protocol buffers, GFS and. .. calculation operations per second Those systems in the TOP500 list [28] are good examples As people are collecting and generating more and more data, such as Internet web pages, photos and videos in social network sites, health records, telescope imageries, transaction logs and so on, automatic processing of these data using large computer systems is of high demand For example, successful data mining... processing Originally designed and implemented in Google, MapReduce is now a hot topic and applied to fields that it was not intended to fit Hadoop also has its open source MapReduce version It is built on top of GFS and Bigtable, and use them as input source and output destination MapReduce’s programming model is from functional languages, consisting of two functions: map and reduce The Map function... the manager a better understanding of the business and its customers, 1 therefore improving this business to a new level Fast and accurate processing of telescope images is possible to provide breakthrough scientific discoveries Database systems are designed to manage large data, but up to 70% to 80% of online data are unstructured and may be used for only a few times, and the processing is not efficient,... CPU, local disk and network interface card According to MapReduce’s topology, ideally a job has two phases: map and reduce But after careful study, we find two synchronization points after mapper and reducer, meaning that all reducers start only after all mappers finish, and the job result is returned only after all reducers finish In this work we focus on average performance, and this randomness of response... are large enough, and size of its output is negligible In this query u name is randomly chosen in order to make queries different from each other, the way the real workload should look like Listing 1 Workload query select p id from u s e r , p h o t o where u i d = p u i d and u name = ”< d i f f e r e n t n a m e s >” Sorting and random number generation are used for the second and third workload ... depend on state of system, but in closed systems, the total number of customers is fixed Many systems can be studied as both open systems and closed systems, and the decision depends on the purpose... billions of objects, fault tolerance and failover, and high throughput of both read and write The traditional storage and database system may become awkward for handling all these, thus new approaches... databases [7] New systems are being designed [4, 11], and companies such as Google, Microsoft and Amazon made these designs into commercial systems that are operating tens of thousands of computers