Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Chapter 2

MapReduce Basics

The only feasible approach to tackling large-data problems today is to divide


and conquer, a fundamental concept in computer science that is introduced
very early in typical undergraduate curricula. The basic idea is to partition a
large problem into smaller sub-problems. To the extent that the sub-problems
are independent [5], they can be tackled in parallel by different workers—
threads in a processor core, cores in a multi-core processor, multiple processors
in a machine, or many machines in a cluster. Intermediate results from each
individual worker are then combined to yield the final output.1
The general principles behind divide-and-conquer algorithms are broadly
applicable to a wide range of problems in many different application domains.
However, the details of their implementations are varied and complex. For
example, the following are just some of the issues that need to be addressed:

• How do we break up a large problem into smaller tasks? More specifi-


cally, how do we decompose the problem so that the smaller tasks can be
executed in parallel?
• How do we assign tasks to workers distributed across a potentially large
number of machines (while keeping in mind that some workers are better
suited to running some tasks than others, e.g., due to available resources,
locality constraints, etc.)?
• How do we ensure that the workers get the data they need?
• How do we coordinate synchronization among the different workers?
• How do we share partial results from one worker that is needed by an-
other?
• How do we accomplish all of the above in the face of software errors and
hardware faults?
1 We note that promising technologies such as quantum or biological computing could

potentially induce a paradigm shift, but they are far from being sufficiently mature to solve
real world problems.

19
In traditional parallel or distributed programming environments, the devel-
oper needs to explicitly address many (and sometimes, all) of the above issues.
In shared memory programming, the developer needs to explicitly coordinate
access to shared data structures through synchronization primitives such as
mutexes, to explicitly handle process synchronization through devices such as
barriers, and to remain ever vigilant for common problems such as deadlocks
and race conditions. Language extensions, like OpenMP for shared memory
parallelism,2 or libraries implementing the Message Passing Interface (MPI)
for cluster-level parallelism,3 provide logical abstractions that hide details of
operating system synchronization and communications primitives. However,
even with these extensions, developers are still burdened to keep track of how
resources are made available to workers. Additionally, these frameworks are
mostly designed to tackle processor-intensive problems and have only rudimen-
tary support for dealing with very large amounts of input data. When using
existing parallel computing approaches for large-data computation, the pro-
grammer must devote a significant amount of attention to low-level system
details, which detracts from higher-level problem solving.
One of the most significant advantages of MapReduce is that it provides an
abstraction that hides many system-level details from the programmer. There-
fore, a developer can focus on what computations need to be performed, as
opposed to how those computations are actually carried out or how to get
the data to the processes that depend on them. Like OpenMP and MPI,
MapReduce provides a means to distribute computation without burdening
the programmer with the details of distributed computing (but at a different
level of granularity). However, organizing and coordinating large amounts of
computation is only part of the challenge. Large-data processing by definition
requires bringing data and code together for computation to occur—no small
feat for datasets that are terabytes and perhaps petabytes in size! MapReduce
addresses this challenge by providing a simple abstraction for the developer,
transparently handling most of the details behind the scenes in a scalable, ro-
bust, and efficient manner. As we mentioned in Chapter 1, instead of moving
large amounts of data around, it is far more efficient, if possible, to move the
code to the data. This is operationally realized by spreading data across the
local disks of nodes in a cluster and running processes on nodes that hold
the data. The complex task of managing storage in such a processing envi-
ronment is typically handled by a distributed file system that sits underneath
MapReduce.
This chapter introduces the MapReduce programming model and the un-
derlying distributed file system. We start in Section 2.1 with an overview of
functional programming, from which MapReduce draws its inspiration. Sec-
tion 2.2 introduces the basic programming model, focusing on mappers and
reducers. Section 2.3 discusses the role of the execution framework in actually
running MapReduce programs (called jobs). Section 2.4 fills in additional de-

2 http://www.openmp.org/
3 http://www.mcs.anl.gov/mpi/

You might also like