Ditp ch2
Ditp ch2
MapReduce Basics
potentially induce a paradigm shift, but they are far from being sufficiently mature to solve
real world problems.
19
In traditional parallel or distributed programming environments, the devel-
oper needs to explicitly address many (and sometimes, all) of the above issues.
In shared memory programming, the developer needs to explicitly coordinate
access to shared data structures through synchronization primitives such as
mutexes, to explicitly handle process synchronization through devices such as
barriers, and to remain ever vigilant for common problems such as deadlocks
and race conditions. Language extensions, like OpenMP for shared memory
parallelism,2 or libraries implementing the Message Passing Interface (MPI)
for cluster-level parallelism,3 provide logical abstractions that hide details of
operating system synchronization and communications primitives. However,
even with these extensions, developers are still burdened to keep track of how
resources are made available to workers. Additionally, these frameworks are
mostly designed to tackle processor-intensive problems and have only rudimen-
tary support for dealing with very large amounts of input data. When using
existing parallel computing approaches for large-data computation, the pro-
grammer must devote a significant amount of attention to low-level system
details, which detracts from higher-level problem solving.
One of the most significant advantages of MapReduce is that it provides an
abstraction that hides many system-level details from the programmer. There-
fore, a developer can focus on what computations need to be performed, as
opposed to how those computations are actually carried out or how to get
the data to the processes that depend on them. Like OpenMP and MPI,
MapReduce provides a means to distribute computation without burdening
the programmer with the details of distributed computing (but at a different
level of granularity). However, organizing and coordinating large amounts of
computation is only part of the challenge. Large-data processing by definition
requires bringing data and code together for computation to occur—no small
feat for datasets that are terabytes and perhaps petabytes in size! MapReduce
addresses this challenge by providing a simple abstraction for the developer,
transparently handling most of the details behind the scenes in a scalable, ro-
bust, and efficient manner. As we mentioned in Chapter 1, instead of moving
large amounts of data around, it is far more efficient, if possible, to move the
code to the data. This is operationally realized by spreading data across the
local disks of nodes in a cluster and running processes on nodes that hold
the data. The complex task of managing storage in such a processing envi-
ronment is typically handled by a distributed file system that sits underneath
MapReduce.
This chapter introduces the MapReduce programming model and the un-
derlying distributed file system. We start in Section 2.1 with an overview of
functional programming, from which MapReduce draws its inspiration. Sec-
tion 2.2 introduces the basic programming model, focusing on mappers and
reducers. Section 2.3 discusses the role of the execution framework in actually
running MapReduce programs (called jobs). Section 2.4 fills in additional de-
2 http://www.openmp.org/
3 http://www.mcs.anl.gov/mpi/