Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Big Data Introduction

Data is largely classified as Structured, Semi-Structured and Un-Structured.

If we know the fields as well as their datatype, then we call it structured. The data in relational
databases such as MySQL, Oracle or Microsoft SQL is an example of structured data.

The data in which we know the fields or columns but we do not know the datatypes, we call it
semi-structured data. For example, data in CSV which is comma separated values is known as
semi-structured data.

If our data doesn't contain columns or fields, we call it unstructured data. The data in the form of
plain text files or logs generated on a server are examples of unstructured data.

The process of translating unstructured data into structured is known as ETL - Extract,
Transform and Load.

Distributed Systems
What is Distributed System?

When networked computers are utilized to achieve a common goal, it is known as a distributed
system. The work gets distributed amongst many computers. The branch of computing that
studies distributed systems is known as distributed computing. The purpose of distributed
computing is to get the work done faster by utilizing many computers.

Most but not all the tasks can be performed using distributed computing.

What is Big Data?


In very simple words, Big Data is data of very big size which can not be processed with usual
tools. And to process such data we need to be distributed architecture. This data could be
structured or unstructured.

Generally, we classify the problems related to the handling of data into three buckets: Volume:
When the problem we are solving is related to how we would store such huge data, we call it
Volume. Examples of Volume are Facebook handling more than 500 TB data per day. Facebook
is having 300 PB of data storage.

Velocity: When we are trying to handle many requests per second, we call this characteristic
Velocity. The problems as the number of requests received by Facebook or Google per second
is an example of Big Data due to Velocity.

Variety: If the problem at hand is complex or data that we are processing is complex, we call
such problems as related to variety.

Imagine you have to find the fastest route on a map since the problem involves enumerating
through many possibilities, it is a complex problem even though the map's size would not be too
huge.

Data could be termed as Big Data if either Volume, Velocity or Variety becomes impossible to
handle using traditional tools.

Why do we need big data ?


Paper, Tapes etc are Analog storage while CDs, DVDs, hard disk drives are considered digital
storage.

This graph shows that the digital storage has started increasing exponentially after 2002 while
Analog storage remained practically same.

The year 2002 is called beginning of the digital age.

Why so? The answer is two fold: Devices, Connectivity On one hand, the devices became
cheaper, faster and smaller. Smart Phone is a great example. On another, the connectivity
improved. We have wifi, 4G, Bluetooth, NFC etc.

This lead to a lot of very useful applications such as a very vibrant world wide web, social
networks, and Internet of things leading to huge data generation.

Roughly, the computer is made of 4 components. 1. CPU - Which executes instructions. CPU is
characterized by its speed. More the number of instructions it can execute per second, faster it
is considered.

Then comes RAM. Random access memory. While processing, we load data into RAM. If we
can load more data into ram, CPU can perform better. So, RAM has two main attributes which
matter: Size and its speed of reading and writing.
To permanently store data, we need hard disk drive or solid state drive. The SSD is faster but
smaller and costlier. The faster and bigger the disk, faster we can process data.

Another component that we frequently forget while thinking about the speed of computation is a
network. Why? Often our data is stored on different machines and we need to read it over a
network to process.

While processing Big Data at least one of these four components become the bottleneck. That's
where we need to move to multiple computers or distributed computing architecture.

Big Data applications - recommendations


So far we have tried to establish that while handling humongous data we would need new set of
tools which can operate in a distributed fashion.

But who would be generating such data or who would need to process such humongous data?
Quick answer is everyone.

Now, let us try to take few examples.

In e-commerce industry, the recommendation is a great example of Big Data processing.


Recommendations also known as collaborative filtering is the process of suggesting someone a
product based on their preferences or behavior.

The e-commerce website would gather lot of data about the customer's behavior. In a very
simplistic algorithm, we would basically try to find similar users and then cross-suggest them the
products. So, more the users, better would be results.

As per Amazon, major chunk of their sales happen via recommendations on website and email.
The other big example of Big Data processing was Netflix 1 million dollar competition to
generate the movie recommendations.

As of today generating recommendations have become pretty simple. The engines such as
MLLib or Mahout have made it very simple to generate recommendations on humongous data.
All you have to do is format the data in the three column format: user id, movie id, and ratings
Big Data applications - A/B Testing
Another application is A/B testing. The idea of A/B testing is to compare two or more versions of
the same thing, and analyze which one works better. It not only helps us understand which
version works better, but also provides evidence to understand if the difference between two
versions is statistically significant or not.

Let us go back to the history of A/B testing. Earlier, in1950s, scientists and medical researchers
started conducting A/B Testing to run clinical trials to test drug efficacy. In the 1960s and 1970s,
marketers adopted this strategy to evaluate direct response campaigns. For instance, Would a
postcard or a letter to target customers result in more sales? Later with the advent of the world
wide web, the concept of A/B testing has also been adopted by the software industry to
understand the user preferences. So basically, the concept which was applied offline, is now
being applied online.

Let us now understand why A/B testing is so important. In real-world scenario, Companies might
think that they know their customers well. For example, a company anticipates that variation B
of the website would be more effective in making more sales compared to variation A. But in
reality, users rarely behave as expected, and Variation A might lead to more number of sales.
So to understand their users better, the companies rely on data driven approaches. The
companies analyse and understand the user behaviour based on the user data they have. Thus,
more the data, lesser the errors. This would in-turn contribute to making reliable business
decisions, which could lead to increased user engagement, improved user experience, boosting
of the company revenue, standing ahead of the competitors, and many more.

Now let see more clearly see how it is done. Say we have an e-commerce website, and we want
to see the impact of the Buy Now button of a product. We can randomly select 50% of the users
to display the button with new color(say blue) and the remaining be shown the old(here green)
coloured button. We see that 23% of the people who were shown new version of the button
have bought the product, whereas only 11% of the people (who were shown the older button)
ended up buying the product. Thus the newer version seems to be more effective.

Let us have a look at how some A/B testing cases lead to more effective decisions in real-world.
In 2009, a team at Google can't decide between two shades. So they tested 41 different shades
of blue to decide which color to use for advertisement links in Gmail. The company showed
each shade of blue to 1% of users. A slightly purple shade of blue got the maximum clicks,
resulting in a $200 million boost of ad revenue. Likewise, Netflix, Amazon, Microsoft and many
others use A/B testing. At Amazon, we have many such experiments running all the time. Every
feature is launched via A/B testing. It is first shown to say 1 percent of users and if it is
performing good, we increase the percentages.
Big Data customers
Big Data Customers:

Government ==

Since governments have huge data about the citizens, any analysis would be Big Data analysis.
The application are many.

First is Fraud Detection. Be it antimony laundering or user identification, the amount of data
processing required is really high.

In Cyber Security Welfare and Justice, the Big Data is being generated and Big Data tools are
getting adopted.

Telecom ==

The telecom companies can use the big data in order to understand why their customers are
leaving and how they can prevent the customers from leaving. This is known customer churn
prevention. The data that could help in customer churn prevention is: 1. How many calls did
customers make to the call center? 2. For how long were they out of coverage area 3. What was
the usage pattern?

The other use-case is network performance optimization. Based on the past history of traffic, the
telecoms can forecast the network traffic and accordingly optimize the performance.

Third most common use-case of Big Data in Telecommunication industry is Calling Data Record
Analysis. Since there are millions of users of a telecom company and each user makes 100s of
calls per day. Analysing the calling Data records is a Big Data problem.

It is very much possible to predict the failure of hardware based on all the data points when
previous failures occurred. A seemingly impossible task is possible because of the sheer
volume of data.

Healthcare ==

Healthcare inherently has humongous data and complex problems to solve. Such problems can
be solved with the new Big Data Technologies as supercomputers could not solve most of these
problems.

Few example of such problems are Health information exchange, Gene sequencing, Healthcare
improvements and Drug Safety.
Big Data Solutions
There are many Big Data Solution stacks.

The first and most powerful stack is Apache Hadoop and Spark together. While Hadoop
provides storage for structured and unstructured data, the Spark provides the computational
capability on top of Hadoop.

The second way could be to use Cassandra or MongoDB.

Third could be to use Google Compute Engine or Microsoft Azure. In such cases, you would
have to upload your data to Google or Microsoft which may not be acceptable to your
organization sometimes.

What is Apache Hadoop?


Hadoop was created by Doug Cutting in order to build his search engine called Nutch. He was
joined by Mike Cafarella. Hadoop was based on the three papers published by Google: Google
File System, Google MapReduce, and Google Big Table.

It is named after the toy elephant of Doug Cutting's son.

Hadoop is under Apache license which means you can use it anywhere without having to worry
about licensing.

It is quite powerful, popular and well supported.

It is a framework to handle Big Data.

Started as a single project, Hadoop is now an umbrella of projects. All of the projects under the
Apache Hadoop umbrella should have followed three characteristics: 1. Distributed - They
should be able to utilize multiple machines in order to solve a problem. 2. Scalable - If needed it
should be very easy to add more machines. 3. Reliable - If some of the machines fail, it should
still work fine. These are the three criteria for all the projects or components to be under Apache
Hadoop.

Hadoop is written in Java so that it can run on kinds of devices.


Overview of apache Hadoop ecosystem
The Apache Hadoop is a suite of components. Let us take a look at each of these components
briefly. We will cover the details in depth during the full course.

HDFS
HDFS or Hadoop Distributed File System is the most important component because the entire
eco-system depends upon it. It is based on Google File System.

It is basically a file system which runs on many computers to provide a humongous storage. If
you want to store your petabytes of data in the form of files, you can use HDFS.

YARN or yet another resource negotiator keeps track of all the resources (CPU, Memory) of
machines in the network and run the applications. Any application which wants to run in
distributed fashion would interact with YARN.

HBase
HBase provides humongous storage in the form of a database table. So, to manage humongous
records, you would like to use HBase.

HBase is a kind NoSQL Datastore.

MapReduce
MapReduce is a framework for distributed computing. It utilizes YARN to execute programs and
has a very good sorting engine.

You write your programs in two parts Map and reduce. The map part transforms the raw data
into key-value and reduce part groups and combines data based on the key. We will learn
MapReduce in details later.

Spark
Spark is another computational framework similar to MapReduce but faster and more recent. It
uses similar constructs as MapReduce to solve big data problems.

Spark has its own huge stack. We will cover in details soon.

Hive
Writing code in MapReduce is very time-consuming. So, Apache Hive makes it possible to write
your logic in SQL which internally converts it into MapReduce. So, you can process humongous
structured or semi-structured data with simple SQL using Hive.

Pig (Latin)
Pig Latin is a simplified SQL like language to express your ETL needs in stepwise fashion. Pig
is the engine that translates Pig Latin into Map Reduce and executes it on Hadoop.
Mahout
Mahout is a library of machine learning algorithms that run in a distributed fashion. Since
machine learning algorithms are complex and time-consuming, mahout breaks down work such
that it gets executed on MapReduce running on many machines.

ZooKeeper
Apache Zookeeper is an independent component which is used by various distributed
frameworks such as HDFS, HBase, Kafka, YARN. It is used for the coordination between
various components. It provides a distributed configuration service, synchronization service, and
naming registry for large distributed systems.

Flume
Flume makes it possible to continuously pump the unstructured data from many sources to a
central source such as HDFS.

If you have many machines continuously generating data such as Webserver Logs, you can use
flume to aggregate data at a central place such as HDFS.

SQOOP
Sqoop is used to transport data between Hadoop and SQL Databases. Sqoop utilizes
MapReduce to efficiently transport data using many machines in a network.

Oozie
Since a project might involve many components, there is a need of a workflow engine to
execute work in sequence.

For example, a typical project might involve importing data from SQL Server, running some Hive
Queries, doing predictions with Mahout, Saving data back to an SQL Server.

This kind of workflow can be easily accomplished with Oozie.

User Interaction
A user can either talk to the various components of Hadoop using Command Line Interface,
Web interface, API or using Oozie.

We will cover each of these components in details later.

Apache Spark is a fast and general engine for large-scale data processing.

It is around 100 times faster than MapReduce using only RAM and 10 times faster if using the
disk.
It builds upon similar paradigms as MapReduce.

It is well integrated with Hadoop as it can run on top of YARN and can access HDFS.

Resource Managers
A cluster resource manager or resource manager is a software component which manages the
various resources such as memory, disk, CPU of the machines connected in the cluster.

Apache Spark can run on top of many cluster resource managers such YARN, Amazon EC2 or
Mesos. If you don't have any resource managers yet, you can use Apache Spark in Standalone
mode.

Sources
Instead of building own file or data storages, Apache spark made it possible to read from all
kinds of data sources: Hadoop Distributed File System, HBase, Hive, Tachyon, Cassandra.

Libraries
Apache Spark comes with great set of libraries. Data frames provide a generic way to represent
the data in the tabular structure. The data frames make it possible to query data using R or SQL
instead of writing tonnes of code.

Streaming Library makes it possible to process fast incoming streaming of huge data using
Spark.

MLLib is a very rich machine learning library. It provides very sophisticated algorithms which run
in distributed fashion.

GraphX makes it very simple to represent huge data as a graph. It proves library of algorithms
to process graphs using multiple computers.

Spark and its libraries can be used with Scala, Java, Python, R, and SQL. The only exception is
GraphX which can only be used with Scala and Java.

With these set of libraries, it is possible to do ETL, Machine Learning, Real time data processing
and graph processing on Big Data.

We will cover each component in details as we go forward.

Zookeeper Introduction
What is a Race Condition?
When two processes are competing with each other causing data corruption.

As shown in the diagram, two persons are trying to deposit 1 dollar online into the same bank
account. The initial amount is 17 dollar. Due to race conditions, the final amount in the bank is
18 dollar instead of 19.

What is a Dead Lock? When two processes are waiting for each other directly or indirectly, it is
called dead lock.

As you can see in second diagram, process 1 is waiting for process 2 and process 2 is waiting
for process 3 to finish and process 3 is waiting for process 1 to finish. All these three processes
would keep waiting and will never end. This is called dead lock.

Coordination
Say, there is an inbox from which we need to index emails. Indexing is a heavy process and
might take a lot of time. So, you have multiple machines which are indexing the emails. Every
email has an id. You can not delete any email. You can only read an email and mark it read or
unread. Now how would you handle the coordination between multiple indexer processes so
that every email is indexed?

If indexers were running as multiple threads of a single process, it was easier by the way of
using synchronization constructs of programming language.

But since there are multiple processes running on multiple machines which need to coordinate,
we need a central storage. This central storage should be safe from all concurrency related
problems. This central storage is exactly the role of Zookeeper.

So what is Zookeeper? In very simple words, it is a central store of key-value using which
distributed systems can coordinate. Since it needs to be able to handle the load, Zookeeper
itself runs on many machines.

Zookeeper provides a simple set of primitives and it is very easy to program to. It uses a data
model like directory tree.

It is used for synchronization, locking, maintaining configuration and failover management.

It does not suffer from Race Conditions and Dead Locks.


Data Model

The way you store data in any store is called data model. In case of zookeeper, think of data
model as if it is a highly available file system with little differences.

We store data in an entity called znode. The data that we store should be in JSON format which
Java script object notation.

The znode can only be updated. It does not support append operations. The read or write is
atomic operation meaning either it will be full or would throw an error if failed. There is no
intermediate state like half written.

znode can have children. So, znodes inside znodes make a tree like heirarchy. The top level
znode is "/".

The znode "/zoo" is child of "/" which top level znode. duck is child znode of zoo. It is denoted as
/zoo/duck

Though "." or ".." are invalid characters as opposed to the file system.

Data Model - Znode


Types of Znodes == There are three types of znodes or nodes: Persistent, Ephemeral and
Sequential.

== Types of Znodes - Persistent == Such kind of znodes remain in zookeeper until deleted. This
is the default type of znode. To create such node you can use the command: create
/name_of_myznode "mydata"

== Types of Znodes - Ephemeral == Ephemeral node gets deleted if the session in which the
node was created has disconnected. Though it is tied to client's session but it is visible to the
other users.

An ephemeral node can not have children not even ephemeral children.

== Types of Znodes - Sequential == Quite often, we need to create sequential numbers such ids.
In such situations we use sequential nodes.

Sequential znode are created with number appended to the provided name.
You can create a znode by using create -s. The following command would create a node with a
zoo followed by a number: create -s /zoo v

This number keeps increasing monotonically on every node creation inside a particular node.
The first sequential child node gets a suffix of 0000000000 for any node.

Getting Started
To get started login to webconsole or ssh at cloudxlab. Type zookeeper-client. This would bring
the zoo keeper prompt. This is where we are going to type zookeeper commands.

Lets take a look of all the nodes at the top level by typing ls / You can see there are some
znodes. Lets see what are the children znodes under a node called brokers using the following
command: ls /brokers Do you see three children znode ids, topics, seqid?

Now, lets see the data inside the znode brokers using get /brokers

You will see all the details about the znode.

Frequently Asked Questions:

Q: I am getting error while logging in error is -invalid login or timed out

A: Make sure you're not trying to use SSH on web console as SSH is to access CXL web console
on your local machine's terminal. You can simply go to web console, enter your credentials and
starts using it. Also you have to login within 60 seconds.

Q: I am getting error while issuing command zookeeper -client the error is -bash: zookeeper:
command not found

A: The command to launch zookeeper is: zookeeper-client Please notice there is no space after
"zookeeper". In your command, you have kept a space before "-client".

Q: Can we please provide the details regarding the Znodes, brokers and all the terms used here?

A: Earlier, this video was before the theory. The intention was to get your started with
commands before learning theory but since many people complained about it. So, we have now
moved it after the theory of nodes and broker. I hope you will be comfortable with this
sequence.
In case you are facing problem establishing the connection follow these steps:

● Go to my lab and open and login into the web console.


● Login to Ambari simultaneously.
● On Ambari, click on zookeeper, you will be able to see the list of servers currently
running, the first two servers are the namenode and the secondary namenode.
● Click on any server except the first two and copy the server name.
● Now go to the web console and run the following command zookeeper-client
-server <<server_name>>

== Hands On == Let's switch to the zookeeper console as opened


earlier.

We will try to create a persistent znode. We are going to create a znode with our name using
command: create /cloudxlab mydata Let's check if it is created by going through list of znodes:
ls / Also, check the data inside znode using: get /cloudxlab

To delete a znode cloudxlab, we can simply use rmr /cloudxlab

You could try creating the znode with your own login name using : create "mylogin"

Now, let's try to create an ephermal node. Switch to the zookeeper console as opened earlier.
Create an ephermal node myeph using create -e /myeph somerandomdata

Now, exit the zookeeper-client by typing quit and pressing enter. Open the zookeeper-client
again. try check if the nodes exists or not using: ls /myeph It should throw an error "Node does
not exist"

Now, lets try to create sequential nodes. Instead of polluting the top level znode "/" lets create
our persistent node first with a name cloudxlab using command: create /cloudxlab "mydata"

Lets create first sequential child with name starting with x under /cloudxlab using: create -s
/cloudxlab/x "somedata"

It says created a znode under cloudxlab with name x followed by zeros.

If we create another sequential node in it it would be suffixed with 1. lets take a look: create -s
/cloudxlab/y "someotherdata"
It should print: Created /cloudxlab/y0000000001

Now, even if we delete previously created node using rmr command [execute rmr
/cloudxlab/x0000000000]

And try to create another sequential node, the deletion would have no impact on the sequence
number. Lets take a look: create -s /cloudxlab/x data

As you can see, The new number is 2.

Note: In real-world, data we want to store in znodes would be way more complex than simple
strings. JSON is often used to store such data. We could create such a znode as follows:

create /cxl/myzndoe "{'a':1,'b':2}"

Architecture

Zookeeper can run in two modes: Standalone and Replicated.

In standalone mode, it is just running on one machine and for practical purposes we do not use
stanalone mode. This is only for testing purposes as it doesn't have high availability.

In production environments and in all practcal usecases, the replicated mode is used. In
replicated mode, zookeeper runs on a cluster of machine which is called ensemble. Basically,
zookeeper servers are installed on all of the machines in the cluster. Each zookeeper server is
informed about all of the machines in the ensemble.

As soon as the zookeeper servers on all of the machines in ensemble are turned on, the phase 1
that is leader selection phase starts. This election is based on Paxos algorithm.

The machines in ensemble vote other machine based on the ping response and freshness of
data. This way a distinguished member called leader is elected. The rest of the servers are
termed as followers. Once all of the followers have synchronized their state with newly elected
leader, the election phase finishes.

The election does not succeed if majority is not available to vote. Majority means more than
50% machines. Out of 20 machines, majority means 11 or more machines.
If at any point the leader fails, the rest of the machine or ensemble hold an election within 200
millseconds.

If the majority of the machines aren't available at any point of time, the leader automatically
steps down.

The second phase is called Atomic Broadcast. Any request from user for writing, modification or
deletion of data is redirected to leader by followers. So, there is always a single machine on
which modifications are being accepted. The request to read data such as ls or get is catered by
all of the machines.

Once leader has accepted a change from user, leader broadcasts the update to the followers -
the other machines. [Check: This broadcasts and synchronization might take time and hence for
some time some of the followers might be providing a little older data. That is why zookeeper
provides eventual consistency no strict consistency.]

When majority have saved or persisted the change to disk, the leader commits the update and
the client or users is sent a confirmation.

The protocol for achieving consensus is atomic similar to two phase commits. Also, to ensure
the durability of change, the machines write to the disk before memory.

Election and Majority


If you have three nodes A, B, C with A as Leader. And A dies. Will someone become leader?

Either B or C will become the leader.

If you have three nodes A, B, C with C being the leader. And A and B die. Will C remain Leader?

C will step down. No one will be the Leader because majority is not available.

As we discussed that if 50% or less machines are available, there will be no leader and hence
the zookeeper will be readonly. Don't you think zookeeper is wasting so many resources?

The question is why does zookeeper need majority for election?

Say, we have an ensemble spread over two data sources. Three machines A B C in one data
center 1 and other three D E F in another data center 2. Say, A is the leader of the ensemble.
And say,The network between data centres got disconnected while the internal network of each
of the centers is still intact.

If we did not need majority for electing Leader, what will happen?

Each data center will have their own leader and there will be two independent nodes accepting
modifications from the users. This would lead to irreconcilable changes and hence inconsitency.
This is why we need majority for election in paxos algorithm.

Sessions

Lets try to understand how do the zookeeper decides to delete ephermals nodes and takes care
of session management.

A client has list of servers in the ensemble. The client enumerates over the list and tries to
connect to each until it is successful. Server creates a new session for the client. A session has
a timeout period - decided by the client. If the server hasn’t received a request within the timeout
period, it may expire the session. On session expire, ephermal nodes are deleted. To keep
sessions alive client sends pings also known as heartbeats. The client library takes care of
heartbeats and session management.

The session remains valid even on switching to another server.

Though the failover is handled automatically by the client library, application can not remain
agnostic of server reconnections because the operation might fail during switching to another
server.

Use case - 1

Let us say there are many servers which can respond to your request and there are many clients
which might want the service. From time to time some of the servers will keep going down. How
can all of the clients can keep track of the available servers?
It is very easy using ZooKeeper as a central agency. Each server will create their own ephermal
znode under a particular znode say "/servers". The clients would simply query zookeeper for the
most recent list of servers.

Lets take a case of two servers and a client. The two server duck and cow created their
ephermal nodes under "/servers" znode. The client would simply discover the alive servers cow
and duck using command ls /servers.

Say, a server called "duck" is down, the ephermal node will disappear from /servers znode and
hence next time the client comes and queries it would only get "cow".

So, the coordinations has been made heavily simplified and made efficient because of
ZooKeeper.

Guarantees
What kind of guarantees does ZooKeeper provide?

Sequential consistency: Updates from any particular client are applied in the order Atomicity:
Updates either succeed or fail. Single system image: A client will see the same view of the
system, The new server will not accept the connection until it has caught up. Durability: Once an
update has succeeded, it will persist and will not be undone. Timeliness: Rather than allowing a
client to see very stale data, a server would prefer shut down.

Ops

ZooKeeper provides the following operations. We have already gone through some of these.

create Creates a znode (parent znode must exist) delete Deletes a znode (mustn’t have children)
exists/ls Tests whether a znode exists & gets metadata getACL, setACL Gets/sets the ACL for a
znode getChildren/ls Gets a list of the children of a znode getData/get, setData Gets/sets the
data associated with a znode sync Synchronizes a client’s view of a znode with ZooKeeper

== Multiupdate == ZooKeeper provides functionality of multiupdate. It batches multiple


operations together. Either all fail or succeed in entirety.Others never observe any inconsistent
state. It is possible to implement transactions using multiupdate.
API
You can use the ZooKeeper from within your application via APIs - application programming
interface.

Though ZooKeeper provides the core APIs in Java and C, there are contributed libraries in Perl,
Python, REST.

sync
For each function of APIs, synchronous and asynchronous both variants are available.

While using synchronous APIs the caller or client will wait till ZooKeeper finishes an operation.
But if you are using asynchronous API, the client provides a handle to the function that would be
called once zooKeeper finishes the operation.

Notes
Whenever we execute a usual function or method, it does not go to the next step until the
function has finished. Right? Such methods are called synchronous.

But when you need to do some work in the background you would require async or
asynchronous functions. When you call an async function the control goes to the next step or
you can say the function returns immediately. The function starts to work in the background.

Say, you want to create a Znode in zookeeper, if you call an async API function of Zookeeper, it
will create Znode in the background while the control goes immediately to the next step. The
async functions are very useful for doing work in parallel.

Now, if something is running in the background and you want to be notified when the work is
done. In such cases, you define a function which you want to be called once the work is done.
This function can be passed as an argument to the async API call and is called a callback.

Watches

Similar to triggers in databases, ZooKeeper provides watches. The objective of watches is to get
notified when znode changes in some way. Watchers are triggered only once. If you want
recurring notifications, you will have re-register the watcher.
The read operations such as exists, getChildren, getData may create watches. Watches are
triggered by write operations: create, delete, setData. Access control operations do not
participate in watches.

WATCH OF exists is TRIGGERED WHEN ZNODE IS created, deleted, or its data updated. WATCH
OF getData is TRIGGERED WHEN ZNODE IS deleted or has its data updated. WATCH OF
getChildren is TRIGGERED WHEN ZNODE IS deleted, or its any of the child is created or deleted

ACL

ACL - Access Control Lists - determine who can perform which operations.

ACL is a combination of authentication scheme, an identity for that scheme, and a set of
permissions.

ZooKeeper supports following authentication schemes: digest - The client is authenticated by a


username & password. sasl - The client is authenticated using Kerberos. ip - The client is
authenticated by its IP address.

Use case

Though there are many usecases of ZooKeeper. The most common ones are: Building a reliable
configuration service A Distributed Lock Service Only single process may hold the lock

== Not to Use === It is important to know when not to use zookeeper. You should not use it to
store big data because the number of copies == number of nodes. All data is loaded in ram too.
Also, there is a Network load for transfer all data to all nodes.

Use ZooKeeper when you require extremely strong consistency


Why HDFS?
In this video, we will learn about Hadoop Distributed File System (HDFS), which is one of the
main components of Hadoop ecosystem.

Before going into depth of HDFS, let us discuss a problem statement.

If we have 100TB data, How will we design a system to store it? Let’s take 2 minutes to find out
possible solutions and then we will discuss it.

One possible solution is to build network-attached storage or storage area network. We can buy
hundred 1TB hard disks and mount them to hundred subfolders as shown in the image. What
will be the challenges in this approach? Let us take 2 minutes to find out challenges and then
we will discuss them.

Let us discuss the challenges.

How will we handle failover and backups?

Failover means switching to a redundant or standby hard disk upon the failure of any hard disk.
For backup, we can put extra hard disks or build a RAID i.e. redundant array of independent
disks for every hard disk in the system but still it will not solve the problem of failover which is
really important for real-time applications.

How will we distribute the data uniformly?

Distributing the data uniformly across the hard disks is really important so that no single disk will
be overloaded at any point in time.

Is it the best use of available resources?

There may be other small size hard disks available with us but we may not be able to add them
to NAS or SAN because huge files can not be stored in these smaller hard disks. Therefore we
will need to buy new bigger hard disks.

How will we handle frequent access to files? What if most of the users want to access the files
stored in one of the hard disks. File access speed will be really slow in that case and apparently
no user will be able to access the file due to congestion.

How will we scale out?

Scaling out means adding new hard disks when we need more storage. When we will add more
hard disks, data will not be uniformly distributed as old hard disks will have more data and newly
added hard disks will have less or no data.
To solve above problems Hadoop comes with a distributed filesystem called HDFS. We may
sometimes see references to “DFS” informally or in older documentation or configurations.

HDFS - NameNode and DataNodes


An HDFS cluster has two types of nodes: one namenode also known as the master and multiple
datanodes

An HDFS cluster consists of many machines. One of these machines is designated as


namenode and other machines act as datanodes. Please note that we can also have datanode
on the machine where namenode service is running. By default, namenode metadata service
runs on port 8020

The namenode manages the filesystem namespace. It maintains the filesystem tree and the
metadata for all the files and directories in the tree. This information is stored in RAM and
persisted on the local disk in the form of two files: the namespace image and the edit log. The
namenode also knows the datanodes on which all the blocks for a given file are located.

Datanodes are the workhorses of the filesystem. While namenode keeps the index of which
block is stored in which datanode, datanodes store the actual data. In short, datanodes do not
know the name of a file and namenode does not know what is inside a file.

As a rule of thumb, namenode should be installed on the machine having bigger RAM as all the
metadata for files and directories is stored in RAM. We will not be able to store many files in
HDFS if RAM is not big as there will not be enough space for metadata to fit in RAM. Since data
nodes store actual data, datanodes should be run on the machines having the bigger disk.

HDFS - Design and Limitations


HDFS is designed for storing very large files with streaming data access patterns, running on
clusters of commodity hardware. Let’s understand the design of HDFS

It is designed for very large files. “Very large” in this context means files that are hundreds of
megabytes, gigabytes, or terabytes in size.

It is designed for streaming data access. It is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or
copied from the source, and then various analyses are performed on that dataset over time.
Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the
whole dataset is more important than the latency in reading the first record
It is designed for commodity hardware. Hadoop doesn’t require expensive, highly reliable
hardware. It’s designed to run on the commonly available hardware that can be obtained from
multiple vendors. HDFS is designed to carry on working without a noticeable interruption to the
user in case of hardware failure.

It is also worth knowing the applications for which HDFS does not work so well.

HDFS does not work well for Low-latency data access. Applications that require low-latency
access to data, in the tens of milliseconds range, will not work well with HDFS. HDFS is
optimized for delivering high throughput and this may be at the expense of latency.

HDFS is not a good fit if we have a lot of small files. Because the namenode holds filesystem
metadata in memory, the limit to the number of files in a filesystem is governed by the amount of
memory on the namenode

If we have multiple writers and arbitrary file modifications, HDFS will not a good fit. Files in
HDFS are modified by a single writer at any time.

Writes are always made at the end of the file, in the append-only fashion. There is no support
for modifications at arbitrary offsets in the file. So, HDFS is not a good fit if modifications have to
be made at arbitrary offsets in the file.

Questions & Answers


Q: Can you briefly explain about low latency data access with example?

A: The low latency here means the ability to access data instantaneously. In case of HDFS,
since the request first goes to namenode and then goes to datanodes, there is a delay in getting
the first byte of data. Therefore, there is high latency in accessing data from HDFS.

<pre>

Latency Comparison Numbers


L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex
lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K
bytes with Zippy 3,000 ns 3 us Send 1K bytes over 1 Gbps network 10,000 ns 10 us Read 4K
randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD Read 1 MB sequentially from memory
250,000 ns 250 us Round trip within same datacenter 500,000 ns 500 us Read 1 MB
sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory Disk seek
10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk
20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA
150,000,000 ns 150,000 us 150 ms </pre>

Source: https://gist.github.com/jboner/2841832
Replication
Let’s understand the concept of blocks in HDFS. When we store a file in HDFS, the file gets split
into the chunks of 128MB block size. Except for the last block all other blocks will have 128 MB
in size. The last block may be less than or equal to 128MB depending on file size. This default
block size is configurable.

Let’s say we want to store a 560MB file in HDFS. This file will get split into 4 blocks of 128 MB
and one block of 48 MB

What are the advantages of splitting the file into blocks? It helps fitting big file into smaller disks.
It leaves less unused space on the every datanode as many 128MB blocks can be stored on the
each datanode. It optimizes the file transfer. Also, it distributes the load to multiple machines.
Let’s say a file is stored on 10 data nodes, whenever a user accesses the file, the load gets
distributed to 10 machines instead of one machine.

It is same like when we download a movie using torrent. The movie file gets broken down into
multiple pieces and these pieces get downloaded from multiple machines parallelly. It helps in
downloading the file faster.

Let’s understand the HDFS replication. Each block has multiple copies in HDFS. A big file gets
split into multiple blocks and each block gets stored to 3 different data nodes. The default
replication factor is 3. Please note that no two copies will be on the same data node. Generally,
first two copies will be on the same rack and the third copy will be off the rack (A rack is an
almirah where we stack the machines in the same local area network). It is advised to set
replication factor to at least 3 so that even if something happens to the rack, one copy is always
safe.

We can set the default replication factor of the file system as well as of each file and directory
individually. For files which are not important we can decrease the replication factor and for files
which are very important should have high replication factor.

Whenever a datanode goes down or fails, the namenode instructs the datanodes which have
copies of lost blocks to start replicating the blocks to the other data nodes so that each file and
directory again reaches the replication factor assigned to it.

HDFS - File reading and Wiriting


When a user wants to read a file, the client will talk to namenode and namenode will return the
metadata of the file. The metadata has information about the blocks and their locations.
When the client receives metadata of the file, it communicates with the datanodes and accesses
the data sequentially or parallelly. This way there is no bottleneck in namenode as client talks to
namenode only once to get the metadata of the files.

HDFS by design makes sure that no two writers write the same file at the same time by having
singular namenode.

If there are multiple namenodes, and clients make requests to these different namenodes, the
entire filesystem can get corrupted. This is because these multiple requests can write to a file at
the same time.

Let’s understand how files are written to HDFS. When a user uploads a file to HDFS, the client
on behalf of the user tells the namenode that it wants to create the file. The namenode replies
back with the locations of datanodes where the file can be written. Also, namenode creates a
temporary entry in the metadata.

The client then opens the output stream and writes the file to the first datanode. The first
datanode is the one which is closest to the client machine. If the client is on a machine which is
also a datanode, the first copy will be written on this machine.

Once the file is stored on one datanode, the data gets copied to the other datanodes
simultaneously. Also, once the first copy is completely written, the datanode informs the client
that the file is created.

The client then confirms to the namenode that the file has been created. The namenode
crosschecks this with the datanodes and updates the entry in the metadata successfully.

Now, lets try to understand what happens while reading a file from HDFS.

When a user wants to read a file, the HDFS client, on behalf of the user, talk to the namenode.

The Namenode provides the locations of various blocks of this file and their replicas instead of
giving back the actual data.

Out of these locations, the client chooses the datanodes closer to it. The client talks to these
datanodes directly and reads the data from these blocks.

The client can read blocks of the file either sequentially or simultaneously.
NameNode Backup & Failover
The metadata is maintained in the memory as well as on the disk. On the disk, it is kept in two
parts: namespace image and edit logs.

The namespace image is created on demand while edit logs are created whenever there is a
change in the metadata. So, at any point, to get the current state of the metadata, edit logs need
to be applied on the image.

Since the metadata is huge, writing it to the disk on every change may be time consuming.
There fore, saving just the change makes it extremely fast.

Without the namenode, the HDFS cannot be used at all. This is because we do not know which
files are stored in which datanodes. Therefore it is very important to make the namenode
resilient to failures. Hadoop provides various approaches to safeguard the namenode.

The first approach is to maintain a copy of the metadata on NFS - Network File System. Hadoop
can be configured to do this. These modifications to the metadata happen either both on NFS
and Locally or nowhere.

In the second approach to making the namenode resilient, we run a secondary namenode on a
different machine.

The main role of the secondary namenode is to periodically merge the namespace image with
edit logs to prevent the edit logs from becoming too large.

When a namenode fails, we have to first prepare the latest namespace image and then bring up
the secondary namenode.

This approach is not good for production applications as there will be a downtime until the
secondary namenode is brought online. With this method, the namenode is not highly available.

To make the namenode resilient, Hadoop 2.0 added support for high availability.

This is done using multiple namenodes and zookeeper. Of these namenodes, one is active and
the rest are standby namenodes. The standby namenodes are exact replicas of the active
namenode.

The datanodes send block reports to both the active and the standby namenodes to keep all
namenodes updated at any point-in-time.

If the active namenode fails, a standby can take over very quickly because it has the latest state
of metadata.
zookeeper helps in switching between the active and the standby namenodes.

The namenode maintains the reference to every file and block in the memory. If we have too
many files or folders in HDFS, the metadata could be huge. Therefore the memory of namenode
may become insufficient.

To solve this problem, HDFS federation was introduced in Hadoop 2 dot 0.

In HDFS Federation, we have multiple namenodes containing parts of metadata.

The metadata - which is the information about files folders - gets distributed manually in different
namenodes. This distribution is done by maintaining a mapping between folders and
namenodes.

This mapping is known as mount tables.

In this diagram, the mount table is defining /mydata1 folder is in namenode1 and /mydata2 and
/mydata3 are in namenode2

Mount table is not a service. It is a file kept along with and referred from the HDFS configuration
file.

The client reads mount table to find out which folders belong to which namenode.

It routes the request to read or write a file to the namenode corresponding to the file's parent
folder.

The same pool of datanodes is used for storing data from all namenodes.

Lets us discuss what is metadata. Following attributes get stored in metadata

List of files
List of Blocks for each file
List of DataNode for each block
File attributes, e.g. access time, replication factor, file size, file name, directory name
Transaction logs or edit logs store file creation and file deletion timestamps.

Yarn - Why ?
YARN - Why?
In this session, we are going to discuss YARN - Yet Another Resource Negotiator. YARN is a
resource manager which keeps track of various resources such as memory and CPU of
machines in the network. It also runs applications on the machines and keeps track of what is
running where.

Before jumping into YARN architecture, let try to understand with an example why we need
distributed computing

Let us say we have a computer with 1 GHz processor and 1 GB RAM. It takes 20 milliseconds
to read the profile pic from disk and then 5 more mill seconds to resize it. How much time would
this computer take to resize a million profile pics?

Can we do two things in parallel when dealing with so many pics? Yes because reading from
disk involves mainly the disk and resizing mainly involves CPU and RAM. So, reading and
resizing can be done in parallel as shown in the diagram. In the diagram, time is increasing from
left to right.

You can see that while pic1 is being resized, pic2 is being read from the disk. For three pics, it
takes 20 times 3 plus 5 milli seconds for resizing. Not 25 times 3. So, it took 65ms not 75ms.

So, it is only the disk read time that matters we can completely ignore the last 5ms on large
scale. For one million pics it would be 1 million times 20 milliseconds which is approximately 5.5
hours

5.5 hours is not good enough? The next questions is how can we make it faster?

If we use a computer which has four cores or processors, can this process finish in less than 5.5
hours?

No, because it is not the CPU which is causing the delay. The main time is being consumed in
disk reads. If we make disk reads faster, the process will become faster. Disk reads can be
made faster by using Solid State Drives and by using many disk drives.

YARN - Evolution from MR 1.0

Before YARN, it was MapReduce 1.0 that was responsible for distributing the work. This is how
MapReduce 1.0 works. It is made up of a Job Tracker and many task trackers. Job Tracker is
like a manager of a shopfloor who is responsible for interacting with customers and get the work
done via Task Trackers. Job tracker breaks down the work and distributes parts to various task
trackers. The task trackers keep Job Tracker updated with the latest status. If a task tracker fails
to provide the status back, Job Tracker assumes that the task tracker is dead. Thereafter Job
Tracker assigns work to some other task tracker.

To ensure equal load on all task trackers, Job Tracker keeps track of the resources and tasks.

MapReduce 1.0 also performed the sorting or ordering required as part of Map-Reduce
framework.

But MapReduce framework was very restrictive - the only way you could get your work done
was by using MapReduce framework. Not all problems were suitable for MapReduce kind of
model. Some of the problems can be solved better using other frameworks. That's why YARN
came into play.

Basically, Map-Reduce 1.0 was split into two big components - YARN and MapReduce 2.0.
YARN is only responsible for managing and negotiating resources on cluster and MapReduce
2.0 has only the computation framework also called workfload which run the logic into two parts
- map and reduce. MapReduce 2.0 also does the sorting of the data.

This refactoring or splitting made way for many other frameworks for solving different kind of
problems such as Tez, HBase, Storm, Giraph, Spark, OpenMPI etcetera

The advantages of YARN are: 1. It supports many workfloads including MapReduce. 2. Now,
with YARN, it became easier to scale up. 3. The MapReduce 2.0 was compatible with
MapReduce 1.0. The program written for MapReduce 1.0 need not be modified. Just
recompilation was enough. 4. This improved the cluster utilization as different kinds of
workloads were possible on the same cluster. 5. It improved Agility. 6. Since map-reduce was
batch oriented, it was not possible to run tasks that needed to be run forever such as stream
processing jobs.

The role of Job Tracker in MapReduce 1 is now split into multiple components in yarn -
Resource Manager, Application Master, Timeline Server. The task tracker is now Node
Manager. The role of a slot in MapReduce 1 is now played by Container in YARN.

YARN - Architecture

Let us try to understand the architecture of YARN. The objective of YARN is to be able to
execute any kind of workfload in distributed fashion.

All of the components that you see in the diagram are software programs. It has one resource
manager. Each machine has a node manager. The node managers create containers to execute
the programs. Application Masters are the programs per application that are executed inside the
containers. Examples of Application Masters are Mapreduce AM and Spark AM.

The flow goes like this: A client submits an application to the resource manager. The resource
manager consults with the node managers and creates an Application Master for the client
inside a container.

Now, the application master registers itself with resource manager so that the client can monitor
the progress and interact with the application master.

Application Master then requests Resource Manager for containers. The resource manager
consults and negotiates with Node Managers for the containers. Node Managers create
containers and launch app in these containers. The node managers talk back to the resource
manager about the resource usage. To check the status, the clients can talk to Application
Master. On completion, Application Master deregisters itself from the resource manager and
shuts down.

It can be better understood if we think of YARN as a country where Resource Manager is the
president. And clients are organizations like Microsoft or Shell who want their work to be done
by the way of establishing offices. Each node is a state such as California and the node
manager is governor of the state. The container is the place for the office. The application
master is the head office.

Though it looks like over simplification, it would make us understand how YARN works.
Microsoft wants to get their offices to be setup, so they contact the president, the president talks
to governors of the states and then the head office is established. After wards to get more
offices established in various states, head office would request president and the president
would discuss with governors and offices would be established.

YARN - Advanced

What does the resource request made by client or application master have? The request may
contain how many containers, how much memory and how many CPU are required. The
request may also contain a constraint such as nodes nearer to certain data. This is called data
locality constraint. It is similar to A mining company requesting for an office near the area which
has mines. The request may either have demand for all of the resources up front or as and
when required.

Another question that comes to mind is for how long does an application run? The application
lifespan can vary dramatically. Lifespan is categorized into three categories: 1. First, One
application per job. This is the simplest case. MapReduce is an example. In this, as soon as the
job is over the application ends. 2. Second is One application per workflow or user session.
Spark operates in this mode when you launch the interactive shell. So, it remains active during
the user's session unless user terminates it. 3. The third one is a long-running application such
as a server which runs forever. Examples are Apache Slider or Impala.

If you plan to build your own application, my suggestion would be to first try to use existing
frameworks such as MapReduce or Spark.

Second, try to utilize existing tools to Build Jobs such as Apache Slider and Apache Twill. With
Apache Slider you can run existing distributed Application such as HBase and Twill can execute
any Java runnable.

If you still need to build your own YARN applications, please keep in mind that building from
scratch is complex. To start with, you can use Yarn project's bundled distributed shell application
example.

Mapreduce - Understanding sorting


Say you have a computer with a 1Ghz processor and 2GB RAM. How much time will take to
sort or order 1TB data? This data has 10 billion names having 100 characters each.

It would take around 6-10 hours.

What's wrong with getting it done in 6-10 hours? Sorting is a very common work, we would need
it to be done faster. We might require sorting of bigger data and more often. On eight September
2011, Google was able to sort 10 petabytes of data in 6.5 hours using 8000 computers with their
MapReduce framework.

Why are we talking about Sorting? Why is it a big deal? When we talk about data processing,
we often think about SQL because the majority of data processing tasks can be performed with
SQL.

If you take a look at SQL queries, most of the operations use or are impacted by sorting
algorithm of the database. For example, if we want to create an index on a table the "where"
clause becomes enormously faster. And indexing basically is "sorting" under the hood. Similarly,
the "group by" construct of SQL involves first sorting and then finding unique.

Joins are easier if tables are already indexed. And "Order BY" is obvious just sorting of the data.
Other than SQL, complex algorithms can benefit partly by sorting algorithm.
So, What is MapReduce?

MapReduce, in very simple words, is a programming paradigm to help us solve Big Data
problems. Hadoop MapReduce is the framework that works on this paradigm. This is specifically
great for the tasks which are sorting or disk read-intensive.

Ideally, you would write two functions or pieces of logic - mapper and reducer. The mapper
converts every record from the input into key-value pairs. Reducer aggregates values for each
key as defined by mapper or the map phase.

MapReduce is also supported by many other systems such as Apache Spark, MongoDB,
CouchDB, and Cassandra. The MapReduce in Hadoop can be written in Java, Shell, Python, or
any binaries.

Let's take a quick look at how map-reduce gets executed. In this diagram, we have three
machines containing data on which map functions are getting executed. Mapper is a logic that
you have defined. This logic takes a record as input and converts it into key-value pairs. Please
note that map logic is provided by you. This logic can be very complex or very simple based on
your need. And these key-value pairs are sorted and then grouped together by Hadoop based
on the key. All of the values for each key are aggregated using your reducer logic.

So, if you want to group data based on some criteria, that criteria would be expressed in the
mapper logic, and how to combine all the values for each key is governed by your logic of
Reducer. The result of reducer is saved into the HDFS.

Let's imagine for a moment that we would like to prepare a veg burger on a very large scale. As
you can see in the diagram the function cut_Into_Pieces() will be executed on each vegetable,
chopping vegetables into pieces and the result will be reduced to form a burger.

Thinking in MR - Programatic & SQL


Converting a problem into map-reduce can be a little tricky and unintuitive. So, we will take it
stepwise approach in order to understand MapReduce.

How would you find the frequencies of unique words in a text file? In other words, find the count
of each unique word in the text file?

Consider it to be a practical assignment. You can suggest any tool or technique to accomplish
this. There are few approaches. We will take you through these approaches and try to weigh
pros and cons of each.
Lets first discuss the approach that involves programming. In this approach, we will use an
in-memory data structure called, hash-map, or a dictionary. Initially, we create an empty
dictionary such that the key will be the word and value will be how many times a word has
occurred so far. we read the data one word at a time. for each word, we increase its count if it
exists in the dictionary. Otherwise, we add the word in the dictionary with count as 1. And when
there are no more words left, we print the key-values from the dictionary.

The same has been demonstrated using python code and flow chart.

What do you think are limitations of this approach?

It can not process the data more than what fits into the memory because the entire dictionary is
maintained in the memory. So, if your text is really huge, you can not use this approach.

Second approach is to use SQL. Basically, break down the text from all these files into one word
per line and then insert each word into a row in the table with single column. Once such table
with one column has been created, we execute an SQL query with group by clause like: "select
word, count(*) from table group by word".

This is a more practical approach because it doesn't involve writing too much of code. And it
also does not suffer from memory overflow because databases manage memory properly while
grouping the data.

Thinking in MR - Unix Pipeline

The third approach is to use Unix command in pipeline or in chain. Let us first try to understand
what does it mean by pipeline.

As we discussed earlier that when we run a program it may take input from you. In other words,
you may provide input to a program by typing. A program or command may also print some
output on the screen.

In Unix, you can provide output of one program as input to another. This is known as piping. A
pipe is denoted by vertical bar symbol. command1 vertical bar command2 means the output of
command1 will become input to command2.

Let us take an example.

echo Unix command prints on the standard output whatever argument is passed to it. For
example, echo "Hi" print "Hi" to the screen.
wc command prints the number of characters, words, and lines out of whatever you type on
standard input. Let me show you, Start wc command, type some text say "hi", newline and "how
are you" and then press Ctrl+d to end the input:

It would print number of lines, words, and characters which are 2, 4, and 15 respectively.

If we want to count the number of words or characters in the output of echo command, we could
use a command like: echo "Hello, World" | wc

Let us try to understand this pipeline of commands for word counting in parts.

The first command cat myfile prints the contents of the file "myfile".

Second command in chain is sed. sed stands for streaming editor. It is used to replace a text
with something else in the input. It is very similar to the search and replace option feature of text
editors. You can use regular expression with sed by providing an option -E to it. sed -E 's/[\t
]+/\n/g' replaces spaces and tabs with newline. Essentially, it converts text into one word per
line. So, when you chain cat and sed, it basically prints one word per line from the file.

This one-word-per-line text can be sent further to a command called sort which can order lines
in input. The sort command take various options. The option -S makes it use only limited
memory. In our case, we are using -S 1g option to sort data using only 1-gigabyte of memory.

The last command is uniq, uniq command finds unique lines in the input. It expects the data to
be ordered already. In case, the input to uniq is not sorted, the result is not correct. uniq
command has -c option which prints the counts of each unique word. So uniq -c would print
counts of each unique word in the sorted input.

So, the entire pipeline consisting of cat, sed, sort followed by uniq prints the word count of
unique words in the text file.

External Sort
We discussed two more approaches - one was using SQL, other was using Unix commands.
What problems do you see in these last two approaches?

The moment the data starts going beyond RAM the time taken starts increasing. CPU, Disk
Speed, Disk Space starts becoming the bottlenecks.

So, what do we do in order to alleviate these bottlenecks caused by the big data?
We may try to use external sort. In external sort, we utilize many computers to do the word
count and then we merge this result together. Let us say we have three machines: Launcher,
Machine 1, and Machine 2. Launcher breaks the file into two parts and sends first part to
machine 1 and second part to machine 2.

Both machines (1 and 2) count the unique words in their part of the text and send the result
back to the launcher. The launcher merges the results. Please note that the results that
launcher gets from both of these machines is ordered so that the merging gets easy because
merging the ordered data is really fast - the time it takes to merge ordered data is linearly
dependent on total number of elements to be merged.

Let us see how merging of sorted data happens: Say Machine 1 returned a:1 - where a is the
key and 1 is the count, b:3 f:1 z:10 And machine 2 returned a:10, d:3, f:23, y:9

To merge both of the results we will simply compare first words from both. If both words are
equal, then we sum up the counts, otherwise, we pick the smaller.

So, in one machine output it is a:1 and in another it is a:10, the word in both is same therefore
we remove a from both and add 10 and 1, the result of merge is a:11.

Now, we remove a from both sides. Next, we compare the heads of both of which are b and d.
So, we pick b:3 and put it in output. Next, we compare f and d and remove d, as d is smaller
than f lexically or English dictionary wise. Afterwards, we merge f from both sides. Then we pick
y and in the last we pick z.

So, you can see that we got the result very quickly. This approach of merging is used to merge
data which is sorted already. In case there are more than two lists to merge, we generally use a
min-heap tree to pick the smallest of the heads from all the lists. Alternatively, we can also
merge two lists at a time and do the merging again with the result.

This is how our launcher also works. In our diagram, the launcher has merged re:2, sa:1 with
ga:2, re:1 to produce ga:2, re:3, sa:1.

With external sorting, we have been able to divide the work amongst many machines. What do
think is the problem with Approach 4 which is External Sorting?

Here are some of the problems:

● A lot of time is consumed in transport of data from launcher to machines and back.

● For each requirement we would need to build a special-purpose network-oriented
program to perform all of these operations. In our case, we were trying to count unique
words. If we need to do some other operation such as counting verbs or nouns in the
text, we will have code the entire mechanics again.

● The external sort requires a lot of Engineering. We have to handle the cases where
network fails or either of the machines fail.

Understanding The Paradigm


This is where MapReduce comes into play. So, as we go forward, we will see that MapReduce
is nothing but a highly customizable version of our External Sort where data doesn't need to be
transferred to the machines because it is already distributed on data-nodes of HDFS.

Again, Hadoop MapReduce is a framework to help us solve Big Data problems. This is
specifically great for the tasks which are sorting or disk read-intensive.

Ideally, you would write two functions or pieces of logic called mapper and reducer. Mapper
converts every input record into key-value pairs. Reducer aggregates values for each key as
given out by Maps() phase.

Map() is executed as many times as records are there. And reduce() is executed for each
unique key in the output map(). You could have a MapReduce job without reduce phase.

Let us take a look at an example where we have many profile pics of the users and we would
like to resize these images to 100 by 100 pixels size. To solve such a problem, first, we will copy
the pics to HDFS - Hadoop distributed file system and then, we write a function which takes an
image as input and resizes it. This function is what we are going to use as a mapper.

Since there is no reducer, in this case, the result of the mapper is going to be saved in the
HDFS. This mapper is going to be executed for all the images on the machines.

So, in our diagram, we have 9 images distributed on three machines, our mapper function will
be executed 9 times on these three machines.

Now, let's take try to understand the scenario with both mapper and reducer.

The HDFS block in first converted into something called InputSplit which is a logical
representation of the raw HDFS block. An InputSplit is a collection of records. It is created by
InputFormat. By default, the input format is TextInputFormat which creates InputSplit out of the
plain text file such that each line is a record.

On the machines having the InputSplits, your mappers or map functions are executed. The logic
of map() is written by you as per your requirements. This map function converts input records
into key-value pairs. These key-values are sorted on each node locally and then transported to
a machine which is designated as a reducer. The reducer machine merges the result from all
the nodes and groups the data.

On each group, your reducer function gets executed. The logic of reducer function gets a key
and a list of values. The reducer function generally aggregates all of the values and then final
value is saved to HDFS.

Since the result from all map function is grouped by framework based on the key, you always
chose the key on which you want your data to be grouped. So, you always try to breakdown
your problem into groupby kind of logic.

Examples
Let's look at an example of wordcount problem. Say you have a text file having two lines. First
Line being "sa re" and second line "sa ga". This plain text will be converted into InputSplit with
two records. First record is a line "sa re" and second record is "sa ga"

Here we have a written a very simple function for mapper() which basically gives out each word
as key and numeric 1 as the value. Thus converting the input line "sa re" into "sa 1" and "re 1"
and input line "sa ga" into "sa 1" and "ga 1". Your mapper function has been executed twice.

The results of mapper is sorted by Hadoop MapReduce Framework based on the key and then
grouped. The dashed line in the diagram represents the work done by Hadoop Framework. So,
we have three unique keys ga,re and sa having their values grouped.

For each of these groups, the reduce function is executed. The reduce function basically gets
the key and the list of values as arguments. Here, the reduce function is simply summing up the
values. So, the outcome of reduce function is ga 1, re 1, and sa 2.

Let's take a more practical example where we have to find maximum temperature of each city
based on the temperature log of various cities on various dates. This temperature log is a
comma-separated values containing temperature, city, and date.

To find the maximum temperature per city we will have to group the data based on the city and
then find maximum temperature. So, our mapper would basically give out city as the key and
temperature as the value, it will not give out date in the output. These values are then ordered
and grouped by Hadoop MapReduce framework. In our example, we got four groups for each of
the cities, BLR, Chicago, NYC, and Seattle.
Now, in the reduce function we would give out the maximum of the values for each key. This
reduce function will be called for each of the groups and hence we would get the maximum
value of temperature for each of the city.

If you had to solve the above problem of finding maximum temperature using SQL, you would
simply group the data by city and for each group you would compute maximum. The query
would like like: "select city, max(temp) from table group by city".

The map part corresponds to selection of column in group-by and reduce part is analogous to
aggregation of SQL.

Similarly for word count, map part corresponds to the selection of column in the group-by of
SQL and reduce part is equivalent of count aggregation of SQL.

Multiple reducers
If there are lot of key-values to merge, a single reducer might take too much time. To avoid
reducer machine becoming the bottleneck, we use multiple reducers.

When you have multiple reducers, each node that is running mapper puts key-values in multiple
buckets just after sorting. Each of these buckets go to designated reducers. On every reducer,
the bucket coming from all mapper nodes get merged.

On the mapper node, which key would go to which reducer node is decided by partitioner. By
default, partitioner computes hashcode of the key which means generating a number
corresponding to the string. And then divide the hashcode by total number of reducers.
Whatever is the remainder, it would send the key to that bucket for that reducer. If the result is 0,
the key would be kept in the bucket for the 0th reducer on the same mapper node. Once the
mapper node finishes processing the input data, this bucket would be taken to the
corresponding reducer node. This model ensures that each key goes to some reducer, it should
not be dropped and also, the same key from different mapper nodes would go to the same
reducer node. The cases like this would never happen: x key on mapper node 1 went to first
reducer while the same x key from mapper node 2 went to second reducer.

You might also like