Apache Hadoop Training
Apache Hadoop Training
-Navneet Sharma
Hadoop, HPCC
Expereinced in NoSQL DBs
(HBase, Cassandra, Mongo,
Couch)
Data Aggregator Engine: Apache
Flume
CEP : Esper
Worked on Analytics, retail, ecommerce, meter data
management systems etc
Cloudera Certified Apache
Hadoop developer
New Requirements
Introducing Hadoop
Brief History of Hadoop
Features of Hadoop
Over view of Hadoop Ecosystems
Overview of MapReduce
Traditional Applications
D a t a Tr a n s f e r
Network
Application Server
Data Base
D a t a Tr a n s f e r
Statistics Part1
Assuming N/W bandwidth is 10MBPS
Data Size
10
10MB
1+1 = 2
10
100MB
10+10=20
10
1000MB = 1GB
10
1000GB=1TB
100000+100000=~55hour
Observation
Data is moved back and forth over the low latency
Conclusion
Achieving Data Localization
Moving the application to the place where the data is
residing OR
Making data local to the application
Statistics Part 2
How data is read ?
Line by Line reading
Observation
Large amount of data takes lot of time to read
Summary
Storage is problem
Cannot store large amount of data
Upgrading the hard disk will also not solve the problem
(Hardware limitation)
Performance degradation
Upgrading RAM will not solve the problem (Hardware
limitation)
Reading
Larger data requires larger time to read
Solution Approach
Distributed Framework
Storing the data across several machine
Performing computation parallely across several
machines
Data Transfer
Data Transfer
Bottleneck if number of
users are increased
Data Transfer
Data Transfer
Data Availability
Consistency
Data Reliability
Upgrading
are down
Should result into graceful degradation of performance
Recoverability
If machines/components fails, task should be taken up
Data Availability
Failing of machines/components should not result
Consistency
If machines/components are failing the outcome of
Data Reliability
Data integrity and correctness should be at place
Upgrading
Adding more machines should not require full restart
of the system
Should be able to add to existing system gracefully and
Introducing Hadoop
Distributed framework that provides scaling in :
Storage
Performance
IO Bandwidth
28
Features of Hadoop
Partition and distributes the data
Performs Computation closer to the data
Data Localization
Performs computation across several hosts
MapReduce framework
Hadoop Components
Hadoop is bundled with two independent components
HDFS (Hadoop Distributed File System)
MR framework (MapReduce)
Used by HDFS
Secondary NameNode
Task Tracker
Job Tracker
Used by MapReduce
Framework
Job tracker
If down cannot run MapReduce Job, but still you can access
HDFS
Overview Of HDFS
Overview Of HDFS
NameNode is the single point of contact
Consist of the meta information of the files stored in
HDFS
If it fails, HDFS is inaccessible
DataNodes consist of the actual data
Store the data in blocks
Blocks are stored in local file system
Overview of MapReduce
Overview of MapReduce
MapReduce job consist of two tasks
Map Task
Reduce Task
Blocks of data distributed across several machines are
optimize my queries, but still not able to do so. Moreover I am not Java
geek. Will this solve my problem?
Hive
Hey, Hadoop is written in Java, and I am purely from C++ back ground,
Well how about Python, Scala, Ruby, etc programmers? Does Hadoop
Hadoop Streaming
37
Downside of RDBMS
Rigid Schema
Once schema is defined it cannot be changed
Any new additions in the column requires new schema
to be created
Leads to lot of nulls being stored in the data base
Cannot add columns at run time
NoSQL DataBases
Invert RDBMS upside down
Column family concept (No Rows)
One column family can consist of various columns
New columns can be added at run time
Nulls can be avoided
Schema can be changed
Meta Information helps to locate the data
No table scanning
Challenges in Hadoop-Security
Poor security mechanism
Uses whoami command
Cluster should be behind firewall
Already integrated with Kerberos but very trickier
conf
mapred-site.xml
core-site.xml
hdfs-site.xml
masters
slaves
hadoop-env.sh
bin
start-all.sh
stop-all.sh
start-dfs.sh, etc
logs
lib
conf Directory
Place for all the configuration files
All the hadoop related properties needs to go into one
of these files
mapred-site.xml
core-site.xml
hdfs-site.xml
bin directory
Place for all the executable files
You will be running following executable files very
often
start-all.sh ( For starting the Hadoop cluster)
stop-all.sh ( For stopping the Hadoop cluster)
logs Directory
Place for all the logs
Log files will be created for every process running on
Hadoop cluster
NameNode logs
DataNode logs
Installation Steps
Pre-requisites
Java Installation
Creating dedicated user for hadoop
Note
The installation steps are provided for CentOS 5.5 or
greater
Installation steps are for 32 bit OS
All the commands are marked in blue and are in italics
Assuming a user by name training is present and this
Note
Hadoop follows master-slave model
There can be only 1 master and several slaves
HDFS-HA more than 1 master can be present
slave
Master machine is also referred to as NameNode machine
Slave machines are also referred to as DataNode machine
Pre-requisites
Edit the file /etc/sysconfig/selinux
Change the property of SELINUX from enforcing to
disabled
You need to be root user to perform this operation
Install ssh
yum install open-sshserver
chkconfig sshd on
service sshd start
open-sshclient
Installing Java
Download Sun JDK ( >=1.6 ) 32 bit for linux
Download the .tar.gz file
Follow the steps to install Java
tar -zxf jdk.x.x.x.tar.gz
export JAVA_HOME=PATH_TO_YOUR_JAVA_HOME
export PATH=$JAVA_HOME/bin:$PATH
Disabling IPV6
Hadoop works only on ipV4 enabled machine not on
ipV6.
Run the following command to check
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
Configuring Hadoop
Download Hadoop tar ball and extract it
tar -zxf hadoop-1.0.3.tar.gz
Above command will create a directory which is
Hadoops home directory
Copy the path of the directory and edit .bashrc file
export HADOOP_HOME=/home/training/hadoop-1.0.3
export PATH=$PATH:$HADOOP_HOME/bin
Edit $HADOOP_HOME/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/training/hadoop-temp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/hadoop-env.sh
export JAVA_HOME=PATH_TO_YOUR_JDK_DIRECTORY
Note
No need to change your masters and slaves file as you
node
Viewing NameNode UI
In the browser type localhost:50070
Viewing MapReduce UI
In the browser type localhost:50030
Java Installation
Configuring ssh
authorized_keys file has to be copied into all the
machines
Make sure you can do ssh to all the machines in a
Masters file
1.1.1.1 / Master
1.1.1.1 (Slave 0)
1.1.1.1 (Slave 1)
Slaves file
1.1.1.2 / Slave 0
1.1.1.3 / Slave 1
Masters file
1.1.1.1 / Master
1.1.1.1 (Slave 0)
1.1.1.1 (Slave 1)
Slaves file
1.1.1.2 / Slave 0
1.1.1.3 / Slave 1
1.1.1.1 / Master
Edit $HADOOP_HOME/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>IP_OF_MASTER:54311</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
</configuration>
Edit $HADOOP_HOME/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/training/hadoop-temp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://IP_OF_MASTER:54310</value>
</property>
</configuration>
NOTE
All the configuration files has to be the same across all
the machines
Generally you do the configuration on NameNode
DataNode
Secondary NameNode
JobTracker
Task Tracker
1.1.1.1 (Slave 0)
1.1.1.1 (Slave 1)
Process running on
Master / NameNode
machine
NameNode
Job Tracker
Secondary
NameNode
DataNode
Task
Tracker
DataNode
Task
Tracker
1.1.1.1 (Slave 0)
1.1.1.1 (Slave 1)
Process running on
Master / NameNode
machine
NameNode
Job Tracker
Secondary
NameNode
DataNode
TaskTracker
DataNode
Task
Tracker
DataNode
Task
Tracker
DataNode
TaskTracker
JobTracker
SecondaryNameNode
DataNode
TaskTracker
fs.default.name
Value is hdfs://IP:PORT [hdfs://localhost:54310]
Specifies where is your name node is running
Any one from outside world trying to connect to
mapred.job.tracker
Value is IP:PORT [localhost:54311]
Specifies where is your job tracker is running
When external client tries to run the map reduce job
dfs.block.size
Default value is 64MB
File will be broken into 64MB chunks
One of the tuning parameters
Directly proportional to number of mapper tasks
dfs.replication
Defines the number of copies to be made for each
block
Replication features achieves fault tolerant in the
Hadoop cluster
Data is not lost even if the machines are going down
dfs.replication contd
Each replica is stored in different machines
hadoop.tmp.dir
Value of this property is a directory
This directory consist of hadoop file system
information
Consist of meta data image
Consist of blocks etc
dfs
data
current
VERSION FILE
Blocks and Meta
file
mapred
name
namesecondary
previou
s.check
point
current
VERSION
fsimage
edits
current
NameSpace IDs
Data Node NameSpace ID
NameNode NameSpace ID
NameSpace ID has to be
same
You will get Incompatible
NameSpace ID error if
there is mismatch
DataNode will not
come up
SafeMode
Starting Hadoop is not a single click
When Hadoop starts up it has to do lot of activities
Restoring the previous HDFS state
Waits to get the block reports from all the Data Nodes etc
During this period Hadoop will be in safe mode
It shows only meta data to the user
It is just Read only view of HDFS
Cannot do any file operation
Cannot run MapReduce job
SafeMode contd
For doing any operation on Hadoop SafeMode should
be off
Run the following command to get the status of
safemode
hadoop dfsadmin -safemode get
SafeMode contd
Run the following command to turn off the safemode
hadoop dfsadmin -safemode leave
What is HDFS?
Its a layered or Virtual file system on top of local file
system
Does not modify the underlying file system
HDFS contd
Hadoop Distributed File System
HDFS contd
Behind the scenes when you
put the file
File is broken into blocks
Each block is replicated
Replicas are stored on local
file system of data nodes
BIG
FILE
BIG FILE
hadoop fs options
options are the various commands
Does not support all the linux commands
Change directory command (cd) is not supported
Cannot open a file in HDFS using VI editor or any other editor
for editing. That means you cannot edit the file residing on
HDFS
hdfs://localhost:54310/user/dev/
Common Operations
Creating a directory on HDFS
Putting a local file on HDFS
directory
fs
fs
-put
/home/dev/foo.txt bar
FileMode
Replicati
on factor
File
owner
File
group
Size
Last
Modified
data
Last
modified
time
Absolute
name of
the file or
directory
fs -cat foo
This will give the list of blocks which the file sample
Hands On
Please refer the Hands On document
HDFS Components
Working of DataNode
Working of Secondary NameNode
Working of JobTracker
Working of TaskTracker
Writing a file on HDFS
Runs on Master/NN
Job Tracker
DataNode
TaskTracker
Runs on Slaves/DN
small files
Small files means the file size is less than the block size
file
going down
By replicating the blocks
BIG
FILE
BIG FILE
Client wants to
analyze this file
BIG
FILE
BIG FILE
Client wants to
analyze this file
BIG FILE
NameNode
Single point of contact to the outside world
Client should know where the name node is running
Specified by the property fs.default.name
Stores the meta data
List of files and directories
Blocks and their locations
For fast access the meta data is kept in RAM
Meta data is also stored persistently on local file system
/home/training/hadoop-temp/dfs/name/previous.checkpoint/fsimage
NameNode contd
Meta Data consist of mapping
Data nodes
NameNode contd
If NameNode is down, HDFS is inaccessible
Single point of failure
Any operation on HDFS is recorded by the NameNode
/home/training/hadoop-temp/dfs/name/previous.checkpoint/edits file
Name Node periodically receives the heart beat signal
DataNode
Stores the actual data
Along with data also keeps a meta file for verifying the
integrity of the data
/home/training/hadoop-temp/dfs/data/current
DataNode contd
Data Node receives following instructions from the
replicated blocks)
BIG
FILE
BIG FILE
BIG
FILE
BIG
FILE
BIG
FILE
BIG
FILE
BIG
FILE
Edits file
Fsimage file
Job Tracker
MapReduce master
Client submits the job to JobTracker
JobTracker talks to the NameNode to get the list of
blocks
Job Tracker locates the task tracker on the machine
where data is located
Data Localization
Once all the mapper tasks are over it runs the reducer
tasks
Task Tracker
Responsible for running tasks (map or reduce tasks)
tracker
Regarding the available number of slots
Status of running tasks
contacting again to NN
Along with the data, check sum is also shipped for verifying
from other DN
HDFS API
Hadoops Configuration
Encapsulates client and server configuration
Use Configuration class to access the file system.
Pseudo Mode
Cluster Mode
Hadoops Path
File on HDFS is represented using Hadoops Path
object
Path is similar to HDFS URI such as
hdfs://localhost:54310/user/dev/sample.txt
FileSystem API
General file system Api
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws
IOException
Hands On
Run the HadoopUtilityProgram.java
Pseudo Mode
Utilizing Local file system
Module 1
What is mapper?
What is reducer?
(dfs.block.size)
No of blocks / input split = No of map tasks
After all map tasks are completed the output from the map
MapReduce Terminology
What is job?
Complete execution of mapper and reducers over the entire data set
What is task?
Single unit of execution (map or reduce task)
Map task executes typically over a block of data (dfs.block.size)
Reduce task works on mapper output
What is task attempt?
Instance of an attempt to execute a task (map or reduce task)
If task is failed working on particular portion of data, another task
will run on that portion of data on that machine itself
If a task fails 4 times, then the task is marked as failed and entire
job fails
Make sure that atleast one attempt of task is run on different
machine
Terminology continued
How many tasks can run on portion of data?
Maximum 4
If speculative execution is ON, more task will run
What is failed task?
Task can be failed due to exception, machine failure etc.
A failed task will be re-attempted again (4 times)
What is killed task?
If task fails 4 times, then task is killed and entire job fails.
Task which runs as part of speculative execution will also be
marked as killed
Input Split
Portion or chunk of data on which mapper operates
Input split is just a reference to the data
Typically input size is equal to one block of data
(dfs.block.size)
64MB
128MB
64MB
Max Split
block size
Size
LONG.MAX_ 64
VALUE
Split size
taken
64
1
128
1
------32
128
128
32
128
128
64
What is Mapper?
Mapper is the first phase of MapReduce job
Mapper contd
Map (in_key,in_val) ------- out_key,out_val
Mapper contd
Map function is called for one record at a time
Input Split consist of records
For each record in the input split, map function will be
called
Each record will be sent as key value pair to map
function
So when writing map function keep ONLY one record in
mind
It does not keep the state of whether how many records it has
processed or how many records will appear
Knows only current record
What is reducer?
Reducer runs when all the mapper tasks are completed
After mapper phase , all the intermediate values for a given
Reducer contd
NOTE all the values for a particular intermediate key
Module 2
Integer
IntWritable
Long
LongWritable
Float
FloatWritable
Byte
ByteWritable
String
Text
Double
DoubleWritable
Input Format
Before running the job on the data residing on HDFS,
Key
Value
Key needs to be
determined from the
header
Value needs to be
determined from the
header
LongWritable
Text
Text
Text
ByteWritable
ByteWritable
Key
Value
If the input split has 4 records, 4 times map function will be called,
one for each record
It sends the record to the map function in key value pair
a file
Example:
Key
Value
Hello
you
(Hello,1)
(how,1)
(are,1)
(you,1)
(Hello,1)
(I,1)
(am,1)
:
:
:
values
If a word Hello has been emitted 4 times in the map
MapReduce Flow-1
MapReduce Flow-2
MapReduce Flow - 3
Input Split 1
M
A
P
P
E
R
Hello
World
Hello
World
(Hello ,1)
(World,1)
(Hello ,1)
(World,1)
Hello (1,1)
World (1,1)
Reducer
Partitioning,
Shuffling,
Sorting
Hello = 1+1 = 2
World = 1 +1 = 2
is present
For data localization
Features contd
After the map phase is completed, the intermediate
So its better to run more than one reducer for load balancing
If more than one reducers are running, PARTITIONER decides
which intermediate key value pair should go to which reducer
Features contd
Data localization is not applicable for reducer
Intermediate key-value pairs are copied
Keys and its list of values are always given to the
Module 3
Third argument: Using this context object you will emit output key
value pair
}
}
public void reduce(Text key, Iterable<IntWritable> values, Context output) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
output.write(key, new IntWritable(sum));
}
}
public void reduce(Text key, Iterable<IntWritable> values, Context output) throws IOException,
InterruptedException {
int sum = 0;
Reducer<Text,IntWritable,Text,IntWritable>
Input key given by the output key of map output
public void reduce(Text key, Iterable<IntWritable> values, Context output) throws IOException,
InterruptedException {
int sum = 0;
Reducer
get {key and list of values
for (IntWritable
valwill
: values)
sum += val.get();
}
Example: Hello {1,1,1,1,1,1,1,1,1,1}
output.write(key, new IntWritable(sum));
}
}
public void reduce(Text key, Iterable<IntWritable> values, Context output) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
output.write(key, new IntWritable(sum));
}
}
Step1 :Get the configuration object, which tells you where the namenode and job tracker are
running
Step6: Specify the reducer o/p key and o/p value class
i/p key and value to reducer is determined by the map o/p key and map o/p value class
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
Module 4
Shopping recommendations
Identifying people of interest
File B
This is dog
My dog is fat
Output from
the mapper
This: File A
is:File A
cat:File A
Big:File A
fat:File A
hen:File A
This:File B
is:File B
dog:File B
My: File B
dog:File B
is:File B
fat: File B
Final Output
This:File A,File B
is:File A,File B
cat:File A
fat: File A,File B
Problem Statement
Calculate the average word length for each character
Output:
Character
(3+3+5) /3 = 3.66
2/1 = 2
5/1 =5
4/1 = 4
Module 4
Hands On
Combiner
Large number of mapper running will produce large
network
Lot of network traffic
Shuffling/Copying the mapper output to the machine
where reducer will run will take lot of time
Combiner contd
Similar to reducer
Runs on the same machine as the mapper task
Runs the reducer code on the intermediate output of the
mapper
Advantages
Minimize the data transfer across the network
Speed up the execution
Reduces the burden on reducer
Combiner contd
Combiner has the same signature as reducer class
Can make the existing reducer to run as combiner, if
The operation is associative or commutative in nature
framework
It may run more than once on the mapper machine
Combiner contd
Combiner contd
In the driver class specify
Job.setCombinerClass(MyReducer.class);
not
Combiner may run more than once on the same
mapper
Depends on two properties io.sort.factor and io.sort.mb
Partitioner
It is called after you emit your key value pairs from mapper
context.write(key,value)
Large number of mappers running will generate large
amount of data
And If only one reducer is specified, then all the intermediate
Partitioner contd
Partitioner divides the keys among the reducers
If more than one reducer running, then partitioner
decides which key value pair should go to which reducer
Default is Hash Partitioner
Calculates the hashcode and do the modulo operator
with total number of reducers which are running
Hash code of key %numOfReducer
The above operation returns between ZERO and
(numOfReducer 1)
Partitioner contd
Example:
Number of reducers = 3
Key = Hello
Hash code = 30 (lets say)
The key Hello and its list of values will go to 30 % 3 = 0th
reducer
Key = World
Hash Code = 31 (lets say)
The key world and its list of values will go to 31 % 3 = 1st
reducer
Module 4
Distributed cache
Hands On
Local Runner
Local Runner
Cluster Mode
HDFS
fs.default.name=file:///
fs.default.name=hdfs://IP:PORT
mapred.job.tracker=local
mapred.job.tracker=IP:PORT
command line
Can specify distributed cache properties
Can be used for tuning map reduce jobs
Flexibility to increase or decrease the reducer tasks
method
The run method is where actual driver code goes
method
No need of creating
configuration object
Note the getConf() inside the
job object
@Override
-D property=value
-conf fileName [For overriding the default properties]
-fs uri [D fs.default.name=<NameNode URI> ]
-jt host:port [ -D mapred.job.tracker=<JobTracker URI>]
-files file1,file2 [ Used for distributed cache ]
-libjars jar1, jar2, jar3 [Used for distributed cache]
-archives archive1, archive2 [Used for distributed cache]
function
If anything extra (parameter, files etc) is required while
function is over
Can do the cleaning like, if you have opened some file in
MR framework will copy the files to the slave nodes on its local
file system before executing the task on that node
After the task is completed the file is removed from the local file
system.
@Override
public void setup(Context context) throws IOException
{
this.files = DistributedCache.getCacheFiles(context.getConfiguration());
Path path = new Path(files[0]);
//do something
}
Module 5
keys
Since the keys will be compared during sorting phase, so
}
@Override
public void write(DataOutput out) throws IOException {
xcoord.write(out);
ycoord.write(out);
}
}
@Override
public void write(DataOutput out) throws IOException {
Provide setters and getters method for member variables
xcoord.write(out);
ycoord.write(out);
}
}
which reducer
Default partioner takes the entire key to decide which
reducer
Smith John key is different than Smith Jacob
reducer
job.setNumReduceTasks()
Command line [ -D mapred.reduce.tasks ]
}
record reader
project name
You can run map reduce job to filter the records
format
While reading the records implement your logic
@Override
public RecordReader<Text, IntWritable> createRecordReader(InputSplit input,
TaskAttemptContext arg1) throws IOException, InterruptedException
{
@Override
public RecordReader<Text,
IntWritable>
createRecordReader(InputSplit input,
Extend your class from
FileInputFormat
TaskAttemptContext
arg1) throws IOException, InterruptedException
FileInputFormat<Text,IntWritable>
{
@Override
public RecordReader<Text, IntWritable> createRecordReader(InputSplit input,
TaskAttemptContext arg1) throws IOException, InterruptedException
{
@Override
public
RecordReader<Text,
IntWritable>
createRecordReader(InputSplit
WikiRecordReader
is the
custom record
reader which needs to be input,
TaskAttemptContext
implemented arg1) throws IOException, InterruptedException
{
public WikiRecordReader()
{
lineReader = new LineRecordReader();
}
@Override
public void initialize(InputSplit input, TaskAttemptContext context)
throws IOException, InterruptedException
{
lineReader.initialize(input, context);
}
public WikiRecordReader()
{
Extend your class from RecordReader class
lineReader = new LineRecordReader();
Using existing LineRecordReader class
}
lineReader.initialize(input,
lineKey will be the context);
input key to the map function
@Override
public void initialize(InputSplit input, TaskAttemptContext context)
throws IOException, InterruptedException
{
lineReader.initialize(input, context);
}
one at a time
Module 6
Custom comparator
Comparable & Comparator are interfaces in java to
Reduce phase.
All keys must implement WritableComparable
WritableComparable
Hadoops equivalent interface as Comparable
Enable to define only one way of sorting Ex: Ascending
order.
How to define multiple sorting strategies?
sorted on values)
public class AscendingComparator implements WritableComparator {
@Override
public int compare(WritableComparable o1, WritableComparable o2) {
Employee e1 = (Employee) o1;
Employee e2 = (Employee) o2;
return e1.getFirstName().compareTo(e2.getFirstName());
}
config.setOutputKeyComparatorClass(AscendingCompara
tor .class);
Current Topic
Secondary Sorting
Motivation Sort the values for each of the keys.
Reminder Keys are by default sorted
So now values are also need to be sorted.
John
John
Gary
John
John
Bill
Tim
Smith
Rambo
Kirsten
McMillan
Andrews
Rod
Southee
Bill
Gary
John
John
John
John
Tim
Rod
Kirsten
Andrews
McMillan
Rambo
Smith
Southee
Map output.
Lets say there are 2 mappers:
Steps: Mapper
@Override
public void map(Text arg0, Text arg1, Context context) throws IOException {
Employee emp = new Employee();
emp.setFirstName(arg0);
Text lastName = new Text(arg1.toString());
emp.setLastName(lastName);
context.write(emp, lastName);
Steps: Reducer
@Override
public void reduce(Employee emp, Iterator<Text> values, Context context)
throws IOException {
Module 7
http://dammit.lt/wikistats/
Its an hourly data set
Structure data with four columns
Project Name
Page Name
Page Count
Page size
files
descending order
descending order
Joins
Current Topic
What are joins?
What is map side join?
What is reduce side join?
Hands on
ID
Country Table
Country
Code
James
01
AUS
Siddharth
11
IN
Suman
23
US
CountryCode
Country Name
AUS
Australia
IN
India
US
United States
Joins contd
Your job is to replace the country code in the employee
ID
Country
James
O1
Australia
Siddharth
11
India
Suman
23
UnitedStates
Joins contd
MapReduce provides two types of join
Map Side join
Reduce Side join
Both kind of joins does the join but differs the way it is
done
Map Side join is done at mapper phase. No Reducer is
required
Reduce side join requires to have reducer
Current Topic
What are joins?
What is map side join?
What is reduce side join?
Hands on
Location
LocationId
13
EMPID
42
Name
John
Location
13
LocationName
New York
LocationName
New York
Now, join the data in Reducer (all values for the same key will be
Motivation contd
Map Reduce in Java is very low level
Needs to deal with all the intricacies
Hive
Data ware housing tool on top of Hadoop
SQL like interface
Provides SQL like language to analyze the data stored
on HDFS
Can be used by people who know SQL
Hive contd
Under the hood hive queries are executed as
MapReduce jobs
No extra work is required
Hive Components
MetaStore
Its a data base consisting of table definitions and other
metadata
By default stored on the local machine on derby data
base
It can be kept on some shared machine like relational
data base if multiple users are using
Installing Hive
Download Hive (0.10.0)
Create following environment in .bashrc file
export HIVE_HOME=<path to your hive home directory>
export HIVE_CONF_DIR=$HIVE_HOME/conf
export HIVE_LIB=$HIVE_HOME/lib
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:$HIVE_LIB
Installing Hive
Create hive-site.xml (if not present) under
$HIVE_HOME/conf folder
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310/</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
</configuration>
Installing Hive
Create two directories /tmp and /user/hive/warehouse on
residing on HDFS
Databases
Name space that separates tables from other units from
naming confliction
Table
Homogenous unit of data having same schema
Partition
Determines how the data is stored
Virtual columns, not part of data but derived from load
Buckets / Cluster
Data in each partition can be further divided into buckets
Efficient for sampling the data
Basic Queries
hive> SHOW TABLES;
hive>SHOW DATABASES;
hive> CREATE TABLE sample(firstName STRING, lastName
STRING, id INT)
ROW FORMAT
DELIMITED FIELDS
TERMINATED BY
LINES TERMINATED BY \n
STORED AS TEXTFILE
hive> DESCRIBE sample;
system
LOAD DATA LOCAL INPATH sample_local_data INTO TABLE
sample;
Joins in Hive
Joins are complex and trickier in map reduce job
Storing Results
Create a new table and store the results into this table
CREATE TABLE output(sum int);
INSERT OVERWRITE INTO TABLE output
SELECT sum(id) FROM TABLE sample;
/user/hive/warehouse folder
It just point to that location
count
Hive Limitations
All the standard SQL queries are not supported
Sub queries are not supported
No Support for UPDATE and DELETE operation
Cannot insert single rows
Installing Pig
Download Pig (0.10.0)
Create following environment variables in .bashrc file
export PIG_HOME=<path to pig directory>
export PATH = $PATH:$PIG_HOME/bin
version of 1.0.3)
export PIG_CLASSPATH=$HADOOP_HOME/conf/
Specify fs.default.name & mapred.job.tracker in pig.properties file
under $PIG_HOME/conf directory
Pig basics
High level platform for MapReduce programs
Uses PigLatin Language
Simple Syntax
PigLatin queries are converted into map reduce jobs
No shared meta data is required as in Hive
Pig Terminology
Tuple:
Rows / Records
Bags:
Unordered Collection of tuples
Relation:
Pig Operator.
Generally the output of operator is stored in a relation
PigLatin scripts generally starts by loading one or more
Loading Data
LOAD keyword
TAB is the default delimiter
If input file is not tab delimited, even though loading will not be a
PigStorage( )
Note when you are defining the column name, you need to specify the
Filtering
Use Filter Keyword
Grouping
Use GROUP key word
grouped_records = GROUP filtered_records by pageName;
Grouping contd
grouped_records = GROUP filtered_records by pageName;
Joins
Use JOIN keyword
personId column
Joins contd
Left outer join
For doing left outer join use LEFT OUTER key word
Joins contd
Right outer join
result = JOIN persons BY personId RIGHT OUTER, orders BY
personId;
Joins contd
Full outer join
FOREACHGENERATE
Iterate over tuples in a bag
EXPLAIN
ILLUSTRATE
REGISTER
User Defined Functions (UDF)
DEFINE
described as follows:
pig script_name.pig
run script_name.pig
exec script_name.pig