Chapter 10
Chapter 10
And Analytics
Seema Acharya
Subhashini Chellappan
Introduction to Pig
Q/A 15 minutes
What is Pig?
Key Features of Pig
The Anatomy of Pig
Pig on Hadoop
Pig Philosophy
Pig Latin Overview
Pig Latin Statements
Pig Latin: Identifiers
Pig Latin: Comments
Data Types in Pig
Simple Data Types
Complex Data Types
Running Pig
Execution Modes of Pig
Relational Operators
Eval Function
Piggy Bank
When to use Pig?
When NOT to use Pig?
Pig versus Hive
• It provides an engine for executing data flows (how your data should flow). Pig
processes data in parallel on the Hadoop cluster.
• Pig Latin contains operators for many of the traditional data operations such as
join, filter, sort, etc.
• It allows users to develop their own functions (User Defined Functions) for
reading, processing, and writing data.
• Interactive shell where you can type Pig Latin statements (Grunt).
• Pig uses both Hadoop Distributed File System and MapReduce Programming.
• By default, Pig reads input files from HDFS. Pig stores the intermediate
data (data produced by MapReduce jobs) and the output in HDFS.
• However, Pig can also read input from and place output to other sources.
Pigs Fly
Pigs are
Pigs Eat
Domestic Pig Philosophy
Anything
Animals
Pigs Live
Anywhere
2. Batch Mode.
Find the tuples of those student where the GPA is greater than 4.0.
DUMP B;
B = GROUP A BY gpa;
DUMP B;
B = DISTINCT A;
DUMP B;
To join two relations namely, “student” and “department” based on the values
contained in the “rollno” column.
DUMP C;
DUMP B;
DUMP X;
B = GROUP A BY studname;
DUMP C;
B = GROUP A BY studname;
DUMP C;
John [city#Bangalore]
Jack[city#Pune]
James [city#Chennai]
DUMP B
register '/root/pigdemos/piggybank-0.12.0.jar';
DUMP upper;
2. When there is a time constraint because Pig is slower than MapReduce jobs.
http://pig.apache.org/docs/r0.12.0/index.html
http://www.edureka.co/blog/introduction-to-pig/