Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that

are used to help Hadoop modules.

 Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
 Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Note: There are various ways to execute MapReduce operations:

 The traditional approach using Java MapReduce program for structured, semi-structured,
and unstructured data.
 The scripting approach for MapReduce to process structured and semi structured data
using Pig.
 The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data
using Hive.

What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each unit:

Unit Name Operation


Hive is a data warehouse infrastructure software that can create interaction
User Interface between user and HDFS. The user interfaces that Hive supports are Hive
Web UI, Hive command line, and Hive HD Insight (In Windows server).
Hive chooses respective database servers to store the schema or Metadata of
Meta Store
tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore. It is
HiveQL Process one of the replacements of traditional approach for MapReduce program.
Engine Instead of writing MapReduce program in Java, we can write a query for
MapReduce job and process it.
The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution Engine Execution Engine. Execution engine processes the query and generates
results as same as MapReduce results. It uses the flavor of MapReduce.
Hadoop distributed file system or HBASE are the data storage techniques to
HDFS or HBASE
store data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step No. Operation


Execute Query
1
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
Get Plan
2
The driver takes the help of query compiler that parses the query to check the syntax
and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
Send Metadata
4
Metastore sends metadata as a response to the compiler.
Send Plan
5
The compiler checks the requirement and resends the plan to the driver. Up to here,
the parsing and compiling of a query is complete.
Execute Plan
6
The driver sends the execute plan to the execution engine.
Execute Job

7 Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.
Metadata Ops
7.1
Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.
Fetch Result
8
The execution engine receives the results from Data nodes.
Send Results
9
The execution engine sends those resultant values to the driver.
Send Results
10
The driver sends the results to Hive Interfaces.

Hive Consists of Mainly 3 core parts

1. Hive Clients
2. Hive Services
3. Hive Storage and Computing

Hive Clients:

Hive provides different drivers for communication with a different type of applications. For
Thrift based applications, it will provide Thrift client for communication.
For Java related applications, it provides JDBC Drivers. Other than any type of applications
provided ODBC drivers. These Clients and drivers in turn again communicate with Hive server
in the Hive services.

Hive Services:

Client interactions with Hive can be performed through Hive Services. If the client wants to
perform any query related operations in Hive, it has to communicate through Hive Services.

CLI is the command line interface acts as Hive service for DDL (Data definition Language)
operations. All drivers communicate with Hive server and to the main driver in Hive services as
shown in above architecture diagram.

Driver present in the Hive services represents the main driver, and it communicates all type of
JDBC, ODBC, and other client specific applications. Driver will process those requests from
different applications to meta store and field systems for further processing.

Hive Storage and Computing:

Hive services such as Meta store, File system, and Job Client in turn communicates with Hive
storage and performs the following actions

 Metadata information of tables created in Hive is stored in Hive "Meta storage database".
 Query results and data loaded in the tables are going to be stored in Hadoop cluster on
HDFS.

Job exectution flow:

From the above screenshot we can understand the Job execution flow in Hive with Hadoop

The data flow in Hive behaves in the following pattern;

1. Executing Query from the UI( User Interface)


2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query
execution) process and its related metadata information gathering
3. The compiler creates the plan for a job to be executed. Compiler communicating with
Meta store for getting metadata request
4. Meta store sends metadata information back to compiler
5. Compiler communicating with Driver with the proposed plan to execute the query
6. Driver Sending execution plans to Execution engine
7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query.
For DFS operations.

 EE should first contacts Name Node and then to Data nodes to get the values stored in
tables.
 EE is going to fetch desired records from Data Nodes. The actual data of tables resides in
data node only. While from Name Node it only fetches the metadata information for the
query.
 It collects actual data from data nodes related to mentioned query
 Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to
perform DDL (Data Definition Language) operations. Here DDL operations like
CREATE, DROP and ALTERING tables and databases are done. Meta store will store
information about database name, table names and column names only. It will fetch data
related to query mentioned.
 Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node,
Data nodes, and job tracker to execute the query on top of Hadoop file system

8. Fetching results from driver


9. Sending results to Execution engine. Once the results fetched from data nodes to the EE,
it will send results back to driver and to UI ( front end)

Hive Continuously in contact with Hadoop file system and its daemons via Execution engine.
The dotted arrow in the Job flow diagram shows the Execution engine communication with
Hadoop daemons.

Different modes of Hive


Hive can operate in two modes depending on the size of data nodes in Hadoop.

These modes are,

 Local mode
 Map reduce mode

When to use Local mode:

 If the Hadoop installed under pseudo mode with having one data node we use Hive in this
mode
 If the data size is smaller in term of limited to single local machine, we can use this mode
 Processing will be very fast on smaller data sets present in the local machine

When to use Map reduce mode:

 If Hadoop is having multiple data nodes and data is distributed across different node we
use Hive in this mode
 It will perform on large amount of data sets and query going to execute in parallel way
 Processing of large data sets with better performance can be achieved through this mode
Below are the key features of Hive that differ from RDBMS.

 Hive resembles a traditional database by supporting SQL interface but it is not a full
database. Hive can be better called as data warehouse instead of database.

 Hive enforces schema on read time whereas RDBMS enforces schema on write time.

In RDBMS, a table’s schema is enforced at data load time, If the data being
loaded doesn’t conform to the schema, then it is rejected. This design is called schema on write.

But Hive doesn’t verify the data when it is loaded, but rather when a
it is retrieved. This is called schema on read.

Schema on read makes for a very fast initial load, since the data does not have to be read,
parsed, and serialized to disk in the database’s internal format. The load operation is just a file
copy or move.

Schema on write makes query time performance faster, since the database can index columns
and perform compression on the data but it takes longer to load data into the database.

 Hive is based on the notion of Write once, Read many times but RDBMS is designed
for Read and Write many times.

 In RDBMS, record level updates, insertions and deletes, transactions and indexes are
possible. Whereas these are not allowed in Hive because Hive was built to operate over
HDFS data using MapReduce, where full-table scans are the norm and a table update is
achieved by transforming the data into a new table.

 In RDBMS, maximum data size allowed will be in 10’s of Terabytes but whereas Hive
can 100’s Petabytes very easily.

 As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction


Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since
there is significant latency between issuing a query and receiving a reply, due to
the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed
to serve.

 RDBMS is best suited for dynamic data analysis and where fast responses are expected
but Hive is suited for data warehouse applications, where relatively static data is
analyzed, fast response times are not required, and when the data is not changing rapidly.

 To overcome the limitations of Hive, HBase is being integrated with Hive to support
record level operations and OLAP.
 Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it
is very costly scale up.

Hive Traditional database


Schema on WRITE – table schema is enforced at
Schema on READ – it’s does not verify the
data load time i.e if the data being loaded does’t
schema while it’s loaded the data
conformed on schema in that case it will rejected
It’s very easily scalable at low cost Not much Scalable, costly scale up.
It’s based on hadoop notation that is Write once In traditional database we can read and write many
and read many times time
Record level updates is not possible in Hive Record level updates, insertions and
deletes, transactions and indexes are possible
OLTP (On-line Transaction Processing) is not yet Both OLTP (On-line Transaction Processing)
supported in Hive but it’s supported OLAP (On- and OLAP (On-line Analytical Processing) are
line Analytical Processing) supported in RDBMS

Data Units
In the order of granularity - Hive data is organized into:

 Databases: Namespaces function to avoid naming conflicts for tables, views, partitions,
columns, and so on. Databases can also be used to enforce security for a user or group of
users.
 Tables: Homogeneous units of data which have the same schema. An example of a table
could be page_views table, where each row could comprise of the following columns
(schema):
o timestamp—which is of INT type that corresponds to a UNIX timestamp of
when the page was viewed.
o userid —which is of BIGINT type that identifies the user who viewed the page.
o page_url—which is of STRING type that captures the location of the page.
o referer_url—which is of STRING that captures the location of the page from
where the user arrived at the current page.
o IP—which is of STRING type that captures the IP address from where the page
request was made.
 Partitions: Each Table can have one or more partition Keys which determines how the
data is stored. Partitions—apart from being storage units—also allow the user to
efficiently identify the rows that satisfy a specified criteria; for example, a date_partition
of type STRING and country_partition of type STRING. Each unique value of the
partition keys defines a partition of the Table. For example, all "US" data from "2009-12-
23" is a partition of the page_views table. Therefore, if you run analysis on only the "US"
data for 2009-12-23, you can run that query only on the relevant partition of the table,
thereby speeding up the analysis significantly. Note however, that just because a partition
is named 2009-12-23 does not mean that it contains all or only data from that date;
partitions are named after dates for convenience; it is the user's job to guarantee the
relationship between partition name and data content! Partition columns are virtual
columns, they are not part of the data itself but are derived on load.
 Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based
on the value of a hash function of some column of the Table. For example the
page_views table may be bucketed by userid, which is one of the columns, other than the
partitions columns, of the page_view table. These can be used to efficiently sample the
data.

Hive - Create Database

In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.

 Initially, we check the default database provided by Hive. So, to check the list of existing
databases, follow the below command: -

hive> show databases;

Here, we can see the existence of a default database provided by Hive.

 Let's create a new database by using the following command: -

hive> create database demo;

So, a new database is created.


 Let's check the existence of a newly created database.

hive> show databases;

 Each database must contain a unique name. If we create two databases with the same
name, the following error generates: -

 If we want to suppress the warning generated by Hive on creating the database with the
same name, follow the below command: -

hive> create a database if not exists demo;

 Hive also allows assigning properties with the database in the form of key-value pair.

1. hive>create the database demo


2. >WITH DBPROPERTIES ('creator' = 'Gaurav Chawla', 'date' = '2019-06-03');
JDBC Program

The JDBC program to create a database is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateDb {


private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";

public static void main(String[] args) throws SQLException {


// Register driver and create driver instance

Class.forName(driverName);
// get connection

Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();

stmt.executeQuery("CREATE DATABASE userdb");


System.out.println(“Database userdb created successfully.”);

con.close();
}
}

Hive - Drop Database

In this section, we will see various ways to drop the existing database.

 Let's check the list of existing databases by using the following command: -

1. hive> show databases;

 Now, drop the database by using the following command.


hive> drop database demo;

 In Hive, it is not allowed to drop the database that contains the tables directly. In such a
case, we can drop the database either by dropping tables first or use Cascade keyword
with the command.
 Let's see the cascade command used to drop the database:-

hive> drop database if exists demo cascade;

Hive - Create Table

In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide
range of flexibility where the data files for tables are stored. It provides two types of table: -

 Internal table
 External table
Internal Table
The internal tables are also called managed tables as the lifecycle of their data is controlled by
the Hive. By default, these tables are stored in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (i.e. /user/hive/warehouse). The internal tables are not flexible
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes both
table schema and data.

 Let's create an internal table by using the following command:-

1. hive> create table demo.employee (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;

 Let's see the metadata of the created table by using the following command:-

hive> describe demo.employee

Let's see the result when we try to create the existing table again.
In such a case, the exception occurs. If we want to ignore this type of exception, we can use if
not exists command while creating the table.

1. hive> create table if not exists demo.employee (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;

 While creating a table, we can add the comments to the columns and can also define the
table properties.

hive> create table demo.new_employee (Id int comment 'Employee Id', Name string com
ment 'Employee Name', Salary float comment 'Employee Salary')

comment 'Table Description'

TBLProperties ('creator'='Gaurav Chawla', 'created_at' = '2019-06-06 11:00:00');

 Let's see the metadata of the created table by using the following command: -

hive> describe new_employee;


 Hive allows creating a new table by using the schema of an existing table.

hive> create table if not exists demo.copy_employee

like demo.employee;

Here, we can say that the new table is a copy of an existing table.

External Table
The external table allows us to create and access a table and a data externally. The external
keyword is used to specify the external table, whereas the location keyword is used to determine
the location of loaded data.

As the table is external, the data is not present in the Hive directory. Therefore, if we try to drop
the table, the metadata of the table will be deleted, but the data still exists.

To create an external table, follow the below steps: -

 Let's create a directory on HDFS by using the following command: -


hdfs dfs -mkdir /HiveDirectory

 Now, store the file on the created directory.

hdfs dfs -put hive/emp_details /HiveDirectory

 Let's create an external table using the following command: -

hive> create external table emplist (Id int, Name string , Salary float)

row format delimited

fields terminated by ','

location '/HiveDirectory';

 Now, we can use the following command to retrieve the data: -

1. select * from emplist;

Hive - Load Data

Once the internal table has been created, the next step is to load the data into it. So, in Hive, we
can easily load data from any file to the database.

 Let's load the data of the file into the database by using the following command: -
1. load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee;

Here, emp_details is the file name that contains the data.

 Now, we can use the following command to retrieve the data from the database.

1. select * from demo.employee;

for further queries use this link

https://www.javatpoint.com/hive-load-data

You might also like