Mtech Scheme
Mtech Scheme
(AUTONOMOUS)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Set No.2
1a)
1a) Define Big data and explain how it differs from traditional data sets, discuss the 8M
convergence of key trends that have led to the rise of big data.
Answer:
Big data refers to extremely large and complex datasets that cannot be easily
managed, processed, or analyzed using traditional data processing tools or methods.
The term is often characterized by the three Vs: Volume (the sheer size of the data),
Velocity (the speed at which data is generated and processed), and Variety (the
diverse types of data, including structured, semi-structured, and unstructured).
Here's a breakdown of the differences between big data and traditional data sets:
Volume:
Big Data: Involves datasets that are too large to be processed and analyzed by
traditional database systems. These datasets can range from terabytes to petabytes and
beyond.
Traditional Data: Typically involves smaller datasets that can be easily handled by
conventional database systems.
Velocity:
Big Data: Refers to the speed at which data is generated, collected, and processed.This is
crucial for real-time analytics and decision-making.
Traditional Data: Usually involves data that is generated and processed at a slower pace
compared to big data environments.
Variety:
Big Data: Encompasses a wide range of data types, including structured data
(e.g.,relational databases), semi-structured data (e.g., XML, JSON), and unstructured
data(e.g., text, images, videos).
Traditional Data: Primarily deals with structured data that fits neatly into tables and
relational databases.
5. Data lakes
Data lakes are a new type of architecture that is revolutionising how businesses
store and analyse data.
Organizations used to store their data in relational databases. The issue with this
type of storage is that it is too structured to store a variety of data types such as
images, audio files, video files, and so on.
It can store data in its native format and process any variety of it, ignoring size
limits.
Increased Data Generation: The proliferation of digital technologies, social
media, sensors, and IoT devices has led to an exponential increase in the
generation of data.
Advancements in Storage and Processing Technologies: Improved storage
solutions, such as distributed file systems and cloud storage, along with parallel
processing frameworks like Apache Hadoop and Apache Spark, have made it
feasible to store and process massive amounts of data.
Cost Reduction in Storage: The decreasing costs of storage have made it more
economical to store and retain large volumes of data for extended periods.
Open Source Technologies: The development and widespread adoption of opensource
technologies, such as Hadoop, Spark, and NoSQL databases, have provided scalable and
cost-effective solutions for big data processing.
Advanced Analytics and Machine Learning: The increasing demand for advanced
analytics and machine learning applications has driven the need for large and diverse
datasets to train and improve models.
Real-time Processing Requirements: The need for real-time or near-real-time
data processing has become crucial in various industries, leading to the
Development of technologies that can handle streaming data.
In summary, the convergence of these trends has given rise to the era of big data,where
organizations can leverage massive and diverse datasets to gain valuable insights, make
informed decisions, and uncover patterns that were previously challenging to discover
with traditional data processing methods.
7M
1b) Describe the role of unstructured data in big data analytics.provide an
1b)
example of how un structured data is used in one industry
ANS:
Unstructured data plays a crucial role in big data analytics, contributing valuable insights
that complement the structured data traditionally used in analytics processes.
Unstructured data refers to information that doesn't fit neatly into a traditional
relational database or structured format. This type of data includes text, images, videos,
social media posts, emails, and other content that doesn't follow a predefined data
model.
Any form of data that has no proper structure or an unknown form is unstructured
data. This type of data is challenging to derive valuable insights from because of the raw
nature of the data.
Unstructured data refers to the data that lacks any specific form or structure
whatsoever.
This makes it very difficult and time-consuming to process and analyze unstructured
data.
Email is an example of unstructured data. Structured and unstructured are two
important
types of big data.
Unstructured data is the kind of data that doesn’t adhere to any definite schema or
set of rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered unstruc
tured data. Even though the metadata accompanying an image or a video may be semi-
structured, the actual data being dealt with is unstructured.
2a 7M
2a) what do you mean by linear and non-linear data structures specify
the sets are comes under linear or non-linear and explain the various
types of sets supported by java.
ANS:
Linear and Non-linear Data Structures:
Linear and non-linear data structures refer to the ways in which data elements are
organized and connected within a data structure.
Linear Data Structures: Elements are arranged in a linear or sequential order.
Each element has a unique predecessor and successor, except for the first and last
elements.
Examples include arrays, linked lists, stacks, and queues.
Non-linear Data Structures:
Elements are not arranged in a sequential order.
Each element can have multiple predecessors and successors, forming a hierarchical or
interconnected structure.
Examples include trees and graphs.
Java Sets:
In the context of Java programming and big data analytics, sets are a type of collection
that represents an unordered collection of unique elements. Java provides several set
implementations in the java.util package. These sets are not specifically categorized as
linear or non-linear structures in the context of big data analytics, as they are more
related to the organization of data within a programming language. However, sets can be
used in various big data scenarios for managing unique elements efficiently.
Here are some common types of sets supported by Java:
HashSet:
Unordered collection of unique elements.
Implements the Set interface.
Uses a hash table for storage, providing constant-time complexity for basic operations
(add, remove, contains).
LinkedHashSet:
Similar to HashSet but maintains the order of insertion.
Implements the Set interface and also extends the HashSet class.
TreeSet:
Implements the Set interface and navigable interface.
Maintains elements in sorted order (natural order or according to a specified
comparator).
Uses a Red-Black tree for storage, providing log(n) time complexity for basic operations.
EnumSet:
Specialized implementation for sets where elements are enum constants.
Highly efficient and compact representation of enum values.
BitSet:
Represents a set of bits or flags.
Used for efficient manipulation of sets of flags or boolean values.
Implements the Set interface.
Usage in Big Data Analytics:
In big data analytics, sets can be used to handle unique identifiers, filter out duplicate
records, and manage distinct values efficiently.
The choice of a specific set implementation depends on the requirements of the
analytics process, such as the need for ordering, fast membership checks, or memory
efficiency.
While sets themselves may not be directly tied to the linear or non-linear distinction, the
algorithms and data structures used in big data analytics (e.g., graphs, trees) may exhibit
characteristics of linear or non-linear organization depending on the nature of the data
and the analytical tasks at hand.
8M
2b) what is the advantages of object serialization in java and explain about
2b)) serializing and de-seralizing an object with suitable examples
object serialization in Java continues to offer advantages similar to those in general
programming. However, its application in big data scenarios often revolves around
distributed computing environments, data storage, and data interchange between
different systems. Here are the advantages and an example of serializing and
deserializing objects in the context of big data analytics:
Advantages of Object Serialization in Big Data Analytics:
Distributed Data Processing:
Big data analytics often involve distributed computing environments where data is
processed across multiple nodes. Serialization facilitates the efficient transmission of
objects between nodes, enabling seamless communication and collaboration.
Data Storage:Serialized objects can be stored in various data storage solutions, including
distributed file systems like Hadoop Distributed File System (HDFS) or cloud storage.
This allows for the preservation of complex data structures and their states.
Interoperability:
In big data analytics, systems may use different programming languages or frameworks.
Object serialization provides a standardized format for data representation, promoting
interoperability between different technologies.
Efficient Data Transfer:
Serialized objects can be transmitted over networks more efficiently than raw,
unstructured data. This efficiency is crucial in scenarios where large volumes of data
need to be transferred between nodes in a distributed environment.
State Preservation:
Serialization enables the preservation of object states, which is valuable in applications
where the state of an object is critical for analysis. This is particularly relevant when
dealing with complex data structures and machine learning models.
Example: Serializing and Deserializing Objects in a Big Data Context:
Consider a scenario where you have a complex data structure representing a machine
learning model, and you want to distribute this model across a cluster for parallel
processing
import java.io.*;
public class MLModel implements Serializable {
private String modelName;
private double[] coefficients;
public MLModel(String modelName, double[] coefficients) {
this.modelName = modelName;
this.coefficients = coefficients;
}
public void train(double[][] trainingData, double[] labels) {
// Training logic here
}
public double predict(double[] input) {
// Prediction logic here
return 0.0;
}
public static void main(String[] args) {
// Serialize the machine learning model
serializeMLModel();
// Deserialize the machine learning model
MLModel deserializedModel = deserializeMLModel();
System.out.println("Deserialized Model: " + deserializedModel.modelName);
}
private static void serializeMLModel() {
try (ObjectOutputStream oos = new ObjectOutputStream(new
FileOutputStream("mlmodel.ser"))) {
double[] initialCoefficients = {0.5, -0.3, 0.8};
MLModel model = new MLModel("LinearRegression", initialCoefficients);
oos.writeObject(model);
System.out.println("Machine Learning Model serialized successfully.");
} catch (IOException e) {
e.printStackTrace();
}
}
private static MLModel deserializeMLModel() {
try (ObjectInputStream ois = new ObjectInputStream(new
FileInputStream("mlmodel.ser"))) {
return (MLModel) ois.readObject();
} catch (IOException | ClassNotFoundException e) {
e.printStackTrace();
return null;
}
}
}
In this example:
The MLModel class represents a simple machine learning model with serialization
capabilities.
The serializeMLModel method serializes an instance of the MLModel class and writes it
to
a file.
The deserializeMLModel method reads the serialized model from the file and returns a
new instance.
In big data analytics, this kind of serialization allows for the distribution of machine
learning models across a cluster of nodes, enabling parallel processing and efficient
utilization of computational resources. It also facilitates the interchange of models
between
different components of a big data ecosystem.
3a) 8M
3a) Discuss the architecture and data model of cassandra.how does it differ from other
NoSQL data base
ANS:
Cassandra Architecture:
Cassandra is a distributed NoSQL database designed for scalability, high availability, and
fault tolerance. Its architecture is decentralized and follows a peer-to-peer model,
allowing for linear scalability by adding more nodes to the cluster. Key components of
the Cassandra architecture include:
Node:
Nodes in a Cassandra cluster are distributed across multiple machines.
Each node is responsible for a portion of the data, and all nodes are equal.
Data Distribution:
Data is distributed across nodes using a partitioning strategy (e.g., random, Murmur3).
Each node is responsible for a range of data, and a consistent hashing mechanism helps
route requests to the appropriate node.
Gossip Protocol:
Nodes communicate with each other using a gossip protocol to share information about
the cluster's state.Gossip ensures that every node eventually learns about changes in the
cluster, supporting decentralized coordination.
Replication:
Cassandra replicates data across multiple nodes to ensure fault tolerance and high
availability.
Replication factor and strategy are configurable, allowing users to define how many
copies of data are stored and on which nodes.
Data Model:
Cassandra's data model is based on column-family, resembling a tabular structure.
Data is organized into keyspaces (similar to databases in traditional systems), which
contain column families (analogous to tables).
Commit Log and Memtable:
Write operations are first recorded in a commit log for durability.
Data is then written to an in-memory structure called a memtable, providing fast write
performance.
SSTables and Compaction:
Data from memtables is periodically flushed to on-disk storage in SSTables (Sorted String
Tables).Compaction processes merge and discard obsolete data from SSTables to
maintain efficiency.
Read Path:
Cassandra supports fast read operations by using an index to locate data efficiently.
Read requests can be served from memory, SSTables, or a combination of both.
Cassandra Data Model:
Keyspace:
The outermost container for data in Cassandra.
Defines the replication strategy and contains column families.
Column Families (Tables):
A column family is a container for rows, similar to a table in relational databases.
Rows are identified by a primary key, and each row can have different columns.
Wide Rows:
Rows in Cassandra can be wide, allowing for dynamic column addition without altering
the table schema.Useful for scenarios with evolving data structures.
CQL (Cassandra Query Language):
CQL is a SQL-like language for interacting with Cassandra databases.
Provides a familiar syntax for querying and manipulating data.
Differences from Other NoSQL Databases in Big Data Analytics:
Cassandra's architecture and data model differentiate it from other NoSQL databases,
especially in the context of big data analytics:
Decentralized and Peer-to-Peer:
Cassandra's decentralized architecture is distinctive, and its peer-to-peer model
contributes to its scalability and fault-tolerance, making it suitable for large-scale big
data scenarios.
Linear Scalability:
Cassandra's linear scalability allows it to efficiently handle growing workloads by adding
more nodes to the cluster. This is crucial in big data analytics where the volume of data
can be massive.
Tunable Consistency:
Cassandra provides tunable consistency, enabling users to balance between consistency
and availability based on specific analytics requirements.
Write-Optimized:
Cassandra is well-suited for write-intensive workloads, making it favorable for scenarios
involving continuous data ingestion and real-time analytics.
Wide-Column Model:
The wide-column data model allows for flexible schema design, accommodating the
evolving nature of big data analytics requirements.
In big data analytics, where distributed computing, scalability, and fault tolerance are
critical, Cassandra's architecture and data model make it a suitable choice for
applications that demand high write throughput, fast query performance, and
continuous availability.
It is often used in conjunction with other big data technologies to build robust and
scalable data processing pipelines..
7M
3b) 3b) describe the process of creating and managing tables in cassandra includes
an example of table creation and data manipulation
Creating and managing tables in Cassandra involves defining keyspaces, specifying
replication strategies, creating column families (tables), and performing data
manipulation using the Cassandra Query Language (CQL). Below is a step-by-step guide
along with an example of table creation and data manipulation in Cassandra.
1. Connect to Cassandra:
Use a CQL shell (cqlsh) or connect programmatically to a Cassandra cluster.
2. Create a Keyspace:
A keyspace is the top-level container for tables in Cassandra. It defines the replication
strategy and other configuration options.
CREATE KEYSPACE IF NOT EXISTS mykeyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
3. Use the Keyspace:
Switch to the keyspace you created.
USE mykeyspace;
4. Create a Table:
Define the structure of your table, including columns, primary key, and any other
relevant options.
CREATE TABLE IF NOT EXISTS users (
user_id UUID PRIMARY KEY,
username TEXT,
email TEXT,
age INT
);
5. Insert Data into the Table:
Add records to your table.
INSERT INTO users (user_id, username, email, age) VALUES (
uuid(), 'john_doe', '[email protected]', 25
);
INSERT INTO users (user_id, username, email, age) VALUES (
uuid(), 'jane_smith', '[email protected]', 30
);
6. Query Data:
Retrieve data from the table.
SELECT * FROM users;
7. Update Data
Modify existing records.
UPDATE users SET age = 26 WHERE user_id = <UUID>;
8. Delete Data:
Remove records from the table.
DELETE FROM users WHERE user_id = <UUID>;
Example: Table Creation and Data Manipulation in Big Data Analytics
Suppose you are managing user data in a big data analytics application. You can create a
table to store user
information and manipulate the data as follows:
-- Step 2: Create a Keyspace
CREATE KEYSPACE IF NOT EXISTS bigdata
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
-- Step 3: Use the Keyspace
USE bigdata;
-- Step 4: Create a Table
CREATE TABLE IF NOT EXISTS user_data (
user_id UUID PRIMARY KEY,
username TEXT,
email TEXT,
age INT
);
-- Step 5: Insert Data into the Table
INSERT INTO user_data (user_id, username, email, age) VALUES (
uuid(), 'john_doe', '[email protected]', 25
);
INSERT INTO user_data (user_id, username, email, age) VALUES (
uuid(), 'jane_smith', '[email protected]', 30
);
-- Step 6: Query Data
SELECT * FROM user_data;
-- Step 7: Update Data
UPDATE user_data SET age = 26 WHERE user_id = <UUID>;
-- Step 8: Delete Data
DELETE FROM user_data WHERE user_id = <UUID>;
In this example:
We create a keyspace named 'bigdata'.
We define a table named 'user_data' with columns for user ID, username, email, and
age.
We insert two records into the table.
We query the data, update the age of a user, and then delete a user's record.
This example demonstrates the basic steps of creating and managing tables in Cassandra
using CQL. In a big data analytics context, such tables can store and process vast
amounts of user data efficiently, and the operations can be scaled horizontally as the
data grows.
8M
4a) 4a) Explain hadoop distributed file system architecture with a neat sketch
ANS:
Architecture of Hadoop distributed file system (HDFS):
Hadoop File System was developed using distributed file system design. It is run on
commodityhardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also
makes applications available to parallel processing.
Features of HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check the status
ofcluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
ANS:
DataNode in Hadoop Distributed File System (HDFS):
In Hadoop Distributed File System (HDFS), a DataNode is a component that runs on each
individual machine (node) in the Hadoop cluster. Its primary responsibility is to store and
manage the actual data blocks that make up the files stored in HDFS. The DataNode is
responsible for performing read and write operations, as well as managing the
replication of data blocks across the cluster for fault tolerance.
Key responsibilities of a DataNode include:
Storage: Storing and managing data blocks on the local file system of the node it resides
on.
Replication: Replicating data blocks to other DataNodes to ensure fault tolerance and
data availability. The default replication factor is three, meaning each block is stored on
three different DataNodes.
Heartbeat and Block Report: Periodically sending heartbeat signals to the NameNode to
confirm its availability. Additionally, sending block reports to provide information about
the list of blocks it is currently storing.
NameNode's Handling of DataNode Failure:
Since the NameNode manages the metadata and namespace of the HDFS, it needs to be
aware of the health and status of each DataNode in the cluster. When a DataNode fails
or becomes unreachable, the NameNode employs several mechanisms to handle the
situation:
Heartbeat Monitoring:
The NameNode expects regular heartbeat signals from each DataNode.
If the NameNode does not receive a heartbeat within a specified time period, it marks
the DataNode as dead or unreliable.
Block Report:DataNodes periodically send block reports to the NameNode, listing all the
blocks they are currently storing.
The NameNode uses this information to track the health and status of each block and
identify missing or corrupt blocks.
Replication Factor Maintenance:
If a DataNode fails or is marked as unreliable, the NameNode takes corrective actions to
maintain the desired replication factor for each block.
It schedules the replication of the missing or under-replicated blocks to other healthy
DataNodes.
Decommissioning:
If a DataNode is identified as consistently failing or unreliable, it may be decommissioned
by the administrator.
Decommissioning involves removing the node from the active set of DataNodes,
preventing it from receiving new blocks. Existing blocks are replicated to other nodes.
Block Re-replication:
The NameNode continuously monitors the replication factor of each block.
If the replication factor falls below the desired value due to DataNode failures, the
NameNode triggers re-replication to ensure fault tolerance.
By actively monitoring heartbeats, block reports, and taking corrective actions, the
NameNode ensures the health and availability of the DataNodes in the HDFS cluster. This
proactive approach helps maintain fault tolerance and data reliability in the face of
individual DataNode failures in a distributed environment.
5a) 7M
5a) what is Hive meta store which classes are used by the Hive to Read and
Write HDFS Files
ANS:
Hive Metastore in Big Data Analytics:
In big data analytics, the Hive Metastore is a critical component that facilitates metadata
management for large-scale data processing using Apache Hive. It allows users to define
and manage tables, databases, and associated metadata in a relational database,
enabling efficient querying and processing of data stored in Hadoop Distributed File
System (HDFS) or other storage systems. The separation of metadata from data storage
is essential for scalability and better integration with various data processing tools and
frameworks.
The Hive Metastore is responsible for managing metadata related to Hive tables,
including their schemas, partitions, and storage location. It stores this metadata in a
relational database and allows Hive to decouple the storage of metadata from the data
itself, facilitating metadata management and enabling better integration with other
tools.
Key functions of the Hive Metastore include:
Storing metadata about Hive tables, databases, partitions, and column statistics.
Managing the schema and structure of Hive tables.
Facilitating access to metadata for query planning and execution.
Enabling compatibility with various storage formats and systems.
Classes Used by Hive to Read and Write HDFS Files:
Hive uses several classes to interact with HDFS for reading and writing data. Here are
some key classes involved in these processes:
InputFormat and OutputFormat Classes:
org.apache.hadoop.mapred.InputFormat and org.apache.hadoop.mapred.OutputFormat
classes define how data is read from and written to HDFS in the MapReduce framework.
Examples include TextInputFormat for plain text, SequenceFileInputFormat for Hadoop
sequence files, and corresponding output formats.
SerDe (Serializer/Deserializer) Classes:
SerDe classes define how Hive tables' data is serialized for storage and deserialized for
processing during queries.
Examples include org.apache.hadoop.hive.serde2.avro.AvroSerDe for Avro format and
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe for simple text-based serialization.
StorageHandler Interface:
The org.apache.hadoop.hive.ql.metadata.StorageHandler interface is implemented by
custom storage handlers, defining how data is stored and retrieved from a particular
storage system.
It includes methods for initializing tables, obtaining input and output formats, and
working with SerDe.
HiveStorageHandler Classes:
Hive includes built-in storage handlers for various storage formats, such as
org.apache.hadoop.hive.ql.io.orc.OrcStorageHandler for ORC (Optimized Row Columnar)
format and org.apache.hadoop.hive.hbase.HBaseStorageHandler for Apache HBase.
HiveFileFormatUtils:
The org.apache.hadoop.hive.ql.io.HiveFileFormatUtils class provides utility methods for
determining the default file format based on table properties.
It is used to infer the file format when the format is not explicitly specified.
HiveMetaStoreClient:
The org.apache.hadoop.hive.metastore.HiveMetaStoreClient class is used to interact
with the Hive Metastore service programmatically.
It provides methods for managing metadata, including creating and altering tables,
databases, and partitions.
In big data analytics workflows, these classes play a crucial role in managing the
interaction between Hive and HDFS. They handle the intricacies of reading and writing
data in different formats, enabling efficient data processing, querying, and analysis
across large-scale distributed datasets. The Hive Metastore ensures that metadata about
tables and databases is well-managed, providing a foundation for the integration of Hive
with other big data tools and analytics frameworks.
8M
5b) Explain the following
5b) a) Logical Joins b) Window Functions
Answer:
Logical joins:
HiveJoin:
HiveQLSelectJoinsQueryTypesof Join inHive
1.Apache Hive Join– Objective
In Apache Hive, for combining specific fields from two tables by using values common to
eachone we use Hive Join – HiveQL Select Joins Query. However, we need to know the
syntax of Hive Join for implementation purpose. So, in this article, “Hive Join – HiveQL
Select Joins Query and its types” we will cover syntax of joins in hive. Also, we will learn
an example of Hive Join to understand well. Moreover, there are several types of Hive
join – HiveQL Select Joins: Hive inner join, hive left outer join, hive right outer join, and
hive full outer join. We willalso learn Hive Join tables in depth.
2.Apache Hive Join– HiveQLSelect Joins Query
Basically, for combining specific fields from two tables by using values common
to each one weuse Hive JOIN clause. In other words, to combine records from two or
more tables in the database we use JOIN clause. However, it is more or less similar to
SQL JOIN. Also, we use it tocombine rows from multiple tables.
Moreover, there are some points we need to observe about Hive Join:
• In Joins, only Equality joins are allowed.
• However, in the same query more than two tables can be joined.
• Basically, to offer more control over ON Clause for which there is no match LEFT,RIGHT,
FULL OUTER joins exist in order.
• Also, note that Hive Joins are not Commutative
• Whether they are LEFT or RIGHT joins in Hive, even then Joins are left-associative.
1. Apache Hive Join Syntax
Following is a syntax of Hive Join – HiveQL
Select Clause.join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN
table_reference join_condition| table_reference LEFT SEMI JOIN
table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]
2. Example of Join in Hive
Example of Hive Join – HiveQL Select Clause.
However, to understand well let’s suppose the following table named CUSTOMERS.
Table.1 Hive Join Example
ID Name Age Address Salary
1 Ross 32 Ahmedabad 2000
2 Rachel 25 Delhi 1500
3 Chandler 23 Kota 2000
4 Monika 25 Mumbai 6500
5 Mike 27 Bhopal 8500
6 Phoebe 22 MP 4500
7 Joey 24 Indore 10000
Also, suppose another table ORDERS as follows:
Table.2 – Hive Join Example
OID Date Customer_ID Amount
102 2016-10-08 00:00:00 3 3000
100 2016-10-08 00:00:00 3 1500
101 2016-11-20 00:00:00 2 1560
103 2015-05-20 00:00:00 4 2060
3. Typeof Joins inHive
4.
Basically, there are 4 types of Hive Join. Such as:
1. Inner join in Hive
2. Left Outer Join in Hive
3. Right Outer Join in Hive
4. Full Outer Join in Hive
So, let’s discuss each Hive join in detail.
1. Inner Join
Basically, to combine and retrieve the records from multiple tables we use Hive
Join clause. Moreover, in SQL JOIN is as same as OUTER JOIN. Moreover, by
using the primary keys andforeign keys of the tables JOIN condition is to be
raised.
Furthermore, the below query executes JOIN the CUSTOMER and ORDER
tables. Then furtherretrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNTFROM CUSTOMERS c JOIN
ORDERS o
ON (c.ID = o.CUSTOMER_ID);
Moreover, we get to see the following response, on the successful execution of the
query:
Table.3 – Inner Join in Hive
ID Name Age Amount
3 Chandler 23 1300
3 Chandler 23 1500
2 Rachel 25 1560
TYPES OF HIVE
1. hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
2. FROM CUSTOMERS c
3. LEFT OUTER JOIN ORDERS o
4. ON (c.ID = o.CUSTOMER_ID);
4 Monika 25 2060
2. Left Outer Join
On defining HiveQL Left Outer Join, even if there are no matches in the right table it
returns allthe rows from the left table. To be more specific, even if the ON clause
matches 0 (zero) recordsin the right table, then also this Hive JOIN still returns a row in
the result. Although, it returns with NULL in each column from the right table.
In addition, it returns all the values from the left table. Also, the matched values from
the
righttable, or NULL in case of no matching JOIN predicate.
However, the below query shows LEFT OUTER JOIN between CUSTOMER as well as
ORDERtables:
Moreover, we get to see the following response, on the successful execution of the
HiveQL
Select query:
Table.4 – Left Outer Join in Hive
ID Name Amount Date
1 Ross NULL NULL
2 Rachel 1560 2016-11-20 00:00:00
3 Chandler 3000 2016-10-08 00:00:00
3 Chandler 1500 2016-10-08 00:00:00
4 Monika 2060 2015-05-20 00:00:00
5 Mike NULL NULL
6 Phoebe NULL NULL
7 Joey NULL NULL
1. hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
2. FROM CUSTOMERS c
3. FULL OUTER JOIN ORDERS o
4. ON (c.ID = o.CUSTOMER_ID);
3. Right Outer Join
Basically, even if there are no matches in the left table, HiveQL Right Outer Join returns
all the rows from the right table. To be more specific, even if the ON clause matches 0
(zero) records inthe left table, then also this Hive JOIN still returns a row in the result.
Although, it returns with NULL in each column from the left table
In addition, it returns all the values from the right table. Also, the matched values from
the lefttable or NULL in case of no matching join predicate.
However, the below query shows RIGHT OUTER JOIN between the CUSTOMER as
well asORDER tables
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS
cRIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
Moreover, we get to see the following response, on the successful execution of the
HiveQLSelect query:
Table.5 – Right Outer Join in Hive
ID Name Amount Date
3 Chandler 1300 2016-10-08 00:00:00
3 Chandler 1500 2016-10-08 00:00:00
2 Rachel 1560 2016-11-20 00:00:00
4 Monika 2060 2015-05-20 00:00:00
4.Full Outer Join
The major purpose of this HiveQL Full outer Join is it combines the records of both the
left and the right outer tables which fulfills the Hive JOIN condition. Moreover, this
joined table contains either all the records from both the tables or fills in NULL values for
missing matches on eitherside.
However, the below query shows FULL OUTER JOIN between CUSTOMER as well as
ORDER tables:
Moreover, we get to see the following response, on the successful execution of the
query:
Table.6 – Full Outer Join in Hive
ID Name Amount Date
1 Ross NULL NULL
2 Rachel 1560 2016-11-20 00:00:00
3 Chandler 3000 2016-10-08 00:00:00
3 Chandler 1500 2016-10-08 00:00:00
4 Monika 2060 2015-05-20 00:00:00
5 Mike NULL NULL
6 Phoebe NULL NULL
7 Joey NULL NULL
3 Chandler 3000 2016-10-08 00:00:00
3 Chandler 1500 2016-10-08 00:00:00
2 Rachel 1560 2016-11-20 00:00:00
4 Monika 2060 2015-05-20 00:00:00
This was all about HiveQL Select – Apache Hive Join Tutorial. Hope you like our
explanation ofTypes of Joins in Hive.
4. Conclusion
As a result, we have seen what is Apache Hive Join and possible types of Join in
Hive- HiveQLSelect.
Window functions:
Windowing Functions In Hive
Windowing allows you to create a window on a set of data further allowing
aggregation surrounding that data. Windowing in Hive is introduced from Hive
0.11. In this blog, we will be giving a demo on the windowing functions available in
Hive.
Windowing in Hive includes the following functions
• Lead
The number of rows to lead can optionally be specified. If the number of rows to
lead is not specified, the lead is one row.
Returns null when the lead for the current row extends beyond the end of the
window
• Lag
The number of rows to lag can optionally be specified. If the number of rows to lag
is notspecified, the lag is one row.
Returns null when the lag for the current row extends before the beginning of the
window.
• FIRST_VALUE
• LAST_VALUE
The OVER clause
● OVER with standard aggregates:
COUNT
SUM
MIN
MAX
AVG
OVER with a PARTITION BY statement with one or more partitioning columns.
• OVER with PARTITION BY and ORDER BY with one or more partitioning and/or
ordering columns.
Analytics functions
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE
To give you a brief idea of these windowing functions in Hive, we will be using stock
marketdata. You can download the sample stocks data from here and load into your
stocks table.
Now we will create a table to load this stock market data as shown below.
Let us dive deeper into the window functions in Hive.
Lag
This function returns the values of the previous row. You can specify an integer offset
whichdesignates the row position else it will take the default integer offset as 1.
Here is the sample function for lag Here using lag we can display the yesterday’s closing
price of the ticker. Lag is to be used withover function, inside the over function you can
use partition or order by classes.
In the below screenshot, you can see the closing price of the stock for the day and the
yesterday’sprice.
Lead
This function returns the values from the following rows. You can specify an integer
offset which designates the row position else it will take the default integer offset as 1.
Here is the sample function for lead Now using the lead function, we will find that
whether the following day’s closing price is higheror lesser than today’s and that can be
done as follows.
In the below screenshot, you can see the result.
FIRST_VALUE
It returns the value of the first row from that window. With the below query, you can see
the firstrow high price of the ticker for all the days.
select ticker,date_,close,lag(close,1) over(partition by ticker) as yesterday_price from
acadgild.stocks
select ticker,date_,close,case(lead(close,1) over(partition by ticker)-close)>0 when true
then
"higher" when false then "lesser" end as Changes from acadgild.stocks
LAST_VALUE
It is the reverse of FIRST_VALUE. It returns the value of the last row from that
window. Withthe below query, you can see the last row high price value of the ticker for
all the days.
Let us now see the usage of the aggregate function using Over.
Count
It returns the count of all the values for the expression written in the over clause. From
the belowquery, we can find the number of rows present for each ticker.
select ticker,first_value(high) over(partition by ticker) as first_high from acadgild.stocks
select ticker,last_value(high) over(partition by ticker) as first_high from acadgild.stocks
For each partition, the count of ticker will be calculated, you can see the same in
the belowscreen shot.
Sum
It returns the sum of all the values for the expression written in the over clause. From
the belowquery, we can find the sum of all the closing stock prices for that particular
ticker.
For each ticker, the sum of all the closing prices will be calculated, you can see the same
in thebelow screen shot.
For suppose let us take if you want to get running total of the volume_for_the_day
for all thedays for every ticker then you can do this with the below query.
select ticker,count(ticker) over(partition by ticker) as cnt from acadgild.stocks
select ticker,sum(close) over(partition by ticker) as total from acadgild.stocks
select ticker,date_,volume_for_the_day,sum(volume_for_the_day) over(partition by
ticker order
by date_) as running_total from acadgild.stocks
Finding the percentage of each row value
Now let’s take a scenario where you need to find the percentage of the
volume_for_the_day onthe total volumes for that particular ticker and that can be
done as follows.
select
ticker,date_,volume_for_the_day,(volume_for_the_day*100/(sum(volume_for_the_day)
over(partition by ticker))) from acadgild.stocks
In the above screenshot, you can see that the percentage contribution of the
volumes for the day is found based on the total volume for that ticker.
Min
It returns the minimum value of the column for the rows in that over clause. From
the belowquery, we can find the minimum closing stock price for each particular
ticker.
Max
It returns the maximum value of the column for the rows in that over clause. From
the belowquery, we can find the maximum closing stock price for each particular
ticker.
select ticker, min(close) over(partition by ticker) as minimum from acadgild.stocks
select ticker, max(close) over(partition by ticker) as maximum from acadgild.stocks
AVG
It returns the average value of the column for the rows that over clause returns.
From the belowquery, we can find the average closing stock price for each particular
ticker.
Now let us work on some Analytic functions.
Rank
The rank function will return the rank of the values as per the result set of the over
clause. If two values are same then it will give the same rank to those 2 values and then
for the next value, the sub-sequent rank will be skipped.
The below query will rank the closing prices of the stock for each ticker. The same
you can seein the below screenshot.
select ticker, avg(close) over(partition by ticker) as maximum from acadgild.stocks
select ticker,close,rank() over(partition by ticker order by close) as closing from
acadgild.stocks
Row_number
Row number will return the continuous sequence of numbers for all the rows of
the result set ofthe over clause.
From the below query, you will get the ticker, closing price and its row number for
each ticker.
Dense_rank
It is same as the rank() function but the difference is if any duplicate value is present
then the rank will not be skipped for the subsequent rows. Each unique value will
get the ranks in a sequence.
The below query will rank the closing prices of the stock for each ticker. The same
you can seein the below screenshot.
select ticker,close,row_number() over(partition by ticker order by close) as num from
acadgild.stocks
select ticker,close,dense_rank() over(partition by ticker order by close) as closing from
acadgild.stocks
Cume_dist
It returns the cumulative distribution of a value. It results from 0 to 1. For suppose
if the total number of records are 10 then for the 1st row the cume_dist will be 1/10
and for the second 2/10 and so on till 10/10.
This cume_dist will be calculated in accordance with the result set returned by the
over clause. The below query will result in the cumulative of each record for every
ticker.
select ticker,cume_dist() over(partition by ticker order by close) as cummulative from
acadgild.stocks
Percent_rank
It returns the percentage rank of each row within the result
set of over clause. Percent_rank is calculated in accordance
with the rank of the row and the calculation is as follows
(rank-1)/ (total_rows_in_group – 1). If the result set has only
one row then the percent_rank will be 0.
The below query will calculate the percent_rank for every row
in each partition and you can see the same in the below screen shot.
Ntile
It returns the bucket number of the particular value. For
suppose if you say Ntile(5) then it will create 5 buckets based
on the result set of the over clause after that it will place the
first 20% of the records in the 1st bucket and so on till 5th bucket.
The below query will create 5 buckets for every ticker and the first 20% records for every
ticker will be in the 1st bucket and select ticker,close,percent_rank() over(partition by
ticker order by close) as closing from acadgild.stocks so on.
In the below screenshot, you can see that 5 buckets will be created for every ticker and
the least 20% closing prices will be in the first bucket and the next 20% will be in the
second bucket and so on till 5th bucket for all the tickers.
This is how we can perform windowing operations in Hive.
6a)
6a) Explain how Hive facilitate the big data analytics discuss its data 8M
types, file formats and HiveQL select ticker,ntile(5) over(partition by ticker order by
close ) as bucket from acadgild.stocks What is a Hive?
ANS:
Before understanding the Hive Data Types first we will study
the hive. Hive is a data warehousing technique of Hadoop.
Hadoop is the data storage and processing segment of Big data platform. Hive holds its
position for sequel data processing techniques. Like other sequel environments hive
can be reached through sequel queries. The major offerings by hive are data analysis, ad-
hoc querying and summarize the stored data from a latency perspective, the queriesgo a
greater amount.
Datatypes are classified into two types:
• Primitive Data Types
• Collective Data Types
1. Primitive Data Types
Primitive means were ancient and old. all datatypes listed as
primitive are legacy ones. The important primitive datatypes
areas listed below:
Type Size (byte) Example
TinyInt 1 20
SmallInt 2 20
Int 4 20
Bigint 8 20
Boolean Boolean true/False FALSE
Hive Data Types are Implemented using JAVA
Ex: Java Int is used for implementing the Int data type
here.
• Character arrays are not supported in HIVE.
• Hive relies on delimiters to separate its fields, hive on
coordinating with Hadoop allowsto increase the write
performance and read performance.
• Specifying the length of each column is not expected in
the hive database.
• String literals can be articulated within either double
quotes (“) single quotes (‘).
• In a newer version of the hive, Varchar types are
introduced and they form a span specifier of (amid 1 and
65535), So for a character string, this acts as the largest
length ofvalue which it can accommodate. When a value
exceeding this length is inserted then therightmost
elements of that values are been truncated. Character
length is resolution with the figure of code points
controlled by the character string.
• All integer literals (TINYINT, SMALLINT, BIGINT) are
considered as INT datatypesbasically, and only the length
exceeds the actual int level it gets transmuted into a
BIGINT or any other respective type.
• Decimal literals afford defined values and superior
Double 8 10.2222
Float 4 10.2222
String Sequence of characters ABCD
Timestamp Integer/float/string 2/3/2012 12:34:56:1
Date Integer/float/string 2/3/2019
collection for floating-point valueswhen compared to the
DOUBLE type. Here numeric values are stored on their exact form, but in the case of
double, they are not stored exactly as numeric values.Date Value Casting Process
Date types can only be converted to/from Date, Timestamp,
or String types. Casting with user-specified formats is
SEQUENCEFILE
• Afirst alternative to the hive default file format.
• Can be specified using”STORED AS SQUENCEFILE” clause
during table creation.
• Files are in flat files structure consisting of binary key-value
pairs.
• In a runtime Hive queries processed into MapReduce jobs,
during which records areassigned/generated with the
appropriate key-value pairs.
• It is a standard format supported by Hadoop itself, thus
becomes native or acceptablewhile sharing files between
Hive and other Hadoop-related tools.
• It’s less suitable for use with tools outside the Hadoop
ecosystem.
• When needed, the sequence files can be compressed at the
block and record level, whichis very useful for optimizing disk
space utilization and I/O, while still supporting the ability to
split files on block for parallel processing.
RCFILE
• RCFile = Record Columnar File
• An efficient internal (binary) hive format and natively
supported by Hive.
• Used when Column-oriented organization is a good storage
option for certain types ofdata and applications.
• If data is stored by column instead of by row, then only the
data for the desired columnshas to be read, this intern
improves performance.
• Makes columns compression very efficient, especially for low
cardinality columns.
• Also, some column-oriented stores do not physically need to
store null columns.
• Helps storing columns of a table in a record columnar way.
• It first partitions rows horizontally into row splits and then it
vertically partitions eachrow split in a columnar way.
• It first stores the metadata of a row split, as the key part of a
record, and all the data of arow split as value part.
• The rcfilecat tool to display the contents of RCFiles from
Hive command line, since theRCFiles can not be seen with
simple editors.
ORC
• ORC = Optimized Row Columnar
• Designed to overcome limitations of other Hive file formats
and has highly efficient wayto store Hive data.
• Stores data as groups of row data called stripes, along with
auxiliary information in a filefooter.
• Holds compression parameters and size of the compressed
footer at the end of the file apostscript section.
AVRO
• Relatively newest Apache’s Hadoop related projects.
• A language neutral preferred data serialization system.
• Handles multiple data formats that can be processed by
multiple languages.
• Relies on schema.
• Uses JSON for defining data structure schema, types
and protocols.
• Stores data structure definitions along with the data, in
an easy-to-process form.
• It includes support for integers, numeric types, arrays,
maps, enums, variable and fixed-length binary data and
strings.
• It also defines a container file format intended to
provide good support for MapReduceand other
analytical frameworks.
• Data structures can specify sort order,
• Faster sorting is possible without deserialization.
• The data created in one programming language can be
sorted by another.
• When data is read, schema used for writing it is always
present and available permittingrecords data
Serialization faster with minimal overheads per record.
• It serializes data in a compact binary format.
• It can provide both a serialization format for persistent
data, and a wire format forcommunication between
Hadoop nodes, and from client programs to the Hadoop
services.
• An additional advantage of storing the full data
structure definition with the data is that itpermits the
data to be written faster and more compactly without a
need to process metadata separately.
• Avro as a file format to store data in a predefined
format and can be used in any of the Hadoop’s tools
like Pig, Hive and other programming languages like
Java, Python, more.
• lets one define Remote Procedure Call (RPC)
protocols. The data types used in RPC are usually
distinct from those in datasets, using a common
serialization system is still useful.
3. HiveQL (Hive Query Language) in Big Data Analytics:
HiveQL is a SQL-like language used to interact with Hive for
defining schema, querying data, and managing metadata.
Key features of HiveQL in big data analytics include:
Data Definition Language (DDL):
Allows users to define and manage tables, databases, and
views.
Example: CREATE TABLE, ALTER TABLE, CREATE
DATABASE.
Data Manipulation Language (DML):
Enables users to perform operations like inserting, updating,
and deleting data.
Example: INSERT INTO, UPDATE, DELETE.
Query Language:
Supports querying large datasets using SQL-like syntax.
Example: SELECT, JOIN, GROUP BY, ORDER BY.
UDFs (User-Defined Functions):
Users can define their custom functions to extend the
functionality of HiveQL.
Example: CREATE FUNCTION.
Built-in Functions:
Hive provides a rich set of built-in functions for common data
manipulations and transformations.
Example: SUM, AVG, MAX, MIN, etc.
HiveQL abstracts the complexities of distributed data processing
and allows users to express complex analytical queries using
familiar SQL constructs. This is particularly advantageous in
big data analytics, where many data analysts and business
intelligence professionals are already well-versed in SQL.
In summary, Hive plays a significant role in big data analytics
by providing a SQL-like interface to interact with large
datasets stored in HDFS. Its support for various data types,
file formats, and a familiar query language makes it an
essential tool for analysts and data scientists working on big
data platforms.
7M
6b) How can we install the Apache Hive on the System-Explain
6b)
ANS:
Step 1: Verifying JAVA Installation
Java must be installed on your system before installing Hive. Let
us verify java installation using the following command:
$ java –version
If Java is already installed on your system, you get to see the
following response:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the steps given
below for installing java.
Installing Java
Step I:
Download java (JDK <latest version> - X64.tar.gz) by visiting the
following
link http://www.oracle.com/technetwork/java/javase/downloads/j
dk7-downloads-1880260.html.
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your
system.
Step II:
Generally you will find the downloaded java file in the Downloads
folder. Verify it and extract the jdk-7u71-linux-x64.gz file using
the following commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz
Step III:
To make java available to all the users, you have to move it to
the location “/usr/local/”. Open root, and type the following
commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
Step IV:
For setting up PATH and JAVA_HOME variables, add the following
commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step V:
Use the following commands to configure java alternatives:
# alternatives --install
/usr/bin/java/java/usr/local/java/bin/java 2
# alternatives --install
/usr/bin/javac/javac/usr/local/java/bin/javac 2
# alternatives --install
/usr/bin/jar/jar/usr/local/java/bin/jar 2
# alternatives --set java/usr/local/java/bin/java
# alternatives --set javac/usr/local/java/bin/javac
# alternatives --set jar/usr/local/java/bin/jar
Now verify the installation using the command java -version from
the terminal as explained above.
Step 2: Verifying Hadoop Installation
Hadoop must be installed on your system before installing Hive.
Let us verify the Hadoop installation using the following
command:
$ hadoop version
If Hadoop is already installed on your system, then you will get
the following response:
Hadoop 2.4.1 Subversion
https://svn.apache.org/repos/asf/hadoop/common -r
1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum
79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the
following steps:
Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache Software
Foundation using the following commands.
$ su
password:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop 2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit
Installing Hadoop in Pseudo Distributed Mode
The following steps are used to install Hadoop 2.4.1 in pseudo
distributed mode.
Step I: Setting up Hadoop
You can set Hadoop environment variables by appending the
following commands to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step II: Hadoop Configuration
You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable
changes in those configuration files according to your Hadoop
infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs using java, you have to
reset the java environment variables in hadoop-env.sh file by
replacing JAVA_HOME value with the location of java in your
system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure
Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number
used for Hadoop instance, memory allocated for the file system,
memory limit for storing the data, and the size of Read/Write
buffers.
Open the core-site.xml and add the following properties in
between the <configuration> and </configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of
replication data, the namenode path, and the datanode path of
your local file systems. It means the place where you want to
store the Hadoop infra.
Let us assume the following data.
dfs.replication (data replication value) = 1
(In the following path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by
hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by
hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the
<configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode
</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode
</value >
</property>
</configuration>
Note: In the above file, all the property values are user-defined
and you can make changes according to your Hadoop
infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn site.xml file and add the
following properties in between the
<configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are
using. By default, Hadoop contains a template of yarn-site.xml.
First of all, you need to copy the file from mapred site,xml.template to mapred-site.xml
file using the following
command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in
between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Verifying Hadoop Installation
The following steps are used to verify the Hadoop installation.
Step I: Name Node Setup
Set up the namenode using the command “hdfs namenode -
format” as follows.
$ cd ~
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/******************************************************
******
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage
directory
/home/hadoop/hadoopinfra/hdfs/namenode has been
successfully formatted.
10/24/14 21:30:56 INFO
namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with
status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/******************************************************
******
SHUTDOWN_MSG: Shutting down NameNode at
localhost/192.168.1.11
*******************************************************
*****/
Step II: Verifying Hadoop dfs
The following command is used to start dfs. Executing this
command will start your Hadoop file system.
$ start-dfs.sh
The expected output is as follows:
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to
/home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-namenode localhost.out
localhost: starting datanode, logging to
/home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode localhost.out
Starting secondary namenodes [0.0.0.0]
Step III: Verifying Yarn Script
The following command is used to start the yarn script. Executing
this command will start your yarn daemons.
$ start-yarn.sh
The expected output is as follows:
starting yarn daemons
starting resourcemanager, logging to
/home/hadoop/hadoop-2.4.1/logs/yarn-hadoop resourcemanager-localhost.out
localhost: starting nodemanager, logging to
/home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-nodemanager localhost.out
Step IV: Accessing Hadoop on Browser
The default port number to access Hadoop is 50070. Use the
following url to get Hadoop services on your browser.
http://localhost:50070/
Step V: Verify all applications for cluster
The default port number to access all applications of cluster is
8088. Use the following url to visit this service.
http://localhost:8088/
Step 3: Downloading Hive
We use hive-0.14.0 in this tutorial. You can download it by
visiting the following link http://apache.petsads.us/hive/hive 0.14.0/. Let us assume it
gets downloaded onto the /Downloads
directory. Here, we download Hive archive named “apache-hive 0.14.0-bin.tar.gz” for
this tutorial. The following command is used
to verify the download:
$ cd Downloads
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin.tar.gz
Step 4: Installing Hive
The following steps are required for installing Hive on your
system. Let us assume the Hive archive is downloaded onto the
/Downloads directory.
Extracting and verifying Hive Archive
The following command is used to verify the download and
extract the hive archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
Copying files to /usr/local/hive directory
We need to copy the files from the super user “su -”. The
following commands are used to copy the files from the extracted
directory to the /usr/local/hive” directory.
$ su -
passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
Setting up environment for Hive
You can set up the Hive environment by appending the following
lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
The following command is used to execute ~/.bashrc file.
$ source ~/.bashrc
Step 5: Configuring Hive
To configure Hive with Hadoop, you need to edit the hive env.sh file, which is placed in
the $HIVE_HOME/conf directory. The
following commands redirect to Hive config folder and copy the
template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
Edit the hive-env.sh file by appending the following line:
export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an
external database server to configure Metastore. We use Apache
Derby database.
Step 6: Downloading and Installing Apache Derby
Follow the steps given below to download and install Apache
Derby:
Downloading Apache Derby
The following command is used to download Apache Derby. It
takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db derby-10.4.2.0/db-derby-10.4.2.0-
bin.tar.gz
The following command is used to verify the download:
$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin.tar.gz
Extracting and verifying Derby archive
The following commands are used for extracting and verifying the
Derby archive:
$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ ls
On successful download, you get to see the following response:
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
Copying files to /usr/local/derby directory
We need to copy from the super user “su -”. The following
commands are used to copy the files from the extracted directory
to the /usr/local/derby directory:
$ su -
passwd:
# cd /home/user
# mv db-derby-10.4.2.0-bin /usr/local/derby
# exit
Setting up environment for Derby
You can set up the Derby environment by appending the following
lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_H
OME/lib/derbytools.jar
The following command is used to execute ~/.bashrc file:
$ source ~/.bashrc
Create a directory to store Metastore
Create a directory named data in $DERBY_HOME directory to
store Metastore data.
$ mkdir $DERBY_HOME/data
Derby installation and environmental setup is now complete.
Step 7: Configuring Metastore of Hive
Configuring Metastore means specifying to Hive where the
database is stored. You can do this by editing the hive-site.xml
file, which is in the $HIVE_HOME/conf directory. First of all, copy
the template file using the following command:
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the
<configuration> and </configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=
true </value>
<description>JDBC connect string for a JDBC
metastore </description>
</property>
Create a file named jpox.properties and add the following lines
into it:
javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName =
org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL =
jdbc:derby://hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Step 8: Verifying Hive Installation
Before running Hive, you need to create the /tmp folder and a
separate Hive folder in HDFS. Here, we use
the /user/hive/warehouse folder. You need to set write permission for
these newly created folders as shown below:
chmod g+w
Now set them in HDFS before verifying Hive. Use the following
commands:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir
/user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w
/user/hive/warehouse
The following commands are used to verify Hive installation:
$ cd $HIVE_HOME
$ bin/hive
On successful installation of Hive, you get to see the following
response:
Logging initialized using configuration in
jar:file:/home/hadoop/hive-0.9.0/lib/hive-common 0.9.0.jar!/hive-log4j.properties
Hive history
file=/tmp/hadoop/hive_job_log_hadoop_201312121621_14949
29084.txt
………………….
hive>
The following sample command is executed to display all the
tables:
hive> show tables;
OK
Time taken: 2.798 seconds
hive>
7a) 7M
7a) list and explain the important features of hadoop
ANS:
Hadoop is a powerful open-source framework designed for distributed storage and
processing of large datasets across clusters of commodity hardware. Here are some
important features of Hadoop:
Features of Hadoop Which Makes It Popular
Let’s discuss the key features which make Hadoop more reliable to use, an
industry favorite, and the most powerful Big Data tool.
1. Open Source:
. Hadoop is open-source, which means it is free to use. Since it is an open source project
the source-code is available online for anyone to understand it
or make some modifications as per their industry requirement.
2. Highly Scalable Cluster:
Hadoop is a highly scalable model. A large amount of data is divided into
multiple inexpensive machines in a cluster which is processed parallelly. the
number of these machines or nodes can be increased or decreased as per the
enterprise’s requirements. In traditional RDBMS(Relational DataBase
Management System) the systems can not be scaled to approach large
amounts of data.
3. Fault Tolerance is Available:
Hadoop uses commodity hardware(inexpensive systems) which can be
crashed at any moment. In Hadoop data is replicated on various DataNodes
in a Hadoop cluster which ensures the availability of data if somehow any of
your systems got crashed. You can read all of the data from a single machine
if this machine faces a technical issue data can also be read from other nodes
in a Hadoop cluster because the data is copied or replicated by default. By
default, Hadoop makes 3 copies of each file block and stored it into different
nodes. This replication factor is configurable and can be changed by changing
the replication property in the hdfs-site.xml file.
4. High Availability is Provided:
Fault tolerance provides High Availability in the Hadoop cluster. High
Availability means the availability of data on the Hadoop cluster. Due to fault
tolerance in case if any of the DataNode goes down the same data can be
retrieved from any other node where the data is replicated. The High available
Hadoop cluster also has 2 or more than two Name Node i.e. Active NameNode
and Passive NameNode also known as stand by NameNode. In case if Active
NameNode fails then the Passive node will take the responsibility of Active
Node and provide the same data as that of Active NameNode which can easily
be utilized by the user.
5. Cost-Effective:
Hadoop is open-source and uses cost-effective commodity hardware which
provides a cost-efficient model, unlike traditional Relational databases that
require expensive hardware and high-end processors to deal with Big Data.
The problem with traditional Relational databases is that storing the Massive
volume of data is not cost-effective, so the company’s started to remove the
Raw data. which may not result in the correct scenario of their business.
Means Hadoop provides us 2 main benefits with the cost one is it’s open source means
free to use and the other is that it uses commodity hardware which is also inexpensive.
6. Hadoop Provide Flexibility:
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images
and Videos) very efficiently. This means it can easily process any kind of data
independent of its structure which makes it highly flexible. It is very much
useful for enterprises as they can process large datasets easily, so the
businesses can use Hadoop to analyze valuable insights of data from sources
like social media, email, etc. With this flexibility, Hadoop can be used with log
processing, Data Warehousing, Fraud detection, etc.
7. Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the
processing work since it is managed by the Hadoop itself. Hadoop ecosystem
is also very large comes up with lots of tools like Hive, Pig, Spark, HBase,
Mahout, etc.
8. Hadoop uses Data Locality:
The concept of Data Locality is used to make Hadoop processing fast. In the
data locality concept, the computation logic is moved near data rather than
moving the data to the computation logic. The cost of Moving data on HDFS
is costliest and with the help of the data locality concept, the bandwidth
utilization in the system is minimized.
9. Provides Faster Data Processing:
Hadoop uses a distributed file system to manage its storage i.e.
HDFS(Hadoop Distributed File System). In DFS(Distributed File System) a
large size file is broken into small size file blocks then distributed among the
Nodes available in a Hadoop cluster, as this massive number of file blocks are
processed parallelly which makes Hadoop faster, because of which it provides
a High-level performance as compared to the traditional DataBase
Management Systems.
10. Support for Multiple Data Formats:
Hadoop supports multiple data formats like CSV, JSON, Avro, and more,
making it easier to work with different types of data sources. This makes it
more convenient for developers and data analysts to handle large volumes of
data with different formats.
11. High Processing Speed:
Hadoop’s distributed processing model allows it to process large amounts of
data at high speeds. This is achieved by distributing data across multiple
nodes and processing it in parallel. As a result, Hadoop can process data
much faster than traditional database systems.
12. Machine Learning Capabilities:
Hadoop offers machine learning capabilities through its ecosystem tools like
Mahout, which is a library for creating scalable machine learning
applications. With these tools, data analysts and developers can build
machine learning models to analyze and process large datasets.
13. Integration with Other Tools:
Hadoop integrates with other popular tools like Apache Spark, Apache Flink,
and Apache Storm, making it easier to build data processing pipelines. This
integration allows developers and data analysts to use their favorite tools
and frameworks for building data pipelines and processing large datasets.
14. Secure:
Hadoop provides built-in security features like authentication, authorization,
and encryption. These features help to protect data and ensure that only
authorized users have access to it. This makes Hadoop a more secure
platform for processing sensitive data.
15. Community Support:
Hadoop has a large community of users and developers who contribute to its
development and provide support to users. This means that users can
access a wealth of resources and support to help them get the most out of
Hadoop.
ANS:
Hive Client:
Hive drivers support applications written in any language like Python, Java,
C++, and Ruby, among others, using JDBC, ODBC, and Thrift drivers, to
perform queries on the Hive. Therefore, one may design a hive client in any
language of their choice.
The three types of Hive clients are referred to as Hive clients:
1. Thrift Clients: The Hive server can handle requests from a thrift client
by using Apache Thrift.
2. JDBC client: A JDBC driver connects to Hive using the Thrift
framework. Hive Server communicates with the Java applications using
the JDBC driver.
3. ODBC client: The Hive ODBC driver is similar to the JDBC driver in
that it uses Thrift to connect to Hive. However, the ODBC driver uses
the Hive Server to communicate with it instead of the Hive Server.
Hive Services:
Hive provides numerous services, including the Hive server2, Beeline,
etc. The services offered by Hive are:
1. Beeline: HiveServer2 supports the Beeline, a command shell that which
the user can submit commands and queries to. It is a JDBC client that
utilises SQLLINE CLI (a pure Java console utility for connecting with
relational databases and executing SQL queries). The Beeline is based
on JDBC.
2. Hive Server 2: HiveServer2 is the successor to HiveServer1. It provides
clients with the ability to execute queries against the Hive. Multiple
clients may submit queries to Hive and obtain the desired results. Open
API clients such as JDBC and ODBC are supported by HiveServer2.
Note: Hive server1, which is also known as a Thrift server, is used to
communicate with Hive across platforms. Different client applications can
submit requests to Hive and receive the results using this server.
HiveServer2 handled concurrent requests from more than one client, so it was
replaced by HiveServer1.
Hive Driver: The Hive driver receives the HiveQL statements submitted by
the user through the command shell and creates session handles for the
query.
Hive Compiler: Metastore and hive compiler both store metadata in order to
support the semantic analysis and type checking performed on the different
query blocks and query expressions by the hive compiler. The execution plan
generated by the hive compiler is based on the parse results.
The DAG (Directed Acyclic Graph) is a DAG structure created by the compiler.
Each step is a map/reduce job on HDFS, an operation on file metadata, and a
data manipulation step.
Optimizer: The optimizer splits the execution plan before performing the
transformation operations so that efficiency and scalability are improved.
Execution Engine: After the compilation and optimization steps, the
execution engine uses Hadoop to execute the prepared execution plan, which
is dependent on the compiler’s execution plan.
Metastore: Metastore stores metadata information about tables and
partitions, including column and column type information, in order to improve
search engine indexing.
The metastore also stores information about the serializer and deserializer as
well as HDFS files where data is stored and provides data storage. It is
usually a relational database. Hive metadata can be queried and modified
through Metastore.
We can either configure the metastore in either of the two modes:
1. Remote: A metastore is not enabled in remote mode, and non-Java
applications cannot benefit from Thrift services.
2. Embedded: A client in embedded mode can directly access the
metastore via JDBC.
HCatalog: HCatalog is a Hadoop table and storage management layer that
provides users with different data processing tools such as Pig, MapReduce,
etc. with simple access to read and write data on the grid.
The data processing tools can access the tabular data of Hive metastore
through It is built on the top of Hive metastore and exposes the tabular data to
other data processing tools.
WebHCat: The REST API for HCatalog provides an HTTP interface to
perform Hive metadata operations. WebHCat is a service provided by the user
to run Hadoop MapReduce (or YARN), Pig, and Hive jobs.
Processing and Resource Management:
Hive uses a MapReduce framework as a default engine for performing the
queries, because of that fact.
MapReduce frameworks are used to write large-scale applications that
process a huge quantity of data in parallel on large clusters of commodity
hardware. MapReduce tasks can split data into chunks, which are processed
by map-reduce jobs.
Distributed Storage:Hive is based on Hadoop, which means that it uses the
Hadoop Distributed File System for distributed storage.
Working with Hive:We will now look at how to use Apache Hive to process
data.
1. The driver calls the user interface’s execute function to perform a query.
2. The driver answers the query, creates a session handle for the query,
and passes it to the compiler for generating the execution plan.
3. The compiler responses to the metadata request are sent to the
metaStore.
4. The compiler computes the metadata using the meta data sent by the
metastore. The metadata that the compiler uses for type-checking and
semantic analysis on the expressions in the query tree is what is written
in the preceding bullet. The compiler generates the execution plan
(Directed acyclic Graph) for Map Reduce jobs, which includes map
operator trees (operators used by mappers and reducers) as well as
reduce operator trees (operators used by reducers).
5. The compiler then transmits the generated execution plan to the driver.
6. After the compiler provides the execution plan to the driver, the driver
passes the implemented plan to the execution engine for execution.
7. The execution engine then passes these stages of DAG to suitable
components. The deserializer for each table or intermediate output uses
the associated table or intermediate output deserializer to read the rows
from HDFS files. These are then passed through the operator tree. The
HDFS temporary file is then serialised using the serializer before being
written to the HDFS file system. These HDFS files are then used to
provide data to the subsequent MapReduce stages of the plan. After the
final temporary file is moved to the table’s location, the final temporary
file is moved to the table’s final location.
8. The driver stores the contents of the temporary files in HDFS as part of
a fetch call from the driver to the Hive interface. The Hive interface
sends the results to the driver.
Different Modes of Hive:
A hive can operate in two modes based on the number of data nodes in
Hadoop.
1. Local mode
2. Map-reduce mode
When using Local mode:
1. We can run Hive in pseudo mode if Hadoop is installed under pseudo
mode with one data node.
2. In this mode, we can have a data size of up to one machine as long as
it is smaller in terms of physical size.
3. Smaller data sets will be processed rapidly on local machines due to the
processing speed of small data sets.
When using Map Reduce mode:
1. In this type of setup, there are multiple data nodes, and data is
distributed across different nodes. We use Hive in this scenario
2. It will be able to handle large amounts of data as well as parallel queries
in order to execute them in a timely fashion.
3. By turning on this mode, you can increase the performance of data
processing by processing large data sets with better performance.
7M
8b) 8b) Describe how spark handle data frame and complex data types include
an example working with JSON data in spark
ANS:
Data Frames:
A DataFrame is a distributed collection of data, which is
organized into named columns. Conceptually, it is equivalent to
relational tables with good optimization techniques.
A DataFrame can be constructed from an array of different
sources such as Hive tables, Structured Data files, external
databases, or existing RDDs. This API was designed for
modern Big Data and data science applications taking
inspiration from DataFrame in R Programming and Pandas
in Python.
Features of DataFrame
Here is a set of few characteristic features of DataFrame −
• Ability to process the data in the size of Kilobytes to
Petabytes on a single node cluster to large cluster.
• Supports different data formats (Avro, csv, elastic
search, and Cassandra) and storage systems (HDFS,
HIVE tables, mysql, etc).
• State of art optimization and code generation through
the Spark SQL Catalyst optimizer (tree transformation
framework).
• Can be easily integrated with all Big Data tools and frameworks via
Spark-Core.
• Provides API for Python, Java, Scala, and R Programming.
SQLContext
SQLContext is a class and is used for initializing the
functionalities of Spark SQL. SparkContext class object (sc) is
required for initializing SQLContext class object.
The following command is used for initializing the SparkContext through
spark-shell.
$ spark-shell
By default, the SparkContext object is initialized with the name
sc when the spark-shell starts.
Use the following command to create SQLContext.
scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
Example
Let us consider an example of employee records
in a JSON file named employee.json. Use the following
commands to create a DataFrame (df) and read a JSON
document named employee.json with the following content.
employee.json − Place this file in the directory where the
current scala> pointer is located.
DataFrame Operations
DataFrame provides a domain-specific language for structured
data manipulation. Here, we include some basic examples of
structured data processing using DataFrames.
Follow the steps given below to perform DataFrame operations −
Read the JSON Document
First, we have to read the JSON document. Based on this,
generate a DataFrame named (dfs).
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}
Use the following command to read the JSON document named
employee.json. The data is shown as a table with the fields −
id, name, and age.
scala> val dfs = sqlContext.read.json("employee.json")
Output − The field names are taken automatically from employee.json.
dfs: org.apache.spark.sql.DataFrame = [age: string, id: string,
name: string]
Show the Data
If you want to see the data in the DataFrame, then use the following
command.
scala> dfs.show()
Output − You can see the employee data in a tabular format.
<console>:22, took 0.052610 s
+----+------+------------------+
|age | id | name |
+----+------+------------------+
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith |
| 23 | 1204 | javed |
| 23 | 1205 | prudvi |
+----+------+------------------+
Use printSchema Method
If you want to see the Structure (Schema) of the
DataFrame, then use the following command.
scala> dfs.printSchema()
Output
root
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
Use Select Method
Use the following command to fetch name-column among
three columns from the DataFrame.
scala> dfs.select("name").show()
Output − You can see the values of the name column.
<console>:22, took 0.044023 s
+ ------------+
| name |
+ ------------+
| satish |
| krishna|
| amith |
| javed |
| prudvi |
+ ------------+
Use Age Filter
Use the following command for finding the employees whose
age is greater than 23 (age > 23).
scala> dfs.filter(dfs("age") > 23).show()
Output
<console>:22, took 0.078670 s
+----+------+------------------+
|age | id | name |
+----+------+------------------+
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith |
+----+------+------------------+
Use groupBy Method
Use the following command for counting the number of
employees who are of the same age.
scala> dfs.groupBy("age").count().show()
Output − two employees are having age 23.
<console>:22, took 5.196091 s
+----+----------+
|age |count|
+----+----------+
| 23 | 2 |
| 25 | 1 |
| 28 | 1 |
| 39 | 1 |
+----+----------+
Complex Data Types in Spark DataFrames:
Spark supports various complex data types:
Arrays:
Represents a collection of elements of the same type.
Useful for scenarios where a field contains multiple values, such as tags or
a list of items.
Structs:
Represents a structure with named fields, allowing the grouping of related
attributes.
Useful for dealing with nested or hierarchical data structures.
Example Working with JSON Data in Spark:
Let's consider an example where we work with JSON data using Spark
# Import required Spark libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create a Spark session
spark = SparkSession.builder.appName("JsonExample").getOrCreate()
# JSON data
json_data = '''
{
"id": 1,
"name": "John Doe",
"age": 25,
"skills": ["Python", "Spark", "SQL"],
"address": {
"city": "New York",
"zipcode": "10001"
}
}
'''
# Read JSON data into a DataFrame
df = spark.read.json(spark.sparkContext.parallelize([json_data]))
# Show the DataFrame
df.show()
# Extracting data using DataFrame operations
result_df = df.select(
col("id"),
col("name"),
col("age"),
col("address.city").alias("city"),
col("address.zipcode").alias("zipcode"),
col("skills")
)
# Show the result DataFrame
result_df.show(truncate=False)
# Stop the Spark session
spark.stop()
Explanation:
Spark Session Creation:
The SparkSession is created, serving as the entry point for Spark
functionality.
Define JSON Data:
A JSON string (json_data) is defined, representing a sample JSON record.
Read JSON Data into a DataFrame:
The spark.read.json() method reads JSON data into a DataFrame (df).
Show the Original DataFrame:
The original DataFrame is displayed to illustrate the structure of the loaded
JSON data.
DataFrame Operations:
DataFrame operations are used to select specific columns (id, name, age,
address.city, address.zipcode, skills) and alias some of them.
Show Result DataFrame:
The result DataFrame is displayed, showing the extracted and manipulated
data.
Stop Spark Session:
The Spark session is stopped to release resources.
In big data analytics, this example demonstrates how Spark can efficiently
handle and process JSON data with nested structures using its DataFrame
API. The support for complex data types allows for flexible and powerful
data manipulations, making Spark well-suited for diverse big data scenarios.
9a)
9a) Explain Event time and state full processing 7M
ANS:
Covered the core concepts and basic APIs; this chapter dives into
event-time andstateful processing. Event-time processing is a hot topic
because we analyze information with respect to the time that it was
created, not processed. The key idea between this style of processing
is that over the lifetime of the job, Spark will maintain relevant state
that it can update over the course of the job before outputting it to the
sink.
Let’s cover these concepts in greater detail before we begin working
with codeto show they work.
Event Time
Event time is an important topic to cover discretely because Spark’s
DStreamAPI does not support processing information with respect to
event-time. At a
higher level, in stream-processing systems there are effectively two
relevant times for each event: the time at which it actually occurred
(event time), and thetime that it was processed or reached the stream processing system
(processing time).
Event time
Event time is the time that is embedded in the data itself. It is most
often, though not required to be, the time that an event actually occurs.
This is important to use because it provides a more robust way of
comparing events against one another. The challenge here is that
event data can be late or out of order. This means that the stream
processing system must beable to handle out-of-order or late data.
Processing time
Processing time is the time at which the stream-processing system
..Stateful Streaming in Apache Spark
Apache Spark is a general processing engine built on top of the Hadoop
eco- system. Spark has a complete setup and a unified framework to
process any kind of data. Spark can do batch processing as well as
stream processing. Sparkhas a powerful SQL engine to run SQL queries
on the data; it also has an integrated Machine Learning library called
MlLib and a graph processing library called GraphX. As it can integrate
many things into it, we identify Sparkas a unified framework rather than a
processing engine.
Now coming to the real-time stream processing engine of Spark. Spark
doesn’tprocess the data in real time it does a near-real-time processing. It
means it processes the data in micro batches, in just a few milliseconds.
Here we have got a program where Spark’s streaming context will
process the data in micro batches but generally, this processing is
stateless. Let’s take we have defined the streaming Context to run for
every 10 seconds, it will process the data that is arrived within that 10 seconds, to
process the previous data we have something called windows concept, windows cannot
give the accumulatedresults from the starting timestamp.
But what if you need to the accumulate the results from the start of the
streaming job. Which means you need to check the previous state of the
RDD inorder to update the new state of the RDD. This is what is known
as stateful streaming in Spark.Spark provides 2 API’s to perform stateful streaming,
which is updateStateByKey and mapWithState.
Now we will see how to perform stateful streaming of wordcount
using updateStateByKey. UpdateStateByKey is a function of Dstreams in
Spark which accepts an update function as its parameter. In that update
function,you need to provide the following parameters newState for the key
which is a seq of values and the previous state of key as an Option[?].
Let’s take a word count program, let’s say for the first 10 seconds we have
giventhis data hello every one from acadgild. Now the wordcount program
result will be
(one,1)
(hello,1)
(from,1) (acadgild,1)(every,1)
Now without writing the updateStateByKey function, if you give some other
data, in the next 10 seconds i.e. let’s assume we give the same line hello
every one from acadigld. Now we will get the same result in the next 10
seconds alsoi.e.,
(one,1)
(hello,1)
(from,1) (acadgild,1)(every,1)
9b) Discuss about Structured Streaming
9b) 8M
ANS:
Structured Streaming is a high-level API for stream processing that
became production-ready in Spark 2.2. Structured Streaming allows you to
take the same operations that you perform in batch mode using Spark’s
structured APIs, and run them in a streaming fashion. This can reduce
latency and allow for incremental processing. The best thing about
Structured Streaming is that it allows you to rapidly and quickly get value
out of streaming systems with virtually no code changes. It also makes it
easy to reason about because you can write your batch job as a way to
prototype it and then you can convert it to a streaming job. The way all of
this works is by incrementally processing that data.
Structured Streaming in big data analytics refers to the application of Apache
Spark's Structured Streaming API for handling and processing real-time data at
scale. It is designed to provide a unified and high-level API for stream
processing, making it more accessible and intuitive for developers familiar
with Spark's DataFrame and SQL API. Here are key aspects of Structured
Streaming in the context of big data analytics:
Unified Programming Model:
Structured Streaming unifies batch and streaming processing, allowing
developers to use the same DataFrame and SQL API for both static (batch) and
streaming data. This brings consistency and simplifies the development
process.
Declarative SQL-Like API:
Developers can express their stream processing logic in a declarative SQL-like
syntax using the DataFrame API. This abstraction simplifies the coding
process and allows for easier integration of stream processing tasks into
existing Spark applications.
Incremental Processing:
One of the key features of Structured Streaming is its support for incremental
processing. It processes only the new data that arrives in the stream since the
last batch was processed. This enables low-latency and efficient stream
processing.
Fault Tolerance:
Structured Streaming provides end-to-end exactly-once semantics, ensuring
that every record is processed exactly once, even in the presence of failures.
This is achieved through mechanisms like checkpointing and the use of write ahead logs.
Continuous Processing Model:
Unlike traditional micro-batch processing, Structured Streaming adopts a
continuous processing model. This enables lower end-to-end latencies and a
more natural handling of event time in the context of real-time analytics.
Event-Time Processing:
Structured Streaming supports event-time processing, allowing developers to
work with the timestamp of events. This is essential for applications where the
order of events matters, and handling late-arriving data is crucial.
Integration with Spark Ecosystem:
Structured Streaming seamlessly integrates with other components of the
Spark ecosystem, such as Spark SQL, MLlib, and GraphX. This allows organizations to
build unified data processing pipelines for both batch and streaming workloads.
Stateful Processing:
Structured Streaming supports stateful processing, enabling the maintenance
of intermediate state across batches. This is particularly useful for scenarios
where aggregations need to be carried over time, such as maintaining a
running count or sum.
Integration with External Systems:
The API allows for integration with external systems for various
functionalities, such as connecting to external databases, calling external APIs,
or incorporating custom logic using user-defined functions (UDFs).
Source and Sink Agnostic:
It supports a variety of data sources and sinks, including popular ones like
Apache Kafka, HDFS, Amazon S3, and more. This flexibility allows
organizations to ingest data from various sources and output the results to
different sinks.
In big data analytics, Structured Streaming enables organizations to build
robust, scalable, and fault-tolerant real-time data processing pipelines, making
it well-suited for applications ranging from fraud detection and monitoring to
live dashboarding and continuous ETL (Extract, Transform, Load) processes.
10a) 8M
10a) Define Streaming explain duplicates in a streaming
ANS:
Streaming refers to the processing of data continuously and in real-time, typically in
the context of data arriving as a continuous flow rather than in discrete batches. In a
streaming scenario, data is generated and processed continuously, and results are
produced as soon as the data becomes available.
Duplicates in Streaming:
In the context of streaming data, duplicates refer to the occurrence of identical
records or events within the data stream. Duplicate records can arise due to various
reasons, and handling them appropriately is essential for ensuring the accuracy and
reliability of streaming analytics.
Common scenarios leading to duplicates in streaming data include:
Reprocessing or Retrying:
In a distributed and fault-tolerant streaming system, it's possible for a record to be
processed more than once. This can happen if there are failures during processing,
and the system retries to process the same record.
Network Delays or Glitches:
Network delays or glitches can cause the same record to be transmitted more than
once. The receiving end of the streaming system may interpret these retransmissions
as duplicate records.
Out-of-Order Arrival:
In some cases, records may arrive out of order due to network latency or delays. This
can result in the same record being processed multiple times if the system is not
designed to handle out-of-order arrivals.
Data Source Characteristics:
The characteristics of the data source itself, such as the way data is produced and
transmitted, can contribute to duplicates. For example, in some streaming scenarios,
records may be emitted periodically, leading to the generation of identical records.
Handling Duplicates in Streaming:
Managing duplicates in a streaming environment is crucial for maintaining the
integrity of analytics and preventing inaccuracies in results. Several techniques can be
employed to handle duplicates:
Deduplication:
Deduplication involves identifying and removing duplicate records from the stream.
This can be achieved by maintaining state and checking for duplicates before
processing each record.
Windowing:
Windowing involves grouping records within a specified time window and processing
them collectively. This can help identify and handle duplicates within the window.
Timestamp-Based Processing:
Processing records based on their timestamp can help identify and discard duplicates
by considering the temporal order of events.
Idempotent Operations:
Designing operations to be idempotent ensures that processing the same record
multiple times has the same effect as processing it once. This approach is effective in
mitigating the impact of duplicates.
Buffering and Caching:
Buffering and caching records for a short period can help identify and eliminate
duplicates by comparing incoming records with those in the buffer.
Handling duplicates in streaming is a critical aspect of building robust and reliable
real-time data processing systems. By implementing appropriate strategies and
techniques, organizations can ensure the accuracy of analytics results and maintain
the integrity of their streaming applications.
Prepared By:
Name:
Designation:
Department:
Mobile No.:
Mail Id: