Hadoop & Big Data

These are notes based on youtube videos by Learning Journal channel.
What is big data?

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation. These data assets could be structured (csv, xls, table) or unstructured (text
document, email, webpage scraped data etc.)
3Vs of big data – Volume, Velocity and Variety.
Hadoop
Hadoop was first developed on the basis of two papers by google engineers named – google file system
(HDFS) & Map reduce (hadoop map reduce): Simplified data processing on large clusters.
The Apache Hadoop software library is an open source framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation
and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.
As of now (May 2020), latest hadoop version is 3.2.1.

Hadoop Core components are: HDFS (a distributed file system), Map reduce (a distributed processing
framework) and YARN (yet another resource negotiator).
Hadoop eco system contains lots of tools using Hadoop as its foundation:
Hive, Pig, Spark, HBase, Sqoop, Kafka, Flume, Oozie, Zookeeper etc.
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However, the differences from
other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed
on low-cost hardware. Features of HDFS are:
1. Distributed – it stores a large file (TB or PB) not on a single machine but in chunks on several
machines. It makes it distributed.
2. Scalable – HDFS is deployed on low-cost commodity hardware machine, which we can add easily
and all machines are connected through a network. This is called horizontal scaling.
3. Cost-effective – Since it uses low cost commodity hardware not some specially built server
machines, so scaling it is cost effective.
4. Fault-tolerant – HDFS is fault tolerant, any of machine in its network can go down anytime.
HDFS has quick and fault detection mechanism as well as auto recovery.
5. High Throughput – HDFS has high throughput i.e. the number of records processed in high per
unit of time. It focus on throughput rather than latency(time to get first record). So HDFS, is not
a good choice for low latency requirements.
HDFS Architecture
Core Switch
Rack Switch Rack
Above is a sample picture of Hadoop cluster. Each Hadoop cluster has a Hadoop client through which it
interacts with name node and data nodes.
1. Hadoop Client interacts with cluster using core switch which is connected to several rack
switches in network. Rack can be considered as collection of computers connected together.
A Hadoop cluster can have multiple racks connected via network.
2. Hadoop follows a master-slave architecture in which one node or computer is assigned as
master(name node) which manages the File System Namespace and controls access to files by
clients.
Functions of NameNode:
 It is the master daemon that maintains and manages the DataNodes (slave nodes)
 It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the
metadata:
o FsImage: It contains the complete state of the file system namespace since the start of
the NameNode.
o EditLogs: It contains all the recent modifications made to the file system with respect to
the most recent FsImage.
 It records each change that takes place to the file system metadata. For example, if a file is
deleted in HDFS, the NameNode will immediately record this in the EditLog.
 It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to
ensure that the DataNodes are live.
 It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.
HDFS write data in blocks of 128MB using FSDataOutputStream. It first stores data in local buffer and
when it reaches 128MB, it sends request to name node for block allocation request and then name node
checks data node information in fsimage for block allocation. It then allocates block in one of the
available data nodes.
3. DataNodes is the place where data is stored.
High Availability & Fault Tolerance in HDFS

1. HDFS uses replication for high availability which is to ensure uptime of data 99.99%. It replicates
a block of data on multiple data nodes so that if data on one node goes down, it can be fetched
from another data node. We can set replication factor while configuring Hadoop. We generally
set it to 3.
2. We always try to keep replicas on data nodes on different rack so that if whole rack fails then
data can be fetched from another rack. This is called rack awareness. It helps in reducing latency
and increase fault tolerance capability.
3. With Fault-tolerance capability, if a replica gets corrupt or its datanode/rack fails then hdfs
create replicas on another node so that each block has replicas as per replication factor.
4. To ensure availability of namenode, we can have its backup – a standby namenode. To take
backup of name node to standby NN, Hadoop use QJM(Quorum Journal Manager). It works as
follows:
 We put two fail-over controllers on each active namenode and standby namenode. In
between them, we use ZooKeeper which have a lock by active namenode first. If active
namenode fails or is down, then its lock expires and lock is acquired by standby
namenode. Also, the block report and heartbeat are configured to be sent to both active
namenode and standby namenode.
 QJM is has multiple journal nodes(usually 3) to each of which editLogs are written
continuously. We are keeping editLogs at multiple journal nodes as it is critical. These
edit logs are continuously read by Standby namenode and it is updating its in-memory
fsImage continuously. So our standby namenode can take place of active namenode
within seconds if it goes down.
Secondary NameNode
Sometimes we need to restart namenode due to several reason. In such case, namenode might take a
long time to create fsImage again from editLog. So, for such situation we have secondary namenode.
It reads editLog periodically, update the fsImage, stores it on disk and then truncate the editLog. EditLog
is also updated periodically on secondary namenode. In case of failure of primary namenode, secondary
namenode can quickly bring fsImage from disk to memory. It is faster than creating fsimage from scratch
using editLog. Rest of the changes made in editLogs after last update in fsimage will take very less time
in comparison to creating fsImage from scratch using editLogs.
In HA scenario, standby namenode can take responsibility of secondary namenode as well.
Apache Spark
Apache Spark is a unified analytics engine for big data processing, with built-in modules for
streaming, SQL, machine learning and graph processing. It is an open-source distributed general-
purpose cluster-computing framework. Spark provides an interface for programming entire clusters
with implicit data parallelism and fault tolerance. It is 10 to 100 times faster than Hadoop map
reduce.
Apache Spark framework primarily consist of two components:
1. A cluster computing engine(Spark Core).

2. A set of libraries, APIs and DSL(domain-specific language) functions(Spark SQL etc).
 Apache spark doesn’t provide cluster management and storage services. They need to be
provided by some external means like YARN, mesos and Kubernetes for cluster resource
manager and HDFS, S3, GCS, CFS etc. for distributed storage.
 Compute engine in spark core takes care of memory management, task scheduling, Fault
recovery, Interaction with cluster manager etc.
 Spark Core API has two parts: Structured and Unstructured.
 Structured API has dataframes and datasets while Unstructured have low level APIs related to
RDD, accumulators and broadcast variables.
 Spark Core API can be used with four languages: Scala, Python, Java & R.
 On top of Spark Core API, we have another set of libraries: Spark SQL, Spark Streaming, MLLib &
GraphX.
 Spark SQL - It allows to query the data via SQL (Structured Query Language) as well as the
Apache Hive variant of SQL called the HQL (Hive Query Language).
 Spark Streaming - Spark Streaming is a Spark component that supports scalable and
fault-tolerant processing of streaming data.
 MLLib - The MLLib is a Machine Learning library that contains various machine
learning algorithms.
 GraphX - The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
 We can execute spark programs in a cluster using below two methods: Interactive clients(Scala
shell, pyspark, notebooks) – suitable for exploration , learning and Spark submit utility –
suitable for running application in production.
 How does spark executes our program on cluster ? Spark is distributed programming engine. It
uses master slave architecture. For every application, it has one dedicated driver(master) and
multiple executors(slave). So each application has a its own set of dedicated driver and
executors.
 Spark driver is responsible for analyzing, distributing, scheduling and monitoring works across
the cluster. Executor’s job is to take the work from driver, execute it and report back the status
to driver.

Client:
 Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all
available resources at its disposal to execute work.
 Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all
Worker nodes (big advantage).
 Because the Master node has dedicated resources of it's own, you don't need to "spend" worker
resources for the Driver program.
 If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
 Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
 Driver runs as a dedicated, standalone process inside the Worker.
 Driver programs takes up at least 1 core and a dedicated amount of memory from one of the
workers (this can be configured).
 Driver program can be monitored from the Master node using the --supervise flag and be reset
in case it dies.
 When working in Cluster mode, all JARs related to the execution of your application need to be
publicly available to all the workers. This means you can either manually place them in a shared
place or in a folder for each of the workers.
Yarn is majorly used cluster manager for spark. Kubernetes is also catching up now. It is a container
orchestration platform by google.
In client mode with yarn as cluster manager, driver creates a spark session on client machine and a
request is sent to yarn resource manager to start a yarn application. Resource manager starts a
application master which is responsible for launching executors. Application master then reaches out to
yarn resource manager to allot containers, once containers are allotted, application master starts an
executor in each of these containers. Now these executors can communicate directly with driver.
In cluster mode with yarn, we submit our packaged application using spark submit to resource manager.
Resource manager starts a yarn application master and it acts as a driver as well. Rest of the process is
same as in client mode.
In Local mode, a spark jvm is started. Driver and executor is started in this spark jvm only. No cluster is
there. It is good for learning purpose.
Spark application process flow
Spark has 3 major datastructure in which it holds data : RDD, DataSet & DataFrames. Under the hood,
DataSet and DataFrames are collection of RDDs only.
RDD
We can create RDD using below two method
1. Load data from a source like file or database.

2. Create an RDD using transforming another RDD
We can control the number of partitions in an RDD.
Directed Acyclic Graphs (DAGs)

Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on
data. Each node is an RDD partition, and the edge is a transformation on top of data. Here,
the graph refers the navigation whereas directed and acyclic refers to how it is done.
Whenever there is a need to move data across partitions, shuffle and sort operations are used. Like we
need to group data by key in one partition then shuffle and sort is triggered.
http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
https://www.freecodecamp.org/news/deep-dive-into-spark-internals-and-architecture-f6e32045393b/
Spark Interview Questions
1. Cache vs Persist. When to use which ? -The difference among them is

that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on
disk, or off-heap memory according to the caching strategy specified by level.
persist() without an argument is equivalent with cache(). We discuss caching strategies later in
this post. Freeing up space from the Storage memory is performed by unpersist().
Spark Persistance storage levels

All different storage level Spark supports are available
at org.apache.spark.storage.StorageLevel class. The storage level specifies how and where to
persist or cache a Spark DataFrame and Dataset.
MEMORY_ONLY – This is the default behavior of the RDD cache() method and stores the RDD or
DataFrame as deserialized objects to JVM memory. When there is no enough memory available
it will not save DataFrame of some partitions and these will be re-computed as and when
required. This takes more memory. but unlike RDD, this would be slower
than MEMORY_AND_DISK level as it recomputes the unsaved partitions and recomputing the
in-memory columnar representation of the underlying table is expensive
MEMORY_ONLY_SER – This is the same as MEMORY_ONLY but the difference being it stores
RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) then
MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in
order to deserialize.
MEMORY_ONLY_2 – Same as MEMORY_ONLY storage level but replicate each partition to two
cluster nodes.
MEMORY_ONLY_SER_2 – Same as MEMORY_ONLY_SER storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK – This is the default behavior of the DataFrame or Dataset. In this Storage
Level, The DataFrame will be stored in JVM memory as a deserialized objects. When required
storage is greater than available memory, it stores some of the excess partitions into disk and
reads the data from disk when it required. It is slower as there is I/O involved.
MEMORY_AND_DISK_SER – This is same as MEMORY_AND_DISK storage level difference being
it serializes the DataFrame objects in memory and on disk when space not available.
MEMORY_AND_DISK_2 – Same as MEMORY_AND_DISK storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK_SER_2 – Same as MEMORY_AND_DISK_SER storage level but replicate
each partition to two cluster nodes.
DISK_ONLY – In this storage level, DataFrame is stored only on disk and the CPU computation
time is high as I/O involved.
DISK_ONLY_2 – Same as DISK_ONLY storage level but replicate each partition to two cluster
nodes.
2. RDD, why rdd is immutable

Spark RDD is an immutable collection of objects for the following reasons:
 Immutable data can be shared safely across various processes and threads
 It allows you to easily recreate the RDD
 You can enhance the computation process by caching RDD
3. Why spark is lazy ?(concept of lazy evaluation)

Lazy evaluation in spark refers to concept that execution of program doesn’t happen until an
action is triggered. Transformations are lazy evaluated i.e. they are not executed immediately,
they are evaluated when an action in triggered. Spark maintains record of operations to be
called using DAG.
In Spark, driver program loads the code to the cluster. When the code executes after every
operation, the task will be time and memory consuming. Since each time data goes to the
cluster for evaluation.
Lazy evaluation saves the trips between driver and clusters, and data I/O operations is usually
the bottleneck of speed.
Lazy evaluation is important to Spark because of the development of Catalyst Optimizer in 2015.
At its core, Catalyst Optimizer optimizes the query execution by planning out the sequence of
computation and skipping potentially unnecessary steps.
4. Transformation vs actions
A transformation is an operation in spark which takes RDD as an input and return another RDD
as output. Transformation are lazy evaluated ie. Not evaluated until an action is triggered.
Example - Map, flatMap, reducebyKey, join etc. Basically, they are core piece of code where you
define your business logic.
Actions are operations in spark which returns final value to driver program or write data to some
external storage. Basically, when an action is triggered, Spark runs all transformation in RDD
Lineage and returns some value. Example count, reduce etc.
5. Narrow and wide transformations
Transformations consisting of narrow dependencies (we’ll call them narrow transformations)
are those for which each input partition will contribute to only one output partition. These
compute data that live on a single partition meaning There will not be any data movement
between partitions to execute narrow transformations. Example map and filter.
A wide dependency (or wide transformation) style transformation will have input partitions
contributing to many output partitions. You will often hear this referred to as a shuffle whereby
Spark will exchange partitions across the cluster. These compute data that live on many
partitions meaning there will be data movements between partitions to execute wider
transformations.
With narrow transformations, Spark will automatically perform an operation called pipelining,
meaning that if we specify multiple filters on DataFrames, they’ll all be performed in-memory.
The same cannot be said for shuffles. When we perform a shuffle, Spark writes the results to
disk. Example groupByKey() and reduceByKey()., aggregate, join, repartition.
6. Spark optimization techniques
7. Handle data skew problem in spark
Data skew is a condition in which data is unevenly distributed among partitions in cluster. Task
executed on skewed partition will take more time as compared to others. It downgrades the
performance of query, especially with join. It can cause large amount of shuffling of data which
is very expensive operation. Solution to data skewness in spark are :
1. Repartitioning – We can do repartition or coalesce on data to distribute data evenly on

partitions. Primary key of datasets must be used. But it doesn’t guarantee the fully even
partitioning of data.
2. Salting - If partitions is based on original key then we can observe imbalanced
distribution of data across the partitions. In order to curb this situation we should
modified our original keys to some modified keys whose hash partitioning cause the
proper distributions of records among the partitions. This technique is called Salting
technique.
https://www.youtube.com/watch?v=HIlfO1pGo0w&t=8s
https://itnext.io/handling-data-skew-in-apache-spark-9f56343e58e8
Other techniques include isolated salting, Isolated Map Join & iterative broadcast technique.
https://bigdatacraziness.wordpress.com/2018/01/05/oh-my-god-is-my-data-skewed/
8. Difference between reduce and reducebykey ? why reduce is

action and reduce by key is transformation ?
Reduce Aggregate the elements of a dataset through func. It is an action and returns a
value.
scala> val names1 = sc.parallelize(List("abe", "abby", "apple"))

names1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1467] at parallelize at <console>:12
scala> names1.reduce((t1,t2) => t1 + t2)

res778: String = abbyabeapple
ReduceByKey Operates on (K,V) pairs of course, but the func must be of type (V,V) => V
And returns an RDD. It is a transformation.
9. What is spark ?
10. Big data Cluster configuration
11. Row and columnar file format(use case) – In column oriented
format, rows in files are broken into row splits, then each row split is stored column wise like all
values in a row split are stored together, followed by all values in second column together and
so on. They are best for read-heavy transactional loads. Example are Parquet and ORC. They are
good for analytical workloads where we need to create reporting having selected columns. A
column-oriented layout permits column that are not accessed in a query to be skipped. Another
aspect to consider is support for schema evolution, or the ability for the file structure to change
over time.
In row-based format, all rows splits are stored in row-wised fashion i.e. values in a row are
stored together. Row oriented formats are appropriate when a large number of columns of a
single row are needed for processing at the same time. They are good for write-heavy
transaction loads. Example are Avro, CSV, TSV, Json.
Because of the way the data is optimized for fast retrieval, the column-based stores, Parquet
and ORC, offer higher compression rates than the row-based Avro format.
12. Parquet vs avro vs orc(use cases)- explain row vs file based format
difference like above. With Avro’s capacity to manage schema evolution, it’s possible to update
components independently, at different times, with low risk of incompatibility. Avro is a row-
based storage format for Hadoop which is widely used as a serialization platform. Avro stores
the data definition (schema) in JSON format making it easy to read and interpret by any
program. The data itself is stored in binary format making it compact and efficient. A key feature
of Avro is robust support for data schemas that change over time - schema evolution. Avro
handles schema changes like missing fields, added fields and changed fields; as a result, old
programs can read new data and new programs can read old data .
ORC - It ideally stores data compact and enables skipping over irrelevant parts without the need
for large, complex, or manually maintained indices. ORC stores collections of rows in one file
and within the collection the row data is stored in a columnar format.
COMPARISONS BETWEEN DIFFERENT FILE FORMATS
AVRO vs PARQUET
1. AVRO is a row-based storage format whereas PARQUET is a columnar based storage format.
2. PARQUET is much better for analytical querying i.e. reads and querying are much more
efficient than writing.
3. Write operations in AVRO are better than in PARQUET.
4. AVRO is much matured than PARQUET when it comes to schema evolution. PARQUET only
supports schema append whereas AVRO supports a much-featured schema evolution i.e.
adding or modifying columns.
5. PARQUET is ideal for querying a subset of columns in a multi-column table. AVRO is ideal in
case of ETL operations where we need to query all the columns.
ORC vs PARQUET
1. PARQUET is more capable of storing nested data.
2. ORC is more capable of Predicate Pushdown.
3. ORC supports ACID properties.
4. ORC is more compression efficient.
13. Different transformation and action used

https://supergloo.com/spark-scala/apache-spark-examples-of-transformations/
https://supergloo.com/spark-scala/apache-spark-examples-of-actions/
https://sparkbyexamples.com/apache-spark-rdd/spark-rdd-transformations/
14. Performance optimization done by you in spark project
15. Paired RDD
We can create an RDD containing key,value pair. Spark provides special operation on paired
RDDs. They are useful building blocks in many programs as they expose operations to act on
each key in parallel or regroup data across the network.
For example, pair RDDs have a reduceByKey() method that can
aggregate data separately for each key, and a join() method that can merge two
RDDs together by grouping elements with the same key. It is common to extract
fields from an RDD (representing, for instance, an event time, customer ID, or other
identifier) and use those fields as keys in pair RDD operations.
Example - val pairs = lines.map(x => (x.split(" ")(0), x))
16. Reduce by key vs group by key
In reduceByKey, Data is grouped at each partition, so that output has only one value for one key
to send over the network. After grouping, very less data is shuffled across partiton
Group by key first shuffles data from one partition to another so that one partition has data for
one key only and then it grouped (aggregation is performed)
aggregateByKey is similar to reduceByKey except it takes an initial value.

Note – All above 3 are wide operations
17. Shuffling in spark
http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
18. How to check job performance in spark using spark UI ?

DAG which gives stages in which spark job is executed
19. Lineage concept
Lineage is set of steps containing transformation on RDDs to get final RDD we are interested in.
It is a logical plan which tells more about dependency. We can check it using command:
<rdd>.toDebugString(). We can read it from bottom to top.
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on
data. Each node is an RDD partition, and the edge is a transformation on top of data. Here, the
graph refers the navigation whereas directed and acyclic refers to how it is done. Apache Spark
DAG allows the user to dive into the stage and expand on detail on any stage. In the stage
view, the details of all RDDs belonging to that stage are expanded. It gives more details about
how much things can be run in parallel and execution plan. We can visualize DAG in spark UI or
history server UI.
20. Use of broadcast variables
Broadcast variables are read-only shared variables that are cached and available on all nodes in
a cluster in-order to access or use by the tasks. Instead of sending this data along with every
task, spark distributes broadcast variables to the machine using efficient broadcast algorithms to
reduce communication costs.
Broadcast variables are useful in joining one large table and a small table. Like there is a column
called country code and their description is stored in another table. In result report, we want full
country name . So in join, we broadcast the small table to all the nodes so that they don’t need
to get it every time. It saves time and network I/O cost
https://sparkbyexamples.com/spark/spark-broadcast-variables/
Example
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))

val countries = Map(("USA","United States of America" ),("IN","India"))
val broadcastStates = spark.sparkContext.broadcast(states)

val broadcastCountries = spark.sparkContext.broadcast(countries)
To get value from broadcast variable, we can use value.get() method like below
val fullCountry = broadcastCountries.value.get(country).get

val fullState = broadcastStates.value.get(state).get
21. Accumulators - Accumulators are variables that are used for aggregating
information across the executors. For example, this information can pertain to data or API
diagnosis like how many records are corrupted or how many times a particular library API was
called. Driver can only read from accumulators but executors can only write to it.
Spark guarantees to update accumulators inside actions only once. So even if a task is restarted
and the lineage is recomputed, the accumulators will be updated only once. To be on the safe
side, always use accumulators inside actions ONLY.
A good use of accumulators is in analyzing large weblogs where we want to use spark to count
occurrence of different httpstatus codes. Here we can define accumulator for different
httpcodes. While analyzing weblog, each worker can update these accumulators as per they
find occurrence of corresponding httpcode. Check below link for example with code :
https://supergloo.com/spark-scala/spark-broadcast-accumulator-examples-scala/
For example, you can create long accumulator on spark-shell using
scala> val accum = sc.longAccumulator("SumAccumulator")

accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0,
name: Some(SumAccumulator), value: 0)
Copy
The above statement creates a named accumulator “SumAccumulator”.
Now, Let’s see how to add up the elements from an array to this
accumulator.
scala> sc.parallelize(Array(1, 2, 3)).foreach(x => accum.add(x))

-----
-----
scala> accum.value
res2: Long = 6
22. Different compression techniques in spark

23. RDD Vs Dataframe VS Dataset
https://community.simplilearn.com/threads/rdd-vs-dataframe-vs-dataset-in-apache-
spark.33973/
24. Partitioning in spark
In cluster computing, the central challenge is to minimize network traffic. When the data is key-
value oriented, partitioning becomes imperative because for subsequent transformations on the
RDD, there’s a fair amount of shuffling of data across the network. If similar keys or range of
keys are stored in the same partition then the shuffling is minimized and the processing
becomes substantially fast.
25. Type Safety
Type safety – meaning production applications can be checked for errors before they are run.
https://loicdescotte.github.io/posts/spark2-datasets-type-safety/
26. How spark or RDD is resilient(fault-tolerance)?
The basic fault-tolerant semantic of Spark are:
 Since Apache Spark RDD is an immutable dataset, each Spark RDD remembers the lineage
of the deterministic operation that was used on fault-tolerant input dataset to create it.
 If due to a worker node failure any partition of an RDD is lost, then that partition can be re-
computed from the original fault-tolerant dataset using the lineage of operations.
 Assuming that all of the RDD transformations are deterministic, the data in the final
transformed RDD will always be the same irrespective of failures in the Spark cluster.
To achieve fault tolerance for all the generated RDDs, the achieved data replicates among
multiple Spark executors in worker nodes in the cluster.
27. Difference between executor and driver – The driver is the process
where the main method runs. First it converts the user program into tasks and after that it
schedules the tasks on the executors. While Executors are worker nodes' processes in charge of
running individual tasks in a given Spark job. They are launched at the beginning of a Spark
application and typically run for the entire lifetime of an application. Once they have run the
task they send the results to the driver. They also provide in-memory storage for RDDs that are
cached by user programs through Block Manager.
28. What is catalyst optimizer?
What is Catalyst
Spark SQL was designed with an optimizer called Catalyst based on the functional programming
of Scala. Its two main purposes are: first, to add new optimization techniques to solve some
problems with “big data” and second, to allow developers to expand and customize the
functions of the optimizer.
Catalyst Spark SQL architecture and Catalyst optimizer integration
Catalyst components
The main components of the Catalyst optimizer are as follows:
Trees
The main data type in Catalyst is the tree. Each tree is composed of nodes, and each node has a
nodetype and zero or more children. These objects are immutable and can be manipulated with
functional language.
As an example, let me show you the use of the following nodes:
Merge(Attribute(x), Merge(Literal(1), Literal(2))

Where:
Literal(value: Int): a constant value

Attribute(name: String): an attribute as input row
Merge(left: TreeNode, right: TreeNode): mix of two expressions
Rules
Trees can be manipulated using rules, which are functions of a tree to another tree. The
transformation method applies the pattern matching function recursively on all nodes of the
tree transforming each pattern to the result. Below there’s an example of a rule applied to a
tree.
tree.transform {
case Merge(Literal(c1), Literal(c2)) => Literal(c1) + Literal(c2)
}
Using Catalyst in Spark SQL
The Catalyst Optimizer in Spark offers rule-based and cost-based optimization. Rule-based
optimization indicates how to execute the query from a set of defined rules. Meanwhile, cost-
based optimization generates multiple execution plans and compares them to choose the
lowest cost one.
Phases
The four phases of the transformation that Catalyst performs are as follows:
1. Analysis
Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST)
returned by a SQL parser, or from a DataFrame object constructed using the API. In both cases,
the relation may contain unresolved attribute references or relations: for example, in the SQL
query SELECT col FROM sales, the type of col, or even whether it is a valid column name, is not
known until we look up the table sales. An attribute is called unresolved if we do not know its
type or have not matched it to an input table (or an alias). Spark SQL uses Catalyst rules and a
Catalog object that tracks the tables in all data sources to resolve these attributes.
2. Logic Optimization Plan

The logical optimization phase applies standard rule-based optimizations to the logical plan.
(Cost-based optimization is performed by generating multiple plans using rules, and then
computing their costs.) These include constant folding, predicate pushdown, projection pruning,
null propagation, Boolean expression simplification, and other rules. It is possible to easily add
new rules.
3. Physical plan
In the physical plan phase, Spark SQL takes the logical plan and generates one or more physical
plans using the physical operators that match the Spark execution engine. The plan to be
executed is selected using the cost-based model (comparison between model costs).
4. Code generation
Code generation is the final phase of optimizing Spark SQL. To run on each machine, it is
necessary to generate Java code bytecode.
Phases of the query plan in Spark SQL. Rounded squares represent the Catalyst trees
Example
The Catalyst optimizer is enabled by default as of Spark 2.0, and contains optimizations to
manipulate datasets. Below is an example of the plan generated for a query of a Dataset from
the Spala SQL API of Scala:
// Business object
case class Persona(id: String, nombre: String, edad: Int)
// The dataset to query
val peopleDataset = Seq(
Persona("001", "Bob", 28),
Persona("002", "Joe", 34)).toDS
// The query to execute
val query = peopleDataset.groupBy("nombre").count().as("total")
// Get Catalyst optimization plan
query.explain(extended = true)
As a result, the detailed plan for the consultation is obtained:
== Analyzed Logical Plan ==

nombre: string, count: bigint
SubqueryAlias total
+- Aggregate [nombre#4], [nombre#4, count(1) AS count#11L]
+- LocalRelation [id#3, nombre#4, edad#5]
== Optimized Logical Plan ==
Aggregate [nombre#4], [nombre#4, count(1) AS count#11L]
+- LocalRelation [nombre#4]
== Physical Plan ==
*(2) HashAggregate(keys=[nombre#4], functions=[count(1)], output=[nombre#4, count#11L])
+- Exchange hashpartitioning(nombre#4, 200)
+- *(1) HashAggregate(keys=[nombre#4], functions=[partial_count(1)], output=[nombre#4,
count#17L])
+- LocalTableScan [nombre#4]
Conclusions
The Spark SQL Catalyst Optimizer improves developer productivity and the performance of their
written queries. Catalyst automatically transforms relational queries to execute them more
efficiently using techniques such as filtering, indexes and ensuring that data source joins are
performed in the most efficient order. In addition, its design allows the Spark community to
implement and extend the optimizer with new features.
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
29. Kappa Vs Lambda architecture
https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa
https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
30. Executor vs Executor core – Executor is a java process started on your
cluster and each executor can have more than one thread executed on it, each thread is
attached to one core. These are called executor cores.
31. Hint Framework Spark SQL(broadcast join, sortmerge join
etc)
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hint-framework.html
https://www.waitingforcode.com/apache-spark-sql/writing-apache-spark-sql-custom-logical-
optimization-unsupported-optimization-hints/read
http://blog.madhukaraphatak.com/spark-3-introduction-part-9/
Scala Interview question

1. Why scala for spark ? Not python or R ?
2. Design patterns in scala. Singleton or factory design pattern.
3. What is trait ?(used for multiple inheritance)
4. How scala helps in solving diamond problem ?
5. Trait vs abstract classes ?
6. Var vs val
7. What is case class and its usage?
8. UDF
Hive Interview questions
1. Hive optimization techniques
https://www.qubole.com/blog/hive-best-practices/
2. Bucketing vs partitioning
3. File format for hive – orc, text, json, parquet, avro
4. Serde
Java understands objects and hence object is a deserialized state of data. When you use
the same concept, Hive understands “columns” and hence if given a “row” of data, the
task of converting that data into columns is the Deserialization part of Hive SerDe. In
short
“A select statement creates deserialized data(columns) that is understood by Hive. An

insert statement creates serialized data(files) that can be stored into an external storage
like HDFS”.
https://medium.com/@gohitvaranasi/how-does-apache-hive-serde-work-behind-the-scenes-a-
theoretical-approach-e67636f08a2a
Happiest Minds

Hadoop & Big Data

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop & Big Data

Uploaded by

Copyright:

Available Formats

These are notes based on youtube videos by Learning Journal channel.

What is big data?

3Vs of big data – Volume, Velocity and Variety.

As of now (May 2020), latest hadoop version is 3.2.1.

High Availability & Fault Tolerance in HDFS

In HA scenario, standby namenode can take responsibility of secondary namenode as well.

Apache Spark framework primarily consist of two components:

1. A cluster computing engine(Spark Core).

We can create RDD using below two method

1. Load data from a source like file or database.

Directed Acyclic Graphs (DAGs)

Spark Interview Questions

1. Cache vs Persist. When to use which ? -The difference among them is

Spark Persistance storage levels

2. RDD, why rdd is immutable

3. Why spark is lazy ?(concept of lazy evaluation)

1. Repartitioning – We can do repartition or coalesce on data to distribute data evenly on

8. Difference between reduce and reducebykey ? why reduce is

scala> val names1 = sc.parallelize(List("abe", "abby", "apple"))

scala> names1.reduce((t1,t2) => t1 + t2)

COMPARISONS BETWEEN DIFFERENT FILE FORMATS

3. Write operations in AVRO are better than in PARQUET.

1. PARQUET is more capable of storing nested data.

2. ORC is more capable of Predicate Pushdown.

3. ORC supports ACID properties.

4. ORC is more compression efficient.

13. Different transformation and action used

aggregateByKey is similar to reduceByKey except it takes an initial value.

18. How to check job performance in spark using spark UI ?

val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))

val broadcastStates = spark.sparkContext.broadcast(states)

val fullCountry = broadcastCountries.value.get(country).get

For example, you can create long accumulator on spark-shell using

scala> val accum = sc.longAccumulator("SumAccumulator")

scala> sc.parallelize(Array(1, 2, 3)).foreach(x => accum.add(x))

22. Different compression techniques in spark

Catalyst Spark SQL architecture and Catalyst optimizer integration

As an example, let me show you the use of the following nodes:

Merge(Attribute(x), Merge(Literal(1), Literal(2))

Literal(value: Int): a constant value

2. Logic Optimization Plan

== Analyzed Logical Plan ==

Scala Interview question

“A select statement creates deserialized data(columns) that is understood by Hive. An

You might also like