Download as pdf or txt
Download as pdf or txt
You are on page 1of 458

PySpark Index

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
• Introduction to Spark
 Why Spark was developed
 What is Spark/Spark Features
 Spark Components
• Spark Session
 Spark Context
 Spark Session
 spark-submit
• RDD Fundamentals
 What is RDD
 RDD Features
 When to use RDDs
 RDD Properties and Problems
• Create RDD
• RDD Operations
 Transformations and Actions
 Row Level Transformations (map, flatMap,
filter)
 Join (join, cogroup, cartesian)
 Total Aggregations (Count, reduce)
 Shuffle and Combiner
 Key Aggregations (groupByKey ,reduceByKey,
aggregrateByKey, countByKey)
 Ranking Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
 sorting
 set
 sample
 repartition and coalesce
 Reading/writing from rdd to HDFS
• Spark Cluster Architecture
 Detailed Spark Cluster Execution Architecture
 YARN Cluster Manager
 JVM Processes
 Commonly Used Terms in Spark Execution
Framework
 Narrow and Wide Transformations
 RDD Lineage
 DAG Scheduler
 Task Scheduler
• RDD Persistence
• Shared Variables
 Broadcast
 Accumulator
• Spark SQL
 Architecture
 Catalyst Optimizer
 Volcano Iterator Model
 Tungsten Execution Engine
 Benchmarks Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
 Understanding Execution Plan
• DataFrame Fundamentals
 What is a DataFrame
 Sources
 Features
 Organization
• SparkSession
 Introduction to SparkSession
 Spark Object
 Spark Submit
 Commonly used Functions
 version
 range
 createDataFrame
 sql
 table
 sparkContext
 conf
 read (csv, text, orc,
parquet,json,avro,hive,jdbc)
 udf
 newSession
 stop Aurora Academy of Training. Copyright [email protected]
 catalog - All Rights Reserved. Subscribe at Learn-Spark.nfo
• DataTypes
• DataFrame Rows
• DataFrame Columns
 Column Functions
• DataFrame ETL
 DataFrame APIs
 selection
 filter
 sort
 set
 join
 aggregate
 groupBy
 window
 sample
 Other Aggregate
 DataFrame built-in functions
 new Column
 encryption
 String

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
 Regexp
 Date
 null
 Collection
 na
 math and statistics
 explode and flatten
 Formatting
 json
 Partition
 What is partition
 Repartition
 Coalesce
 Repartition vs Coalesce
 Extraction
 csv
 text
 Parquet
 Orc
 Json
 avro
 Hive
 jdbc Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
• Optimization and Management
 Join Strategies
 Driver Configurations
 Executor Configurations
 Parallelism Configurations
 Memory Management

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Introduction to Spark

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Why Spark was Developed
?

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
• The biggest problem in any big data project is to achieve the “Scale”. The RDBMS databases like Oracle, sql server
etc are the oldest approaches of storing and processing the data. But as data grows, they are unable to scale
accordingly.
• The need of new approaches led to creation of different effective systems like GFS, MapReduce.
• Still the problem was not resolved fully:
For example – Lets see the shortcomings of MapReduce.
1. Hard to manage and administer due to operational complexity.
2. MapReduce forces the data processing into Map and Reduce. Other workflows like join, filter, union etcs
are missing.
3. Stateless machine – read and write to disk before or after each map and reduce stages. This repeated
performance of disk I/O took its toll: large MR jobs could run for hours, or even days.
4. Java Natively - support for other languages missing.
5. Only support batch processing – Not good for streaming, Machine Learning or interactive sql like queries.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Map Reduce

HDFS HDFS HDFS


HDFS
Writes Reads Writes
Reads
HDFS MR HDFS MR HDFS

Spark

Operation 1 Operation … HDFS


HDFS
RAM RAM

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
• To handle these problems, engineers developed different systems at different times (Hive, Storm, Impala, Giraph,
Mahout etc). Each of these systems have their own APIs and cluster configurations. This adds further operational
complexity of Hadoop Map Reduce Systems.
• So the idea was to develop a single unified system having capabilities to solve all these problems.
• 2009, The researchers at UC Berkeley who had previously worked on Hadoop MapReduce took on this challenge
with a project and they call it Spark.
• Ideas from Hadoop MapReduce are borrowed but enhanced in the Spark Project.
• Spark was developed as highly fault-tolerant, massively parallel, in-memory processing support, easy APIs in
multiple language in unified manner.
• By 2013, Spark had gained widespread use and popularity. Some of its creators donated the Spark project to the
ASF and formed a company called Databricks.
• Databricks and the community of open source developers worked to release Apache Spark 1.0 in May 2014.

0.5 June 2012


1.0  May 2014
1.5  Sep 2015
2.0 July 2016
2.4 Nov 2018
3.0 June 2020
3.1 Mar 2021

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
What is Spark ?

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
• Unified: Spark is a unified analytics engine mainly designed for large-scale distributed data processing,
on-premises data centers or in the cloud.
• Speed: Spark provides in-memory storage for intermediate computations and limited disk I/O, this gives
it a huge performance boost. Run workloads 100x faster.
• Ease of Use (Polyglot): Write applications quickly in Java, Scala, Python, R, and SQL. Scala is the
functional programming language spark is written. Spark achieves simplicity by providing a fundamental
abstraction of a simple logical RDD upon which all other higher-level structured data abstractions, such
as DataFrames and Datasets are constructed.
• Generality: Spark powers a stack of libraries including SQL and DataFrames, MLib for machine
learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same
application.
Spark Spark
Mlib GraphX
SQL Streaming

Apache Spark

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
• Extensibility: Decouples Compute and Storage. It focuses on fast and parallel computation engine than on storage.
Storage can be extended to read data from other sources.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
• Runs Everywhere:
 YARN  Kubernetes
 Mesos  EC2 (AWS)
 Standalone
• Open Source
• Scalable
• Distributed
• Powerful Caching
• Real Time and Batch
Processing

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Components

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Modules
• Core - Transformations and Actions -APIs such as
SPARK SPARK map, reduce, join, filter etc. They typically work on RDD
MLib GraphX
SQL STREAMING • Spark SQL and Data Frames -APIs and Spark SQL
interface for batch processing on top of Data Frames or
Data Sets(not available for Python)
Apache Spark Core
• Structured Streaming - APIs and Spark SQL interface
for stream data processing on top of Data Frames
• Machine Learning Pipelines - Machine Learning data
pipelines to apply Machine Learning algorithms on top
of Data Frames
• GraphX – For graphs and graph-parallel computation.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ch 2 (Spark Session)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Initializing Spark
 The first thing a Spark program is to create a SparkContext object, which tells Spark how to access a cluster.
 When we run any spark application, the driver program starts, which has the main() function and the
SparkContext gets initiated here.

 Prior to 2.0, to create SparkContext we first need to build a SparkConf object that contains information about
the application. (This is old way of doing.)
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Initializing Spark
 In spark 2.0 and onwards, we can create the a spark object using SparkSession class. Then using this object we can
access the SparkContext.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master('yarn') \
.appName("Python Spark SQL basic example") \
.getOrCreate()
 spark.sparkContext

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Introduction
To
Spark Session

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
SparkSession : The Entry point Spark 2.0 Onwards
 In Spark 2.0, SparkSession(spark) is the new entry point to work with RDD, DataFrame and all other functionalities.
 Prior to 2.0, SparkContext(sc) used to be an entry point.
 SparkContext: Entry point to work with RDD, Accumulators and broadcast variables (< Spark 2.0).
SQLContext: Used for initializing the functionalities of Spark SQL (< spark 2.0).
HiveContext: Super set of SQLContext (< spark 2.0).
 Almost all the APIs available in SparkContext, SQLContext, HiveContext are now available in SparkSession.
 By Default, Spark Shell provides a “spark” object which is an instance of SparkSession class (Spark 2.x).
“sc” object is created for spark 1.x.
 While working a real project, we will have to create these objects manually.

Prior to 2.0
2.0 and Onwards

SparkContext SparkSession
SQLContext (Also has these old classes but not
recommended to use)
HiveContext

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : The Entry point Spark 2.0 Onwards

House Class SparkSession Class

Instance
Instance Instance

My House My Friend House spark Object

Object Object

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : Spark Object
& spark-submit

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : Create
from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.master('yarn') \
.appName("Python Spark SQL basic example") \
.getOrCreate()

Master can be yarn, mesos, Kubernetes or local(x) , x > 0

How to Run :
1. Organize the folders and create a python file under bin folder.
2. Write above codes in the .py file.
3. Execute the file using spark-submit command.

spark2-submit \
/devl/example1/src/main/python/bin/basic.py

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
Spark-submit is a utility to run a pyspark application job by specifying options and configurations.

spark-submit \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key<=<value> \
--driver-memory <value>g \
--executor-memory <value>g \
--executor-cores <number of cores> \
--jars <comma separated dependencies> \
--packages <package name> \
--py-files \
<application> <application args>

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
--master : Cluster Manager (yarn, mesos, Kubernetes, local, local(k))
local – Use local to run locally with one worker node.
local(k) – Specify k with the number of cores you have locally, this runs application with k worker
threads.

--deploy-mode: Either cluster or client


Cluster: Driver runs on one of the worker nodes and you can see the code as a driver on the spark UI of your
application. We cant see the logs on the terminal. Logs available only in the UI or the yarn CLI.
yarn logs -applicationId application_1622930712080_16253
Mainly used for production jobs.

Client: Driver runs locally where we submit the application.


See the logs on the terminal.
Mainly used for interactive or debugging purpose.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Runtime Components in Client deploy mode:

Executor JVM

JVM Heap
Client JVM

JVM Heap T T T

T
Driver
Spark
Context Executor JVM

Spark
Application JVM Heap
JVM Heap

T T T

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Runtime Components in Cluster deploy mode:

Cluster
Executor JVM

Driver JVM JVM Heap

Task Slots
T T T
JVM Heap
Unoccupied Task Slots
T
Client JVM
Spark Context

Spark Executor JVM


Application

JVM Heap

T T T

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
--conf: We can provide runtime configurations, shuffle parameters, application configurations using –conf.
Ex: --conf spark.sql.shuffle.partitions = 300
This configures the number of partitions that are used when shuffling data for joins or aggregations.
https://spark.apache.org/docs/latest/sql-performance-tuning.html

--conf spark.yarn.appMasterEnv.HDFS_PATH=“practice/retail_db/orders”
We can set environment variables like this when spark is running on yarn.
https://spark.apache.org/docs/latest/running-on-yarn.html#configuration

Default Spark Properties File (spark-defaults.conf ):


• There is a default spark properties file at $SPARK_HOME/conf/spark-defaults.conf.
• This could be overridden using spark-submit’s “--properties-file” command-line option.
For Ex - spark-submit --properties-file [FILE]
• Individual properties of this file can be overridden by spark-sumit’s conf options.
For Ex- spark.submit.deployMode=client is in the default file.
We can override this using
spark-submit --deploy-mode “cluster”

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
--driver-memory : Amount of memory to allocate for a driver (Default: 1024M).

--executor-memory : Amount of memory to use for the executor process.

--executor cores : Number of CPU cores to use for the executor process.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
--jars: Dependency .jar files.
Ex : --jars /devl/src/main/python/lib/ojdbc7.jar, fil2.jar, file3.jar

--packages: Pass the dependency packages.


Ex : --packages org.apache.spark:spark-avro_2.11:2.4.4

--py-files: Use –py-files to add .py and .zip files. File specified with –py-files are uploaded to the cluster before it
run the application.
Ex - --py-files file1.py, file2.py,file3.zip

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
spark-submit \
--master “yarn” \
--deploy-mode “client”\
--conf spark.sql.shuffle.partitions = 300 \
--conf spark.yarn.appMasterEnv.HDFS_PATH=“practice/retail_db/orders”
--driver-memory 1024M \
--executor-memory 1024M \
--num-executors 2 \
--jars --jars /devl/src/main/python/lib/ojdbc7.jar, fil2.jar, file3.jar \
--packages org.apache.spark:spark-avro_2.11:2.4.4 \
--py-files file1.py, file2.py,file3.zip \
/dev/example1/src/main/python/bin/basic.py arg1 arg2 arg3

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
What is RDD ?

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Data Structures
2 types of data structures in PySpark - RDD and Data Frames.
• RDD is the low-level data structure which spark uses to distribute the data between
tasks while data is being processed.
• DataFrame is created on top of RDDs. We can seamlessly move between RDD and
DataFrame.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD – Resilient Distributed Datasets
RDD is one of the fundamental abstractions of Spark on which it was created initially. Almost everything in
Spark is built on top of RDDs.

Resilient – “able to withstand or recover quickly from difficult conditions”.


• RDDs are immutable and can not be modified once created.
• Fault Tolerance – RDDs track data lineage information to recover lost data quickly and
automatically on failure at any point of execution cycle.
Distributed – Divided into smaller chucks called Partitions and distributed across multiple nodes across the
cluster.
Datasets – Holds data.

***Correction - I said 100MB. It is 200MB File would be divided into 2 partitions – 128MB and 72MB.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD Main Features
1. Resilience
2. Distributed
3. Lazy Evaluation
4. Immutability
5. In-memory Computation
6. Structured or semi-structured Data

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD Main Features
1. Resilience
• Fault Tolerance – RDDs track data lineage information to recover lost data quickly and automatically on failure at any point of execution cycle.
• Spark keeps a record of lineage while tracking the transformations that have been performed. If any part of RDD is lost, then spark will utilize this
lineage record to quickly and efficiently re-compute the RDD using the identical operations.

Transformations

Create RDD
sc.parallelize
RDD

Actions

Result

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
2. Distributed
• RDD will be divided into partitions while data being processed. Each partition will be processed by one task.
• If spark is running in YARN cluster, the number of RDD partitions is typically based on HDFS block size which is 128MB by default. We can control the
number of minimum partitions by using additional arguments while invoking APIs such as textFile.

Worker Node 1

Executor
Partition 1 task1
Task 1 e Task 2

RDD Partition 2 task2

Partition 3 task3 Worker Node 2


Executor

Task 3

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
3. Lazy Evaluation:
• Each Transformation is a Lazy Operation. Evaluation is not started until an action is triggered.

T T
RDD RDD RDD
filter Group by 3
1 2
A Count

4
Only at this point execution
starts.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Example – Find out 10 sample records having a string “Robert” from a file(1TB).

With Out Lazy Evaluation: With Lazy Evaluation:

Step 1 Load the 1TB File Step 1 Wait for an Action


and then Load the
Step 2 Perform full scan to 1TB File
find out all Records
having “Robert” Filter out first 10
Step 2 records having
Step 3 Retrieve 10 Records “Robert” .

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
4. Immutability:
RDDs are considered to be “Immutable Storage”.

Single-Node

Transfor
Data in Transfor Transfor Data in
mation
Stage 1 mation 1 mation 2 Stage 2

Data Stored The Transformation Recipe is Stored Data is not


Immutably Ex – Logic to print all the even Stored
numbers is col%2 == 0

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
5. In-memory Computation:
• The data is kept in RAM instead of disk drives and is processed in parallel.
• This has become popular because it reduces the cost of memory.
• The two main columns of in-memory computation are-RAM storage and Parallel distributed processing.

DISK Operation 1 Operation … DISK

RAM RAM

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
6. Structured or Semi-structured Data:
Along with structured data, RDD can also be used for semi structured data like text, media streaming data etc.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
When to Use RDDs?
• Low Level API & control of dataset
• Dealing with unstructured data(text or media streams)
• Don’t care schema or structure of data
• Don’t care about optimization, performance

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Problems in RDD ?

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD Properties
RDD is defined by five main properties:
1. List of Parent RDDs (Dependencies)
2. An array of partitions that a dataset is divided into.
3. A Compute function to do a computation on partitions.
4. Optional practitioner that defines how keys are hashed and the pairs partitioned (key value RDDs).
5. Optional Preferred locations – Information about the locations of the split block for an HDFS file (if on YARN).

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Main Problems in RDD ?
Problem#1 Can not Optimize !!!
Compute function: Partition => Iterator(T)

Opaque Computation
&
Opaque data
Compute function is used to do computations on partitions.
 Map from each partition to an Iterator of data that is stored within the RDD.
 Given a particular partition, it is going to create an iterator and it will execute that particular code on that partition, distribute
across the cluster and merge the results back and that’s the fundamental programming model of an RDD.
 RDD does not know what is the function doing or anything about the data. It just serialize this part of code, send it over to executor
and let it execute.
For example if you are trying to join or filter or project, spark can not automatically optimize it.
 Also the data that is stored in the RDD is also opaque to spark and so spark can not do any pruning of data if the query does not
need it.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Main Problems in RDD ?
Problem#2 (Pretty verbose to work with)

emp.map(lambda x: (x.deptid, (x.age, 1))) \


.reduceByKey(lambda x,y: (x(0) + y(0), x(1) + y(1))) \
. Map(lambda x: (x(0), x(1)(0) / x(1)(1))) \
.collect()

SELECT deptid, AVG(age) FROM emp GROUP BY deptid

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Create RDD

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ways to create RDD :
 External Data (HDFS, local etc)
 Local Data
 Python List/Parallelized Collections
 Other RDDs
 Existing DataFrame

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Create RDD using textFile:
Using Spark Context Object sc.
ordRDD = sc.textFile('practice/retail_db/orders',8)
• By default minPartitions is the file block numbers in HDFS. We can give a different number.
• But if you use a number less that number of blocks, that will not be used and number of blocks tasks will be
executed.
• Default Number of partitions is 2 while loading a file.

Create RDD using wholeTextFiles:


Read entire file as a single record into RDD.
ordRDD = sc.wholeTextFiles('practice/retail_db/orders',8)

Create Empty RDD:


Create empty RDD with no partition.
rdd = spark.sparkContext.emptyRDD()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Python List:
lst = open('/staging/test/sample.txt').read().splitlines()
rdd = sc.parallelize(lst)

Other RDDs
rdd1 = rdd
Rdd1.rdd.map(lambda x: x(1))

Existing DataFrame
df=spark.createDataFrame(data=(('robert',35),('Mike',45)),schema=('name','age'))
new_rdd= df.rdd

Create RDD from Local File:


rdd = sc.textFile("file://<local_path>")
Ex - Local Path: /staging/test

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD Operations

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Create RDD

RDD operating in
Parallel

Transformations

Actions

Results

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
https://spark.apache.org/docs/latest/rdd-programming-guide.html

Transformations
Row Level Joining Key Agg Sorting Set Sampling Pipe Partitions
map join reduceByKey sortByKey union sample pipe Coalesce

flatMap cogroup aggregateByKey intersection Repartition

filter cartesian groupByKey distinct repartitionAndSort


WithinPartitions
mapValues countByKey substract

Actions
Display Total Agg File Extraction foreach
take reduce saveAsTextFile foreach
takeSample count saveAsSequenceFile
takeOrdered saveAsObjectFile

first
collect

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Low Level Transformations
(map, flatMap, filter)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
map : map(f, preservesPartitioning=False)
 Perform row level transformations where one record transforms into another record.
 Number of records in input is equal to output.
 Return a new RDD by applying a function to each element of this RDD.
 When we apply a map function to an RDD, a pipelineRDD is formed, a subclass of RDD. It has all the APIs defined in
the RDD.

ord = sc.textFile('practice/retail_db/orders')
ordItems = sc.textFile('practice/retail_db/order_items')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### PS: Project all the Order_ids.
ordMap = ord.map(lambda x : x.split(‘,’)[0])
for i in ordMap.take(5) : print(i)

### PS: Project all the Orders and their status.


ord.map(lambda x : (x.split(',')(0),x.split(',')(1))).take(5)

### PS: Combine Order id and status with ‘#’


ord.map(lambda x : x.split(',')(0) + '#' + x.split(',')(3)).take(5)

### PS: Convert the Order date into YYYY/MM/DD Format.


ord.map(lambda x : x.split(',')(1).split(' ')(0).replace('-','/')).first()

### PS: Create key-value pairs with key as Order id and values as whole records.
ordMap = ord.map(lambda x : (x.split(',')(0),x))

### PS: Project all the Order_item_ids and their subtotal.


ordItemsMap = ordItems.map(lambda x : (x.split(',')(0),x.split(',')(4)))

### PS: Applied user defined function to convert status into lowercase.
def lowerCase(str):
return str.lower() Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
ord.map(lambda x : lowerCase(x.split(',')(3))).first()
flatMap : flatMap(f, preservesPartitioning=False)
 Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
 Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a
single item). Number of records in input is less than or equal to output.

### PS : Word count in orders file.


ord = sc.textFile('practice/retail_db/orders')
wordCount = ord.flatMap(lambda x : x.split(',')).map(lambda w : (w,1)).reduceByKey(lambda x,y : x+y)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
filter: filter(f)
Return a new dataset formed by selecting those elements of the source on which func returns true.

PS: Print all the orders which are closed or Complete and ordered in the year 2013.
ord = sc.textFile('practice/retail_db/orders')
filteredOrd = ord.filter(lambda x : (x.split(',')(3) in ("CLOSED","COMPLETE")) and (x.split(',')(1).split('-')(0) == '2014'))

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
mapValues: mapValues(f)
• Only applicable to pair RDDs RDD((A,B)).
• Do not change the key. Apply the function ‘f’ to all values of the same key.
• The difference with map is map operates on the entire record.

Ex - rdd = sc.parallelize((("a", (1,2,3)), ("b", (3,4,5)),("a", (1,2,3,4,5))))


def f(x): return len(x)
rdd.mapValues(f).collect()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Join Transformations

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Joining datasets
 Join
 leftOuterJoin
 rightOuterJoin
 fullOuterJoin
Join: join(other, numPartitions=None)
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

RDD2

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
ord = sc.textFile('practice/retail_db/orders')
ordItems = sc.textFile('practice/retail_db/order_items')

PS: Find the subtotal for each ORDER_CUSTOMER_ID.

ordMap=ord.map(lambda x : (x.split(',')(0),x.split(',')(2)))
ordItemsMap=ordItems.map(lambda x : (x.split(',')(1),x.split(',')(4)))
findSubtotalForCust = ordMap.join(ordItemsMap)
findSubtotalForCust.map(lambda x : x(1)(0)+','+x(1)(1)).first()
findSubtotalForCust.map(lambda x : str(x(1)(0))+','+str(x(1)(1))).first()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
cogroup:
cogroup(other, numPartitions=None):
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples.

Ex –
x = sc.parallelize((("a", 1), ("b", 4)))
y = sc.parallelize((("a", 2)))
xy = x.cogroup(y)
for i,j in list(xy.take(5)) : print(i + ' ' + str(map(list,j)))

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
cartesian:
cartesian(other) : Perform a cross join.
Ex -
rdd = sc.parallelize((1,3,2))
sorted(rdd.cartesian(rdd).collect())

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Aggregation Operations (Total)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Aggregation Operations:
There are several APIs to perform aggregation Operations.
• Total aggregations – reduce, count (Actions)
• By Key aggregations – reduceByKey, aggregrateByKey, groupByKey, countByKey (Transformations)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Total aggregations:
reduce(f): Reduces the elements of this RDD using the specified commutative and associative binary operator. Currently
reduces partitions locally.
count(): Return the number of elements in this RDD.
Ex –
ord = sc.textFile('practice/retail_db/orders')
ordItems = sc.textFile('practice/retail_db/order_items')

### Count the number of orders which are closed.


ord.filter(lambda x : x.split(',')(3)).count()

### Find the total quantity sold for Order ID 1-10.


ordItems.filter(lambda x : int(x.split(',')(1)) < 11).map(lambda x : float(x.split(',')(4))).reduce(lambda x,y : x+y)
from operator import add
ordItems.filter(lambda x : int(x.split(',')(1)) < 11).map(lambda x : float(x.split(',')(4))).reduce(add)

### For a given order 10 find the maximum subtotal out of all orders.
ordItems.filter(lambda x : int(x.split(',')(1)) ==10).map(lambda x : x.split(',')(4)).reduce(lambda a,b : a if
(float(a.split(',')(0)) > float(b.split(',')(0))) else b)
ordItems.filter(lambda x : int(x.split(',')(1)) ==10).map(lambda x : x.split(',')(4)).reduce(max)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Shuffling and Combiner

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Shuffling and Combiner ?
Shuffle:
• Shuffling is a process of redistributing data across partitions or even nodes.
• Shuffle operation would create a new stage.
• Based on data size we may reduce or increase the number of partitions using configuration spark.sql.shuffle.partitions or through codes like repartition
and coalesce.
• Costly operation as it involves disk I/O, Network I/O and data serialization/de-Serilization.
• Spark shuffling triggers for transformation operations like gropByKey(), reducebyKey(), join(), union(),cogroup, groupBy() etc.
• distinct creates a shuffle.
• Count and countByKey does not create any shuffle.
• Avoid shuffling at all cost. If Shuffling is absolutely necessary, use combiner.
• Out of 3 main key aggregation APIs, the groupByKey does not use a combiner and so should be avoided. The reduceByKey and
aggregrateByKey use the combiner and should be preferred.

Combiner:
• It computes the intermediate values for each partition to avoid partial shuffling.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Shuffling with out Combiner
A1 B1 B1
Shuffling Sorting/Parti Reducer
Bananna,1
Apple,1 tion Apple,1
Bananna,1
Orange,1 Apple,1
Bananna,1
Apple,1 Apple,1
Apple,1
Orange,1 Apple,1
Partition 1 Apple,1
Apple,1
Apple,1
Apple,1
Grapes,1 Bananna,1
Apple,1 C1
Banana,1 Bananna,1
Apple,1
Grapes,1 Bananna,1 Apple,5
RDD Bananna,3
A2 B2 B2 Grapes,3
Orange,5
Orange,1 Orange,1 Grapes,1
Apple,1 Orange,1 Grapes,1
Orange,1 Orange,1 Grapes,1
Partition 2 Orange,1 Orange,1 Orange,1
Apple,1 Orange,1 Orange,1
Banana,1 Grapes,1 Orange,1
Grapes,1 Grapes,1 Orange,1
Banana,1 Grapes,1 Orange,1

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
What Happens in Shuffle Stage?
In shuffle Stage,
1. Spark runs the map tasks on all partitions which groups all values for a single key.
2. Then the results of the map tasks are kept in memory.
3. When results do not fit into memory, Spark stores the data into a disk. So there is I/O and data serilization
involved.
4. So Shuffle generates a large number of intermediate files on disk. These files are preserved until the
corresponding RDDs are no longer used and are then garbage collected.
We can also mention a temporary storage directory in the spark.local.dir while configuring the sparkcontext.
5. Finally it runs the reduce tasks on each partition based on key.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Shuffling with Combiner
A1
Combiner Shuffling Sorting Reducer
Apple,1
Orange,1 B1 C1 C1
Apple,1 Bananna,1
Orange,1 Apple,3 Apple,2
Bananna,2
Partition 1 Apple,1 Orange,2
Apple,3 Apple,3
Grapes,1 Grapes,2 Bananna,1
Banana,1 Apple,2 D1
Banana,1 Bananna,2
Grapes,1 Apple,5
Bananna,3
RDD Grapes,3
A2 Orange,5
Orange,1 B2 C2
Apple,1 C2
Orange,3 Grapes,1
Orange,1
Orange,2 Grapes,2
Partition 2 Orange,1 Apple,2
Orange,3 Orange,2
Apple,1 Banana,2
Grapes,2 Orange,3
Banana,1 Grapes,1
Grapes,1
Grapes,1
Banana,1

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
If you have a performance issues, look into the code if there is a shuffle happening.

toDebugString returns “A description of this RDD and its recursive dependencies for debugging.” It
includes possible shuffles.

Use explain() for DataFrames.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Aggregation Operations (Key)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Key Aggregations:
groupByKey():
aggregrateByKey():
reduceByKey():
countByKey():

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
groupByKey(numPartitions=None, partitionFunc)
• Can be used for aggregations but should be given low priority as it does not use the combiner.
• When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
• If grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will
yield much better performance.
• The number of reduce tasks is configurable through an optional argument – numPartitions.
Ex –
ORDER_ITEMS
ordItems = sc.textFile('practice/retail_db/order_items')
order_item_id
For each product, find its aggregated revenue.
ordGrp = ordItems.map(lambda x : (int(x.split(',')[2]),float(x.split(',')[4]))).groupByKey()
order_item_order_id

result = ordGrp.mapValues(sum).collect() order_item_prod_id


OR order_item_quantity
result = ordGrp.map(lambda x : (x[0],sum(x[1]))).collect() order_item_revenue
order_item_price

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
reduceByKey(func, numPartitions=None, partitionFunc)
• When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce
function func, which must be of type (V,V) => V.
• Like in groupByKey, the number of reduce tasks is configurable through an optional argument - numPartitions.
• It uses Combiner. Associative reduction.

Ex-
from operator import add
rdd = sc.parallelize((("a", 1), ("b", 1), ("a", 1)))
sorted(rdd.reduceByKey(add).collect())

### Find the total revenue sold for each order.


ordItems = sc.textFile('practice/retail_db/order_items')
ordItems.map(lambda x : (int(x.split(',')[1]),float(x.split(',')[4]))).reduceByKey(lambda x,y : x+y).collect()
OR
ordItems.map(lambda x : (int(x.split(',')[1]),float(x.split(',')[4]))).reduceByKey(add).collect()

### Find the maximum revenue for each order.


ordItems.map(lambda x : (int(x.split(',') [1]),x)).reduceByKey(lambda a,b : a if (float(a.split(',')[4]) > float(b.split(',')[4])) else
b).collect()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Associative Reduction:

2, 199.99
2, 250.00
Partition 1 4, 49.98 2, 449.99
4, 299.95 4, 349.93

2, 579.98
4, 699.85

2, 129.99 2, 129.99
Partition 2 4, 150.0 4, 349.92
4, 199.92

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
aggregateByKey(zeroValue, seqOp, combOp, (numPartitions)):
• First aggregate elements in each partition and then aggregating results of all partition to get the final result and the result could be any
type than the type of your RDD.
• 3 mandatory arguments:
 Zero Value: Initial value to initialize the accumulator. Use 0 for integer and NULL for collections.
 SeqOp: Function used to accumulate the results of each partition, and stores the running accumulated result to U. (U,T) => U.
 CombOp: Function is used to combine results of all partitions U.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Partition 1 Partition 2
2  ("Joseph", 200)
2  ("Tina", 130)
2  ("Jimmy", 250)
4  ("Jimmy", 50)
4  ("Joseph", 150)
4  ("Tina", 300)
4  ("Ram", 200)
7  ("Tina", 200)
7  ("Joseph", 300)
7  ("Jimmy", 80)

SeqOp : Max Revenue & SeqOp : Max Revenue


Count & Count
(2, 250, 2) (2, 130, 1)
(4, 200, 2) (4, 300, 2)
(7, 300, 1) (7, 200, 2)

CombOp : Max Revenue


& Count
(2, 250, 3)
(4, 300, 4)
(7, 300, 3)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ex – 1
Find the maximum revenue for each Order.

ordItems=sc.parallelize([
(2,"Joseph",200), (2,"Jimmy",250), (2,"Tina",130), (4,"Jimmy",50), (4,"Tina",300),
(4,"Joseph",150), (4,"Ram",200), (7,"Tina",200), (7,"Joseph",300), (7,"Jimmy",80)],2)

#Create a Paired RDD


ordPair = ordItems.map(lambda x : (x[0],(x[1],x[2])))

#Initialize Accumulator
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val=0

#Define Sequence Operation


# Sequence operation : Finding Maximum revenue from each partition
def seq_op(accumulator, element):
if(accumulator > element[1]):
return accumulator
else:
return element[1]

#Define Combiner Operation


#Combiner Operation : Finding Maximum revenue from all partitions
def comb_op(accumulator1, accumulator2):
if(accumulator1 > accumulator2):
return accumulator1
else:
return accumulator2

aggr_ordItems = ordPair.aggregateByKey(zero_val, seq_op, comb_op)


for i in aggr_ordItems.collect(): print(i) Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ex – 2
Find the maximum revenue for each Order. Print customer name.

ordItems=sc.parallelize([
(2,"Joseph",200), (2,"Jimmy",250), (2,"Tina",130), (4,"Jimmy",50), (4,"Tina",300),
(4,"Joseph",150), (4,"Ram",200), (7,"Tina",200), (7,"Joseph",300), (7,"Jimmy",80)],2)

#Create a Paired RDD


ordPair = ordItems.map(lambda x : (x[0],(x[1],x[2])))

#Initialize Accumulator
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val=('',0)

#Define Sequence Operation


# Sequence operation : Finding Maximum revenue from each partition
def seq_op(accumulator, element):
if(accumulator[1] > element[1]):
return accumulator
else:
return element

#Define Combiner Operation


#Combiner Operation : Finding Maximum revenue from all partitions
def comb_op(accumulator1, accumulator2):
if(accumulator1[1] > accumulator2[1]):
return accumulator1
else:
return accumulator2

aggr_ordItems = ordPair.aggregateByKey(zero_val, seq_op, comb_op)


for i in aggr_ordItems.collect(): print(i) Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ex – 3
Sum up all revenue and number of records for each order.

ordItems=sc.parallelize([
(2,"Joseph",200), (2,"Jimmy",250), (2,"Tina",130), (4,"Jimmy",50), (4,"Tina",300),
(4,"Joseph",150), (4,"Ram",200), (7,"Tina",200), (7,"Joseph",300), (7,"Jimmy",80)],2)

#Create a Paired RDD


ordPair = ordItems.map(lambda x : (x[0],(x[1],x[2])))

#Initialize Accumulator
# Zero Value: Zero value in our case will be 0 as we are finding Maximum Marks
zero_val=(0,0)

#Define Sequence Operation


# Sequence operation : Sum up all revenue and number of records per partition.
def seq_op(accumulator, element):
return (accumulator[0] + element[1], accumulator[1] + 1)

#Define Combiner Operation


#Combiner Operation : Sum up all revenue and number of records for all partition.
def comb_op(accumulator1, accumulator2):
return (accumulator1[0] + accumulator2[0], accumulator1[1] + accumulator2[1])

aggr_ordItems = ordPair.aggregateByKey(zero_val, seq_op, comb_op)


for i in aggr_ordItems.collect(): print(i)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Accumulator(0, 0)

2  ("Joseph", 200) 0+1 = 1


2  ("Tina", 130) 0+1 = 1
2  ("Jimmy", 250) 1+1 = 2

SeqOp : Max Revenue &


Count SeqOp : Max Revenue
(2, 250, 2) & Count def seq_op(accu, element):
(2, 130, 1) return (accu[0] + element[1], accu[1] + 1)

Partition 1 Partition 2
Max Revenue Count

def comb_op(accu1, accu2):


CombOp : Max Revenue
& Count return (accu1[0] + accu2[0], accu1[1] + accu2[1])
(2, 250, 3)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Difference Between ReduceByKey Vs AggregateByKey:
• aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type.
• In another words, it lets you have an input as type x and aggregate result as type y.
Ex - For example (1,2),(1,4) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of
each key.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
countByKey():
• Only available on RDDs of type (K, V). Returns a (K, Int) pairs with the count of each key.
• Returns a Collection Dictionary.
• No shuffle.
Orders
Ex - Count number oF orders per each status.
ord = sc.textFile('practice/retail_db/orders')
Order Id
ordPair = ord.map(lambda x : (x.split(',')[3],1))
countByStatus = ordPair.countByKey() Order Date
for i in countByStatus.items : print(i)
for i in countByStatus.keys : print(i) Order Customer Id
for i in countByStatus.keys : print(i)
Order Status

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Sorting Transformations

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
sortByKey(ascending=True, numPartitions=None, keyfunc=<function <lambda>>)
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in
the Boolean ascending argument.

ord = sc.textFile('practice/retail_db/orders')

Ex-1 (Sort orders using customer id.)


ordPair = ord.map(lambda x : (int(x.split(',')[2]),x))
ordSort = ordPair.sortByKey(ascending=False)
for i in ordSort.take(10) : print(i)

Ex-2 (Sort orders using customer and status.)


ordPair = ord.map(lambda x : ((int(x.split(',')[2]), x.split(',')[3]),x))
ordSort = ordPair.sortByKey(ascending=False)
for i in ordSort.take(10) : print(i)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ranking

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Global Ranking or Ranking per Group
Global Ranking :
• sortByKey and take

Global Ranking
• takeOrdered or top
Ranking Per Group:
• Getting ranking per group is a bit complex but important to know.
• Per-key or Per group ranking can be achieved using
• groupByKey with flatMap
• Python Knowledge like sorted function, list etc

Ranking Per Group


G1
G1
G1
G2
G2
G2

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Global Ranking

Global Ranking
Using sortByKey and take:
Ex – Top five products with highest prices.
prod = sc.textFile('practice/retail_db/products')
prodPair = prod.map(lambda x : (float(x.split(',')[4]),x))
prod = prod.filter(lambda x : x.split(',')[4] = '').count()
prodPair = prod.map(lambda x : (float(x.split(',')[4]),x))
top5Products = prodPair.sortByKey(False).take(5)

Using takeOrdered or top:


takeOrdered(num, key=None)
Get the N elements from an RDD ordered in ascending order or as specified by the optional key function.
Ex-
sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7]).takeOrdered(6)
Ex-
sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7]).takeOrdered(6,key=lambda x : -x)
Ex-
top5Products1 = prod.takeOrdered(5,lambda k: -float(k.split(',')[4])

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ranking Per Group:

Ex –
Top 2 Products with highest Prices per Category.
prod = sc.textFile('practice/retail_db/products')
prodF = prod.filter(lambda x : (int(x.split(',')[1]) in [2,3,4]) and (int(x.split(',')[0]) in [1,2,3,4,5,25,26,27,28,29,49,50,51,52,53]))
prodGroupBy = prodF.map(lambda line : ( int(line.split(',')[1]), line)).groupByKey()
first = prodGroupBy.first()
sorted(first[1],key = lambda x : float(x.split(',')[4]),reverse=True)
top2ProductsByPrice= productsGroupBy.flatMap(lambda x: sorted(x[1], key=lambda k:float(k.split(",")[4]), reverse=True)[:2])

Ranking Per Group


G1
G1
G1
G2
G2
G2 [email protected]
Aurora Academy of Training. Copyright
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Set Transformations

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
union(other)
• A union will get all the elements from both the data sets.
• In the case of a union, it will not get distinct elements. Apply distinct, if you only want to get distinct elements after union
operation.
• When we use set operations such as union and intersect, data should have a similar structure (Same Columns and Types).
Ex – Number of customers placed order in July or Aug Month.
ord = sc.textFile('practice/retail_db/orders')

julyOrd = ord.filter(lambda x : str(x.split(',')[1].split('-')[1]) == '07').map(lambda x : x.split(',')[2])


augOrd = ord.filter(lambda x : str(x.split(',')[1].split('-')[1]) == '08').map(lambda x : x.split(',')[2])
julyAugOrders = julyOrd.union(augOrd).distinct().count()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Intersection(other)
• Return the intersection of this RDD and another one.
• The output will not contain any duplicate elements, even if the input RDDs did.

Ex – Orders applied both in July and Aug datamonth.


julyAugCommonOrders=julyOrd.intersection(augOrd).count()

Ex – Check if duplicates are reported.


rdd1=sc.parallelize([1,2,3,3,3])
rdd2=sc.parallelize([1,3,5])
rdd1.intersection(rdd2).collect()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
distinct(numPartitions=None)
• Return a new RDD containing the distinct elements in this RDD.
Ex-
rdd1.distinct().collect()

subtract (other, numPartitions=None)


• Return each value in left RDD that is not contained in right RDD.
Ex-
rdd2.subtract(rdd1).collect()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Sampling Transformations

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
sample(withReplacement, fraction, seed=None) Transformations
• To get random sample records from the RDD.
• withReplacment: True or False. With True, Same result can be produced more than once.
• Fraction: Between 0 to 1. 0.3 means 30%. Does not guarantee the exact 30% of the records.
• Seed: Reproduce the same sample.

rdd = sc.parallelize(range(100), 4)
rdd.sample(seed=10,fraction=0.1,withReplacement=False).collect()

takeSample(withReplacement, num, seed=None) Actions


• Return a fixed-size sampled subset of this RDD.
• This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
• ‘num’ is exact.

rdd.takeSample(seed=10,num=10,withReplacement=True)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD Repartition & Coalesce

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
What is Partition ?
• Datasets are huge is size and they cannot fit into a single node and so they have to be partitioned across
different nodes or machines.
• Partition in spark is basically an atomic chuck of data stored on a node in the cluster. They are the basic
units of parallelism.
• One partition can not span over multiple machines.
• Spark automatically partitions RDDs/DataFrames and distributes the partitions across different nodes.
• We can also configure the optimal number of partitions. Having too few or many partitions is not good.
• How Spark does the default Partitioning of Data ?
Spark checks the HDFS Block size. The HDFS Block size for Hadoop 1.0 is 64mb and Hadoop 2.0/YARN is
128MB. It creates one partition for each block size.
Ex- We have a file of 500MB, so 4 partitions would be created.
• At times programmers are required to change the number of partitions based on the requirements of the
application job. The change can be to increase the number of partitions or decrease the number of
partitions.
• So we would either apply the repartition or coalesce.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
HDFS

Apply a Filter T
File: test_data
Size: 192B
Size:670 MB Record Count :32
Record Count :8m
No of Blocks:6
Partitions: 6

Partitions: 6

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Create some Dump data for testing
df = spark.range(1000000)
df = df.select(df.id,df.id*2,df.id*3)
df = df.union(df)
df = df.union(df)
df = df.union(df)
df = df.union(df)
df = df.union(df)

### Convert DataFrame to RDD.


RDD = df.rdd.map(lambda x : str(x[0]) + ',' + str(x[1]) + ',' + str(x[2]))

### Save the file at a HDFS Path


RDD.coalesce(1).saveAsTextFile('/user/test/test_data')

### Testing
rdd = sc.textFile('/user/test/test_data')
rdd.getNumPartitions()

#Apply a filter.
rdd1 = rdd.filter(lambda x : int(x.split(',')[0]) == 1)
rdd1.getNumPartitions() Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition
Repartition(numPartitions):
• Return a new RDD that has exactly numPartitions partitions.
• Create almost equal sized partitions.
• Can increase or decrease the level of parallelism.
• Spark performs better with equal sized partitions. If you need further processing of huge data, it is preferred to have a equal sized partitions
and so we should consider using repartition.
• Internally, this uses a shuffle to redistribute data from all partitions leading to very expensive operation. So avoid if not required.
• If you are decreasing the number of partitions, consider using coalesce, where the movement of data shuffling across the partitions is lower.

Ex –
ord = sc.textFile('practice/retail_db/orders')
ord.glom().map(len).collect()

ord = ord.repartition(5)
ord.glom().map(len).collect()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition

repartition(2)
Lots of Shuffling in
Repartition.
P 1 (100)

P 2(50) P 6(500)

P 3(350)

P 4(400) P 7(500)

P 5(100)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition and Sort
repartitionAndSortWithinPartitions(numPartitions=None, partitionFunc=<function portable_hash>, ascending=True, keyfunc)
• Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.
• Need a key value paid RDD.

Ex -
rdd = sc.parallelize(((9, ('a','z')), (3, ('x','f')), (6, ('j','b')), (4, ('a','b')), (8, ('s','b')), (1, ('a','b'))),2)
rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x : x % 2, True)
rdd2.glom().collect()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Coalesce
coalesce(numPartitions, shuffle=False):
• Return a new RDD that is reduced into `numPartitions` partitions.
• Optimized version of repartition().
• No shuffling.
• Results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle,
instead each of the 100 new partitions will claim 10 of the current partitions.
• If a larger number of partitions is requested, it will stay at the current number of partitions.
• By Default Coalesce can be only used for decreasing the partitions. But by passing shuffle=True parameter it behaves
like repartition and we can increase the partitions as well.

#Reduce number of partitions using coalesce


rdd = sc.textFile('practice/retail_db/orders')
rdd.getNumPartitions()
rdd.coalesce(1).getNumPartitions()

#Try increase number of partitions using coalesce


rdd.coalesce(5).getNumPartitions()

#Try increase number of partitions using coalesce and shuffle param


rdd.coalesce(5).getNumPartitions()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
coalesce

coalesce(2)
No Shuffle in coalesce.

P 1 (100) P 1 (400)

P 2(300)

P 3(5) P 3 (8)

P 4(3)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition vs Coalesce
Repartition Coalesce

1. Repartitions does a full shuffle. 1. Coalesce avoids full shuffle.


2. Preferable used to increase number of partitions. 2. Preferable used to decrease number of partitions.
3. Repartition creates new partitions and does a full shuffle. 3. Coalesce uses existing partitions to minimize the amount of data that's
4. Repartition results in roughly equal sized partitions. shuffled.
5. Coalesce may run faster than repartition, but unequal sized 4. Coalesce results in partitions with different size of data.
partitions are generally slower to work with than equal sized 5. Coalesce may run faster than repartition, but unequal sized partitions
partitions. are generally slower to work with than equal sized partitions.
6. It's critical to repartition or coalesce after running joining or 6. It's critical to repartition or coalesce after running joining or filtering
filtering queries. If the data becomes smaller, consider using queries. If the data becomes smaller, consider using coalesce to merge
coalesce to merge partitions and data becomes larger consider partitions and data becomes larger consider using repartition to
using repartition to increase the number of partitions. increase the number of partitions.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD Loading/Writing from HDFS

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Reading File from HDFS into RDD:
• sc.textFile  Text File
• sc.sequenceFile  Sequence File
Writing data into HDFS From RDD:
• saveAsTextFile
• saveAsSequenceFile

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD : Save as a Text File
saveAsTextFile(path, compressionCodecClass=None)
• Save this RDD as a text file.
• The list of supported Codecs will be found at core-site.xml.

Ex – Find the number of customers placed order in July or Aug Month. Store the output at HDFS as a text format.
Create 5 files. Compression format should be bzip2.
ord = sc.textFile('practice/retail_db/orders')
julyOrd = ord.filter(lambda x : str(x.split(',')[1].split('-')[1]) == '07‘)
augOrd = ord.filter(lambda x : str(x.split(',')[1].split('-')[1]) == '08')
julyAugOrders = julyOrd.union(augOrd).distinct()
julyAugOrders.coalesce(5).saveAsTextFile('practice/retail_db/dump/julyAugOrders',
compressionCodecClass='org.apache.hadoop.io.compress.BZip2Codec')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD : Save into a SequenceFile
saveAsSequenceFile(path, compressionCodecClass=None)
• A sequence File is a collection of tuple of Key-Value Pair.
• Its flat binary file type that serves as a container for data to be used in Hadoop distributed computing projects and Map Reduce.

#Extract rdd to sequence file


Ex – Fine the number of customers placed order in July or Aug Month. Store the output at HDFS as a sequence file format. Create 1 file. No
Compression is required.
JulyAugOrdPair = julyAugOrders.map(lambda r : (r.split(',')[0],r))
JulyAugOrdPair.coalesce(1).saveAsSequenceFile('practice/retail_db/dump/julyAugOrders')

#Keys can also be None if you do not have a suitable key.


JulyAugOrdPair = julyAugOrders.map(lambda r : (None,r))

#Load a sequence file into rdd


rdd = sc.sequenceFile('practice/retail_db/dump/julyAugOrders')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Cluster Execution
Architecture

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Execution Architecture

Driver Program Worker Node

Executor Cache

8 9
1 Task Task

JVM 2 7
3 Spark Context
6 Cluster Manager

7 Worker Node
Action Cache
Executor
Operation 8 9
Task Task

DAG Scheduler 4 Task Scheduler 5


task1
RDD

task2
RDD1 RDD2

task3 Tasks
Operator
DAG RDD
task4 Launch task via Task
Scheduler
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
 Follows master-slave architecture.
 Client submits user application code to Driver. (1)
 JVM Is created on the driver. (2)
 Spark Context is created in the JVM of driver program. Only one active sc per JVM. (3)
 Driver implicitly converts the user code into logically DAG(Directed Acyclic Graph) using DAG scheduler. (4)
• DAG Scheduler performs optimizations such as pipelining transformations and then it converts the logical graph DAG
into Physical executing plan with many stages.
• After creating physical executing plans, it creates Physical executing units called tasks under each stage.
 The stages pass on to Task Scheduler. It launces task through cluster manager. (5)
 Now driver via Spark Context talks to the cluster manager and negotiates resources. It request for worker nodes and
executors in the cluster. (6) (7)
• Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these
communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other
applications (e.g. Mesos/YARN).
• Cluster Manager allocate resources and instruct executors to execute the job.
• Also track the submitted jobs and report back the status of the job to the driver.
 Driver sends application code and dependencies(defined by jar or Python files passed to the SparkContext) to executors. (8)
 Finally driver also send the tasks to the executors to run.
 Job resources which we are trying to pass as part of execution can be cached at Worker Nodes. One of the job
resource can be our code itself.
 All the executors start to register themselves with the drivers so that driver will have a complete view of the executors.
 Executors now start executing the tasks that are assigned by the driver program.
 When application is running, the driver program will monitor the set of executors that runs. Driver also schedules the future
tasks based on data placement. Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
 After execution, the result returns back to the Spark Context. (9)
Cluster Manager Types:
The system currently supports several cluster managers:
• Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
• Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
• Hadoop YARN – the resource manager in Hadoop 2.
• Kubernetes – an open-source system for automating deployment, scaling, and management of containerized
applications.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
YARN Cluster Manager

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
YARN as Spark Cluster Manager:
• Yet Another Resource Negotiator for Hadoop 2.x.
• Cluster Management – Used for resource allocation and Scheduling.
• It has 3 major components –
 Resource Manager
 Node Manager
 Application Master

Flow:
1. Client submit the Spark application. Driver instantiates SparkContext.
2. Driver talks to the cluster manager(YARN) and negotiates resources.
3. The YARN resource manager search for a Node Manger which will, in turn, launch an ApplicationMaster for the
specific job in a container.
4. The ApplicationMaster registers itself with the resource Manager.
5. The ApplicationMaster negotiates containers for executors from the ResourceManager. Can request for more
resources from RM.
6. The ApplicationMaster notifies the Node Managers to launch the containers and executors. Executor then executes
the tasks.
7. Driver communicates with executors to coordinate the processing of tasks of an application.
8. Once the tasks are complete, ApplicationMaster un-registers with the Resource Manager.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
YARN Cluster Manager

Node Manager

3 3
Container

4 Application
1 Resource 5
Master
Driver 2
Manager 8
Node Manager 6 6

Container

Executor
7
Task Task Node Manager

Container

Executor
7
Task Task

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Resource Manager:
• Runs on Master Node.
• Manages the resources used across the cluster.
• Two components – Scheduler and Application Manager.
Scheduler - Performs scheduling based on the requirement of resources by the applications.
Application Manager: It manages the running of Application Master and restart it on its failure. Also it is
responsible to accept the submission of jobs.

Node Manager:
• Runs on all Worker Nodes.
• Launches and monitor containers which are assigned by RM.
• Responsible for the execution of the task in each data node.

Containers:
• Are set of resources like RAM, CPU, memory etc on a single node and they are scheduled by RM and monitored
by NM.

Application Master:
• An individual ApplicationMaster is assigned for each job by RM.
• It’s chief responsibility is to negotiate the resources from the RM. It works with the Node Manager to monitor
and execute the tasks.
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
JVM Processes

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Where JVMs are created during execution across the Clusters :

Cluster
Executor JVM

Driver JVM JVM Heap

Task Slots
T T T
JVM Heap Unoccupied Task Slots
T
Client JVM
Spark Context

Spark Executor JVM


Application

JVM Heap

T T T

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Terms used in Spark Execution Framework:
Application: User application built on Spark.
Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect).
Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce);
you'll see this term used in the driver's logs.
Task: A unit of work that will be sent to one executor.
Application Jar: A jar containing the user's Spark application codes.
Driver Program: The process running the main() function of the application and creating the SparkContext.
Cluster Manager: An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN). In YARN it is Resource Manager
+ Application Master.
Deploy Mode: Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client"
mode, the submitter launches the driver outside of the cluster.
Worker Node: Any node that can run application code in the cluster.
Executor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each
application has its own executors.
Cache: Cache is nothing but the job resources which we are trying to pass as part of execution. One of the job resource can be our code itself.
Typically the code will be compiled in the jar file and jar file will be passed as cache and it will be cached into the executors. Tasks will be using this
cache at the run time so that it can apply the logic onto the data while data is being processed. We can also persist or cache the RDDs.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Multiple Jobs for Multiple Stages for Multiple tasks for
One Application Multiple Jobs for Multiple Stages for Multiple tasks for
Each Application
Multiple Jobs for Each Job
Multiple Stages for Each stage
Multiple tasks for
Each Application Each Job Each stage
Each Application Each Job Each stage

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Narrow and
Wide
Transformations

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Narrow and Wide Transformations
Narrow Transformations:
• These types of transformations convert each input partition to only one output partition.
• Spark merges all narrow transformations into one stage.
• Fast.
• No data shuffle.
• Ex- map(), filter()
Partition 1 Partition 1

Partition 2 T[ Partition 2

Partition 3 Partition 3

RDD1 RDD2
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Wide Transformations:
• This type of transformation will have input partitions contributing to many output partitions.
• Each Wide Transformations creates a new stage.
• Slow compared to Narrow.
• Data shuffle.
• Ex- groupByKey(), aggregateByKey(), join, distinct(), repartition() etc.

Partition 1 Partition 1

T
Partition 2 Partition 2

Partition 3 Partition 3

Aurora Academy of Training. Copyright [email protected]


RDD1 - All Rights Reserved. Subscribe at Learn-Spark.nfo RDD2
DAG

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DAG Graph

Stage 1 Stage 2 Stage 3 Stage 4

aggregate
parallelize ByKey parallelize join

map map map

distinct map filter filter

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DAG
DAG Stands for Directed Acyclic Graph.
Directed  Directly connected to one node to another.
Acyclic  There is no cycle or loop. So it is in line and we can not go back to its original position.
Graph  It has Vertices and Edges. Vertices indicates RDDs and edges refers to the operations on the RDD.
These all represented as a graph.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DAG Scheduler
Driver Program

Spark Context

DAG Scheduler Task Scheduler


Physical Execution Plan
RDD1 task1
Operator
DAG
Action
RDD2 RDD3 task2 task5
Operation

RDD4 task3 task6


Tasks
Logical Execution Plan
task4 task7

Stage 1 Stage 2

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Steps to build a DAG
1. User submits an application job to spark.
2. Drivers takes the application and create a Spark Context to process the application.
3. Spark Context identifies all the T and A operations present in the application.
4. All the operations are arranged in a logical flow of operations called DAG (Logical Execution Plan).
5. It stops here if SC doesn’t find any A Operations.
6. If it identifies an A operations, spark submit the Operator DAG to DAG scheduler.
7. DAG Scheduler converts the Logical Execution plan into Physical Execution plan and creates stages
and tasks. Here Narrow T are fused together into one stage. Wide T involving shuffle process creates
new stages.
8. DAG scheduler bundles all the tasks and send it Task Scheduler which then submit the job to cluster
manager for execution.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD Lineage
RDD Lineage:
• Each RDD maintains a pointer to one or more parent along with metadata about what type of relationship it
has with the parent.
• Ex - if we call rdd2=rdd1.map(), the rdd2 keeps a reference to its parent rdd1 and this is called RDD lineage.
• Print the RDD lineage information using toDebugString() API.

rdd1

rdd2 rdd3

rdd4

RDD Lineage Graph

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
toDebugString(self) :
• Displays Logical Execution Plan.
• We can learn about a RDD Lineage Graph using API toDebugString. Spark Context

• Displays the description of this RDD and its recursive dependencies textFile
for debugging.
HadoopRDD
Word Count Program:
NT Stage 1
text_file = sc.textFile('practice/retail_db/word')
wordCounts = text_file.flatMap(lambda line: line.split(",")) \ mapPartitionsRDD
.filter(lambda x : x.isdigit() == False) \ NT
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) mapPartitionsRDD
top3Words = wordCounts.takeOrdered(3,lambda k: -float(k[1]))
NT

for i in rdd.toDebugString().split(“\n”) : print(i) mapPartitionsRDD


WT

shuffleRDD Stage 2
A

top3Words = wordCounts.takeOrdered(3,lambda k: -float(k[1]))

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Logical Execution
Plan

DAG Scheduler

Physical
Execution Plan

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Hadoop RDD p1 p2

DAG Scheduler
mapPartitionsRDD p1 p2
Stage 1

mapPartitionsRDD p1 p1

Task Scheduler

mapPartitionsRDD p1 p2

shuffleRDD p1 p2 Stage 2

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DAG Scheduler to Task Scheduler:
• DAG Scheduler submit the stages into task scheduler.
• The number of tasks submitted depends on the number of partitions present in the textFile.
• For ex – imagine we have 3 partitions, then there will be 3 set of tasks created and submitted in parallel
provided there are enough cores.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Parallel Execution of Stages:
• The stages which are not dependent on each other may be submitted to the cluster for execution in parallel.

text_file = sc.textFile('practice/retail_db/orders').map(lambda x : (int(x.split(',')[0]),x.split(',')[3]))


rdd = sc.parallelize([1,2,3,4,5,6,7]).map(lambda x : (x,1*1) )
joined = text_file .join(rdd)
joined.count()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
How to find the Resource Manager URL
 Go to /etc/hadoop/conf
 Open yarn-site.xml
 Find the resource Manager url by searching name ‘yarn.resourcemanager.webapp.address’
 Default port : 8088

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
RDD Persistence

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
• One of the most important capabilities in Spark is persisting (or caching) a dataset in memory.
• When we persist an RDD, each node stores any partitions of it that it computes in memory and reuses
them.
• We can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is
computed in an action, it will be kept in memory on the nodes.
• Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.
• Each persisted RDD can be stored using different storage level.
• These levels are set by passing a StorageLevel object

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication=1)
• Data can be stored either in Disk or Memory or off-heap memory or any of these combinations.
• Data can be stored in Serialized or deserialized. Serilization is a way to convert a object in memory to series of bits. The
deserialization is the process of bringing those bits into memory as an object. Whenever we are talking about 'deserialized'
RDD/DF, we are always referring to RDD/DFs in memory.
• Off-Heap Memory is a segment of memory lies outside the JVM, but is used by JVM for certain use-cases. Off-Heap memory
can be used by Spark explicitly as well to store serialized data-frames and RDDs.
• Use the replicated storage levels if you want fast fault recovery.

Default
Storage

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
persist(storageLevel=StorageLevel(False, True, False, False, 1) )
• Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to
assign a new storage level if the RDD does not have a storage level set yet.
• If no storage level is specified defaults to MEMORY_ONLY.

rdd = sc.parallelize(("b", "a", "c"))


rdd.persist().is_cached
rdd.getStorageLevel()

from pyspark import StorageLevel


rdd.persist( StorageLevel.MEMORY_AND_DISK_2 )
rdd.getStorageLevel()
print(rdd1.getStorageLevel())

unpersist():
• Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
• Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used
(LRU) fashion.
• If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist()
method.

rdd.persist()
rdd.is_cached
rdd.unpersist()
rdd.is_cached

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Usage and trade-offs:
• Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency.
• If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most
CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
• When amount large amount of dataset or dataset operations are very expensive or memory is not sufficient to hold
the data, spill it into disk.
• Use the replicated storage levels if you want fast fault recovery

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Shared Variables

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Shared Variables
• Shared variables are the variables that are required to be used by functions and methods in parallel.
• Shared variables can be used in parallel operations.
• Spark provides two types of shared variables −
 Broadcast
 Accumulator

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Broadcast Variables
• Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than
shipping a copy of it with tasks.
• Immutable and cached on each worker nodes only once.
• Efficient manner to give a copy of a large dataset to each node.
• They should fit in memory.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
When to use Broadcast Variable:
• For processing, the executors need information regarding variables or methods. This information is serialized by Spark and
sent to each executor and is known as CLOSURE.
• If we have a huge array that is accessed from spark CLOSURES, for example - if we have 5 nodes cluster with 100 partitions
(20 partitions per node), this Array will be distributed at least 100 times (20 times to each node). If we broadcast it
will be distributed once per node using efficient p2p protocol.

What not to do:


• Once we broadcasted the value to the nodes, we shouldn’t make changes to its value to make sure each node have
exact same copy of data. The modified value might be sent to another node later that would give unexpected results.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Hands-on:

--Broadcast a Dictionary
days={"sun": "Sunday", "mon" : "Monday", "tue":"Tuesday"}
bcDays = spark.sparkContext.broadcast(days)
bcDays.value
bcDays.value['sun']

--Broadcast a list
numbers = (1,2,3)
broadcastNumbers=spark.sparkContext.broadcast(numbers)
broadcastNumbers.value
broadcastNumbers.value[0]

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
PS: Convert the 3 bytes days into full days.
Input data= (("James","Smith","USA","mon"),
("Michael","Rose","USA","tue"),
("Robert","Williams","USA","sun"),
("Maria","Jones","USA","tue")
)

Solution:
days={"sun": "Sunday", "mon" : "Monday", "tue":"Tuesday"}
bcDays = spark.sparkContext.broadcast(days)

data = (("James","Smith","USA","mon"),
("Michael","Rose","USA","tue"),
("Robert","Williams","USA","sun"),
("Maria","Jones","USA","tue")
)

rdd = spark.sparkContext.parallelize(data)

def days_convert(dict):
return bcDays.value(dict)

input = rdd.map(lambda x: (x(0),x(1),x(2),x(3))).collect()


Aurora Academy of Training. Copyright [email protected]
output = rdd.map(lambda x: (x(0),x(1),x(2),days_convert(x(3)))).collect()
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Accumulator Variables
• Accumulator is a shared variable to perform sum and counter operations.
• These variables are shared by all executors to update and add information through associative or commutative
operations.

Commutative -> f(x, y) = f(y, x)


Ex : sum(5 ,7 ) = sum(7, 5)

Associative -> f(f(x, y), z) = f(f(x, z), y) = f(f(y, z), x)


Ex : sum(multiply(5,6),7) = sum(multiply(6,7),5)
sum(sum(5,6),7) = sum(sum(6,7),5)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Accumulator Variables
• Why it is needed as we can use the normal variables?

counter = 0
def f1(x):
global counter
counter += 1
rdd = spark.sparkContext.parallelize((1,2,3))
rdd.foreach(f1)
counter.value
The counter Variable will not added or changed, because when spark ships this code to every executor the variables become local to
that executor. So the variable is updated for that executor but do not send it back to the driver. To avoid this problems, we need ac
accumulator. All the updates to accumulator variable in every executor is send it back to the driver.

### sparkContext.accumulator() is used to define accumulator variables


counter = spark.sparkContext.accumulator(0)
def f2(x):
... global counter
... counter.add(1) ### add() function is used to add/update a value in accumulator

rdd.foreach(f1)
counter.value ### Only accessed by Driver

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Accumulator Variables
Lets check one more example:

accum=sc.accumulator(0)
rdd=spark.sparkContext.parallelize((1,2,3,4,5))
rdd.foreach(lambda x:accum.add(x))
print(accum.value) #Accessed by driver

foreach() is an action which is applied to each element of the rdd and then adding each element to accum variable.
rdd.foreach() is executed on workers and accum.value is called from driver.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Accumulator Variables
• Spark natively supports accumulators of numeric types (int, float) and programmers can add support for new custom types using
AccumulatorParam class of PySpark.
• Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value
is only updated once that RDD is computed as part of an action.
counter = spark.sparkContext.accumulator(0)
def f2(x):
global counter
counter.add(1)
rdd.map(f2)
• Computations inside transformations are evaluated lazily, so unless an action happens on an RDD the transformations are not
executed. As a result of this, accumulators used inside functions like map() or filter() wont get executed unless some action happen
on the RDD. Spark guarantees to update accumulators inside actions only once.
Always use accumulators inside actions ONLY (ex – foreach).

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
SparkSQL Architecture

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark SQL Architecture
Processing data in DataFrame
Processing data in DataFrame

Tungsten Execution
Catalyst Optimizer
Engine

Equivalent to a relational table


in SQL to perform SQL
Operations

Read and Store Structured or


Semi-structured data

CSV JSON JDBC PARQUET Hive AVRO


Flexible Data Access Layer. Data is
acquired from various input
Elastic Cassand formats.
Search ra
HDFS … and many more.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
How Catalyst and Tungsten Works : An Overview
Spark 2.x

Catalyst
User Programs
Tungsten
Series of Transformations
SQL Query 
Parsed
Optimized
Query Plan RDDs
Query Plan

DataFrame

Abstractions of User
Programs (Trees)

SQL Query/DF Optimized Query


Query Plan Code Execution
API Plan
Reference: databricks.com summit
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark SQL Architecture
SQL, DataFrames are high-level APIs. It means the user programs just describe what data operations are needed
without specifying how to execute these operations.

Optimizer: An optimizer can automatically find out the most efficient plan to execute a query.

Catalyst Optimizer:
• Spark SQL is designed with Catalyst Optimizer which is based on functional programming of Scala.
• Responsible to improve the performance of user programs (SQL Query/DataFrame APIs).
• It converts a query plan into optimized query plan.
• Two main Purpose:
 Add new optimization techniques to solve big data problems.
 Allows developers/spark community to implement and extend the optimizer with new features.
• Offers both rule-based and cost-based optimization (Spark 2.0).
 Rule Based : How to execute the query from a set of defined rules.
 Cost-Based: Generates multiple execution plans and chose the lowest cost plan.

Tungsten Execution Engine:


• Generates code and execute it in the cluster in the distributed fashion.
• The engine builds upon ideas from modern compilers and MPP databases.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Trees: Abstractions of Users Programs

Identify
Expressions

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Query Plan
Describe data operation like aggregates, joins, filters etc. and these operations essentially generate a new dataset
based on a input dataset.

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Query Plan

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Logical Query Plan
2 types - Logical Plan and Physical Plan

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Physical Query Plan

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Transform
A Transform is a function associated with every tree to implement a single rule.

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Transform
Predicate Pushdown

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Transform
Constant Folding

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Transform
Column Pruning

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
--SQL Query
select count(*) from orders Aggregate
where order_cust_id = 1000
Project
OR
Filter
--DataFrame API
Scan
ord.select(ord.order_customer_id
== 1000).count()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Volcano Iterator Model:
select count(*) from orders
where order_cust_id = 1000

Each operator in the query plan would


implement an iterator interface that
takes in the records from the operator
below it, tries to do some processing
and output that record optionally to the
operator above it.
Aggregate

Project

Filter

Scan
Reference: databricks.com summit
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Volcano Iterator Model:
• Spark 1.6
• Each operator in the query plan would implement an iterator interface that takes in the records from
the operator below it, tries to do some processing and output that record optionally to the operator
above it.
• Advantages:
 Each operator is independent from each other. So it is easy to write when we introduce a new
operator and we no need to worry about how it interacts with all the other operators.
 No need to worry about the operators before it or after it.
• Disadvantages:
 Too many virtual function calls. Since we are agnostic to the operator that is below, we have no
idea where the input data is coming from.
 Extensive memory access: Each operator has to write the intermediate row that its trying to send
to the upstream operator into memory. So there is memory read or writes bottleneck.
 Can’t take advantage of modern CPU features like SIMD, pipelining, prefetching etc.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Whole-stage Code Generation – Spark 2.0 Tungsten Engine
• Idea is to make Spark as a Compiler (Spark would try to generate code that look like hand optimized code)
• Fusing Operators together so the generator code looks like hand optimized code.
 Identify chain of operators (“stages”)
 Compile each stage into a single function
 Better performance as if hand built system just to run your given query.

Aggregate ( ~ count)
var count = 0
Project (No Column to project) for (order_cust_id in orders) {
if (order_cust_id == 1000) {
count += 1
Filter (~ Filter Condition)
}
}
Scan (~For Loop)

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Volcano vs Hand Optimized Code (Whole-stage):
• No Function Virtual Dispatches: In the Volcano model, to process a tuple would require calling the
next() function at least once. These function calls are implemented by the compiler as virtual function
dispatches. The hand-written optimized code, on the other hand, does not have a single function call.
• Intermediate data memory vs CPU registers: In the Volcano model, each time an operator passes a
tuple to another operator, it requires putting the tuple in memory (function call stack). In the hand-
written version, by contrast, the compiler (JVM JIT in this case) actually places the intermediate data in
CPU registers. Again, the number of cycles it takes the CPU to access data in memory is orders of
magnitude larger than in registers.
• Modern CPU Features: Volcano model is unable to use modern CPU features due to its complex
function call graphs. Modern compilers are incredible efficient when compiling and executing simple
for loops.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Benchmark

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Lets check the performance with an example – Add one billion Numbers ?

def benchmark(version):
start = time()
spark.range(1000 * 1000 * 1000).select(sum("id")).show()
end = time()
elapsed = end-start
print(elapsed)

Spark 1.6:
spark.conf.set("spark.sql.codegen.wholeStage",False)
benchmark('1.6')
Total Time: 10.4 secs

Spark 2.0:
spark.conf.set("spark.sql.codegen.wholeStage",True)
benchmark('2.0')
Total Time: 0.4 secs

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Understanding the Execution Plan
explain(extended=False):
• Prints the (logical and physical) plans to the console for debugging purpose.
• Extended is a Boolean parameter.
default : False. If False, it prints only the physical plan.
If True, it prints all – Parsed Logical Plan, Analyzed Logical Plan, Optimized Logical Plan and Physical Plan.
• Explain function is extended for whole-stage code generation.
• When an operator has a star around it (*), whole-stage code generation is enabled. In the following example,
Range, Filter, and the two Aggregates are both running with whole-stage code generation. Exchange does not
have whole-stage code generation because it is sending data across the network.

Ex –
spark.conf.set("spark.sql.codegen.wholeStage",True)
spark.range(1000).filter("id > 100").selectExpr("sum(id)").explain()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Operator Benchmarks : Processing Cost/row in ns.

primitive Spark 1.6 Spark 2.0


filter 15 1.1
sum w/o group 14 0.9 5-30x
sum w/ group 79 10.7 Speedups
hash join 115 4
Sort (8-bit entropy) 620 5.3 10-100x
Sort (64-bit entropy) 620 40 Speedups
Shuffling still the
Sort-merge join 750 700 bottleneck

Parquet Decoding (single 120 13 10x


int column) Speedups

Reference: databricks.com summit


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Let’s put it all
together

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
User Programs Catalyst Tungsten

Physical
Analysis Logical Planning Code
Optimization Generation
SQL Query 

Cost Model
Parsed Optimized Selected
Unresolved Logical Physical Physical
Logical RDDs
Logical Plan Plan Plans Plan
Plan

DataFrame
Catalog

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark SQL Architecture
Spark SQL:
It is a module used for structured and semi structured data processing.

Flexible Data Access:


The bottom layer in the architecture is the flexible data access. Data is acquired from various input formats like CSV,
JSON, Parquet, any database with JDBC, Hive etc.

DataSource API:
Used to read and store structured and semi-structured data into Spark SQL.
DataSource API then fetches the data which is then converted into a DataFrame API.

DataFrame API:
Equivalent to a relational table in SQL to perform SQL Operations.
Distributed collection of data organized into named Columns.
Data is stored in partitions.

DataFrame DSL/DataFrame SQL/HQL:


Data is processed.

Catalyst Optimizer:
It converts a query plan into optimized query plan.

Tungsten:
Takes the optimized query plan from Catalyst and generates code and execute in the cluster in a
distributed fashion. Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame
Fundamentals

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
What is a DataFrame ?
• DataFrame is a Dataset organized into named columns/rows.

Name Age
“Robert”,31
Robert 31
“Alicia”,25
(‘name’,’age’) Alicia 25
“Deja”,19
“Manoj”, 31 Deja 19
Dataset Schema Majoj 31
DataFrame

• Conceptually equivalent to RDBMS Table/Python data frame


+
Richer Optimizations.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Sources ?
• Structured data files (CSV, JSON, AVRO, PARQUET etc)
• Hive
• Cassandra
• Python Data frame
• RDBMS Databases
• RDDs

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features ?
1. Distributed
2. Lazy Evaluation
3. Immutability
4. Used across the Spark Ecosystem
5. Polyglot
6. Work on Huge collection of dataset
7. Support both Structured and Semi-Structured Data

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features -
Distributed

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features
1. DataFrame is Distributed.
• Like RDD, DataFrame is also distributed.
• Supports HA and FT.
Worker Node 1

Partition 1 Task1 Executor


DataFrame Task 1 e Task 2
Partition 2 Task2

Worker Node 2
Partition 3 Task3 Executor

Task 3

Hands-on 1

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
1. DataFrame is Distributed.
Ex – Find out the even numbers.

Worker Node 1
Partition 1 Task1
Executor
DataFrame 128MB col%2 == 0
300MB Task 1 Task 2
Partition 2 Task2
128MB col%2 == 0

Partition 3 Task3 Worker Node 2


46MB col%2 == 0 Executor

Task 3

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features – Lazy
Evaluation

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features Contd …
2. Lazy Evaluation:
Each Transformation is a Lazy Operation. Evaluation is not started until an action is triggered.

T T
DataFrame DataFrame DataFrame
filter Group by 3
1 2
A Count

4
Only at this point execution
starts.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features Contd …

Example – Find out 10 sample records having a string “Robert” from a file(1TB).

With Out Lazy Evaluation: With Lazy Evaluation:

Step 1 Load the 1TB File Step 1 Wait for an Action


and then Load the
Step 2 Perform full scan to 1TB File
find out all Records
having “Robert” Filter out first 10
Step 2 records having
Step 3 Retrieve 10 Records “Robert” .

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features -
Immutability

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features Contd …
3. Immutability:
DataFrames are considered to be “Immutable Storage”.

Single-Node

Transfor
Data in Transfor Transfor Data in
mation
Stage 1 mation 1 mation 2 Stage 2

Data Stored The Transformation Recipe is Stored Data is not


Immutably Ex – Logic to print all the even Stored
numbers is col%2 == 0
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features Contd …
Multi-Node

Data in Data in
Transform Transform Transform
Stage 1 on Stage 2 on
ations 1 ations 2 ations …
Node 1 Node 1

Data in Data in
Stage 1 Stage 2 on
on Node 2 Node 1
Transformation Recipe is stored
Data in Data in
Stage 1 on Stage 2 on
Node 3 Node 1

Data in Data in
Stage 1 on Stage 2 on
Node … Node …

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features – Used
Across Spark Ecosystem

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features Contd …
4. Used across the Spark Ecosystem
DataFrame is a unified API across all libraries in Spark.

Use DF consistently
across all Spark Spark SQL
Libraries

Spark Streaming
DataFrame

Mlib(Machine
Learning)

Spark Single Learn API


GraphX
Developer API Call
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Features Contd …
5. Polyglot:
Support multiple Languages - Scala, Python, Java, R.
6. Works on Huge collection of dataset, feasible to work with a wide file.
7. Supports both Structured and Semi structured data (JSON, XML etc).

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Hands-On

 Common functions on Data Frames


• printSchema – to print the column names and data types of data
frame
• show – to preview data (default 20 records) Name Age
• describe – to understand characteristics of data Robert 31
• count – to get number of records
Alicia 25
• collect – to convert data frame into Array
• dtypes Deja 19
• columns Majoj 31

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Organization

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Organization of Data
• A DataFrame has 3 levels to organize and
process its data - Schema, Storage and API. ObjectID Name …
Age
• Schema: String String Long String
 DataFrame is Implemented as a dataset
of rows. Row #1
Schema
 Each column is named and typed.
Row #2 Dataset<Row>
• Storage:
 Storage is distributed and data is stored Row #3
in partitions. Row #4
 Storage in Memory and Disc or off-heap
or any of these 3 combinations. Partition #1
• API
 Used to process the data. Alicia 35 … …
Partition #2 Storage
Sai 31 … …
…..

DataFrame APIs
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Organization of Data
StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication=1)
• Data can be stored either in Disk or Memory or off-heap memory or any of these combinations.
• Off-Heap Memory is a segment of memory lies outside the JVM,but is used by JVM for certain use-cases. Off-Heap memory
can be used by Spark explicitly as well to store serialized data-frames and RDDs.
• Data can be stored in Serialized or deserialized. Serilization is a way to convert a java object in memory to series of bits. The
deserialization is the process of bringing those bits into memory as an object. Whenever we are talking about 'deserialized'
RDD/DF we are always referring to RDD/DFs in memory.
• Use the replicated storage levels if you want fast fault recovery.

Default
Storage

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Organization of Data
from pyspark import StorageLevel
df = spark.range(10)
df.rdd.persist(). getStorageLevel()
df.rdd.persist(StorageLevel.MEMORY_AND_DISK_2). getStorageLevel()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Introduction
To
Spark Session

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
SparkSession : The Entry point Spark 2.0 Onwards
 In Spark 2.0, SparkSession is the new entry point to work with RDD, DataFrame and all other functionalities.
 Prior to 2.0, SparkContext used to be an entry point.
 Almost all the APIs available in SparkContext, SQLContext, HiveContext are now available in SparkSession.
SparkContext: Entry point to work with RDD, Accumulators and broadcast variables (< Spark 2.0).
SQLContext: Used for initializing the functionalities of Spark SQL (< spark 2.0).
HiveContext: Super set of SQLContext (< spark 2.0).
 By Default, Spark Shell provides a “spark” object which is an instance of SparkSession class.

Prior to 2.0
2.0 and Onwards

SparkContext SparkSession
SQLContext (Also has these old classes but not
recommended to use)
HiveContext

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : The Entry point Spark 2.0 Onwards

House Class SparkSession Class

Instance
Instance Instance

My House My Friend House spark Object

Object Object

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : Spark Object
& spark-submit

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : Create
from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.master('yarn') \
.appName("Python Spark SQL basic example") \
.getOrCreate()

Master can be yarn, mesos, Kubernetes or local(x) , x > 0

How to Run :
1. Organize the folders and create a python file under bin folder.
2. Write above codes in the .py file.
3. Execute the file using spark-submit command.

spark2-submit \
/devl/example1/src/main/python/bin/basic.py

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
Spark-submit is a utility to run a pyspark application job by specifying options and configurations.

spark-submit \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key<=<value> \
--driver-memory <value>g \
--executor-memory <value>g \
--executor-cores <number of cores> \
--jars <comma separated dependencies> \
--packages <package name> \
--py-files \
<application> <application args>

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
--master : Cluster Manager (yarn, mesos, Kubernetes, local, local(k))
local – Use local to run locally with one worker node.
local(k) – Specify k with the number of cores you have locally, this runs application with k worker
threads.

--deploy-mode: Either cluster or client


Cluster: Driver runs on one of the worker nodes and you can see the code as a driver on the spark UI of your
application. We cant see the logs on the terminal. Logs available only in the UI or the yarn CLI.
yarn logs -applicationId application_1622930712080_16253
Mainly used for production jobs.

Client: Driver runs locally where we submit the application.


See the logs on the terminal.
Mainly used for interactive or debugging purpose.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
--conf: We can provide runtime configurations, shuffle parameters, application configurations using –conf.
Ex: --conf spark.sql.shuffle.partitions = 300
This configures the number of partitions that are used when shuffling data for joins or aggregations.
https://spark.apache.org/docs/latest/sql-performance-tuning.html

--conf spark.yarn.appMasterEnv.HDFS_PATH=“practice/retail_db/orders”
We can set environment variables like this when spark is running on yarn.
https://spark.apache.org/docs/latest/running-on-yarn.html#configuration

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
--driver-memory : Amount of memory to allocate for a driver (Default: 1024M).

--executor-memory : Amount of memory to use for the executor process.

--executor cores : Number of CPU cores to use for the executor process.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
--jars: Dependency .jar files.
Ex : --jars /devl/src/main/python/lib/ojdbc7.jar, fil2.jar, file3.jar

--packages: Pass the dependency packages.


Ex : --packages org.apache.spark:spark-avro_2.11:2.4.4

--py-files: Use –py-files to add .py and .zip files. File specified with –py-files are uploaded to the cluster before it
run the application.
Ex - --py-files file1.py, file2.py,file3.zip

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark-submit
spark-submit \
--master “yarn” \
--deploy-mode “client”\
--conf spark.sql.shuffle.partitions = 300 \
--conf spark.yarn.appMasterEnv.HDFS_PATH=“practice/retail_db/orders”
--driver-memory 1024M \
--executor-memory 1024M \
--num-executors 2 \
--jars --jars /devl/src/main/python/lib/ojdbc7.jar, fil2.jar, file3.jar \
--packages org.apache.spark:spark-avro_2.11:2.4.4 \
--py-files file1.py, file2.py,file3.zip \
/dev/example1/src/main/python/bin/basic.py arg1 arg2 arg3

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : Commonly
Used Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : Commonly Used Functions
• version:
Returns Spark version where your application is running. spark.version
• range(start, end=None, step=1, numPartitions=None):
This creates a DatFrame with a range of values.
• createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)) :
This creates a DataFrame from a collection(list, dict), RDD or Python Pandas.
• sql() :
Returns a dataframe representing the result of a given query.
• table():
Returns the specified table as dataframe.
• sparkContext:
Returns sparkContext
• conf():
Runtime configuration (get and set).
Ex- spark.sql.shuffle.partitions
• read(): Used to load a dataframe from external storage systems.
• udf() : Dedicated section for this.
• newSession()
• stop() : Stop the underlying SparkContext.
• catalog()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
SparkSession - Version

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : version Method
• version: Returns Spark version where your application is running.
spark.SparkContext.version
sc.version

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
SparkSession - range

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : range() Method
• range():
This creates a DatFrame with a range of values.
Ex- 1 :
df = spark.range(1,10,2)
Ex-2:
df =spark.range(10)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
SparkSession - createDataFrame

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : createDataFrame() Method
• createDataFrame() :
This creates a DataFrame from a collection(list, dict), RDD or Python Pandas.

Ex-1 : Using Python List


lst = (('Robert',35),('James',25))
spark.createDataFrame(data=lst),
df = spark.createDataFrame(data=lst,schema=('Name','Age')) ##With Out Schema
df = spark.createDataFrame(data=lst,schema=('Name','Age')) ##With Schema

Ex-2 : Using Python Dict


dict = ({"name":"robert","age":25}, {"name" : "james","age" : 31})
df = spark.createDataFrame(dict)

Ex-3: Using RDD


rdd = sc.parallelize(lst)
df = spark.createDataFrame(data=rdd,schema=('name string, age long'))

Ex-4: Using Row in RDD


#Row is used to create row Object using named arguments.
from pyspark.sql import Row
rdd = sc.parallelize((Row(name='James',age=31),Row(name='Robert',age=55)))
df = spark.createDataFrame(data=rdd)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : createDataFrame() Method
• Ex-5 : Using Python Pandas DamaFrame
Pandas dataframe is a two dimensional structure with named rows and columns. So data is aligned in
a tabular fashion in rows and columns

import pandas as pd
data = (('tom', 10), ('nick', 15), ('juli', 14))
df_pandas = pd.DataFrame(data,columns=('Name','Age'))
df = spark.createDataFrame(data=df_pandas)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Performance: Pandas DataFrame Vs Spark DataFrame

Source: www.databrics.com

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : sql() Method
• sql() :
Returns a dataframe representing the result of a given query.

Ex-1:
lst1 = (('Robert',35),('James',25))
lst2= (('Robert',101),('James',102))

df_emp = spark.createDataFrame(data=lst1,schema=(‘EmpName','Age'))
df_emp.createOrReplaceTempView("dept")

df_dept = spark.createDataFrame(data=lst2,schema=(‘EmpName',’DeptNo'))
df_dept.createOrReplaceTempView("dept")

df_joined = spark.sql (""" select e.name,e.age,d.dept from emp e join dept d where e.name = d.name """)

createOrReplaceTempView("table1") //Creates the view in the current database and valid for only one session.

createOrReplaceGlobalTempView(“table1”) //Creates the views in global_temp database.


// Valid across all sessions of an application.
Ex-2: Using Hive Table
df = spark.sql (“”” select * fromAurora
empAcademy
“””) of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : table() Method
• table() :
Returns the specified table as dataframe.

Ex-1:
lst1 = (('Robert',35),('James',25))
df_emp = spark.createDataFrame(data=lst1,schema=('EmpName','Age'))
df_emp.createOrReplaceTempView("emp")

df_op = spark.table("emp")
sorted(df_op.collect()) == sorted(df_emp.collect())

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : conf() Method
• conf():
We can provide runtime configurations, shuffle parameters, application configurations using –conf or
spark.conf().
Ex-1: System Defined
spark.conf.get("spark.sql.session.timeZone" )
spark.conf.get('spark.sql.shuffle.partitions')
spark.conf.set(‘spark.sql.shuffle.partitions’,300) # configures the number of partitions that are used
when shuffling data for joins or aggregations.
https://spark.apache.org/docs/latest/sql-performance-tuning.html

Ex-2 : Spark Running on YARN Environment Variables


spark.conf.set('spark.yarn.appMasterEnv.HDFS_PATH','practice/retail_db/orders')
--conf spark.yarn.appMasterEnv.HDFS_PATH #From spark-submit
https://spark.apache.org/docs/latest/running-on-yarn.html#configuration

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read() Method
• read(): Interface used to load a dataframe from external storage systems.

Load a csv File:

Load a Text File:

Load a orc File: (Column Oriented)

Load a Parquet File: (Column Oriented)

Load a json File:

Load a avro File: (Row Based)

Read a hive Table:

Read a JDBC:

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read – csv
Load a csv File:
Ex-1: df = spark.read.load(path='practice/retail_db/orders', format='csv', \
schema=('order_id int,order_date string,order_customer_id int,order_status string'))

Ex-2: df = spark.read.load(path='practice/retail_db/orders', format='csv',inferSchema=True)

Ex-3: df = spark.read.load(path='practice/retail_db/orders', format='csv',header=True)

Ex-4:
df=spark.read.load('practice/retail_db/testSpace.txt',format='csv',sep=',',ignoreLeadingWhiteSpace=True,ignoreTr
ailingWhiteSpace=True)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read – text
Load a text File:
• Use text file where there is fixed length.
• Default field name is ‘value’.
• Also you may load Into rdd.
• Convert to dataframe using toDF() and Row.

df = spark.read.load('practice/retail_db/orders',format='text')

--Read the whole text file into a single line.


df =spark.read.load('practice/retail_db/orders',format='text', wholeText=True)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read – orc/parquet
• Load a orc File
df = spark.read.load('practice/retail_db/orders_orc',format='orc')

• Load a Parquet File


df = spark.read.load('practice/retail_db/orders_parquet',format='parquet')

CSV, JSON and AVRO are Row-based File formats. Sample data in CSV:
(Sequence file is also row based)
ID,FIRST_NAME,AGE
1, Matthew, 19
2, Joe,25

ORC,PARQUET are Column-based File formats.


ID/INT/3:1,2
FIRST_NAME/STRING/11:Matthew,Joe
AGE/INT/6:19,25

In general, column-oriented formats work well when queries access only a small number of columns in the
table. Conversely, row oriented formats are appropriate when a large number of columns of a single row are
needed for processing at the same time.
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read – orc/parquet
In general, column-oriented formats work well when queries access only a small number of columns in the
table. Conversely, row oriented formats are appropriate when a large number of columns of a single row are
needed for processing at the same time.
Column-oriented formats need more memory for reading and writing, since they have to buffer a row split
in memory, rather than just a single row. Also, it’s not usually possible to control when writes occur (via
flush or sync operations), so column-oriented formats are not suited to streaming writes, as the current file
cannot be recovered if the writer process fails. On the other hand, row-oriented formats like sequence files
and Avro datafiles can be read up to the last sync point after a writer failure. It is for this reason that Flume
uses row-oriented formats.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read – json
Load a json File:
df = spark.read.load(‘practice/retail_db/orders_json',format='json')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read – avro
Load a avro File:
• Avro Is a third party file format. We need to import its package or jar file while launching a spark-submit
application or pyspark shell. spark by default does not support it.

pyspark2 --master yarn \


--packages org.apache.spark:spark-avro_2.11:2.4.4

df = spark.read.load(‘practice/retail_db/orders',format='avro')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read() Method
Load a Hive Table: If Hive and Spark are integrated, we can create data frames from data in Hive tables or
run Spark SQL queries against it.

spark.sql(“”” SELECT * FROM <db>.<tabe_name>“””).show()


spark.table(“<db>.<table_name>”).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read() Method
Load a JDBC Table:
• Make sure JDBC jar file is registered using –packages or –jars while launching pyspark or spark-submit
• Typical jdbc files are located at /usr/share/java folder. You may keep it there or copy it to your project lib folder.

pyspark2 ---jars <jdbc driver jar file>

Ex-1: Table

df= spark.read.format("jdbc") \
.option("url", "jdbc:oracle:thin:@xxxx-xxx-xxxx:1521/xxx") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable",“ORDERS” ) \
.option("user", “someUser") \
.option("password", “somePsw") \
.load()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read() Method
Ex-2: Query

df= spark.read.format("jdbc") \
.option("url", "jdbc:oracle:thin:@xxxx-xxx-xxxx:1521/xxx") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable",“(SELECT * FROM T_EMP WHERE ID=1) query” ) \
.option("user", “someUser") \
.option("password", “xxx") \
.load()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read() Method
Ex-3: Partition
Partitioning can be done only on numeric or date fields. If there is no numeric field generate it. For ex- Use ROWNUM
to generate dummy numeric fields in Oracle Database. Define partitionColumn, numPartitions, lowerBound,
upperBound.

df= spark.read.format("jdbc") \
.option("url", "jdbc:oracle:thin:@xxxx-xxx-xxxx:1521/xxx") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable",“ORDERS” ) \
.option(“partitionColumn”,”ORDER_ID”) \
.option(“lowerBound”, “500”) \
.option(“upperBound”, “1000”) \
.option(“numPartitions”,”5”) \
.option("user", “someUser") \
.option("password", “somePassword") \
.load()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read() Method
lowerBound=500
upperbound=1000
numPartitions=5
So stride is = (upperbound-lowerbound)/numPartitions = 100.
Total Records in the Table = 1509

Partition 1: (599 Records)


First 499 Records
+
Select * from orders where order_id between (500,599)  100 records

Partition 2: (100 Records)


Select * from orders where order_id between (600,699)  100 records

Partition 3: (100 Records)


Select * from orders where order_id between (700,799)  100 records

Partition 4: (100 Records)


Select * from orders where order_id between (800,899)  100 records

Partition 5: (610 Records) Aurora Academy of Training. Copyright [email protected]


Select * from orders where order_id >=900- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : read() Method
Ex-4: Partition with out Numeric field

df= spark.read.format("jdbc") \
.option("url", "jdbc:oracle:thin:@xxxx-xxx-xxxx:1521/xxx") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable", "(select t1.*, cast(ROWNUM as number(5)) as num_rows from (select * from orders) t1) oracle_table1") \
.option(“partitionColumn”, ”num_rows ”) \
.option(“lowerBound”, “500”) \
.option(“upperBound”, “1000”) \
.option(“numPartitions”,”10”) \
.option("user", “someUser") \
.option("password", “somePassword") \
.load()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : spark.udf
• UDFs are the User Defined Functions. Spark UDFs are similar to RDBMS User Defined
Functions.
• If there is a need of a function and pyspark build-in features don’t have this function, then
you can create a udf and use it is DataFrames and Spark SQLs.
• UDFs are error-prune and so should be designed carefully. First check if similar function is
available in pyspark functions library(pyspark.sql.functions). If not designed properly, we
would come across optimization and performance issues.
• We can use UDFs both in DataFrame and Spark SQL.
1. For Spark SQL, create a python function/udf and register it using spark.udf.register
method.
2. For DataFrame, create a udf by wrapping under @udf or udf() function.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ex – 1 (Create a udf, use it in DataFrame and register for spark sql)

import string
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType,IntegerType

@udf(returnType=StringType())
def initCap(str):\
finalStr=""\
ar = str.split(" ")\
for word in ar:\
finalStr= finalStr + word[0:1].upper() + word[1:len(word)] + " " \
return string.strip(finalStr)

DataFrame:
df.select(df.emp_name, initCap(df.emp_name)).show()

Spark Sql:
spark.udf.register("initcap1", initCap)
spark.sql(""" select emp_name, initcap1(emp_name) from default.emp """).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ex-2: (Using Python Function and Register in Spark sql)

def convertCap(str):
finalStr=""
ar = str.split(" ")
for word in ar:
finalStr= finalStr + word(0:1).upper() + word(1:len(word)) + " "
return string.strip(finalStr)

Spark Sql:
spark.udf.register(" initcap ", convertCap)
Spark.sql(" " " select emp_name,initcap(emp_name) from default.emp " " ")

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ex-3: (Using Python Lambda Function and Use it in Spark Sql)

from pyspark.sql.types import IntegerType


from pyspark.sql.functions import udf
slen = udf(lambda s: len(s), IntegerType())
spark.udf.register("slen", slen)
spark.sql("SELECT slen('test')").collect()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : newSession()
spark.newSession():
• Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and
UDFs, but shared SparkContext and table cache.
• Ex- The registered udfs will not be visible to the new session.

Ex- 1 (Using udf)  Different SparkContext


new_spark = spark.newSession()

import string
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType,IntegerType

@udf(returnType=StringType())
def initCap(str):
finalStr=""
ar = str.split(" ")
for word in ar:
finalStr= finalStr + word(0:1).upper() + word(1:len(word)) + " "
return string.strip(finalStr)

spark.udf.register("initcap1", initCap)
spark.sql(""" select emp_name, initcap1(emp_name) from default.emp """).show()

spark.sql(""" select emp_name, initcap1(emp_name) from default.emp


Aurora Academy """).show()
of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ex-2: (Using Table data)  Shared SparkContext
new_spark = spark.newSession()

spark.sql (""" create table student(name int) """);


spark.sql (""" insert into student values (1) """);

spark.sql (""" select count(*) from student """).show()


new_spark.sql (""" select count(*) from student """).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : stop()
spark.stop(): To stop the underlying spark context.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Session : catalog()
• Catalog is introduced in spark 2.0 which is a standard API for accessing metadata in Spark SQL.
• This works both for Spark Sql and Hive Metadata.
• Below are different methods in catalog for extracting important information.

Database Functions: View Functions:


currentDatabase dropGlobalTempView
listDatabases dropTempView
setCurrentDatabase
Function based functions:
Table Functions: listFunctions
listColumns registerFunction (= spark.udf.register)
listTables
cacheTable
isCached
uncacheTable
clearCache
recoverPartitions
refreshTable
refreshByPath
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Data Types

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Supported Data Types:
Below are some of the important data types supported in both Data Frame and Spark SQL. For full list, please visit
https://spark.apache.org/docs/latest/sql-ref-datatypes.html.
from pyspark.sql.types import *
Numeric Types
IntegerType() 4-byte signed integer numbers
FloatType() 4-byte single-precision floating point numbers
DoubleType() 8-byte double-precision floating point numbers

String Types
StringType() Character String Values
VarcharType(length) Variant of StringType with Length limitation
CharType(length) Variant of VarcharType with Fixed Length

Boolean Types
BooleanType () Boolean Values (True or False. Also can have Null Values)

Binary Type
BinaryType () Byte Aurora
Sequence
AcademyValues
of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Supported Data Types:
Date Types
TimestampType() year, month, day, hour, minute, second, time zone
DateType () year, month, day

Complex Type
ArrayType (elementType,containsNull)
MapType (keyType, valueType,valueContainsNull)
StructType (fields)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Supported Data Types:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DateType

Ex-1
schema = StructType((
StructField("name",StringType(),True),
StructField("id", IntegerType(),True),
))

data=(("James",1),
("Robert",2),
("Maria",3)
)

df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Supported Data Types:
Ex-2 (Map Types)
schema=MapType(StringType(),StringType())

schema = StructType((
StructField('name', StringType(), True),
StructField('properties', MapType(StringType(),StringType()),True)
))

d = ( ('James',{'hair':'black','eye':'brown'}),
('Michael',{'hair':'brown','eye':None}),
('Robert',{'hair':'red','eye':'black'})
)
df_map= spark.createDataFrame(data=d, schema = schema)
df_map.printSchema()
df_map.show(truncate=False)
df_map.select(df_map.properties).show(truncate=False)
df_map.select(df_map.properties[‘eye’]).show(truncate=False)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Supported Data Types:
Ex-3 (Array Types)
schema=ArrayType(IntegerType())

schema = StructType((
StructField('name', StringType(), True),
StructField('mobileNumbers', ArrayType(IntegerType()),True)
))

d = ( ('James',(123,456,789)),
('Michael',(234,456,678)),
('Robert',(168,89,190))
)
df_arr = spark.createDataFrame(data=d, schema = schema)
df_arr.printSchema()
df_arr.show(truncate=False)
df_arr.select(df_arr.mobileNumbers[1]).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Supported Data Types:
What we have been so far ?  Aliases used in Spark Sql

For Ex-
IntegerType  int, integer
StringType string
BooleanType boolean

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Supported Data Types:
Special Values:
• None (Null)

• Inf, -Inf (FloatType or DoubleType. Infinity)

• NaN (FloatType or DoubleType. Non a Number)


NaN = NaN returns True.
In aggregations, all NaN values are grouped together.
Ex – spark.sql (""" SELECT float('NaN') AS col """).show()
spark.sql(""" SELECT double('NaN') = double('NaN') AS col """).show()
spark.sql (""" CREATE TABLE test (c1 int, c2 double) """)
spark.sql (""" INSERT INTO test VALUES (1, double(10)) """)
spark.sql (""" INSERT INTO test VALUES (2, double(10)) """)
spark.sql (""" INSERT INTO test VALUES (3, double('NaN')) """)
spark.sql (""" INSERT INTO test VALUES (4, double('NaN')) """)
spark.sql (""" INSERT INTO test VALUES (5, double('NaN')) """)
spark.sql (""" SELECT c2,count(*) FROM TEST GROUP BY C2""").show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Rows

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame is a Dataset organized into named columns/rows.

Name Age
Robert 31
Alicia 25
Deja 19
Majoj 31

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Row
• Represented as a record/row in DataFrame.
• We can create a Row object by using named arguments, or create a custom Row like class.
• Available in pyspark.sql.Row

1. Row Object:
from pyspark.sql import Row

row = Row(name="Alice", age=11)


row.name
'name' in row
'Alice' in row.name

lst=(Row(name="Alice",age=11), Row(name="Robert",age=35),Row(name="James",age=33))
rdd = sc.parallelize(lst)
for i in rdd.collect(): print (str(i.age) + ' ' + i.name)

df = spark.createDataFrame(lst)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Row
2. Custom Class from Row
Person = Row("name", "age")
p1=Person("James", 40)
p2=Person("Alice", 35)
print(p1.name +","+p2.name)

lst=(Person("Alice",11), Person("Robert",35),Person("James",33))
rdd=sc.parallelize(lst)
for i in rdd.collect() : print i.name

df = spark.createDataFrame(lst)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Row
Row Methods:

count(): Return number of occurrences of value


person = Row(name="Alice", age=11,username="Alice")
person.count("Alice")

index(): Return first index of value.


person.index(11)

asDict(): Resturn as a Dict


person.asDict()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Columns

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Column
• A column in a DataFrame.
• Available in pyspark.sql.Column

ord = spark.read.load('practice/retail_db/orders', format='csv', sep=',' ,schema=('order_id int,order_date timestamp,order_cust

1. Select a Column
df.order_id or df(“order_id")
ord.select(col("*")).show() #from pyspark.sql.functions import col

2. Give an alias name to a column


alias()
Ex – ord.select(.order_id.alias("orderId")).show(5)

3. Order a Column
asc()
asc_nulls_first()
asc_nulls_last()
desc()
desc_nulls_first()
desc_nulls_last)
Ex - ord.orderBy(ord.order_status.asc()).select(ord.order_status).distinct().show()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Column
4.cast() : Convert type of a column. asType() is alias of cast().
PS: Convert order_id column from Integer Type to String Type.
ord.select(ord.order_id.cast("string"))

5. between():
PS: Print all the orders between 10 and 20.
ord(ord.order_id.between(10,20)).show()
ord.where(ord.order_id.between(10,20)).show()

6. contains(), startswith,endswith(),like(),rlike()
PS: Print all the orders with Status CLOSED.
ord.where(ord.order_status.contains('CLOSED')).show()

PS: Print all the status with alphabets ‘LO.’


ord.where(ord.order_status.like('%LO%')).show()

7.isin(): Multiple values.


PS: Print all the status with CLOSED or PENDING Orders.
ord.where(ord.order_status.isin('CLOSED','PENDING')).select(ord.order_status).distinct().show()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Column
8. eqNullSafe() : Equality test that is safe for null values.
Assignments in pyspark do not check the Null or None Values. To check them use this function.
from pyspark.sql import Row
df1 = spark.createDataFrame(( Row(id=1, value='foo'), Row(id=2, value=None) ))
df1.select( df1('value') == 'foo', df1('value').eqNullSafe('foo'), df1('value').eqNullSafe(None)).show()

isNull(), isNotNull()

9. substr
PS: Find Number of completed orders in the year 2013.
ord.where((ord.order_date.substr(1,4).contains('2013')) & (ord.order_status.contains('CLOSED') )).count()
ord.where((ord.order_date.substr(1,4) == '2013') & (ord.order_status == 'CLOSED' )).count()

10.getField() : gets a field by name in a StructField


Ex - #create a struct field
df1 = spark.createDataFrame((Row(r=Row(a1=1, a2="b"))))
df1.select(df.r.getField(“a2")).show()
df1.select(df1.r.a).show()

11. getItem(): An expression that gets an item at position ordinal out of a list, or gets an item by key out of a dict.
Ex- df = spark.createDataFrame((((1, 2),Aurora
{"key": "value"})),
Academy of Training.("lst",
Copyright"dict"))
[email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
df.select(df.lst.getItem(0), df.dist.getItem("key")).show()
Column
12. when(), otherwise() :
Kind of if-else statements in SQL. Using this, we can check multiple conditions in sequence and returns a value when
the first condition is met.

Ex –
from pyspark.sql.functions import when
ord.select(ord.order_status,
when(ord.order_status == 'PENDING_PAYMENT', 'PP')
.when(ord.order_status == 'CLOSED', 'CL')
.when(ord.order_status == 'COMPLETE', 'CO')
.when(ord.order_status == 'PROCESSING', 'PR')
.otherwise(ord.order_status).alias("order_status2")).show(10)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Transformations and
Extractions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Transformations
• DataFrame APIs Extraction
 selection
 filter
 sort
 set Part1
 join • csv
 aggregation • text
 groupBy • parquet
 window
• orc Part3
 sample
• avro
• DataFrame Built-in Functions • json
 New Column • hive
 Encryption • jdbc
 String
 RegExp
 Date Part2
 Null
 Collection
 Na
 Math & Statistics
 Explode & Flatten
 Formatting
 Json
Aurora Academy of Training. Copyright [email protected]
 Other - All Rights Reserved. Subscribe at Learn-Spark.nfo
Part 1 DataFrame APIs

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Selection APIs

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Prepare Data
ord = spark.read.load('retail_db/orders',sep=',',format='csv',schema=('order_id int,order_date timestamp,
order_customer_id int,order_status string'))

data=(('Robert',35,40,40),('Robert',35,40,40),('Ram',31,33,29),('Ram',31,33,91))
emp = spark.createDataFrame(data=data,schema=('name','score1','score2','score3'))

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Selection or Projection APIs
• Select(*cols)
 Select one or more columns.
ord.select(ord.order_id,'order_id',"order_id", (ord.order_id +100).alias("order10")).show()
 Can apply necessary functions on the selected columns.
ord.select(lower(ord.order_status)).show()

• selectExpr(*expr)
 This is a variant of select that accepts SQL expressions.
ord.selectExpr('substring(order_date,1,10) as order_month').show()
ord.select(substring(ord.order_date,1,4).alias('order_year')).show()
 If we want to use any functions available in SQL but not in Spark Built-in functions, then we can use selectExpr.
stack(n, expr1, ..., exprk) - Separates expr1, ..., exprk into n rows.
df.selectExpr("stack(3,1,2,3,4,5,6)").show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Selection or Projection APIs
• withColumn(colName, col)
 Applied transformation to only selected columns.
 The first argument is an alias name. If we give an alias name same as a column name, the transformations will apply
on the same column.
 Otherwise a new column will be formed. Avoid giving alias name same as column name.
ord.withColumn('order_month',substring(orderDF.order_date,1,10)).show()

• withColumnRenamed(existingCol, newCol)
 Rename Existing Column.
ord.withColumnRenamed('order_id','order_id1').show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Selection or Projection APIs
• drop(*cols)
 Drop a column.
order = ord.drop('order_id','order_date')

• dropDuplicates(subset=None)
 Drop duplicate rows.
 Optionally can consider only subset of columns.
emp.dropDuplicates().show()
emp.dropDuplicates(("name","score1","score2")).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Filter APIs

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Prepare Data
ord = spark.read.load('retail_db/orders',sep=',',format='csv',schema=('order_id int,order_date timestamp,
order_customer_id int,order_status string'))

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Filter APIs
• filter(condition): (Its alias ‘where’)
 Filter rows using a given condition.
 use '&' for 'and‘. '|' for 'or‘. (boolean expressions)
 Use column function isin() for multiple search.
 Or use IN Operator for SQL Style syntax.

ord.where((ord.order_id > 10) & (ord.order_id < 20)).show()


--Using isin()
ord.where(ord.order_status.isin('COMPLETE','CLOSED')).show()
--Using IN Operator
ord.where("order_status IN ('COMPLETE','CLOSED') ").show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Sort APIs

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Prepare Data
ord = spark.read.load('retail_db/orders',sep=',',format='csv',schema=('order_id int,order_date timestamp,
order_customer_id int,order_status string'))

data=(('a',1),('d',4),('c',3),('b',2),('e',5))
df = spark.createDataFrame(data=data,schema='col1 string,col2 int')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Sorting APIs
• sort() or orderBy():
 Sort specific column(s).
ord.sort(ord.order_date.desc(),ord.order_status.asc()).show()
ord.sort(ord.order_date,ord.order_status,ascending=(0,1)).show() #1 Ascending, 0 Descending

• sortWithinPartitions:
 At time, we may not want sort globally, but with in a group. In that case we can use sortWithinPartitions.
df.sortWithinPartitions(df.col1.asc(),df.col2.asc()).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Set Operators

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Set Operator APIs
• union() and unionAll():
 Same and contains duplicate values.
 Use distinct after union or unionAll to remove duplicates .

• unionByName():
 The difference between this function and :func:`union` is that this function
resolves columns by name (not by position)
df1 = spark.createDataFrame(data=(('a',1),('b',2)),schema=('col1 string,col2 int'))
df2 = spark.createDataFrame(data=((2,'b'),(3,'c')),schema=('col2 int,col1 string'))
df1.union(df2).show()
df1.unionByName(df2).show()

• intersect(): Containing rows in both DataFrames. Removed duplicates.


• intersectAll(): Same as intersect. But retains the duplicates.
df1 = spark.createDataFrame(data=(('a',1),('a',1),('b',2)),schema=('col1 string,col2 int'))
df2 = spark.createDataFrame(data=(('a',1),('a',1),('c',2)),schema=('col1 string,col2 int'))
df1.intersect(df2).show()
df1.intersectAll(df2).show()

• exceptAll(): Rows present in one DataFrame


Aurora Academybut not Copyright
of Training. in another.
[email protected]
df1.exceptAll(df2).show() - All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Join

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Join APIs
Join Type ~ SQL Join
inner INNER JOIN, JOIN
outer, full, fullouter, FULL OUTER JOIN
full_outer
left, LEFT JOIN
left_outer,leftouter PT: Semi joins performs better
right,right_outer,rig RIGHT JOIN than inner joins. Use them
htouter wherever possible.
cross CROSS JOIN
left_anti, leftanti
leftsemi,left_semi

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Join APIs
• join (otherDF, on=None, how=None)
on: Joining Column
how: 'inner', 'outer', 'full', 'fullouter', 'full_outer', 'leftouter', 'left', 'left_outer', 'rightouter', 'right', 'right_outer', 'leftsemi',
'left_semi', 'leftanti', 'left_anti', 'cross'.

df1 = spark.createDataFrame(data=((1,'Robert'),(2,'Ria'),(3,'James')),schema='empid int,empname string')


df2 = spark.createDataFrame(data=((2,'USA'),(4,'India')),schema='empid int,country string')
df1.join(df2,df1.id == df2.id,'inner').select(df1.id,df2.country).show()

• crossJoin(self, other)
• self Join
df1 = spark.createDataFrame(data=((1,'Robert',2),(2,'Ria',3),(3,'James',5)),schema='empid int,empname
string,managerid int')
df1.alias("emp1").join(df1.alias("emp2"),col("emp1.managerid") ==
col("emp2.empid"),'inner').select(col("emp1.empid"),col("emp1.empname"),col("emp2.empid").alias("managerid"),c
ol("emp2.empname").alias("managaer_name")).show()

Use of col(): Sometimes we need to use the column name which is the alias of a withColumn. In that case we need to
refer the column name as col(column _name).
pyspark.sql.functions import col
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Join APIs
• Multi Column Join
df1 = spark.createDataFrame(data=((1,101,'Robert'),(2,102,'Ria'),(3,103,'James')),schema='empid int,deptid
int,empname string')
df2 = spark.createDataFrame(data=((2,102,'USA'),(4,104,'India')),schema='empid int,deptid int,country string')
df1.join(df2,(df1.empid == df2.empid) & (df1.deptid == df2.deptid)).show()

• Multi DataFrame Join


df1 = spark.createDataFrame(data=((1,'Robert'),(2,'Ria'),(3,'James')),schema='empid int,empname string')
df2 = spark.createDataFrame(data=((2,'USA'),(4,'India')),schema='empid int,country string')
df3 = spark.createDataFrame(data=((1,'01-jan-2021'),(2,'01-feb-2021'),(3,'01-mar-2021')),schema='empid
int,joindate string')
df1.join(df2,df1.empid==df2.empid).join(df3,df1.empid == df3.empid).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Aggregation

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Aggregation APIs
• summary
df.summary("count", "min", "25%", "75%", "max").show()
df.select("age", "name").summary("count").show()
• avg, max, min
ordItems.select(avg(ordItems.price)).show()
• sum, sumDistinct
ordItems.select(sum(ordItems.price), sumDistinct(ordItems.price)).show()
• count, countDistinct
ordItems.select(count(ordItems.order_item_product_id),countDistinct(ordItems.order_item_product_id).show()
• first,last
ordItems.sort(ordItems.price.asc()).select(first(ordItems.price)).show()
• collect_set, collect_list
df = spark.createDataFrame(((1,100),(2,150),(3,200),(4,50),(5,50)),schema='id int,salary int')
df.select(collect_list(df.salary)).show(truncate=False)
• skewness
• variance
• stddev

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : groupBy

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
GroupBy API
When we apply groupBy on a DataFrame Column, it returns GroupedData object. It has below aggregate functions:

avg(),mean()
count()
min()
max()
sum()
agg() For multiple aggregations at once
pivot()
apply()

data = (("James","Sales","NY",9000,34),
("Alicia","Sales","NY",8600,56),
("Robert","Sales","CA",8100,30),
("Lisa","Finance","CA",9000,24),
("Deja","Finance","CA",9900,40),
("Sugie","Finance","NY",8300,36),
("Ram","Finance","NY",7900,53),
("Kyle","Marketing","CA",8000,25),
("Reid","Marketing","NY",9100,50)
)
schema=("empname","dept","state","salary","age")
df = spark.createDataFrame(data=data,schema=schema)

df.groupBy(df.dept)
<pyspark.sql.group.GroupedData object at 0x7f68eaead690>
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
GroupBy API
Using avg(),sum(),min(),max(),count(),agg()

Ex-1 (Using 1 column)


df.groupBy(df.dept).avg("salary").show()

Ex-2 (Multiple Columns)


df.groupBy(df.dept,df.state).min("salary","age").show()

Ex-3 (Using Agg() many aggregations)


df.groupBy(df.dept).agg(min("salary").alias('min_salary'),
max("salary").alias('max_salary'),
avg("salary").alias('avg_salary')
).show()

Ex-4 (Using filter or where)


df.where(df.state == 'NY').groupBy(df.dept).agg(min("salary").alias('min_salary')).where(col("min_salary") > 8000).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
GroupBy API
Using pivot():
Transpose rows into columns.
Ex-
df_t = df.groupBy(df.dept).pivot("state").sum("salary")

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
GroupBy API
Using unpivot:
Ex-
There is no such function as unpivot. We can do it using stack() function in a selectExpr.
stack(n, expr1, ..., exprk) - Separates expr1, ..., exprk into n rows. Uses column names col0, col1, etc. by default unless specified otherwise.

spark.sql(""" select stack(3,1,2,3,4,5,6) """).show()


spark.sql(""" select dept, stack(2,'CA',CA,'NY',NY) as (state,salary) from d """).show()
df_t.selectExpr("dept","stack(2,'CA',CA,'NY',NY) as (state,salary)").show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
GroupBy API
/* Not Recorderd */
Using apply(pandas_udf):
• Takes a pandas udf and apply it to the current dataframe and returns a dataframe.
• Pandas_udf is available in pyspark.sql.functions
• pandas_udf(f=None, returnType=None, functionType=None)
f Function. Optional. User Defined Function.
returnType Optional. The return type of user defined function.
functionType  Optional. A value in pyspark.sql.functions.pandasUDFType. Default:scalar. Exists for compatibility.

Ex-
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(((1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)),("id", "v"))
@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) # doctest: +SKIP
def normalize(pdf):
v = pdf.v
return pdf.assign(v=(v - v.mean()) / v.std())

df.groupby("id").apply(normalize).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : window

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
• Window Functions operates on a group of rows and return a single value for each input row.
• Main Package: pyspark.sql.window. It has two classes Window and WindowSpec
• Window class has APIs such as partitionBy, orderBy, rangeBetween, rowsBetween.
• WindowSpec class defines the partitioning, ordering and frame boundaries. It has also above 4 APIs.
• These APIs such partitionBy returns a WindowSpec object.
>>> spec = Window.partitionBy(df.dept)
>>> Window.partitionBy(df.dept)
<pyspark.sql.window.WindowSpec object at 0x7f88d8ef3910>

• To perform a window function, we will have to partition the data using Window.partitionBy.
• Lets see 3 types of Window Functions:
 Ranking
 Analytical
 Aggregate

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions

Aggregate Function

Analytical Function

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
--Prepare Input Data
data = (("James","Sales","NY",9000,34),
("Alicia","Sales","NY",8600,56),
("Robert","Sales","CA",8100,30),
("John","Sales","AZ",8600,31),
("Ross","Sales","AZ",8100,33),
("Kathy","Sales","AZ",1000,39),
("Lisa","Finance","CA",9000,24),
("Deja","Finance","CA",9900,40),
("Sugie","Finance","NY",8300,36),
("Ram","Finance","NY",7900,53),
("Satya","Finance","AZ",8200,53),
("Kyle","Marketing","CA",8000,25),
("Reid","Marketing","NY",9100,50)
)
schema=("empname","dept","state","salary","age")
df = spark.createDataFrame(data=data,schema=schema)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
• Ranking Window Functions: Used to provide a ranking to the result within a partition.
 row_number() : Sequential Row Number.
 rank() : Ranks but gaps when ties.
 dense_rank() : Ranks with out any gaps.
 percent_rank: Relative rank (i.e. percentile) of rows within a window partition. First row is always 0 and last row is
always 1.
 ntile() :returns the ntile group id (from 1 to n inclusive) in an ordered window partition. For example, if n is 4, the first
quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will
get 4.
 cume_dist():Returns the cumulative distribution of values within a window partition,
i.e. the fraction of rows that are below the current row.
Ex –
from pyspark.sql.window import *
from pyspark.sql.functions import *
spec = Window.partitionBy("dept").orderBy("salary")
df.select(df.dept,df.salary) \
.withColumn("row_number",row_number().over(spec))\
.withColumn("rank",rank().over(spec))\
.withColumn("dense_rank",dense_rank().over(spec))\
.withColumn("percent_rank",percent_rank().over(spec))\
.withColumn("cume_dist",cume_dist().over(spec))\
.withColumn("ntile",ntile(3).over(spec))\Aurora Academy of Training. Copyright [email protected]
.show() - All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
• Analytical Window Functions:
 lag() : Return offset row value before the current row value.
 lead() : Return offset row value after the current row value.

Ex -
df.select(df.dept,df.salary) \
.withColumn("lag_prev_sal",lag("salary",1,0).over(spec)) \
.withColumn("lead_next_sal",lead("salary",1,0).over(spec)) \
.show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
• Aggregate Window Functions:
 avg
 sum()
 max()
 min()
 count()
 first()
 last()

Ex – Ex –
spec = Window.partitionBy("dept") spec = Window.partitionBy("dept").orderBy("salary")

df.select(df.dept,df.salary) \ df.select(df.dept,df.salary) \
.withColumn("sum_sal",sum("salary").over(spec)) \ .withColumn("first_sal",first("salary").over(spec)) \
.withColumn("max_sal",max("salary").over(spec)) \ .withColumn("last_sal",last("salary").over(spec)) \
.withColumn("min_sal",min("salary").over(spec)) \ .show()
.withColumn("avg_sal",avg("salary").over(spec)) \
.withColumn("count_sal",count("salary").over(spec)) \
.show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
• rangeBetween:
 Takes two argument (start,end) to define frame boundaries.
 Default : unboundedPreceding and unboundedFollowing.
 Both `start` and `end` are relative from the current row. For example, "0" means "current row", while "-1" means one
off before the current row, and "5" means the five off after the current row.
 Recommend to use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``, and ``Window.currentRow``
to specify special boundary values, rather than using integral values directly.

spec=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)

spec=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rangeBetween(Window.currentRow, Window.unboundedFollowing)

spec=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rangeBetween(Window.currentRow,500)

df.select(df.dept,df.salary).withColumn("sum_sal",sum("salary").over(spec1)).show()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
• rowsBetween:
 Takes two argument (start,end) to define frame boundaries.
 Deafult : unboundedPreceding and unboundedFollowing.
 Both `start` and `end` are relative from the current row. For example, "0" means "current row", while "-1" means one
off before the current row, and "5" means the five off after the current row.
 Recommend to use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``, and ``Window.currentRow``
to specify special boundary values, rather than using integral values directly.

spec1=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

spec1=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rowsBetween(Window.currentRow, Window.unboundedFollowing)

spec1=Window.partitionBy(df.dept).\
orderBy(df.salary). \
rowsBetween(Window.currentRow, 2)

df.select(df.dept,df.salary).withColumn("sum_sal",sum("salary").over(spec1)).show()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
dept salary sum_sal
Finance 7900 43300
Finance 8200 43300
Finance 8300 43300
Finance 9000 43300
Finance 9900 43300

rangeBetween(Window.currentRow, Window.unboundedFollowing)
rowsBetween(Window.currentRow, Window.unboundedFollowing)

dept salary sum_sal


Finance 7900 43300
Finance 8200 35400
Finance 8300 27200
Finance 9000 18900
Finance 9900 9900

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
rangeBetween(Window.currentRow, 500)

dept salary sum_sal


Finance 7900 24400
Finance 8200 16500
Finance 8300 8300
Finance 9000 9000
Finance 9900 9900

rowsBetween(Window.currentRow, 2)

dept salary sum_sal


Finance 7900 24400
Finance 8200 25500
Finance 8300 27200
Finance 9000 18900
Finance 9900 9900

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Window Functions
### Prepare Input Data
--df DataFrame
data = (("James","Sales","NY",9000,34),
("Alicia","Sales","NY",8600,56),
("Robert","Sales","CA",8100,30),
("John","Sales","AZ",8600,31),
("Ross","Sales","AZ",8100,33),
("Kathy","Sales","AZ",1000,39),
("Lisa","Finance","CA",9000,24),
("Deja","Finance","CA",9900,40),
("Sugie","Finance","NY",8300,36),
("Ram","Finance","NY",7900,53),
("Satya","Finance","AZ",8200,53),
("Kyle","Marketing","CA",8000,25),
("Reid","Marketing","NY",9100,50)
)
schema=("empname","dept","state","salary","age")
df = spark.createDataFrame(data=data,schema=schema)

--df1 DataFrame
df1 = spark.range(10)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame API: Sampling

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
sample(withReplacement=None, fraction=None, seed=None):
• To get random sample records from the dataset.
withReplacment: True or False. With True, Same result can be produced more than once.
Fraction: Between 0 to 1. 0.3 means 30%. Does not guarantee the exact 30% of the records.
Seed: Reproduce the same sample.
Ex-
dataset=spark.range(100)
dataset.sample(withReplacement=True,fraction=0.07,seed=10).show()

sampleBy(col, fractions, seed=None):


• Returns a stratified sample without replacement based on the fraction given on each stratum.
• If a stratum is not specified, we treat its fraction as zero.
Ex-
from pyspark.sql.functions import col
dataset = spark.range(0, 100).select((col("id") % 3).alias("key"))
sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=0)
sampled.groupBy("key").count().orderBy("key").show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Built-in
Part 2 Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : New Column
Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
New Columns Functions:
Window Ranking Functions( dense_rank, rank, row_number etc) :

monotonically_increasing_id():
• A column that generates monotonically increasing 64-bit integers.
• The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
• Use Case: Create a Primary Key/Unique column.
df.withColumn('id',monotonically_increasing_id()).show()

lit():
• It creates a static column with value is provided.
df.withColumn('col',f.lit(10)).show()
• Can also be used to concat columns.
df.select(concat('salary',lit('|'),'age').alias('value')).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
New Columns Functions:
expr(str):
• It takes SQL Expression as a string argument, executes the expression and returns a Column Type.
Ex - df.withColumn('empname_len',expr("length(empname)")).show()
• We can use SQL-like functions that are not present in pyspark column type and built-in functions(pyspark.sql.functions).
For example – CASE WHEN, Concat operator || etc
df.withColumn(age_desc",expr(" case when age > 50 then 'Senior' else 'Adult' end")).show()
df.withColumn('emp-dept',expr(" empname ||'-'|| dept")).show()
• Arithmatic Expression
df.select(expr(" age + 10 as age_10 ")).show()
• Spark do not check this in the compile type like other DataFrame Operations.

spark_partition_id():
• Generates a column with partitions ids.
Ex-
df1 = spark.range(10)
df1.repartition(5)
df1=df1.repartition(5)
df1.select("id",spark_partition_id()).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
New Columns Functions:
rand(seed):
• Generates a random column with independent and identically distributed (i.i.d.) samples from uniform
distribution.
df.withColumn('random_col',rand(70)).show(truncate=False)

randn(seed):
• Generates a column with independent and identically distributed (i.i.d.) samples from the standard
normal distribution.

Uniform Distribution Standard Distribution

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Column
Encryption Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Column Encryption Functions
sha1(col):
• Used for Encryption.
• Returns the hex string result SHA-1 Family.
• Only for string columns. Convert the int column to string.
df.select(df.age,sha1(df.age.cast('string'))).show(truncate=False)

sha2(col,numBits):
• Used for Encryption.
• Returns the hex string result of SHA-2 family.
• numBits : 0, 224, 256, 384, 512
df.select(df.age,sha2(df.age.cast('string'),224)).show(truncate=False)

hash(*cols):
• Any type of column or combination of columns.
• Calculates the hash code of given columns and return result as int column.
• May be used for Encryption.
df.select(df.age,hash(df.age)).show(truncate=False)
md5(cols):
• Calculates the MD5 digest and returns the value as a 32 character hex string.
df.select(df.age,md5(df.age)).show(truncate=False)
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : String
Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
String Manipulation Functions
split(str, pattern) : Splits str around pattern (pattern is a regular expression).
Ex-1
df = spark.createDataFrame((('ab12cd23fe27kl',)), ('s',))
df.select(split(df.s,'(0-9)+')).show(truncate=False)
Ex-2
ord.select(split(ord.order_date,'-')).show(truncate=False)

length(col): Computes the character length of string.


ordDF.select(length(ordDF.order_status)).distinct().show()

lower(col), upper(col), initcap(col):

ltrim(col),rtrim(col),trim(col):

lpad(col,len,pad), rpad(col,len,pad): Pad the string column to width `len` with `pad`.

reverse(col): Returns a reversed string.

repeat(col, n): Repeats a string column n times.

hex(col): Computes hex value of the given column.


Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
String Manipulation Functions
concat(*cols): Concatenates multiple input columns together into a single column
ord.withColumn('IDStatus',concat(ord.order_id,ord.order_status)).show()

concat_ws(sep, *cols): Concatenates multiple input string columns together into a single string column, using the
given separator.
ord.withColumn('IDStatus',concat_ws('-',ord.order_id,ord.order_status)).show()

substring(str, pos, len): Substring starts at `pos` and is of length `len` when str is String type.
ord.withColumn('orderYear',substring(ord.order_date,1,4)).show()

substring_index(str, delim, count):Returns the substring from string str before count occurrences of the delimiter
delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is
negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a
case-sensitive match when searching for delim.
ord.withColumn('sub',substring_index(ord.order_date,'-',1)).show()

instr(str, substr): Locate the position of the first occurrence of substr column in the given string.
Returns null if either of the arguments are null.
ord.withColumn('instr',instr(ord.order_status,'LO')).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
String Manipulation Functions
locate(substr, str, pos=1): Locate the position of the first occurrence of substr in a string column, after position pos.
The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.
ord.withColumn('instr',locate('00',ord.order_date,2)).show()

translate(srcCol, matching, replace): Translate any character in the `srcCol` by a character in `matching`.
df = spark.createDataFrame((('translate',)), ('col'))
df.select(translate(df.col, "rnlt", "123")).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
String Manipulation Functions
overlay(src,replace,pos,len): New in version 3.0. Overlay the specified portion of src with replace starting from the
byte position pos of src and proceeding for len bytes.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : RegExp
Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
regexp_extract(str, pattern, idx):
We can extract a pattern.
str: string or column from which we want to extract data.
pattern: pattern-regex pattern. Quantifier Meaning

idx: Group Number to extract \d Matches digits 0-9.


df=spark.createDataFrame(data=(('11ss1 ab',)),schema=('str')) \w Matches alphabets and digits.
df.select(df.str,regexp_extract(df.str,'(\d+)(\w+)(\s)((a-z)+)',1).alias('op')).show() \s Matches space.
. Matches any character except
regexp_replace(str, pattern, replacement): newline.
We can replace a column value with a string for another string or substring. ? 0 or more character.
Returns empty string if does not match. + 1 or more character.
df.select(regexp_replace(lit('11ss1 ab'),'(\d+)','xx').alias('op')).show() (a-zA-Z) Anything in range a-z and A-Z.

Use when() for conditional replace.


addr = ((1,"2625 Indian School Rd","Phoenix"),
(2,"1234 Thomas St","Glendale"))
df =spark.createDataFrame(address,("id","addr","city"))
df.withColumn('new_addr',
when(df.addr.endswith('Rd'),regexp_replace(df.addr,'Rd','Road')) \
.when(df.addr.endswith('St'),regexp_replace(df.addr,'St','Street')) \
.otherwise(df.addr)) \
.show(truncate=False)
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
rlike():
Not a dataframe function but a column function we can use to check if a pattern is found or not.
df.select(df.str.rlike('(\d+)')).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Date Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Prepare Data
ord = spark.read.load('practice/retail_db/orders',sep=',',format='csv',schema=('order_id int,order_date timestamp,
order_customer_id int,order_status string'))

ord_new = ord.withColumn('new_order_date',date_add(ord.order_date,50))

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
current_date():
Returns the current date.
df.select(current_date()).show()

current_timestamp():

next_day(date, dayOfWeek):
Returns the first date which is later than the value of the date column.
dayOfWeek - "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun“.
ord.select(ord.order_date,next_day(ord.order_date,'Fri')).distinct().show()

last_day(date):
Returns the last day of the month which the given date belongs to.
ord.select(ord.order_date,last_day(ord.order_date)).distinct().show()

dayofweek(col):
dayofmonth(col):
dayofyear(col):
weekofyear(col):

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
second(col):
Extract the seconds of a given date as integer.
ord.select(current_timestamp(),second(current_timestamp())).distinct().show(truncate=False)
minute(col):
hour(col):
month(col):
quarter(col):
year(col):

months_between(date1, date2, roundOff=True)


The result is rounded off to 8 digits unless `roundOff` is set to `False`.
ord_new.select(ord_new.order_date,ord_new.new_order_date,months_between(ord_new.order_date,ord_new.new_order_date,roun
dOff=False)).show()

date_add(start, days) : Add Number Of Days.


ord.withColumn('new_order_date',date_add(ord.order_date,50)).show()

date_sub(start, days) : Subtract Number Of Days.


ord.withColumn('new_order_date',date_sub(ord.order_date,50)).show()

add_months(start, months): Add Number of months.


ord.withColumn('new_order_date',add_months(ord.order_date,3)).show()

datediff(end, start): Returns the number of days from `start` to `end`.


Aurora Academy of Training. Copyright [email protected]
ord_new.select(datediff(ord_new.new_order_date,ord_new.order_date)).show()
- All Rights Reserved. Subscribe at Learn-Spark.nfo
date_trunc(format, timestamp)
Returns timestamp truncated to the unit specified by the format.
format: 'year', 'yyyy', 'yy', 'month', 'mon', 'mm', 'day', 'dd', 'hour', 'minute', 'second', 'week', 'quarter' .
ord.withColumn('new_order_date',date_trunc('yyyy',ord.order_date)).show()

date_format(date, format):
Converts a date/timestamp/string to a value of string in the format specified by the date.
ord.withColumn('new_order_date',date_format(ord.order_date,'yyyy/MM/dd')).show(5)

unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss'):


Convert time string with given pattern ('yyyy-MM-dd HH:mm:ss', by default) to Unix time stamp (in seconds), using the
default timezone and the default locale, return null if fail.
ord.withColumn('new_order_date',unix_timestamp(ord.order_date)).show(5)

to_timestamp(col, format=None): Converts a Column into timestamp type.


df = spark.createDataFrame((('1997-02-28 10:30:00',)), ('t'))
df.select(to_timestamp(df.t).alias('dt')).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss'):
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of
that moment in the current system time zone in the given format.
time_df = spark.createDataFrame(((1428476400,)), ('unix_time'))
time_df.select(from_unixtime('unix_time').alias('ts')).collect()

from_utc_timestamp(timestamp, tz):
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp
which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the
given time zone.

df = spark.createDataFrame((('1997-02-28 10:30:00', 'JST')), ('ts', 'tz'))


df.select(from_utc_timestamp(df.ts, "PST").alias('local_time')).collect()
df.select(from_utc_timestamp(df.ts, df.tz).alias('local_time')).collect()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
to_date(col, format=None):
Converts a string type or timestamp type into date type using the optionally specified format.
ord.select(ord.order_date,to_date(ord.order_date,'yyyy-mm-dd')).show()

to_timestamp(col, format=None):
df.select(df.t,to_timestamp(df.t).alias('dt')).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Null Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
###Prepare Data
df = spark.createDataFrame((('Robert',1, None,114.0), ('John',None, 2577,float('nan'))), ("name", "id","phone","stAdd"))

Isnull(col): Returns true if the column is null.


df.select(isnull(df.id)).show()

Isnan(col): Returns true if the column is NaN.


df.select(isnan(df.stAdd)).show()

nanvl(col1, col2): Returns col1 if it is not NaN, or col2 if col1 is NaN.


df.select(df.stAdd,df.phone,nanvl(df.stAdd,df.phone)).show()

coalesce(*cols): Returns the first column that is not null.


df.select(df.phone,df.stAdd,coalesce(df.phone,df.stAdd)).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Collection
Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### emp1 DataFrame
data =(('Alicia','Joseph',('Java','Scala','Spark'),{'hair':'black','eye':'brown'}), \
('Robert','Gee',('Spark','Java'),{'hair':'brown','eye':None}), \
('Mike','Bianca',('CSharp',''),{'hair':'red','eye':''}), \
('John','Kumar',None,None), \
('Jeff','L',('1','2'),{}))
schema = ('FirstName','LastName','Languages','properties')
emp1 = spark.createDataFrame(data=data,schema=schema)

### emp2 DataFrame


data=(('Robert',35,40,40),('Ram',31,33,29),('John',95,89,91))
schema = ('name','score1','score2','score3')
emp2= spark.createDataFrame(data=data, schema=schema)

### emp3 DataFrame


emp3 =
spark.createDataFrame(data=(('John',(10,20,20),(25,11,10)),('Robert',(15,13,55),(5,None,29)),('James',(11,13,45),(5,89,79))),schema=('em
pName', 'score_arr1', 'score_arr2'))

### df DataFrame
df = spark.sql("SELECT array(struct(1, 'a'), struct(2, 'b')) as data")

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
size(col):
• Returns the length of the array or map stored in the column.
• Size is -1 for null elements.
emp1.select(size(emp1.Languages),size(emp1.properties)).show()

element_at(col,extraction):
• Returns element of array at given index in extraction if col is array.
• Returns value for the given key in extraction if col is map.
emp1.select(emp1.FirstName,element_at(emp1.Languages,2),element_at(emp1.properties,'eye')).show()

struct(*cols) :
• Create a new struct column.
emp_new = emp1.select(struct((emp1.FirstName,emp1.LastName)))

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
array(*cols):
• Creates a new array column.
emp_new = emp2.select(array(emp2.score1,emp2.score2,emp2.score3))

array_max(col), array_min(col):
• Returns maximum or minimum values of an array column.
emp3.select(array_max("score_arr1")).show()

array_distinct(col):
• Returns distinct values of an array column.
emp3.select(array_distinct("score_arr1")).show()

array_repeat(col,count):
• Repeated count times.
emp3.select(array_repeat("score_arr1",3)).show(truncate=False)

--Flatten the array and chose the distinct values


emp3.select(array_distinct(flatten(array_repeat("score_arr1",3)))).show(truncate=False)

slice(col,start,length)
• Returns an array containing all the elements in `col` from index `start` for length `length`.
• col is Array Type.
emp1.select(emp1.Languages,slice(emp1.Languages,3,1)).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
array_position(col,value):
• Locates the position of the first occurrence of the given value in the given array.
• Starts with Index 1.
emp3.select(emp3.score_arr1,array_position(emp3.score_arr1,27)).show()

array_remove(col,element):
• Remove all elements that equal to element from the given array.
emp3.select(array_remove("score_arr1",20)).show()

array_sort(col):
• Sorts the input array in ascending order.
• The elements of the input array must be orderable.
• Null elements will be placed at the end of the returned array.
emp3.select("score_arr1",array_sort("score_arr1")).show()

sort_array(col,asc=True):
• Sorts the input array in ascending or descending order according to the natural ordering of the array elements.
• Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in
descending order.
emp3.select(emp3.score_arr1,sort_array(emp3.score_arr1,asc=False)).show()

array_contains(col, value):
• Returns null if the array is null, true if the array contains the given value, and false otherwise.
emp3.select(array_contains(emp3.score_arr1,55)).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
array_union(col1,col2):
• Returns an array of the elements in the union of col1 and col2 without duplicates.
emp3.select(array_union(emp3.score_arr1,emp3.score_arr2)).show(truncate=False)

array_except(col1, col2):
• Returns an array of the elements in col1 but not in col2 without duplicates.
emp3.select(array_except(emp3.score_arr1,emp3.score_arr2)).show(truncate=False)

array_intersect(col1, col2):
• Returns an array of the elements in the intersection of col1 and col2 without duplicates.
emp3.select(array_intersect(emp3.score_arr1,emp3.score_arr2)).show(truncate=False)

array_join(col, delimiter, null_replacement=None):


• Concatenates the elements of `column` using the `delimiter`.
• Null values are replaced with `null_replacement` if set, otherwise they are ignored.
emp3.select(emp3.score_arr2,array_join(emp3.score_arr2,'#',null_replacement="1")).show()

arrays_zip(*cols):
• Merge arrays. First element of array1 will be merged with first element of array 2 and so on.
emp3.select(arrays_zip(emp3.score_arr1,emp3.score_arr2)).show(truncate=False)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
arrays_overlap(a1,a2):
• Returns true if the arrays contain any common non-null element;
• Returns null if both the arrays are non-empty and any of them contains a null element
• Returns false otherwise.
emp3.select(arrays_overlap(emp3.score_arr1,emp3.score_arr2)).show(truncate=False)

shuffle(col):
• Random shuffle an array.
emp3.select(emp3.score_arr1,shuffle(emp3.score_arr1)).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
create_map(*cols):
• Create a new map column.
emp1.select(create_map(emp1.FirstName,emp1.LastName)).printSchema()

map_from_entries(col):
• col is array of paired structs.
• Function returns a map created from the given array of entries.
df.select(map_from_entries("data").alias("map")).show()

map_from_arrays(col1,col2):
• Creates a new map from two arrays.
emp3.select(map_from_arrays(emp3.score_arr1,emp3.score_arr2)).printSchema()

map_keys():
• Returns an unordered array containing the keys of the map.
emp1.select(map_keys(emp1.properties)).show()

map_values():
• Returns an unordered array containing the values of the map.
emp1.select(map_values(emp1.properties)).show()

map_concat(*cols):
• Returns the union of all the given maps.
emp1.select(map_concat(emp1.properties,emp1.properties)).show(truncate=False)
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
sequence(start, stop, step =1):
• Generate a sequence of integers from `start` to `stop`, incrementing by `step`.
• If `step` is not set, incrementing by 1 if `start` is less than or equal to `stop`, otherwise -1.
emp2.select(emp2.score1, emp2.score2,sequence(emp2.score1,emp2.score2).alias('new_col')).show(truncate=False)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : na Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
na Functions
Na Functions are used to work with missing data.

drop(how='any', thresh=None, subset=None):


• Remove rows with NULL Values.
• how: any or all.
If 'any', drop a row if it contains any nulls.
If 'all', drop a row only if all its values are null.
• Thresh : If specified, drop rows that have less than `thresh` non-null values. This overwrites the `how`
parameter.
• Subset: optional list of column names to consider.

Ex- data=(('Alice',80,10),('Bob',None,5),('Tom',50,50),(None,None,None),('Robert',30,35))
schema='name string, age int, height int'
df = spark.createDataFrame(data,schema)
df.na.drop().show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
na Functions
fill(value, subset=None):
• Replace null values.
• value : Value to replace null values with.
• subset: Optional list of column names to consider.

df.na.fill(50).show() #String Columns are ignored.


df.na.fill('Ram').show() #Non-String Columns are ignored.
df.na.fill({'age' : 50, 'name' : 'Ram' }).show()
df.na.fill(value=100,subset='height').show()

replace(to_replace, value=<no value>, subset=None)


df.na.replace(10,20).show()
df.na.replace({'Alice':'Alex','Bob': 'Cob'},subset='name').show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Mathematics
and Statistics Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Mathematics Functions
abs(col):
Compute the absolute value.
df1.select(df1.col1,abs(df1.col1)).show()
exp(col):
Computes the exponential of the given value.
df1.select(df1.col1,exp(df1.col1)).show()
factorial(col):
Computes the factorial of the given value.
df1.select(df1.col1,factorial(df1.col1)).show()
sqrt(col):
cbrt(col) :
pow(col,n):
df.select(df.age,sqrt(df.age),cbrt(df.age),pow(df.age,2)).show()
floor():
ceil(): Aurora Academy of Training. Copyright [email protected]
df2.select(df2.col1,floor(df2.col1),ceil(df2.col1)).show()
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Mathematics Functions
round(col,scale=0):
df2.select(df2.col1,round(df2.col1,2)).show()
trunc(col,format):
• Format : 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm'
ord.select(ord.order_date,trunc(ord.order_date,'year')).show()
signum(): 1 if n > 0, 0 if n = 0, -1 if n < 0)
df2.select(df2.col1,signum(df2.col1)).show()
avg(col):
sum(col):
sumDistinct(col):
mean(col):
Count(col):
countDistinct(col):
min(col): Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
max(col):
Statistics Functions
corr(col1,col2):
Return Pearson Correlation Coefficient.
df.select(corr(df.salary,df.age)).show()

covar_pop(col1, col2):
Return population covariance.
df.select(covar_pop(df.salary,df.age)).show()

covar_samp(col1, col2):
Return sample covariance.

var_pop(col): Return population variance of the values in a group.


var_samp(col): Returns the unbiased variance of the values in a group.
variance(col): Returns the population variance of the values in a group.
stddev(col): Returns the unbiased sample standard deviation of the expression in a group.
stddev_pop(col): Returns population standard deviation of the expression in a group.
stddev_samp(col): Returns the unbiased sample standard deviation of the expression in a group.
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Explode and
Flatten Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
explode(col):
• Can be used for array and map.
• When an array is passed to this function, it creates a new column and each element of the array in a separate row.
• When a map is passed to this function, it creates two new columns one for key and one for value.
• This will ignore null elements.

data =(('Alicia',('Java','Scala'),{'hair':'black','eye':'brown'}),
('Robert',('Spark','Java',None),{'hair':'brown','eye':None}),
('Mike',('CSharp',''),{'hair':'red','eye':''}),
('John',None,None),
('Jeff',('1','2'),{}))
schema = ('empName','Languages','properties')
emp = spark.createDataFrame(data=data,schema=schema)
emp.select(emp.empName,explode(emp.Languages)).show()
emp.select(emp.empName,explode(emp.properties)).show()

explode_outer(col): In explode nulls are ignored, but in explode_outer nulls are reported.
emp.select(emp.empName,explode_outer(emp.Languages)).show()

posexplode(col): Explode with a separate position/index field. Ignores null values.


emp.select(emp.empName,posexplode(emp.Languages)).show()
emp.select(emp.empName,posexplode(emp.properties)).show()

posexplode_outer(col): Explode with a separate position/index field. Nulls are reported.


emp.select(emp.empName,posexplode_outer(emp.Languages)).show()
emp.select(emp.empName,posexplode_outer(emp.properties)).show()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
flatten():
• Convert an Array of Array Column to a single Array Column.
ArrayType(ArrayType(StringType))  ArrayType(StringType)
• If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

Ex-
data =(('Alicia',(('Java'),('Scala'),('Python'))),\
('Robert',((None),('Java'),('Hadoop')))
)
schema = ('empName','ArrayofArray')
emp = spark.createDataFrame(data=data,schema=schema)
emp.select(emp.empName,flatten(emp.ArrayofArray)).show()

Ex-
data =(('Alicia',((1),(2))),\
('Robert',(None,(1)))
)
schema = ('empName','ArrayofArray')
emp = spark.createDataFrame(data=data,schema=schema)
emp.select(emp.empName,flatten(emp.ArrayofArray)).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Formatting
Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Prepare Data
ordItems=spark.read.load('retail_db/order_items',sep=',',format='csv',schema=('order_item_id int,order_item_order_id
int,order_item_product_id int,quantity tinyint,subtotal float,price float'))
df = spark.createDataFrame(((5, "hello")), ('a', 'b'))

format_number(col,d): Formats the number X to a format to d decimal places with HALF_EVEN round mode, and returns the
result as a string.
ordItems.select(ordItems.subtotal,format_number(ordItems.subtotal,4)).show()

format_string(format, *cols):
Formats the arguments in printf-style and returns the result as a string column.
df.select(format_string('%d %s', df.a, df.b).alias('v')).show()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame API: json Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Prepare Data
data=((1,"""{"Zipcode":85016,"ZipCodeType":"STANDARD","City":"Phoenix","State":"AZ"}"""))
df_map=spark.createDataFrame(data,("id","value"))

data = ((1, '''(1, 2, 3)'''))


df_arr=spark.createDataFrame(data,("id","value"))

data=((1, """{"Zipcode":85016,"ZipCodeType":"STANDARD","City":"Phoenix","State":"AZ"}"""))
df_struct=spark.createDataFrame(data,("id","value"))

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
from_json(col, schema):
Convert a JSON string into a Map Type or StructType or ArrayType.
Returns `null`, in the case of an unparseable string.
Json String Column Collection Type Column
--Map Type Ex
schema=MapType(StringType(),StringType())
df_map_new = df_map.withColumn('map_column',from_json(df_map.value,schema))
df_map_new.printSchema()

--Array Type Ex
schema=ArrayType(IntegerType())
df_arr_new = df_arr.withColumn('arr_column',from_json(df_arr.value,schema))
df_arr_new.printSchema()

--Struct Type Ex
schema=StructType((StructField("Zipcode",IntegerType()),StructField("ZipCodeType",StringType()),StructField("city",StringType()),StructField("s
tate",StringType())))
df_struct_new = df_struct.withColumn('struct_column',from_json(df_struct.value,schema))
df_struct_new.printSchema()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
to_json(col)
Converts a column with StructType or ArrayType or MapType into a JSON string.
Throws an exception, in the case of an unsupported type.
Collection Type Column Json String Column
--Using Map Type
df_map_new.select(to_json(df_map_new.map_column)).printSchema()

--Using Array Type


df_arr_new.select(to_json(df_arr_new.arr_column)).printSchema()

--Using Struct Type


df_struct_new.select(to_json(df_struct_new.struct_column)).printSchema()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
json_tuple(col, *fields):
Extract the elements from JSON string column and create the result as a new columns.
Json String Column New Columns
df_map.select(json_tuple(col("value"),"Zipcode","State")).toDF("Zip","State").show()

schema_of_json(json_string):
Use schema_of_json() to create schema string from JSON string column. Json String Column Schema
schemaOfStr=spark.range(1) \
.select(schema_of_json(lit("""{"id":101, "name":"Robert","City":"Phoenix","State":"AZ"}"""))) \
.collect()(0)(0)

get_json_object(col,path):
Used to extract the JSON string based on path from the JSON column.
df_map.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").alias("ZipCodeType")) \
.show(truncate=False)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Other
Aggregate Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Prepare Data
data = (("James","Sales","NY",None,34),
("Alicia","Sales","NY",8600,56),
("Robert","Sales","CA",8100,30),
("John","Sales","AZ",8600,31),
("Ross","Sales","AZ",8100,33),
("Kathy","Sales","AZ",1000,39),
("Lisa","Finance","CA",9000,24),
("Deja","Finance","CA",9900,40),
("Sugie","Finance","NY",8300,36),
("Ram","Finance","NY",7900,53),
("Satya","Finance","AZ",8200,53),
("Kyle","Marketing","CA",8000,25),
("Reid","Marketing","NY",9100,50)
)
schema=("empname","dept","state","salary","age")
df = spark.createDataFrame(data=data,schema=schema)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
first(col, ignorenulls=False) :
Returns the first value in a group.
Return the first non-null value it sees when ignoreNulls is set to true.
df.select(first(df.salary)).show()

last(col, ignorenulls=False):

greatest(*cols):
Returns the greatest value of the list of column names, skipping null values.
df.select(greatest(df.salary,df.age)).show()

least(*cols):
Returns the leastvalue of the list of column names, skipping null values.

skewness(col):
Returns the skewness of the values in a group.
df.select(skewness(df.salary)).show()

collect_list(col): Returns a list of objects with duplicates.


df.select(collect_list('age')).show(truncate=False)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame APIs : Other Misc
Functions

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
###Prepare Data
df = spark.createDataFrame((('Robert',1, None,114.0), ('John',None, 2577,float('nan'))), ("name", "id","phone","stAdd"))

ascii(col):
• Computes the numeric value of the first character of the string column.
df.select(ascii(lit('a'))).show()

bin(col):
• Returns the string representation of the binary value of the given column.
df.select(df.phone,bin(df.phone)).show()

expr()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Part 3 DataFrame Extraction

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition & Coalesce

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
What is Partition ?
• Datasets are huge is size and so they cannot fit into a single node and so they have to be partitioned across
different nodes.
• Partition in spark is basically an atomic chuck of data stored on a node in the cluster. They are the basic
units of parallelism.
• One partition can not span over multiple machines.
• Spark automatically partitions RDDs/DataFrames and distributes the partitions across different nodes.
• We can also configure the optimal number of partitions. Having too few or many partitions is not good.
• How Spark does the default Partitioning of Data ?
Spark checks the HDFS Block size. The HDFS Block size for Hadoop 1.0 is 64mb and Hadoop 2.0/YARN is
128MB. It creates one partition for each block size.
Ex- We have a file of 500MB, so 4 partitions would be created.
• At times programmers are required to change the number of partitions based on the requirements of the
application job. The change can be to increase the number of partitions or decrease the number of
partitions.
• So we would either apply the repartition or coalesce.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
HDFS

Apply a Filter T
Size ~1MB
File: test_data.csv
Record Count
Size:2.6GB
:64k
Record Count :128m

Partitions: 21
Partitions: 21

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Create some Dump data for testing
df = spark.range(1000000)
df = df.select(df.id,df.id*2,df.id*3)
df = df.union(df)
df = df.union(df)
df = df.union(df)
df = df.union(df)
df = df.union(df)
df = df.union(df)
df = df.union(df)

### Save the file at a HDFS Path


df.write.save(‘/user/test/test_data.csv’,format=‘csv’)

### Testing
df_new = spark.read.load('/user/test/test_data.csv',format='csv',schema=('col1 int, col2 int, col3 int'))
df_new.rdd.getNumPartitions()
--- Apply Fillter
df_filter = df_new.where(df_new.col1 < 501)
df_filter.rdd.getNumPartitions()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition
Repartition(numPartitions, *cols):
• DataFrame is hash partitioned.
• Create almost equal sized partitions.
• Can increase or decrease the level of parallelism.
• Internally, this uses a shuffle to redistribute data from all partitions leading to very expensive operation. So avoid if not required.
• Spark performs better with equal sized partitions. If you need further processing of huge data, it is preferred to have a equal sized partitions
and so we should consider using repartition.
• If you are decreasing the number of partitions, consider using coalesce, where the movement of data across the partitions is lower.

Ex-df_new.rdd.glom().map(len).collect()
df_filter.rdd.glom().map(len).collect()

df_filter = df_filter.repartition(5)
df_filter.rdd.glom().map(len).collect()

Ex-
data=(('Ram',30),('Raj',25),('James',30),('Joann',25),('Kyle',25),('Robert',30),('Reid',35),('Sam',35))
df = spark.createDataFrame(data=data,schema=('name','age'))
df = df.repartition('age')
df.rdd.getNumPartitions()
spark.conf.get("spark.sql.shuffle.partitions")

df = df.repartition(4,'age')
df.rdd.getNumPartitions()
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition

repartition(2)
Shuffle in Repartition

P1

P2 P6

P3

P4 P7

P5

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
coalesce
coalesce(numPartitions):
• Return a new dataframe that is reduced into numPartitions partitions.
• Optimized version of repartition().
• No shuffling.
• Results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle,
instead each of the 100 new partitions will claim 10 of the current partitions.
• If a larger number of partitions is requested, it will stay at the current number of partitions.

df_filter.rdd.glom().map(len).collect()
df_filter = df.coalesce(5)
df_filter.rdd.glom().map(len).collect()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
coalesce

coalesce(2)
No Shuffle in coalesce

P1
P 1 (= P1 + P2)

P2

P3
P 3 (= P3 + P4)
P4

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Repartition vs Coalesce
Repartition Coalesce

1. Repartitions does a full shuffle. 1. Coalesce avoids full shuffle.


2. Preferable used to increase number of partitions. 2. Preferable used to decrease number of partitions.
3. Repartition creates new partitions and does a full shuffle. 3. Coalesce uses existing partitions to minimize the amount of data that's
4. Repartition results in roughly equal sized partitions. shuffled.
5. Coalesce may run faster than repartition, but unequal sized 4. Coalesce results in partitions with different size of data.
partitions are generally slower to work with than equal sized 5. Coalesce may run faster than repartition, but unequal sized partitions
partitions. You'll usually need to repartition datasets after filtering are generally slower to work with than equal sized partitions. You'll
a large data set. I've found repartition to be faster overall because usually need to repartition datasets after filtering a large data set. I've
Spark is built to work with equal sized partitions. found repartition to be faster overall because Spark is built to work
6. It's critical to repartition after running filtering queries. The with equal sized partitions.
number of partitions does not change after filtering, so if you 6. It's critical to repartition after running filtering queries. The number of
don't repartition, you'll have way too many memory partitions. partitions does not change after filtering, so if you don't repartition,
The more the filter reduces the dataset size, the bigger the you'll have way too many memory partitions. The more the filter
problem. reduces the dataset size, the bigger the problem.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame Extraction

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
### Prepare Data
ord = spark.read.load('practice/retail_db/orders',sep=',',format='csv',schema=('order_id int,order_date
timestamp, order_customer_id int,order_status string'))

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : csv File
• csv can take a lot of parameters. For complete list, use help command.
• Below are few important parameters.

csv Params:
• path
• mode : default ‘error’ or ‘errorIfExists’ : Throw exception.
‘append’ :Append contents to existing data.
‘overwrite’ : Overwrite existing data.
‘ignore’ : Ignore the operation if data exists.
• compression (none, bzip2, gzip, lz4,snappy and deflate)
• sep (default ‘ , ’)
• header (True or False. default – False)
• dateFormat (default – ‘yyyy-MM-dd’)
• timestampFormat (Default – ‘yyyy-MM-dd'T'HH:mm:ss.SSSXXX’)
• ignoreLeadingWhiteSpace (Dafault – True)
• ignoreTrailingWhiteSpace (Default – True )
• And more …

PS - Calculate number of orders in each status and save into 2 files.


ordCountByStatus = ord.groupBy("order_status").count()
ordCountByStatus.coalesce(2).write.save('practice/dump/retail_db/ordCountByStatus',format='csv',sep='#',mode='overwrite',header=T
rue, compression=‘gzip’)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : text file
• To save into a text file, make sure to append all columns into a single column with String type.
• Do not support a header.
• To create a text file from dataframe, it should have only 1 column. So it is preferred to use a rdd for creating text file.
ordRDD = ordDF.rdd
ordRDD.saveAsTextFile('retail_db/orders')

Text Params:
• path
• compression
• linesep (default – ‘\n’)

PS: Convert the Order csv file to text file with fixed length of tab and create 10 output files.
from pyspark.sql.functions import concat_ws
ordText = ord.select(concat_ws('\t',ord.order_id,ord.order_date,ord.order_customer_id,ord.order_status).alias('col1'))
ordText.repartition(10).write.save('practice/dump/retail_db/orderText',format='text')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : parquet File
parquet Params:
• path
• mode (Same as csv)
• partitionBy (Apply partition as per the column. One file per each partition.)
• Compression (Default Compression - spark.sql.parquet.compression.codec)

PS: Convert the Order csv file to parquet file. Create one file for each order status category.
ord.write.save('practice/dump/retail_db/orderParquet',format='parquet',mode='overwrite',partitionBy="order_status")

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : orc File
orc Params:
• path
• mode (Same as csv)
• partitionBy (Apply partition as per the column. One file per each partition.)
• Compression (Default Compression - spark.sql.orc.compression.codec)

PS: Convert the Order csv file to orc file. Create one file for each order status category.
ord.write.save('practice/dump/retail_db/orderOrc',format='orc', partitionBy="order_status")

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : json File
json Params:
• path
• mode
• compression (none, bzip2, gzip, lz4,snappy and deflate)
• dateFormat (default – yyyy-MM-dd)
• timestampFormat (default - yyyy-MM-dd'T'HH:mm:ss.SSSXXX)
• lineSep (default – ‘\n’)

PS: Convert the Order csv file to json file. File should be bzip2 compressed and total 1 output file.
ord.coalesce(1).write.save('practice/dump/retail_db/orderJson',format='json', compression='bzip2')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : avro File
• Avro Is a third party file format. We need to import its package or jar file while launching pyspark or spark-submit. Spark
by default does not support it.
• pyspark2 --packages org.apache.spark:spark-avro_2.11:2.4.4

PS: Convert the Order csv file to avro file.


ord.write.save('practice/dump/retail_db/orderAvro',format='avro')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction: hive Table
insertInto(tableName, overwrite=False):
• Insert data to a Hive Table.
• By default it will append the data. With overwrite=True, it will overwrite the existing data.
• Table should be available, otherwise throws exception.

Ex-
ord.write.insertInto('db2.orders')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DataFrame: hive Table
saveAsTable(name, format=None, mode=None, partitionBy=None, **options)
• format – File Format
If we do not pass any file format or compression, by default it is snappy compressed and parquet format.
• mode - append, overwrite, error, ignore
Table should not exist if we do not pass a mode.
• partitionBy - Create a partition on the partition column. Number of files created for the hive table is equal to the
number of distinct values in that column.
• compression: By default it is snappy compressed.
Pass compression=‘none’ to avoid compression.

Ex-
ord.write.saveAsTable(name='db2_orders.order_test', format='orc')

Ex-
ord.write.saveAsTable(name='db2_orders.order_test', format='orc', mode='overwrite')

Ex-
ord.write.saveAsTable('db2_orders.orders1',partitionBy='order_status',format='orc',mode='overwrite',compression='none')

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : jdbc
• Make sure JDBC jar file is registered using –packages or –jars while launching pyspark or spark-submit
• Typical jdbc files are located at /usr/share/java folder. You may keep it there or copy it to your project lib folder.
pyspark2 --jars <jdbc driver jar file>
• Complete list of properties:
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

mode: append, overwrite,ignore, error or errorifexists (default case)

Ex-1 ( Write into a Table. Table should not exist.)


df.write.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable","new_orders" ) \
.option("user", someUser) \
.option("password", somePassword) \
.save()

If Table exist, use mode.


df.write.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable","new_orders" ) \
.option("user", someUser) \
.option("password", somePassword) \
.mode("overwrite") \ Aurora Academy of Training. Copyright [email protected]
.save() - All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : jdbc
createTableColumnTypes : Apply column datatypes instead of defaults.

Ex -2 (Using createTableColumnTypes )

ord.select('order_id').write.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("createTableColumnTypes","order_id char(10)" ) \
.option("dbtable","new_orders" ) \
.option("user", someUser) \
.option("password", somePassword) \
.mode("overwrite") \
.save()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : jdbc
batchsize: The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option
applies only to writing. It defaults to 1000.
fetchsize: reading.

ord.select('order_id').write.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("createTableColumnTypes","order_id char(10)" ) \
.option("dbtable","new_orders" ) \
.option("user", someUser) \
.option("password", somePassword) \
.option("batchsize",5000) \
.mode("overwrite") \
.save()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Extraction : jdbc
queryTimeout: The number of seconds the driver will wait for a Statement object to execute to the given number of seconds. Zero is default and
means there is no limit.

ord.select('order_id').write.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("createTableColumnTypes","order_id char(10)" ) \
.option("dbtable","new_orders" ) \
.option("user", someUser) \
.option("password", somePassword) \
.option("queryTimeout",1) \
.mode("overwrite") \
.save()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Performance & Optimization

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
1. Caching Data in Memory
2. Configuration Options
3. Join Strategy Hints for SQL Queries
4. Coalesce Hint for SQL Queries
5. Adaptive Query Execution

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Join Strategies

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Join Strategies
• Spark has below important join strategies.
 Broadcast Join (Hint – BROADCAST)
 Shuffle Hash Join (Hint - SHUFFLE_HASH)
 Sort Merge Join (Hint - SORT MERGE)
 Cartesian Product Join (Hint- BROADCAST)
 Broadcast Nested Loop Join(Hint – SHUFFLE_REPLICATE_NL)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Broadcast Join

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Divided into two Steps:
1. Broadcast the smallest dataset to all executors.
2. Perform a hash join

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Broadcast Join in
Spark

Hash
join
1
Partition 1 Table B BitTorrent Protocol/
peer to peer protocol.
Hash
Table A join
1
Partition 2 Table B

New Table
Hash
join
1
Partition
… 2 Table B
Table B
Hash
join
Partition n 1
Table B

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Broadcast Join in
Spark
2
Hash
join
1
Partition 1 Table B BitTorrent Protocol/
2 peer to peer protocol.
Hash
Table A join
1
Partition 2 Table B

2 New Table
Hash
join
1
Partition
… 2 Table B
Table B
Hash
2
join
Partition n 1
Table B

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Hash Join: Only supported for “=“ join.

Create hash table based


on the join key of
smaller dataset.

Loop over large dataset


and match the hashed
join key with the hash
table created above.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Broadcast Join:
• One of the most impactful performance Impact optimization techniques we can use.
• It performs a join on two datasets by first broadcasting the smaller one to all Spark executors using
BitTorrent peer to peer protocol.
• In this way, each executor has all the information required to perform the join at its location, with out needing
redistributing data and shuffle.
• Broadcast join can be very efficient for joins between a large table (fact) with relatively small tables (dimensions)
that could then be used to perform a star-schema join.
• Size of the smaller table should be lesser than: spark.sql.autoBroadcastJoinThreshold. Configurable.
Default Size: 10MB
int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold"))/1024/1024
• Recently Spark has increased the maximum size for the broadcast table from 2GB to 8GB. Thus, it is not possible to
broadcast tables which are greater than 8GB.
• Also called map-side join or replicated join.
• Use Hint BROADCAST to force Spark Optimizer.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Auto Detection:
• In many cases, Spark can automatically detect whether to use a broadcast join or not, depending on the size of the data.
If Spark can detect that one of the joined DataFrames is small (10 MB by default), Spark will automatically broadcast it for
us.
• Spark will perform auto-detection when
 Constructs a DataFrame from scratch, e.g. spark.range.
 It reads from files with schema and/or size information, e.g. Parquet, Avro

Ex – Lets join orders and orderItems files.

ordDF = spark.read.load('practice/retail_db/orders',sep=',',format='csv',schema=('order_id int,order_date timestamp, \


order_customer_id int,order_status string'))

ordItemsDF=spark.read.load('practice/retail_db/order_items',sep=',',format='csv',schema=('order_item_id int,\
order_item_order_id int,order_item_product_id int,quantity tinyint,subtotal float,price float'))

joined= ordDF.join(ordItemsDF,ordDF.order_id == ordItemsDF.order_item_order_id)

Joined.explain()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Sometimes Not:
• But not always, Spark can do auto detection. For Ex- Local Collection.
• The reason is that Spark will not determine the size of a local collection because it might be big.
• In this case, we can force optimizer to use BROADCAST Hint.

Ex –
largeDF = spark.range(1,1000000000)

data = [(1, 'a'),(2, 'b'),(3, 'c')]


schema= ['id', 'col2']
smallDF=spark.createDataFrame(data,schema)

joined = largeDF.join(smallDF, "id")

joined.explain()

Ex- Use hint.


from pyspark.sql.functions import broadcast
joined = smallDF.hint('broadcast').join(broadcast(largeDF), "id")
joined.explain()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Testing:
• If you want to test the joins with out presence of automatic optimization, set spark.sql.autoBroadcastJoinThreshold to -1.
• This will disable the automatic broadcast join detection.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Notes:
• Table needs to be broadcast less than spark.sql.autoBroadcastJoinThreshold the configured value, default
10M (or add a broadcast join the hint).
• Only supported for ‘=’ join.
• Supported for all join types (inner, left, right) except full outer join.
• Faster than any other join strategies.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Shuffle Hash Join

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Shuffle Hash Join is divided into two steps:
Step#1 : (Shuffle Phase):
• Both datasets are read and shuffled.
• So the same keys from both sides end up in the same partition or task.

P1 P1

10
10 Shuffle 10 P1 P1
20 30 Shuffle 10
30 30 10
15
40
DataFrame
DataFrame P2 P2
;00000
B
P2 P2
A 4
30 15 40
20 Shuffle
40 Shuffle
35 35
20
10 40
40
20 40

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Step#2 : (Hash Join Phase):
• After Shuffle phase, spark picks one side based on statistics(generally the smaller dataset) and it
will be hashed into buckets.
• Then hash join is performed.

Hash Join
P1 P1

P1
10 10
20 Shuffle 10 10 Bucket 1 P1 P1

30 30 10 10 Shuffle 10
30 30 10
30 15
40
DataFrame
DataFrame P2 P2
;00000
B
P2 P2
P2
A 4
30 Bucket 1 15 40
20 Shuffle
40 Shuffle 20 15 35 35
20
10 20 35 40
40
20 40 40 Bucket 2
40 40

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
DAG Visualization:

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
When does shuffle hash join work:
• Only for equi joins (‘=‘).
• Works well when a dataset can not be broadcasted but one side of the partitioned data after shuffling will be small
enough for hash join.
When does shuffle hash join not work:
• Non-equi Joins (<,>,<=,>= etc)
• Does not work with data which are heavily skewed.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Note:
• The join keys don’t need to be sortable.
• Supported for all join types except full outer joins.
• Expensive join as both shuffle and hashing are involved. Spark prefers Sort merge join than Shuffle Hash Join.
By default spark.sql.join.preferSortMergeJoin is True, but configurable.
• Hint: SHUFFLE_HASH

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
spark.conf.set("spark.sql.join.preferSortMergeJoin", "false")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 2)

Ex –
df1 = spark.range(1,10000000000)
df2 = spark.range(1,10000000)
spark.conf.set("spark.sql.join.preferSortMergeJoin", False)
joined = df1.join(df2,"id")
joined.explain()

joined = df1.hint('shuffle_hash').join(df2,”id”)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Shuffle Sort Merge Join

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Shuffle Sort Merge Join:
• From spark 2.3 Merge-Sort join is the default join algorithm in spark. However, this can be turned down by using
the internal parameter ‘spark.sql.join.preferSortMergeJoin’ which by default is true.
• Divided into two steps:
 The first step is to shuffle and sort the join keys of both the datasets.
 Apply the merge algorithm.
• Hint: merge

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Merge
Partition 1 P1 P1 P1 P1 Partition 1

Merge
Partition 2 Shuffle
P2 Sort
P2 P2 P2 Shuffle Partition 2
Sort
Table A Merge Table B
Partition 3 P3 P3 P3 P3 Partition 3
Merge

Partition 4 P4 P4 P4 P4 Partition 4

New Table

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
User Table Product Table
Id Name Product Id User Id
2 Json 101 2
1 Kyle 102 4
4 Reid 103 4
3 Robert 104 1
Sorting on Joining Key
Id Name Product Id User Id
1 Kyle 104 1
2 Json 101 2
3 Robert 102 4
4 Reid 103 4

Merge
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Notes:
 Only supported for ‘=‘ join.
 The join keys need to be sortable.
 Supported for all join types (inner, left, right etc).

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ex –
df1 = spark.range(1,10000000000)
df2 = spark.range(1,10000000)
joined = df1.join(df2,"id")
joined.explain()

joined = df1.hint(‘merge').join(df2,”id”)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Cartesian Product Join

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Cartesian Product Join:
• Steps:
 Both datasets are read.
 All partitions from one of the dataset are sent to all partitions in the other dataset. So shuffle.
 Once partitions from both dataset are available on one side, a nested loop join is performed.
 If there are M records in first dataset and N records in the second dataset, nested loop is performed on
M *N records.
• Very expensive and high possibility of OOM errors. Lots of shuffle.
• Supports both “=“ and non-equi joins (<=, <, >=,>)
• Supports all join types.
• Hint : shuffle_replicate_nl

P1 P1 for rec in table1:


for rec in table2:
on table1.join_key = table2.join_key

P2 P2

Aurora Academy of Training. Copyright [email protected]


Dataset 1 Dataset 2 - All Rights Reserved. Subscribe at Learn-Spark.nfo
DAG Visualization: The join is executed all in 1 stage.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Broadcast Nested Loop Join

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Broadcast Nested Loop Join:
• 2 Steps:
 Broadcast: Smallest dataset is broadcasted to all executors processing the bigger dataset.
 Nested Loop Join Phase: Every Record from one dataset is attempted to join with every record
from another dataset in a nested loop.
• Since this join is used for non-equi conditions, the iteration can not stop as soon as a match is
encountered like in Sort Merge Join. The iteration will go through the entire dataset.
• No sort in this join.
• Very slow, but no Shuffle. Preferred than Cartesian Product Join.
• This join will not work when either sides are big enough for broadcasting and you could see Out Of
Memory exceptions.
• Supports both “=“ and non-equi joins (<=, <, >=,>)
• Supports all join types.
for rec in table1:
for rec in table2:
on table1.join_key = table2.join_key

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
spark.conf.set("spark.sql.crossJoin.enabled",True)
df1 = spark.range(1,1000)
df2 = spark.range(1,100)
joined = df1.join(df2)
joined.explain()

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
How Spark Prioritize Join Strategies:

When More than one hint is specified and hints are applicable:
• If its an ‘=‘ join:
 Broadcast hint : Pick broadcast hash join
 Merge hint : Pick sort-merge join
 Shuffle_Hash hint : Pick shuffle hash join
 Shuffle_replicate_nl hint : Pick Cartesian Product join.
• If its not ‘=‘ join:
 Broadcast hint: Pick broadcast nested loop join.
 Shuffle_replicate_nl : Pick Cartesian product if join type is inner like.

When no hints are specified or hints are not applicable


• If its an ‘=‘ join:
 Pick broadcast hash join if one side is small enough to broadcast, and the join type is supported.
 Pick shuffle hash join if one side is small enough to build the local hash map, and is much smaller than the
other side, and spark.sql.join.preferSortMergeJoin is false.
 Pick sort-merge join if join keys are sortable.
 Pick Cartesian product if join type is inner .
 Pick broadcast nested loop join as the final solution. It may OOM but there is no other choice.
• If its not ‘=‘ join:
 Pick broadcast nested loop join if one side is small enough to broadcast.
 Pick cartesian product if joinAurora
typeAcademy
is inner like.
of Training. Copyright [email protected]
 Pick broadcast nested loop join as - Allthe
Rightsfinal solution.
Reserved. SubscribeIt
at may OOM but we don’t have any other choice.
Learn-Spark.nfo
>>> spark.conf.set("spark.sql.crossJoin.enabled",True)
>>> df1 = spark.range(1,1000)
>>> df2 = spark.range(1,100)
>>> joined = df1.join(df2)
>>> joined.explain()
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner
:- *(1) Range (1, 1000, step=1, splits=2)
+- BroadcastExchange IdentityBroadcastMode
+- *(2) Range (1, 100, step=1, splits=2)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Driver Configurations

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Driver Options:
• When we apply collect(), take() operations on datasets, it requires the data to be moved to Driver. If we do so on huge dataset,
it can crash the driver process with Out Of Memory errors(OOM).
• If you observe, we perform most of the computational work of a Spark Job in the Executors and so we rarely required to do any
performance tuning for the driver.
• However sometimes, the job may fail if they collect too much data to the driver.
• Setting a proper limit can protect the driver from out of memory errors.

Spark-submit Options:
--driver-memory : Memory for driver (e.g. 1000M, 2G) (Default: 1024M)
• Driver Memory is the amount of memory to use for driver process, i.e. the process running the main() function of the application
and where SparkContext is instantiated.
--driver-cores :
• Number of cores used by the driver, only in cluster mode (Default: 1).
• Generally, not required unless you want to perform some local computations in parallel.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Configuration Properties:
spark.driver.memory : Default 1024
spark.driver.cores : Default 1
spark.driver.maxResultSize:
• Limit of each Spark action (e.g. collect) in bytes.
• Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit.
• Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects
in JVM).
• Setting a proper limit can protect the driver from out-of-memory errors.
spark.driver.memoryOverhead: Default driverMemory * 0.10, with minimum of 384
• Amount of overhead (non-heap) memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified.
• This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc.
• This tends to grow with the container size (typically 6-10%).

All Properties:
https://spark.apache.org/docs/latest/configuration.html

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Driver in Cluster deploy mode: Driver in Client deploy mode:

Driver JVM Client JVM


JVM Heap
JVM Heap
Overhead
Client JVM Overhead
Off- Heap
Off- Heap

Spark Context
Driver
Spark Spark
Application Context

Spark
Application

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Executors Configurations

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Data Node
Executor 1 Cache Executor 2 Cache
JVM Heap JVM Heap
Core 1 Core 2 Core 5 Core 6

Task Task Task Task


Core 3 Core 4 Core 7 Core 8

Task Task Task Task

• Executors and Cores:


 Executors are created in worker/data nodes and they are in charge of running tasks in a given spark job.
 Each executor comprises a JVM. They are launched at the beginning of a spark application and run the entire
lifetime of the spark job.
 After they run the assigned task, they send the results to the driver.
 They also provide in-memory storage for RDDs that are cached by user programs.
 Each worker node can have one or multiple cores.
 To run the tasks in parallel, we can launch executors with multiple cores.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Below three options play a very important role in spark performance as they control the amount of CPU & memory of our
spark application.
Lets understand how to configure them.

--executor-memory: Memory per executor (e.g. 1000M, 2G) (Default: 1G).


--num-executors: Number of executors to launch (Default: 2).
--executor-cores: Number of cores per executor. (Default: 1 in YARN mode).

Below conf application properties are related to Executors.


spark.executor.memory : Default (1G)
spark.executor.cores: Default 2.
Spark.executor.memoryOverhead : 10% or 384MB (Whichever is higher)
• The amount of off-heap memory to be allocated per executor, in MiB unless otherwise specified.
• This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc.
• This tends to grow with the executor size (typically 6-10%).
• When we plan the performance tuning we need to consider this as well.
spark.executor.heartbeatInterval: Default 10s
• Interval between each executor's heartbeats to the driver. Heartbeats let the driver know that the executor is still
alive.

List of all Configuration properties:


https://spark.apache.org/docs/2.4.0/configuration.html
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Runtime Components in Cluster deploy mode:

Cluster
Executor JVM

Driver JVM JVM Heap

Task Slots
T T T
JVM Heap Unoccupied Task Slots
T
Client JVM
Spark Context

Spark Executor JVM


Application

JVM Heap

T T T

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Spark Runtime Components in Client deploy mode:

Executor JVM

JVM Heap
Client JVM

JVM Heap T T T

T
Driver
Spark
Context Executor JVM

Spark
Application JVM Heap
JVM Heap

T T T

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Node Manager

YARN Container

Executor JVM

JVM Heap Daemon Processes Others (Non-spark)

T T T

Overhead (10% or 384 MB)

Off-Heap (Disabled by
Default)

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Lets Assume :-
Cluster Configuration:
10 Nodes
16 cores per each Node.
64GB RAM per Node

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Cluster Configuration:
10 Nodes Executor 1 Cache Executor 2 Cache
16 cores per each Node. Memory: 4GB Memory: 4GB
64GB RAM per Node
Core 1 Core 1
Spark Job was run with one core per executor.
Task Task
• Total number of cores in cluster = 16 * 10 = 160
• So number of executors = 160
• Number of executors per Node = 160/10 = 16
• Memory per Executor = 64/16 = 4GB

Problems:
• With only one executor per core, we will not be able to take ……………………..
advantage of running multiple tasks in the same JVM. ……………………..
• There is ~10% overhead for each JVM. Due to 160 executors ……………………...
160 JVM processes would be created and so lot of
unnecessary overheads.
Executor 159 Cache Executor 160 Cache
• Shared variables (Broadcast, Accumulator) will be copied 160 Memory: 4GB Memory: 4GB
times.
• Not leaving enough memory for YARN Daemon processes and Core 1 Core 1
Application Manager.
• Not enough memory for executors. Task Task

NOT GOOD

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Cluster Configuration:
Executor 1 Cache Executor 2 Cache
10 Nodes Memory: 32GB Memory: 32GB
16 cores per each Node.
64GB RAM per Node Core 1 Core 2 Core 3 Core 1 Core 2 Core 3

Task Task Task Task Task Task


Spark Job was run with 5core per executor.(--executors-cores = 5)
Core 4 Core 5 Core 4 Core 5
• Let’s Leave 1 core for YARN Daemons.
Task Task Task Task
So total available cores per node for processing job = 16-1=15
So total number of cores in cluster = 15 * 10 = 150
• So, Number of available Executors = 150/5 = 30
• Let’s leave 1 executor for Yarn Application Master.
So total number of available executors = 30 -1 = 29 ……………………..
• Number of Executors per each Node = 30/10 =3 ……………………..
• Memory per Executor: 64 GB/3 = 21.3 GB ……………………...
Heap overhead around 10% (2.13GB)
So actual Memory per executor = 19.7GB
Executor 29 Cache Executor 30 Cache
Memory: 32GB Memory: 32GB
With this approach, we have achieved the parallelism and best
throughputs. Core 1 Core 2 Core 3 Core 1 Core 2 Core 3

Task Task Task Task Task Task


GOOD
Core 4 Core 5 Core 4 Core 5
BALANCED
RECOMMENDED Task Task Task Task
Approach …
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
Let’s see how we can pass these options in spark-submit:

spark2-submit \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key<=<value> \
--driver-memory 2G\
--executor-memory 64G \
--executor-cores 5 \
--num-executors 10\
--jars <comma separated dependencies> \
--packages <package name> \
--py-files \
<application> <application args>

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Parallelism Configurations

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
As the shuffle operations re-partitions the data, we can use below two configurations to control the number of partitions shuffle
creates.
• spark.default.parallelism
• spark.sql.shuffle.partitions

spark.default.parallelism :
• Only applicable to RDD.
• Default value set to the number of all cores on all nodes in a cluster.
• RDD wider transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling. Prior to using these
operations, use the below code to set the desired partitions for shuffle operations. Change the value accordingly.
spark.conf.set("spark.default.parallelism",150)
spark.conf.get("spark.default.parallelism")

Ex –
pyspark2 --master yarn --conf spark.default.parallelism=150
rdd = sc.parallelize(range(1000))
rdd.getNumPartitions()
150

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
spark.sql.shuffle.partitions:
• Only applicable to DataFrames.
• Default to 200.
• For DataFrame, wider transformations like groupBy(), join() triggers the data shuffling hence the result of these
transformations results in partition size same as the value set in spark.sql.shuffle.partitions.
spark.conf.get("spark.sql.shuffle.partitions")
Ex –
data=(('Ram',1),('Raj',2),('Ram',1),('Joann',4),('Raj',2),('Robert',5),('Reid',6),('Sam',7))
df = spark.createDataFrame(data=data,schema=('name','id'))
df.rdd.getNumPartitions()
2
df1 = df.groupBy("name").count()
df1.rdd.getNumPartitions()
200

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Memory Management

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
YARN Cluster Manager

Node Manager

Container

Application
Resource Master
Driver
Manager

Node Manager Node Manager

Container Container

Executor Executor
Task Task Task Task

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
YARN as Spark Cluster Manager:
• Yet Another Resource Negotiator for Hadoop 2.x.
• Cluster Management – Used for resource allocation and Scheduling.
• It has 3 major components –
 Resource Manager
 Node Manager
 Application Master

Flow:
1. Client submit the Spark application. Driver instantiates SparkContext.
2. Driver talks to the cluster manager(YARN) and negotiates resources.
3. The YARN resource manager search for a Node Manger which will, in turn, launch an ApplicationMaster for the
specific job in a container.
4. The ApplicationMaster registers itself with the resource Manager.
5. The ApplicationMaster negotiates containers for executors from the ResourceManager. Can request for more
resources from RM.
6. The ApplicationMaster notifies the Node Managers to launch the containers and executors. Executor then executes
the tasks.
7. Driver communicates with executors to coordinate the processing of tasks of an application.
8. Once the tasks are complete, ApplicationMaster un-registers with the Resource Manager.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Node Manager
YARN Container Heap
YARN Container
Heap (Executor) Spark Memory = 70 %
Daemon Processes

Overhead
User Memory = 23 %
Others (Non-spark)
Off-heap Reserved Memory = 7 %

70%,23% and 7% are when we allocate 4GB to


Executor(Heap) and use the default Options.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
yarn-site.xml
Spark 1.6 Node Manager(YARN)
Executor Container(YARN)

Heap
Spark.executor.memory

No Hard Boundary

Usable Memory

Executor Memory
Execution Memory Storage Memory
Usable Memory *
Usable Memory *
spark.memory.fraction *
spark.memory.fraction *
(1 – spark.memory.storageFraction) spark.memory.storageFraction

Daemon Processes

User Memory (Usable Memory * (1 - spark.memory.fraction))

Others (Non-spark)
Reserved Memory (reserved_system_memory_bytes = 300MB)

Overhead
spark.executor.memoryOverhead
10% or a minimum 384MB

Off-Heap
spark.memory.offHeap.size
It is disabled by default.
Aurora Academy of Training. Copyright [email protected]
- All Rights Reserved. Subscribe at Learn-Spark.nfo
• Off-heap Memory:
 We had covered this in the RDD Persistence.
 Off-Heap Memory is a segment of memory lies outside the JVM, but is used by JVM for certain use-cases. Off-
Heap memory can be used by Spark explicitly as well to store serialized data-frames and RDDs.
 Spark may use off-heap memory for data-intensive applications.
 User can also persist data at Off-heap memory using persist method.
 Off-heap storage is not managed by JVM’s GC mechanism and so must be explicitly handled by the
application.
 It is disabled by default.
spark.memory.offHeap.enabled (Default False)
spark.memory.offHeap.size  Can be set after enabling it.
• Overhead Memory:
 Default - 10% of executor memory with a minimum of 384MB.
 Can be set by property:
spark.executor.memoryOverhead
 It basically covers expenses like VM overheads, interned strings, other native overheads, etc.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
• Executor Container is a JVM process. It allocates memory to 3 sections – Heap Memory, Off-Heap Memory and Overhead memory.
Heap Memory:
 All objects in heap memory are bound by GC.
 3 Regions – Reserved Memory, User Memory and Spark Memory(Unified Execution/Storage Memory).
 Reserved Memory: Reserved to store internal objects. Hardcoded to 300MB.
 User Memory:
 Stores all the user defined data structures, any UDFs created by the user etc.
 Not managed by spark.
 Formula - Usable Memory * (1 - spark.memory.fraction)
Ex – 3796 * (1 – 0.75) = 950 MB
 Spark Memory: Managed By Spark.
Storage Memory:
 Used for storing all the cached data, shared variables.
 Any persist() operation with Memory storage level.
 Spark deletes old data and insert new data using LRU mechanism.
 Data might store in disk once removed from cache or recomputed if MEMORY_ONLY is set..
Formula - Usable Memory *spark.memory.fraction * spark.memory.storageFraction
Ex – 3796 * 0.75 * 0.50 = 1423 MB
Execution Memory:
 Used by Spark for objects created during execution of a task.
 When you perform an T and A, all the intermediate results are stored here.
 Ex – it is used to store hash table for hash aggregation step.
 Supports spilling on disk if not enough memory is available.
Formula - Usable Memory * spark.memory.fraction * (1 – spark.memory.storageFraction)
Ex – 3796 * 0.75 * (1-0.5) = 1423MB

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Dynamic Occupancy Mechanism :

Execution Memory Storage Memory Disc

Important Notes for Execution Memory:


• Execution can not be break.
• Execution memory can borrow space from Storage memory if blocks are not used in Storage memory.
• If blocks from Execution memory is used by Storage memory and Execution needs more memory, it can forcefully evict the excess blocks
occupied by Storage Memory. Blocks from storage memory will be written to disk or recomputed (of persistence level is MEMOEY_ONLY).
• Write into disk if still more memory is required.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Dynamic Occupancy Mechanism :

Execution Memory Storage Memory Disc

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Dynamic Occupancy Mechanism :

Execution Memory Storage Memory Disc

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Dynamic Occupancy Mechanism :

Storage Memory Execution Memory


Disc

X X X X

Important Notes for Storage Memory:


• Storage memory can borrow space from execution memory only if blocks are not used in Execution memory.
• Storage memory can not forcefully evict the excess blocks occupied by Execution Memory. It will wait till spark releases the excess
blocks stored by Execution memory and then occupies them.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Dynamic Occupancy Mechanism :

Storage Memory Execution Memory


Disc

X X X X

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Dynamic Occupancy Mechanism :

Storage Memory Execution Memory


Disc

X X X X

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Ex - For understanding purpose.
Take 4GB Memory Allocation for executor and leave the default configurations.

Executor Memory(EM) = 4GB


Reserved Memory(RM) =300MB
Usable Memory = EM-RM= 4GB – 300 MB = 4096MB – 300MB = 3796 MB
spark.memory.fraction (Fraction of heap space used for execution and storage) = 0.75
Spark.memory.storageFraction(Amount of storage memory immune to eviction) = 0.50

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo
Heap
Spark.executor.memory = 4GB = 4096 MB
Spark Memory

Usable Memory

Executor Memory
Execution Memory Storage Memory
Usable Memory *
Usable Memory *
spark.memory.fraction *
spark.memory.fraction *
(1 – spark.memory.storageFraction) spark.memory.storageFraction
=3796 * 0.75 * 0.5 =3796 * 0.75 * 0.5
= 1423.5 MB = 1423.5 MB

User Memory (Usable Memory * (1 - spark.memory.fraction)) = 3796 * (1-0.75)


=3796 * 0.25 = 949 MB
Reserved Memory (reserved_system_memory_bytes = 300MB)

Overhead
spark.executor.memoryOverhead
10% or a minimum 384MB = 409.6 MB

Off-Heap
spark.memory.offHeap.size
It is disabled by default.

Aurora Academy of Training. Copyright [email protected]


- All Rights Reserved. Subscribe at Learn-Spark.nfo

You might also like