Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Spark Streaming

Tathagata “TD” Das

Tathagata Das ( TD )
Whoami

•  Core committer on Apache Spark

•  Lead developer on Spark Streaming

•  On leave from PhD program in UC Berkeley


Big Data
Big Streaming Data
Big Streaming Data Processing
Fraud detection in bank transactions


Anomalies in sensor data


Cat videos in tweets

How to Process Big Streaming Data
Distributed
Processing System

Raw Data Streams Processed Data

Scales to hundreds of nodes

Achieves low latency

Efficiently recover from failures

Integrates with batch and interactive processing



What people have been doing?
> Build two stacks – one for batch, one for streaming
-  Often both process same data

> Existing frameworks cannot do both


-  Either, stream processing of 100s of MB/s with low latency
-  Or, batch processing of TBs of data with high latency

> Extremely painful to maintain two different stacks


-  Different programming models
-  Doubles implementation effort
-  Doubles operational effort
Fault-tolerant Stream Processing
> Traditional processing model
mutable  state  
–  Pipeline of nodes
input    
–  Each node maintains mutable state records  
–  Each input record updates the state node  1  
and new records are sent out
input     node  3  
records  
node  2  
> Mutable state is lost if node fails

> Making stateful stream processing


fault-tolerant is challenging!
Existing Streaming Systems

> Storm
-  Replays record if not processed by a node
-  Processes each record at least once
-  May update mutable state twice!
-  Mutable state can be lost due to failure!

> Trident – Use transactions to update state


-  Processes each record exactly once
-  Per-state transaction to external database is slow

9

treaming
What is Spark Streaming?
> Receive data streams from input sources, process
them in a cluster, push out to databases/
dashboards
> Scalable, fault-tolerant, second-scale latencies

Kafka
Flume HDFS

HDFS Databases
Kinesis treaming Dashboards
Twitter
How does Spark Streaming work?
>  Chop up data streams into batches of few secs
>  Spark treats each batch of data as RDDs and
processes them using RDD operations
>  Processed results are pushed out in batches

treamin
g
Receivers

data streams

batches as results as
RDDs RDDs
Spark Streaming Programming Model
> Discretized Stream (DStream)
-  Represents a stream of data
-  Implemented as a sequence of RDDs

> DStreams API very similar to RDD API
-  Functional APIs in Scala, Java
-  Create input DStreams from different sources
-  Apply parallel operations
Example – Get hashtags from Twitter
val  ssc  =  new  StreamingContext(sparkContext,  Seconds(1))  
val  tweets  =  TwitterUtils.createStream(ssc,  auth)  


Input  DStream  

Twi9er  Streaming  API   batch  @  t   batch  @  t+1   batch  @  t+2  

tweets  DStream  

stored  in  memory    


as  RDDs  
Example – Get hashtags from Twitter
val  tweets  =  TwitterUtils.createStream(ssc,  None)  
val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  


transformed   transforma0on:  modify  data  in  one  
DStream   DStream  to  create  another  DStream    

batch  @  t   batch  @  t+1   batch  @  t+2  

tweets  DStream  

flatMap   flatMap   flatMap  

hashTags  Dstream  

new  RDDs  created  
[#cat,  #dog,  …  ]  
for  every  batch    
Example – Get hashtags from Twitter
val  tweets  =  TwitterUtils.createStream(ssc,  None)  
val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  
hashTags.saveAsHadoopFiles("hdfs://...")  

output  opera0on:  to  push  data  to  external  storage  

batch  @  t   batch  @  t+1   batch  @  t+2  


tweets  DStream  
flatMap flatMap flatMap

hashTags  DStream  
save save save

every  batch  
saved  to  HDFS  
Example – Get hashtags from Twitter
val  tweets  =  TwitterUtils.createStream(ssc,  None)  
val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  
hashTags.foreachRDD(hashTagRDD  =>  {  ...  })  

foreach:  do  whatever  you  want  with  the  processed  data  

batch  @  t   batch  @  t+1   batch  @  t+2  


tweets  DStream  
flatMap flatMap flatMap

hashTags  DStream  
foreach foreach foreach

Write  to  a  database,  update  analyOcs  


UI,  do  whatever  you  want  
Languages
Scala API  

val  tweets  =  TwitterUtils.createStream(ssc,  auth)  


val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  

hashTags.saveAsHadoopFiles("hdfs://...”)


Java API
FuncOon  object  
JavaDStream<Status>  tweets  =  ssc.twitterStream()  
JavaDstream<String>  hashTags  =  tweets.flatMap(new  Function<...>  {    })  
hashTags.saveAsHadoopFiles("hdfs://...")  
 

Python API  
...soon  
Window-based Transformations
val  tweets  =  TwitterUtils.createStream(ssc,  auth)  
val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  
val  tagCounts  =  hashTags.window(Minutes(1),  Seconds(5)).countByValue()  

sliding  window  
window  length   sliding  interval  
operaOon  

window  length  

DStream  of  data  


sliding  interval  
Arbitrary Stateful Computations
Specify function to generate new state based on
previous state and new data  
-  Example: Maintain per-user mood as state, and update it
with their tweets

   def  updateMood(newTweets,  lastMood)  =>  newMood  


 
   val  moods  =  tweetsByUser.updateStateByKey(updateMood  _)  

Arbitrary Combinations of Batch and
Streaming Computations

Inter-mix RDD and DStream operations!
-  Example: Join incoming tweets with a spam HDFS file to
filter out bad tweets"
 

 tweets.transform(tweetsRDD  =>  {  
 tweetsRDD.join(spamFile).filter(...)  
 })  


 
DStreams + RDDs = Power
> Combine live data streams with historical data
-  Generate historical data models with Spark, etc.
-  Use data models to process live data stream

> Combine streaming with MLlib, GraphX algos


-  Offline learning, online prediction
Spark     Spark   MLlib   GraphX  
-  Online learning and prediction SQL   Streaming   machine  
learning  
graph  
processing  

Apache  Spark  

> Interactively query streaming data using SQL


-  select * from table_from_streaming_data
Advantage of an Unified Stack
>  Explore data
$  ./spark-­‐shell  
scala>  val  file  =  sc.hadoopFile(“smallLogs”)  

interactively to
...  
scala>  val  filtered  =  file.filter(_.contains(“ERROR”))  

identify problems
...  
scala>  val  mapped  =  filtered.map(...)  
...  
object  ProcessProductionData  {  
     def  main(args:  Array[String])  {  
       val  sc  =  new  SparkContext(...)  
>  Use same code in        val  file  =  sc.hadoopFile(“productionLogs”)  
       val  filtered  =  file.filter(_.contains(“ERROR”))  
Spark for processing        val  mapped  =  filtered.map(...)  
       ...  
large logs    }  
}  object  ProcessLiveStream  {  
   def  main(args:  Array[String])  {  
       val  sc  =  new  StreamingContext(...)  
       val  stream  =  KafkaUtil.createStream(...)  
>  Use similar code in        val  filtered  =  stream.filter(_.contains(“ERROR”))  
       val  mapped  =  filtered.map(...)  
Spark Streaming for        ...  
   }  
realtime processing }  
Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency

7   3.5  
WordCount  
Cluster  Thhroughput  (GB/s)  

Cluster  Throughput  (GB/s)  


Grep  
6   3  
5   2.5  
4   2  
3   1.5  
2   1  
1  sec   1  sec  
1   0.5  
2  sec   2  sec  
0   0  
0   50   100   0   50   100  
#  Nodes  in  Cluster   #  Nodes  in  Cluster  
Fault-tolerance
> Batches of input data are tweets  
RDD   input  data  
replicated in memory for replicated  
fault-tolerance in  memory  

flatMap  
> Data lost due to worker
failure, can be recomputed
from replicated input data hashTags  
RDD  
lost  parOOons  
recomputed  on  
> All transformations are fault- other  workers  
tolerant, and exactly-once
transformations
Input Sources
•  Out of the box, we provide
-  Kafka, Flume, Kinesis, Raw TCP sockets, HDFS, etc.

•  Very easy to write a custom receiver


-  Define what to when receiver is started and stopped

•  Also, generate your own sequence of RDDs, etc.


and push them in as a “stream”


Output Sinks
•  HDFS, S3, etc (Hadoop API compatible filesystems)

•  Cassandra (using Spark-Cassandra connector)

•  Hbase (integrated support coming to Spark soon)

•  Directly push the data anywhere


Conclusion

You might also like