Spark Streaming: Tathagata "TD" Das

Spark Streaming
Tathagata “TD” Das
Tathagata Das ( TD )
Whoami
•  Core committer on Apache Spark
•  Lead developer on Spark Streaming
•  On leave from PhD program in UC Berkeley

Big Data
Big Streaming Data
Big Streaming Data Processing
Fraud detection in bank transactions

Anomalies in sensor data

Cat videos in tweets

How to Process Big Streaming Data
Distributed
Processing System
Raw Data Streams Processed Data
Scales to hundreds of nodes
Achieves low latency
Efficiently recover from failures
Integrates with batch and interactive processing

What people have been doing?
> Build two stacks – one for batch, one for streaming
-  Often both process same data
> Existing frameworks cannot do both

-  Either, stream processing of 100s of MB/s with low latency
-  Or, batch processing of TBs of data with high latency
> Extremely painful to maintain two different stacks

-  Different programming models
-  Doubles implementation effort
-  Doubles operational effort
Fault-tolerant Stream Processing
> Traditional processing model
mutable state
–  Pipeline of nodes
input
–  Each node maintains mutable state records
–  Each input record updates the state node 1
and new records are sent out
input node 3
records
node 2
> Mutable state is lost if node fails
> Making stateful stream processing

fault-tolerant is challenging!
Existing Streaming Systems

> Storm
-  Replays record if not processed by a node
-  Processes each record at least once
-  May update mutable state twice!
-  Mutable state can be lost due to failure!
> Trident – Use transactions to update state

-  Processes each record exactly once
-  Per-state transaction to external database is slow
9

treaming
What is Spark Streaming?
> Receive data streams from input sources, process
them in a cluster, push out to databases/
dashboards
> Scalable, fault-tolerant, second-scale latencies
Kafka
Flume HDFS
HDFS Databases
Kinesis treaming Dashboards
Twitter
How does Spark Streaming work?
>  Chop up data streams into batches of few secs
>  Spark treats each batch of data as RDDs and
processes them using RDD operations
>  Processed results are pushed out in batches
treamin
g
Receivers
data streams
batches as results as
RDDs RDDs
Spark Streaming Programming Model
> Discretized Stream (DStream)
-  Represents a stream of data
-  Implemented as a sequence of RDDs

> DStreams API very similar to RDD API
-  Functional APIs in Scala, Java
-  Create input DStreams from different sources
-  Apply parallel operations
Example – Get hashtags from Twitter
val ssc = new StreamingContext(sparkContext, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)

Input DStream
Twi9er Streaming API batch @ t batch @ t+1 batch @ t+2
tweets DStream
stored in memory

as RDDs
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))

transformed transforma0on: modify data in one
DStream DStream to create another DStream
batch @ t batch @ t+1 batch @ t+2
tweets DStream
flatMap flatMap flatMap
hashTags Dstream
…
new RDDs created
[#cat, #dog, … ]
for every batch
hashTags.saveAsHadoopFiles("hdfs://...")
output opera0on: to push data to external storage

tweets DStream
flatMap flatMap flatMap
hashTags DStream
save save save
every batch
saved to HDFS
hashTags.foreachRDD(hashTagRDD => { ... })
foreach: do whatever you want with the processed data

tweets DStream
flatMap flatMap flatMap
hashTags DStream
foreach foreach foreach
Write to a database, update analyOcs

UI, do whatever you want
Languages
Scala API

hashTags.saveAsHadoopFiles("hdfs://...”)

Java API
FuncOon object
JavaDStream<Status> tweets = ssc.twitterStream()
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")

Python API
...soon
Window-based Transformations
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window
window length sliding interval
operaOon
window length
DStream of data

sliding interval
Arbitrary Stateful Computations
Specify function to generate new state based on
previous state and new data
-  Example: Maintain per-user mood as state, and update it
with their tweets
def updateMood(newTweets, lastMood) => newMood

val moods = tweetsByUser.updateStateByKey(updateMood _)

Arbitrary Combinations of Batch and
Streaming Computations

Inter-mix RDD and DStream operations!
-  Example: Join incoming tweets with a spam HDFS file to
filter out bad tweets"

tweets.transform(tweetsRDD => {
tweetsRDD.join(spamFile).filter(...)
})

DStreams + RDDs = Power
> Combine live data streams with historical data
-  Generate historical data models with Spark, etc.
-  Use data models to process live data stream
> Combine streaming with MLlib, GraphX algos

-  Offline learning, online prediction
Spark Spark MLlib GraphX
-  Online learning and prediction SQL Streaming machine
learning
graph
processing
Apache Spark
> Interactively query streaming data using SQL

-  select * from table_from_streaming_data
Advantage of an Unified Stack
>  Explore data
$ ./spark-‐shell
scala> val file = sc.hadoopFile(“smallLogs”)
interactively to
...
scala> val filtered = file.filter(_.contains(“ERROR”))
identify problems
...
scala> val mapped = filtered.map(...)
...
object ProcessProductionData {
def main(args: Array[String]) {
val sc = new SparkContext(...)
>  Use same code in val file = sc.hadoopFile(“productionLogs”)
val filtered = file.filter(_.contains(“ERROR”))
Spark for processing val mapped = filtered.map(...)
...
large logs }
} object ProcessLiveStream {
def main(args: Array[String]) {
val sc = new StreamingContext(...)
val stream = KafkaUtil.createStream(...)
>  Use similar code in val filtered = stream.filter(_.contains(“ERROR”))
val mapped = filtered.map(...)
Spark Streaming for ...
}
realtime processing }
Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency
7 3.5
WordCount
Cluster Thhroughput (GB/s)
Cluster Throughput (GB/s)

Grep
6 3
5 2.5
4 2
3 1.5
2 1
1 sec 1 sec
1 0.5
2 sec 2 sec
0 0
0 50 100 0 50 100
# Nodes in Cluster # Nodes in Cluster
Fault-tolerance
> Batches of input data are tweets
RDD input data
replicated in memory for replicated
fault-tolerance in memory
flatMap
> Data lost due to worker
failure, can be recomputed
from replicated input data hashTags
RDD
lost parOOons
recomputed on
> All transformations are fault- other workers
tolerant, and exactly-once
transformations
Input Sources
•  Out of the box, we provide
-  Kafka, Flume, Kinesis, Raw TCP sockets, HDFS, etc.
•  Very easy to write a custom receiver

-  Define what to when receiver is started and stopped
•  Also, generate your own sequence of RDDs, etc.

and push them in as a “stream”

Output Sinks
•  HDFS, S3, etc (Hadoop API compatible filesystems)

•  Cassandra (using Spark-Cassandra connector)
•  Hbase (integrated support coming to Spark soon)
•  Directly push the data anywhere

Conclusion

Spark Streaming: Tathagata "TD" Das

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark Streaming: Tathagata "TD" Das

Uploaded by

Copyright:

Available Formats

Spark Streaming

Tathagata “TD” Das

• Core committer on Apache Spark

• Lead developer on Spark Streaming

• On leave from PhD program in UC Berkeley

Raw Data Streams Processed Data

Scales to hundreds of nodes

Achieves low latency

Efficiently recover from failures

Integrates with batch and interactive processing

> Existing frameworks cannot do both

> Extremely painful to maintain two different stacks

> Making stateful stream processing

> Trident – Use transactions to update state

Twi9er Streaming API batch @ t batch @ t+1 batch @ t+2

stored in memory

batch @ t batch @ t+1 batch @ t+2

flatMap flatMap flatMap

output opera0on: to push data to external storage

batch @ t batch @ t+1 batch @ t+2

foreach: do whatever you want with the processed data

batch @ t batch @ t+1 batch @ t+2

Write to a database, update analyOcs

val tweets = TwitterUtils.createStream(ssc, auth)

DStream of data

def updateMood(newTweets, lastMood) => newMood

> Combine streaming with MLlib, GraphX algos

> Interactively query streaming data using SQL

Cluster Throughput (GB/s)

• Very easy to write a custom receiver

• Also, generate your own sequence of RDDs, etc.

• Hbase (integrated support coming to Spark soon)

• Directly push the data anywhere

You might also like

•  Core committer on Apache Spark

•  Lead developer on Spark Streaming

•  On leave from PhD program in UC Berkeley

> Existing frameworks cannot do both

> Extremely painful to maintain two different stacks

> Making stateful stream processing

> Trident – Use transactions to update state

> Combine streaming with MLlib, GraphX algos

> Interactively query streaming data using SQL

•  Very easy to write a custom receiver

•  Also, generate your own sequence of RDDs, etc.

•  Hbase (integrated support coming to Spark soon)

•  Directly push the data anywhere