spark-streaming 2012 论文笔记

Introduction

Most current distributed streaming systems are based on a record-at-a-time processing model, which has several challenges below:

Faults and Stragglers cannot handle stragglers(slow nodes)
Consistency difficult for them
Unification with batch processing cannot combine streaming data with historical data

spark-streaming:

propose a radical design point for a significantly more scalable streaming system: run each streaming computation as a series of deterministic batch computations on small time intervals
discretized streams or D-Streams
key insight is that it is possible to achieve sub-second latencies in a batch system by leveraging Resilient Distributed Datasets(RDDs)

Goals and Background

目前已有的典型的应用场景有：

Site activity statistics 网站活动统计，比如统计一些实时的页面点击量
Spam detection 垃圾检测，通过统计学习算法实时检测出垃圾信息
Cluster monitoring 集群监控，在成百上千台集群里收集日志，并检测出集群问题
Network intrusion detection 网络入侵侦测系统（NIDS），从百万上亿条日志中实时检测出异常活动

2.1 Previous Streaming Systems

Most previous systems employ the same record-at-a-time processing model.
In this model, streaming computations are divided into a set of long-lived stateful operators, and each operator processes records as they arrive by updating internal state and sending new records in response.

2.2 The Challenge of Fault and Straggler Tolerance

To rebuild the state of a lost, or slow, node, previous systems use one of two schemes:
1. replication
2. upstream backup

Discretized Streams

Instead of relying on long-lived stateful operators that are dependent on cross-stream message order, D-Streams execute computations as a series of short, stateless, deterministic tasks. They then represent state across tasks as fault-tolerant data structures(RDDs) that can be recomputed deterministically. 上述设计能很好的支持 fault tolerance，同时 D-Streams 也能够提供：
1. clear consistency semantics
2. a simple API
3. unification with batch processing
具体的说，该模型 treats streaming computations as a series of deterministic batch computations on discrete time intervals. 每个 interval 对于的数据会分布于集群中，形成一个 input dataset，一旦该 interval 结束，这个 input dataset 就会被 deterministic parallel operations （如 map、reduce、groupBy）所处理，生成了新的 dataset，或 intermediate state. 这些结果都是通过 RDD 的形式存储着。具体如下图：
spark-streaming 2012 论文笔记
用户就是通过操作 D-Streams 对象（a sequence of immutable, partitioned datasets），并利用 deterministic operators 来处理数据的。
最后，为了从 faults and stragglers 中恢复数据，Spark Streaming 要求 D-Streams 和 RDDs 都需要记录自己的 lineage. 这种记录机制是 at the level of partitions with each dataset，如下图所述：
spark-streaming 2012 论文笔记

3.1 Timing Considerations

从上述介绍可以知道，每个 D-Streams 都是一个固定的 interval 时间窗口的数据，这里的时间是由日志发送到 Spark Streaming 系统的时间而决定的，那么问题就来了，如果日志本身是记录着时间戳的，比如多个客户端发的日志，由于地理位置、客户端时钟不一致，会导致各个数据不是按照有序的时间发送过来的。那么 Spark Streaming 是怎么应对这种场景的呢？有两种方法：
1. 系统每次处理一个 batch 前提供一个 limited slack time 去等待慢的日志。但这样就对整个结果都增加了一个固定的延迟。
2. 在 application level 层面上，用户程序自己纠正延迟的日志。比如，一个应用希望计算 t 到 t+1 时间的统计，那么正常的日志发送过来的 t ~ t + 1 这个 batch 会计算出一个初始的统计，然后接下来的几个 batch 会更新这个统计值。该更新过程可以使用一个有效的算子 incremental reduce operator 进行。整个过程有点像「order-independent processing」.

3.2 D-Stream Operators

D-Streams 提供了两类操作符：
- Transformation operators, 这里又有两类，stateless ones where each output RDD depends only on parent RDDs in the same time interval 以及 stateful ones which also use older data
- Output operators
stateless transformation 一般和 Spark 里的一样，有 map、reduce、groupBy 和 join 等，而 stateful ones 有：
- Windowing: The window operator groups all the records from a sliding window of past time intervals into one RDD. 其中，要求一个 window 的 interval 必须是 Spark Streaming 设置的 batch interval 的倍数。
- Incremental aggregation: 主要是 reduceByWindow 操作符，也就是聚合计算依赖于过去的几个 batch 聚合结果。有下图解释：

spark-streaming 2012 论文笔记
- State tracking: 主要是 track 操作符，用于记录数据流中产生的状态，比如 session 统计。

3.3 Consistency

在 record-at-a-time 类型的流式计算系统中，不同节点之间的状态一致性问题是个难处理的问题。但是在 D-Streams 中，一致性问题很容易处理：each interval’s output reflects all of the input received until then, even if the output is distributed across nodes, simply because each record had to pass through the whole deterministic batch job. This makes distributed state far easier to reason about, and is equivalent to “exactly-once” processing of the data even if there are faults or stragglers.

3.4 Unification with Batch & Interactive Processing

3.5 Summary

System Architecture

The D-Streams 系统就是 Spark Streaming, 其由三部分组成：
1. a master that tracks the D-Stream lineage graph and schedules tasks to compute new RDD partitions.
2. Worker nodes that receive data, store the partitions of input and computed RDDs, and execute tasks.
3. A client library used to send data into the system.
用图解释：
spark-streaming 2012 论文笔记

4.1 Application Execution

每个 Spark Streaming 应用由一个或多个输入流组成，该输入流可以是从 client 中获取到，也可以是外部的存储系统，如 HDFS。如果是直接从 client 中获取的，那么为了保证数据的稳定性，Spark Streaming 会在发送 ack 给 client 前，将每条输入数据同时发到两个 worker nodes 上。
每个 worker node 上有一个 block manager，用于管理数据，同时 master 上有一个 block tracker 记录所有 worker 上的 block 信息，这样每个 worker node 就可以通过该 tracker 获知其它 worker node 上的 block 数据。

4.2 Optimizations for Stream Processing

为了提高 streaming data 的处理速度，Spark Streaming 在 Spark 引擎上做了几点优化：
1. Block placement, 通过 replicas based on load 的方式替换 replicase based on random, 避免产生 hotspots nodes.
2. Network communication, use asynchronous I/O to let tasks with remote inputs, such as reduce tasks, fetch them faster.
3. Timestep pipelining, 为了充分利用集群资源，提供了 concurrent submission of jobs that depend on RDDs in a currently running jobs.
4. Lineage cutoff, 通过 checkpoint 机制避免 lineage graphs 过于冗长。

spark-streaming 2012 论文笔记

spark-streaming 2012 论文笔记

Introduction