Policy Assignment

Spark streaming driver will act as a listener to get all config(SpoutSpec,RouterSpec,AlertBoltSpec,PublishSpec,StreamDefinition) from zk nodecache.Then boardcast to tasks.If some config changed , the listener will consume the change message and reboardcast to tasks.In detail design,we will add interface SpecListener and MetadataType.ALL.We will also add new case logic in onNewConfig function in ZKMetadataChangeNotifyService.

Partition

In alert engine on storm,tuple will be partitioned by eagle custom partition logic and emit to downstream by streamId. In Spark Streaming ,rdd will repartition by the same logic and send tuple to corresponding task.The number of task is equal to the number of rdd partition.

Hybird Time Window

Apache Eagle support Session Window operation base on event time. It can also flush event in window base on system time to avoid high delay I think.Apache Eagle will aggregates event in heap in order to handle events arriving out of order.Due to nature of mini batches, support for windowing is very limited in Spark as of now. Only you can window the batches based on the process time. In the future,Spark Structure Streaming will support watermark,event time,flexible windowing,out-of-order event naivly.But for now,we consider use Apache Eagle existing window function.We will store window element in  Accumulator and recover windonw element in the RDD initialization.

Scalability

In Spark 2.0,Spark Streaming can scales the number of executors registered with this application up and down based on the workload.See https://issues.apache.org/jira/browse/SPARK-12133.

Monitor itself

For now Apache Eagle use MultiCountMetric to store metric.Spark can use Accumulators to store metric.But interface StreamContext is bind to storm .So we should refactor the interface to be more abstract firstly. Then implement the interface base on Spark.

Kafka Integration

We will hack DirectKafkaInputDStream to make sure it can support dynamic add/delete topic without requiring restart of the streaming context.Detail design is to always refresh TopicAndPartition offset before create kafkaRDD.

 

  • No labels