Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Operator Time: This usually gives the fastest results since using timestamps in processing might require waiting for slower operations and buffering of elements since the elements will arrive at operations out-of-order. That is, they are usually not ordered by their timestamps. This will, however, not give you any guarantees about the ordering of elements when joining several data streams and when operations require a repartitioning of the elements. 
  • Event Time: This makes it possible to define orderings or time windows that are completely independent of the system time of the machines that a topology is running on. This might, however, lead to very high latencies because operations have to wait until they can be sure that they have seen all elements up to a certain timestamp. 
  • Ingress Time: This gives you a good compromise between event time and operator time. Even after repartitioning data or joining several data streams the elements can still be processed according to the time that they entered the system. The latency should also be low because the timestamps that sources attach to elements are steadily increasing.

Tracking the

...

Progress of Time

This only applies to operations that use timestamp time (ingress time or event time) since operator time is only local to one parallel instance of an operation.

When using element timestamps to order elements or perform windowing it becomes necessary to keep track of the progress of time. For example, in a windowing operator the window from t to t + 5 can only be processed once the operator is certain that no more elements with a timestamp lower than t + 5 will arrive from any of its input streams. Flink uses watermarks to keep track of the timestamp of tuples passing through the system: when a source knows that no elements with a timestamp lower than t1 will be emitted in the future it will emit a watermark with timestamp t1. Watermarks are broadcast to downstream operators. The watermark at once specific operator at any point in time is the minimum watermark time over all its inputs. Once that watermark increases the operator can react to that and further broadcast the watermark to downstream operators. In our windowing example from earlier, the operator would wait for the watermark for time t + 5, then process the window and then broadcast watermark t + 5 itself.

 

Latency considerations

Image Added

The concept of watermarks as used in Flink is inspired by Google's MillWheel: http://research.google.com/pubs/pub41378.html and by work building on top of it for Google Cloud Dataflow API: http://strataconf.com/big-data-conference-uk-2015/public/schedule/speaker/193653

Latency considerations

 

 

Ordering Records

  • Ordering for OrderedKeyDataStream
  • Ordering for windows

...

Other windows

Staging buffer