Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

This page discusses the implementation of Flink's distributed communication via Akka, which has been adopted in version 0.9. With Akka, all remote procedure calls are now realized as asynchronous messages. This mainly affects the components JobManager, TaskManager and JobClient. In the future, it is likely that even more components will be transformed into an actor, allowing them to send and process asynchronous messages.

...

Wherever possible, Flink tries to use asynchronous messages and to handle responses as futures. Futures and the few existing blocking calls have a timeout after which the operation is considered failed. This prevents the system from getting deadlocked in case a message gets lost or a distributed component crashes. However, if you happen to have a really large cluster or a slow network, timeouts might be triggered wrongly. Therefore, the timeout for these operations can be specified via "akka.ask.timeout" in the configuration.

Before an actor can talk to another actor it has to retrieve an ActorRef for it. The lookup for this operation requires also a timeout. In order to make the system fail fast if an actor is not started, the lookup timeout is set to a
smaller a smaller value than the regular timeout. In case that you experience lookup timeouts, you can increase the lookup time via "akka.lookup.timeout" in the configuration.

Another peculiarity of Akka is that it sets a limit for the maximum message size it can send. The reason for this is that it reserves a serialization buffer of the same size and does not want to waste memory. If you should ever encounter a transmission error because the message exceeded the maximum size, you can increase the framesize via "akka.framesize" in the configuration.

...

  • akka.ask.timeout: Timeout used for all futures and blocking Akka calls. If Flink fails due to timeouts then you should try to increase this value. Timeouts can be caused by slow machines or a congested network. The timeout value requires a time-unit specifier (ms/s/min/h/d) (DEFAULT: 100 s).

...

  • akka.

...

  • lookup.

...

  • timeout

...

  • Timeout used for the lookup of the JobManager. The timeout value has to contain a time-unit specifier (ms/s/min/h/d) (DEFAULT:

...

  •  10 s

...

  • ).

...

  • akka.

...

  • framesize:

...

  •  Maximum size of messages which are sent between the JobManager and the TaskManagers. If Flink fails because messages exceed this limit, then you should increase it. The message size requires a size-unit specifier (DEFAULT:

...

  •  10485760b

...

  • ).

...

  • akka.watch.heartbeat.

...

  • interval:

...

  •  Heartbeat interval for Akka's DeathWatch mechanism to detect dead TaskManagers. If TaskManagers are wrongly marked dead because of lost or delayed heartbeat messages, then you should increase this value. A thorough description of Akka's DeathWatch can be

...

...

  •  (

...

  • DEFAULT: akka.ask.timeout/10

...

  • ).

...

  • akka.watch.heartbeat.

...

  • pause:

...

  •  Acceptable heartbeat pause for Akka's DeathWatch mechanism. A low value does not allow a irregular heartbeat. A thorough description of Akka's DeathWatch can be

...

...

  •  (DEFAULT: akka.ask.timeout

...

  • ).

...

  • akka.watch.

...

  • threshold:

...

  •  Threshold for the DeathWatch failure detector. A low value is prone to false positives whereas a high value increases the time to

...

  • detect a dead TaskManager. A thorough description of Akka's DeathWatch can be found here (DEFAULT: 12).
  • akka.transport.heartbeat.

...

  • interval:

...

  •  Heartbeat interval for Akka's transport failure detector. Since Flink uses TCP, the detector is not necessary. Therefore, the detector is disabled by setting the interval to a very high value. In case you should need the transport failure detector, set the interval to some reasonable value. The interval value requires a time-unit specifier (ms/s/min/h/d) (DEFAULT:

...

  •  1000 s

...

  • ).

...

  • akka.transport.heartbeat.

...

  • pause:

...

  •  Acceptable heartbeat pause for Akka's transport failure detector. Since Flink uses TCP, the detector is not necessary. Therefore, the detector is disabled by setting the pause to a very high value. In case you should need the transport failure detector, set the pause to some reasonable value. The pause value requires a time-unit specifier (ms/s/min/h/d) (DEFAULT:

...

  •  6000 s

...

  • ).

...

  • akka.transport.

...

  • threshold:

...

  •  Threshold for the transport failure detector. Since Flink uses TCP, the detector is not necessary and, thus, the threshold is set to a high value (DEFAULT:

...

  •  300

...

  • ).

...

  • akka.tcp.

...

  • timeout:

...

  •  Timeout for all outbound connections. If you should experience problems with connecting to a TaskManager due to a slow network, you should increase this value (DEFAULT:

...

  •  akka.ask.timeout

...

  • ).

...

  • akka.

...

  • throughput:

...

  •  Number of messages that are processed in a batch before returning the thread to the pool. Low values denote a fair scheduling whereas high values can increase the performance at the cost of unfairness (DEFAULT:

...

  •  15

...

  • ).

...

  • akka.log.lifecycle.

...

  • events:

...

  •  Turns on the Akka's remote logging of events. Set this value to 'on' in case of debugging (DEFAULT:

...

  •  off

...

  • ).

...

  • akka.startup-

...

  • timeout:

...

  •  Timeout after which the startup of a remote component is considered being failed (DEFAULT:

...

  •  akka.ask.timeout

...

  • ).