Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

That operation was introduced for “safety” reasons, to avoid the number of cases where users can create incorrect programs by reusing mutable objects (a discouraged pattern, but possible). For example when using state backends that keep the state as objects on heap, reusing mutable objects can theoretically create cases where the same object is used in multiple state mappings.

The effect of that "safety" mechanism is that many people that try or use Flink get much lower performance than they could possibly get. From empirical evidence, almost all users that I (Stephan) have been in touch with eventually run into this issue eventually.


There are multiple observations about that design:

  •  Object copies are extremely costly. While some simple types copy virtually for free (types reliably that are detected as immutable are not copied at all), many real pipelines use types like Avro, Thrift, JSON, etc, which are very expensive to copy.
  • Keyed operations currently only occur after shuffles. The operations are hence the first in a pipeline chain and will never have a reused object anyways. That means for the most critical operation, this precaution is unnecessary.
  • The mode is inconsistent with the contract of the DataSet API, which does not copy at each step
  • To prevent these copies, users can select enableObjectReuse(), which is misleading, since it does not really reuse mutable objects, but avoids additional copies.

...