Discussion thread
Vote thread
JIRA

FLINK-17160 - Getting issue details... STATUS

Release1.11

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The integration with docker in Flink is currently addressed in many places which often intersect, repeat each other or apply different approaches. This makes it really hard to follow the whole topic for users and maintainers. This FLIP suggests how to unify this topic. It means having one place which has the Dockerfile, all necessary scripts and docs following each other in a consistent way without any repetitions or contradictions.

Current state of Dockerfiles

Currently, we have a lot of places in our repository and docs where different aspects of Flink docker integration are scattered around:

flink-contrib/docker-flink

This was the first ever Flink contribution about how to integrate it with docker and build the docker image with different Flink/Scala/Hadoop versions. This module addresses running Flink in standalone session mode. Also, example scripts to run it on docker compose and swarm were introduced. Additionally, an example of the integration with IBM Bluemix was added. The compose example is also copied into a separate docker example repo.

flink-container/docker

This module was introduced later to address running a Flink job in standalone mode. It is similar to the previous module, some parts are the same. Additionally, the build can be done with python and job artefacts. It also documents how to pack a user job with Flink into the docker container and run with docker compose. Its Dockerfile is also used in a sibling module to run it on Kubernetes.

apache/flink-docker (ML discussion thread)

This is the latest official Dockerfile for Flink. It is used to build official Flink docker images for the docker hub. It does not install Python and Hadoop. Its standard entry point configures RPC address and RPC/WebUI ports to constants.

docs/ops/deployment/docker.md

This is part of official Flink documentation. It refers to all other mentioned places in an attempt to put it altogether.

Mostly Flink process is started within the container with a start-foreground command which does not log into files, only to console. This breaks logs in Web UI. Kubernetes standalone session doc example uses another approach where logs are forwarded from files to console and starting the background Flink process which does not break logs in Web UI.

Other components relying on docker images

There are also other places which rely on the listed above:

There are also custom Dockerfiles for tests and development purposes. This FLIP keeps them out of scope of this effort.

Public Interfaces

The changes can affect entry point of the existing docker hub image.

Proposed Changes

The idea is to keep all docker related resources in apache/flink-docker. It already has a detailed Dockerfile which is well suited for common use cases or at least serves as a good starting point. The suggestion is to make it extensible for other concerns which are currently addressed in other places. This mainly means refactoring of the existing code and introducing more docs as a first step. This effort should enable further improvements and follow-ups for the docker integration with Flink.

This would subsequently mean to adjust all other places to rely on or refer to apache/flink-docker code or docs. Eventually, all other purely docker related places can be completely removed: flink-contrib/docker-flink, flink-container/docker and docs/ops/deployment/docker.md.


Entry point

The default entry point script can accept a command to start JM (job/session) or TM. Additionally, users can customise starting the process by e.g. setting environment vars. This would already allow users to run the default docker image in various modes without creating a custom image, like:

docker run flink session_jobmanager --env ENABLE_BUILT_IN_PLUGINS=true --env-file ./env.list

User docs

User docs can be extended in markdown files of apache/flink-docker. The docs should explain:

  • the mainstream docker hub image
  • how to run it in various modes 
    • session JM
    • single job JM (plus packing job artifacts)
    • TM
    • other options
    • environment variables
      • FLINK_PROPERTIES to add more Flink config options to flink-conf.yaml (once, Flink supports configuring with env variables, we can consider to deprecate FLINK_PROPERTIES)
      • ENABLE_BUILT_IN_PLUGINS
      • Custom jar paths (pointing e.g. to custom locations in mounted docker volumes)
      • Custom logging conf
  • how to extend it to build a custom image
    • install python/hadoop
    • install optional /lib dependencies from /opt or externally
    • install /plugins
    • add user job jar for single job mode
  • add docs with examples of running compose/swarm
    • give script examples (mention job/session)

Also, existing docs/scripts for other components relying on docker images have to be adjusted to reference and adopt approaches described in the docker docs of apache/flink-docker. Eventually, docker/compose/swarm/bluemix scripts can be removed in favour of examples in docs.

Logging

Currently, if deployment scripts start the Flink process in foreground, the logs will be outputted only to the console but no logs will be appended to the usual files locally. Outputting logs to console makes sense in case of running the docker container as this is the usual docker way. The problem is that the web ui cannot display the logs because it relies on those local files. We can modify the log4j-console.properties to also output logs into the usual files in case of starting foreground Flink process. Here we have to also check how it satisfies the container space limits nicely (rolling fixed files etc).

Custom logging

We could also provide an environment variable which contains custom logging properties to rewrite the file in base image in entry point script, similar to FLINK_PROPERTIES for config options in flink-conf.yaml.

Another option is to expose an environment variable pointing to another location of logging properties, e.g. in a mounted volume.

Compatibility, Deprecation, and Migration Plan

In general, there should be no compatibility issues because the mainstream docker hub image is not supposed to be changed a lot because we are planning mostly some refactoring and docs extension.

As discussed in mailing list, once the user documentation is good enough, we are going to remove the existing docker/compose/swarm/bluemix scripts in:

Implementation steps

  • Document the official docker hub image and examples of how to run it (as of now)
  • Document examples of how to extend the official docker hub image (as of now)
  • Remove flink-contrib/docker-flink
  • Extend entry point script and docs with job cluster mode and user job artefacts
  • Remove flink-container/docker

Tentative improvements:

  • Modify the log4j-console.properties to also output logs into the files for WebUI
  • Make logging properties configurable
  • Split stdout/stderr file container logs

Test Plan

Firstly, mostly manual testing. Later we can think of more extensive docker CI tests.

Future road map

We can still do move improvements to the user experience with docker in Flink:

  • Investigate how to support developers to build a custom image for a snapshot version, e.g. for a certain commit in Flink repo
  • Rewrite Flink options in flink-conf.yaml by environment variables when Flink process starts
  • Refactor Flink bash scripts into one thin script which uses a Java bootstrap utility to prepare and configure started Flink process (similar to BashJavaUtils for memory setup)

For more details see also this FLIP discussion thread and more detailed proposal doc.

Rejected Alternatives

None so far.