Status
Current state: Under Discussion
Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)
JIRA:
-
FLINK-16478Getting issue details...
STATUS
Released: <Flink Version>
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
As requested by many Flink users in the community, it will be very helpful to have a feature to change the level of certain loggers dynamically at runtime without restart, so that the issue and bug that is not able to be easily reproduced locally can be detected with the help of additional logs.
The level is able to be changed currently if the cluster is running on an environment where log4j2 is used and log4j.properties can be modified, for example, running on Kubernetes while the log4j.properties file is backed by a ConfigMap. However, there are still cases where the log configuration file cannot be changed. Yarn deployment is quite a good example of this.
Public Interfaces
REST APIs
Change the logging level of a logger
URL: /logconfig
Verb: POST
Request Payload Example:
Set the level of the logger "org.apache.flink.runtime"
to DEBUG
.
{ "loggerName": "org.apache.flink.runtime", "level": "DEBUG" }
Or reset the level to its original one.
{ "loggerName": "org.apache.flink.runtime", "level": null }
Response Code: 200 OK
Response Payload: empty
Proposed Changes
In general, the aforementioned public REST API will be introduced. Besides that, the following two RPC methods are also introduced to support this feature.
RPCs
ResourceManagerGateway
public interface ResourceManagerGateway { /** * Changes the level of the logger at runtime. * * <p>By providing a {@code null} LogLevel, the previously-changed level is reverted to its * original value. * * @param loggerName the name of the logger * @param level the log level * @return future which is completed exceptionally if the operation fails */ CompletableFuture<Void> changeLogLevel(String loggerName, @Nullable LogLevel level); }
This RPC method will be called by the handler that serves the /logconfig
request. When it is called, it does two thing,
- It changes the logging level on this job manager.
- It broadcasts this change to all the task managers that are currently registered on this job manager, by calling the RPC method
TaskExecutorGateway.changeLogLevel
as describe below.
TaskExecutorGateway
public interface TaskExecutorGateway { /** * Change the level of the logger at runtime. * * <p>By providing a {@code null} LogLevel, the previously-changed level is reverted to its * original value. * * @param loggerName the name of the logger * @param level the log level * @return future which is completed exceptionally if the operation fails */ CompletableFuture<Void> changeLogLevel(String loggerName, @Nullable LogLevel level); }
When this method is called by a ResourceManager, it does only one thing: changes the logging level on this task manager.
Logging Abstraction
Finally, the problem left is how the logging level is changed on a job manager or task managers. Although log4j2 is the default logging implementation that is included in the distribution, other logging frameworks, including log4j1 and logback, are also recommended by the documentation "How to use logging". We need to have some kind of abstraction to not directly depend on the logging implementation. The following interface and classes are introduced so that not only do they suit the currently supported ones but also we can have any other logging frameworks including the user-defined ones.
As illustrated by the class diagram above, the interface LogginerProvider
is introduced where the setLogLevel()
method varies among different logging implementations. The isEnabled()
method is invoked during the initialization. All the LoggingProvider
implementation classes, that are registered using the Java service loading facility, are tested and only the first enabled one will be used. Any subclass of the Slf4jLoggingProvider
is considered to be enabled as long as the factory name returned from StaticLoggerBinder.getLoggerFactoryClassStr()
is the same as the one of its logging implementation factory. If no one is enabled, NoOpLoggingProvider
is used, and warning messages will be printed both duration initialization and when setLogLevel()
method is called.
Supported Log Levels and Their Mappings
Here is the table of the corresponding native levels of all the supported logging implementations.
Level / Implementation | Log4j | Logback | Java Logging |
---|---|---|---|
TRACE | org.apache.logging.log4j.Level.TRACE | ch.qos.logback.classic.TRACE | java.util.logging.FINEST |
| org.apache.logging.log4j.Level.DEBUG | ch.qos.logback.classic.DEBUG | java.util.logging.FINE |
INFO | org.apache.logging.log4j.Level.INFO | ch.qos.logback.classic.INFO | java.util.logging.INFO |
WARN | org.apache.logging.log4j.Level.WARN | ch.qos.logback.classic.WARN | java.util.logging.WARNING |
ERROR | org.apache.logging.log4j.Level.ERROR | ch.qos.logback.classic.ERROR | java.util.logging.SEVERE |
SEVERE | org.apache.logging.log4j.Level.ERROR | ch.qos.logback.classic.ERROR | java.util.logging.SEVERE |
OFF | org.apache.logging.log4j.Level.OFF | ch.qos.logback.classic.OFF | java.util.logging.OFF |
Limitations
Considered to be an MVP (Minimal Viable Product), this improvement does not support the features, such as defining the scope (e.g. job manager or task manager only) and a timer. The log levels of the target logger are going to be changed cluster-wide, on the job manager leader and all the currently-registered task managers. To unset the previously-changed level, pass a null
log-level argument.
Another limitation is that changing the log levels is not considered to be a persistent operation for simplicity. If a task manager is (re-)joined or a standby job manager becomes the leader, the log levels on these processes remain unchanged and only will be changed upon a next call.
Compatibility, Deprecation, and Migration Plan
It does not affect any compatibility or have any deprecation or migration.
Test Plan
Besides unit tests, system tests will be included to cover the cases where different logging providers are used or no logging provider is enabled.
Rejected Alternatives
If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.