Status
Motivation
It is desirable to provide better visibility into the distribution of CPU resources while executing user code. One of the most visually effective means to do that are Flame Graphs. They allow to easily answer question like:
Which methods are currently consuming CPU resources?
How consumption by one method compares to the others?
Which series of calls on the stack led to executing a particular method?
Flame Graphs are constructed by sampling stack traces a number of times. Every method call is presented by a bar, where the length of the bar is proportional to the number of times it is present in the samples.
Flink supports FLIP-165: Operator's Flame Graphs now. and it draw flame graph by the front-end libraries d3-flame-graph. My research shows that Arthas and intellij idea both use async profiler to support this functionality.
And this tool is more professional.And i have already added this feature to our company. The most importance is the Operator's Flame Graphs has some fatal flaw when the job parallesim more than 500+ it will case chrome browser hang.
And the chrome browser can not do anything.
Public Interfaces
N/A
Proposed Changes
we want to propose to provide an taskmanager level(process) flame graph by async profiler.
1) we should support taskmanager level configurable script feature like yarn. user can configure multiple scripts.
taskmanager.execution.flame-graph.dir: /opt/flink/profiler/flamegraph
taskmanager.execution.flame-script.path: /opt/flink/bin/taskmanager-flame-graph.sh // it will encapsulate async profiler
taskmanager.execution.flame-script.opts: cpu 30
and it supports user defined script like:
taskmanager.execution.xxx.path:
taskmanager.execution.xxx.opts:
2) add 2 interface
call the taskmanager to run the script
list and display the flame graph
Compatibility, Deprecation, and Migration Plan
- What impact (if any) will there be on existing users?
- If we are changing behavior how will we phase out the older behavior?
- If we need special migration tools, describe them here.
- When will we remove the existing behavior?
Test Plan
Describe in few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?
Rejected Alternatives
If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.