Interactive Scala Shell: Once Flink has the ability to run interactive programs, it should be easy to add an interactive Scala shell (REPL).


Distributed mutable state: Currently delta iterations use internally a hash
index to store the state of the iteration, and they invoke index merging
functionality. One idea would be to surface an operator (with care) to the
APIs that essentially allows mutable state manipulations. Another idea
would be to implement something along the lines of a parameter server and
make such functionality accessible to the APIs.

Enhance Flink's Monitoring Capabilities: Flink has a web interface to track the progress
of running jobs. In addition to that, it contains some basic system information.
We would like to enhance the monitoring capabilities to a much broader set of features, including
system performance monitoring (cpu, memory, disk, network, processes) and application
level monitoring (records processed per second, garbage collection statistics, input/output ratio,
data distribution information, iteration statistics).
There is also a need for reworking the internal APIs of the current webinterface. Changing the 
AJAX requests to a well-defined API (Rest, ...), integration with other systems such as Ambari will be
much easier. 

 

Integrate static code analysis fro UDFs

The issue is tracked here:  Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Brief description:

Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly.

Right now, users can add lightweight annotations to UDFs to provide this information (such as adding @ConstandFields("0->3, 1, 2->1").

We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it "magically" makes programs faster.

For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the "Soot" toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far.

I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special "flink-code-analysis.jar" and drop it into the "lib" folder to enable this functionality. We may even add a script to "tools" that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly).

Prior work on this has been done by Aljoscha Krettek and Sebastian Kunert, which could provide a code base to start with.

Appendix

Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/

Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf

Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6)

Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3)

 


Domain-specific language for spatial data: Create spatial data types
(point, region, etc) and operations thereof

Integration into Apache BigTop

Integration with Apache Ambari

Pig frontend for Flink: An initial effort was here:
http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf

Cascading on Flink

Optimizing the integration with columnar file formats (Parquet, ORCFile)
and perhaps eventually pushing filters down to data scans.

Statistical operators to extract statistical information from a DataSet
(e.g., histograms of value distributions)

Eclipse plugin that includes functionality for execution plan debugging

Local execution of programs using Java Collections

Utility Library:  Unable to locate Jira server for this macro. It may be due to Application Link configuration.


 

  • No labels