Helium - Brings Zeppelin to data analytics application platform
Zeppelin provides pluggable Interpreter architecture which results in a wide variety of the supported backend system.
Each interpreter abstracts underlying computing framework complexity (eg. SparkInterpreter abstracts Spark cluster) with it's own interface (eg. SparkInterpreter provides scala/sql/python for the interface).
Also there is a powerful feature called "Angular Display system" that enables user to create his own front-end interface that interacts with interpreter.
And there is a "dependency loader" that enables them to load libraries from remote repository.
Putting it all gother, one could imagine a full application platform, on top of Apache Zeppelin.
So what I propose is a framework, code-named Helium that turns Zeppelin into a data analytics application platform by:
- Leveraging computing resources provided by Interpreters
- Generalizing dependency loader
- Providing SDK on top of Angular Display system
- adding a package repository
What is Helium Application?
The idea is simple, instead of user write an code and display result on notebook, user runs packaged code and get result on the notebook.
Packaged code will able to access Zeppelin provided resources through Resource Pool as well as Display System to display any output.
Helium Application = View + Algorithm + Access to Resources
How application displays output?
Each paragraph has output message, angular objects, dynamic forms. Single paragraph will have multiple applications and each of them has their own output message, angular objects just like an paragraph output.
Provided by interpreter or provided by another Helium Application.
Every interpreter automatically provides result of last run.
Additionally they can provide their own resource (eg. SparkContext).
Also any user code in Helium Application can provide any resource they want.
The resource can be any java object.
So it can be a data, it can be an abstraction of computing (eg. SparkContext), it can be anything.
How Helium Application runs
Applications are packaged into Jars and published into maven repository.
Also a spec file in package registry is required.
Then, depending on the Resource that the resource pool has, Zeppelin automatically suggest possible Application that user can run.
When user selects an Application, that application is being downloaded and run on the interpreter process where resource exists.
User Application extends org.apache.zeppelin.helium.Application class in SDK.
SDK provides development mode, so you can actually run application inside of Zeppelin without full deployment.
Here's short video how SDK works
Package Repository and spec file
Helium Application is packaged into the standard Jar file, therefor it can be distributed by maven repository.
Package Repository is actually collection of spec file. Each spec file provides information of:
- Name of Application
- Artifact name in maven repository
- Resources this application requires
The package repository is going to to be maintained as separate gitrepo with it's own homepage. (like spark-packages.org for spark package), so any user can add their applications there, without PMC review, wich scales well.
There will be a bot that automatically merges pull requests w/ a specfiles into the master branch of the repo.
I propose the repository
There're proof of concept implementation.
Actual implementation is in progress.
I have created some example applications based on PoC implementation.
Git commit data - datasource
Wordcloud - visualize the paragraph's table result
SparkMon - appliction that access spark
Here's video of three example applications