Status
Current state: In-progress
Discussion thread: here
JIRA:
Motivation
- Uploading jars for java job submission
- Uploading hive scripts to be used by Sqoop
- Launching jobs after upload or launching jobs that are already put into HDFS
- Submiting mapreduce, spark, resource manager jobs/applications
- Moving and copying files
- Monitoring Job statuses and acting on the reported status in some way
- Scripted access to JDBC resources such as Hive, SparkSQL, etc
- Scripted access to HBase
Improvements
1. Programming Model Definition/s
Data Workers (ETL)
We need to more formally define what sort of programming model is needed for the Data Worker.
This likely starts with an understanding of what they need to do today inside of the cluster and how it can be accomplished through Knox without SSH access.
Deliverables here would need to include samples targeted for the identified usecases, blogs that introduce them and evangalize them.
We should also consider an SDK beyond the scripting capabilities that we currently use for samples and testing, etc.
Data Scientists - Tooling Vendors
Third-party Data Scientist tooling vendors may be interested in these capabilities as well.
Those vendors that write to Hadoop APIs today are relegated to internal Hadoop APIs like UserGroupInformation, etc.
These dependecies also drag in configuration dependencies and other jars.
The classes that are wrapped by the DSL in KnoxShell can potentially be improved to represent an SDK for tooling, custom applications, etc.
2. KnoxShell Session/SSO
In order to be able to schedule headless script invocations, we will need some notion of a stored credential.
There are a couple different types of scenarios that we could pursue here and both may make sense when we fully understand the required programming model and usecases.
- An active user may want to invoke a number of scripts without having to provide credentials to Knox with every API interaction. (Each script may call any number of APIs). The user should be able to login as is done with kebreros with kinit. Let's say knoxinit, this will prompt the user for username and password and provide these to a token service which will return a token that would be used in susequent calls.
- file permission protection of the token
- a token API in Knox
- and a credential collector for the KnoxShell to acquire the token
- abillity for the REST API client within KnoxShell to set the token as a bearer token in HTTP header
- wire encryption to protect from capture and reuse
- expiration of the token
- It may be required that scripts be run as scheduled headless processes that need to be able to get a token as described in #1. This is more like a keytab (knoxtab?) file that can be used as credentials to acquire a session token.
- file permission protection of the knoxtab
- credentials encrypted by the knoxtab API to protect against compromising the credentials
- protection from being captured and reused
- expiration or invalidation by password changes in user store or both?
- invalidate all knoxtabs by rolling of the key used for encryption
3. KnoxShell Interpreter for Zeppelin
4. KnoxShell Improvements
Very specific pain points and improvements are articulated in the following article.
It comes primarily from the context of a general java SDK for Hadoop rather than from a scripting or commandline perspective.
The general usecase is from within a multi-user server side application context.
Each one of the points in this should have a JIRA and be addressed.
5. KnoxShell Tests
Add a testing framework and some example tests for the existing shell functionality. As we add more functionality it is hard to keep up with manual testing.
The proposal here to use the Mini-* (MiniHDFS, MiniKDC etc) testing framework that is used right now for kerberos testing only. We can use a similar setup to test out the shell scripts.