Status

Current stateIn-progress

Discussion threadhere

JIRA

Motivation

As we see more and more movement to cloud deployments, I believe that we are going to see more demand for users to be able to get their jobs done without the need to ssh into the cluster.
The KnoxShell provides a pretty powerful scripting and programming model that doesn't require a Hadoop install, much if any config and little dependence on client side jars.
This could easily become a popular programming model for Hadoop in the cloud.
Much of what the Data Worker must do for ETL, data scrubbing, transformation, etc - requires the user to SSH into the cluster in most of today's Apache Hadoop deployments.
Things like the following will be important for such workers and Knox should enable this work without SSH, lots of config and jar dependencies:
  • Uploading jars for java job submission
  • Uploading hive scripts to be used by Sqoop
  • Launching jobs after upload or launching jobs that are already put into HDFS
  • Submiting mapreduce, spark, resource manager jobs/applications
  • Moving and copying files
  • Monitoring Job statuses and acting on the reported status in some way
  • Scripted access to JDBC resources such as Hive, SparkSQL, etc
  • Scripted access to HBase

Improvements

1. Programming Model Definition/s

Data Workers (ETL)

We need to more formally define what sort of programming model is needed for the Data Worker.

This likely starts with an understanding of what they need to do today inside of the cluster and how it can be accomplished through Knox without SSH access.

Deliverables here would need to include samples targeted for the identified usecases, blogs that introduce them and evangalize them.

We should also consider an SDK beyond the scripting capabilities that we currently use for samples and testing, etc.

Data Scientists - Tooling Vendors

Third-party Data Scientist tooling vendors may be interested in these capabilities as well.

Those vendors that write to Hadoop APIs today are relegated to internal Hadoop APIs like UserGroupInformation, etc.

These dependecies also drag in configuration dependencies and other jars.

The classes that are wrapped by the DSL in KnoxShell can potentially be improved to represent an SDK for tooling, custom applications, etc.

2. KnoxShell Session/SSO

In order to be able to schedule headless script invocations, we will need some notion of a stored credential.

There are a couple different types of scenarios that we could pursue here and both may make sense when we fully understand the required programming model and usecases.

  1. An active user may want to invoke a number of scripts without having to provide credentials to Knox with every API interaction. (Each script may call any number of APIs). The user should be able to login as is done with kebreros with kinit. Let's say knoxinit, this will prompt the user for username and password and provide these to a token service which will return a token that would be used in susequent calls.
    1. file permission protection of the token
    2. a token API in Knox
    3. and a credential collector for the KnoxShell to acquire the token
    4. abillity for the REST API client within KnoxShell to set the token as a bearer token in HTTP header
    5. wire encryption to protect from capture and reuse
    6. expiration of the token
  2. It may be required that scripts be run as scheduled headless processes that need to be able to get a token as described in #1. This is more like a keytab (knoxtab?) file that can be used as credentials to acquire a session token.
    1. file permission protection of the knoxtab
    2. credentials encrypted by the knoxtab API to protect against compromising the credentials
    3. protection from being captured and reused
    4. expiration or invalidation by password changes in user store or both?
    5. invalidate all knoxtabs by rolling of the key used for encryption

3. KnoxShell Interpreter for Zeppelin

Interesting thought to be able to leverage the KnoxShell scripting within the Zeppelin Notebook.
We may be able to combine the session token work described in Section 2 above and KnoxSSO in order to get this to work based on the authenticated user in some way.
This might require another credential collector in order to get it from the interpreter request.

%knoxshell
Hdfs.rm( session ).file( ‘/path/to/dir ).recursive().now().statusCode

4. KnoxShell Improvements

Very specific pain points and improvements are articulated in the following article.

It comes primarily from the context of a general java SDK for Hadoop rather than from a scripting or commandline perspective.

The general usecase is from within a multi-user server side application context.

Each one of the points in this should have a JIRA and be addressed.


5. KnoxShell Tests

Add a testing framework and some example tests for the existing shell functionality. As we add more functionality it is hard to keep up with manual testing.

The proposal here to use the Mini-* (MiniHDFS, MiniKDC etc) testing framework that is used right now for kerberos testing only. We can use a similar setup to test out the shell scripts.

 


  • No labels