Apache Knox Gateway - Usage Examples

This guide provides detailed examples for how to do some basic interactions with Hadoop via the Apache Knox Gateway.

The first two examples submit a Java MapReduce job and workflow using the
KnoxShell DSL

Example #1: WebHDFS & Templeton/WebHCat via KnoxShell DSL
Example #2: WebHDFS & Oozie via KnoxShell DSL

The second two examples submit the same job and workflow but do so using only
the cURL command line HTTP client.

Example #1: WebHDFS & Templeton/WebHCat via cURL
Example #2: WebHDFS & Oozie via KnoxShell cURL

Assumptions

This document assumes a few things about your environment in order to simplify the examples.

The JVM is executable as simply java.
The Apache Knox Gateway is installed and functional.
The example commands are executed within the context of the GATEWAY_HOME current directory. The GATEWAY_HOME directory is the directory within the Apache Knox Gateway installation that contains the README file and the bin, conf and deployments directories.
A few examples optionally require the use of commands from a standard Groovy installation. These examples are optional but to try them you will need Groovy installed.

Customization

These examples may need to be tailored to the execution environment. In particular hostnames and ports may need to be changes to match your environment. In particular there are two example files in the distribution that may need to be customized. Take a moment to review these files. All of the values that may need to be customized can be found together at the top of each file.

samples/ExampleSubmitJob.groovy
samples/ExampleSubmitWorkflow.groovy

If you are using the Sandbox VM for your Hadoop cluster you may want to review these configuration tips.

Example #1: WebHDFS & Templeton/WebHCat via KnoxShell DSL

This example will submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL. There are several ways to do this depending upon your preference.

You can use the "embedded" Groovy interpreter provided with the distribution.

java -jar bin/shell.jar samples/ExampleSubmitJob.groovy

You can manually type in the KnoxShell DSL script into the "embedded" Groovy interpreter provided with the distribution.

java -jar bin/shell.jar

Each line from the file below will need to be typed or copied into the interactive shell.

samples/ExampleSubmitJob

import com.jayway.jsonpath.JsonPath
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import org.apache.hadoop.gateway.shell.job.Job

import static java.util.concurrent.TimeUnit.SECONDS

gateway = "https://localhost:8443/gateway/sample"
username = "bob"
password = "bob-password"
dataFile = "LICENSE"
jarFile = "samples/hadoop-examples.jar"

hadoop = Hadoop.login( gateway, username, password )

println "Delete /tmp/test " + Hdfs.rm(hadoop).file( "/tmp/test" ).recursive().now().statusCode
println "Create /tmp/test " + Hdfs.mkdir(hadoop).dir( "/tmp/test").now().statusCode

putData = Hdfs.put(hadoop).file( dataFile ).to( "/tmp/test/input/FILE" ).later() {
    println "Put /tmp/test/input/FILE " + it.statusCode }
putJar = Hdfs.put(hadoop).file( jarFile ).to( "/tmp/test/hadoop-examples.jar" ).later() {
     println "Put /tmp/test/hadoop-examples.jar " + it.statusCode }
hadoop.waitFor( putData, putJar )

jobId = Job.submitJava(hadoop) \
    .jar( "/tmp/test/hadoop-examples.jar" ) \
    .app( "wordcount" ) \
    .input( "/tmp/test/input" ) \
    .output( "/tmp/test/output" ) \
    .now().jobId
println "Submitted job " + jobId

done = false
count = 0
while( !done && count++ < 60 ) {
    sleep( 1000 )
    json = Job.queryStatus(hadoop).jobId(jobId).now().string
    done = JsonPath.read( json, "${SDS}.status.jobComplete" )
}
println "Done " + done

println "Shutdown " + hadoop.shutdown( 10, SECONDS )

exit

Example #2: WebHDFS & Oozie via KnoxShell DSL

This example will also submit the familiar WordCount Java MapReduce job to the Hadoop cluster via the gateway using the KnoxShell DSL. However in this case the job will be submitted via a Oozie workflow. There are several ways to do this depending upon your preference.

You can use the "embedded" Groovy interpreter provided with the distribution.

java -jar bin/shell.jar samples/ExampleSubmitWorkflow.groovy

You can manually type in the KnoxShell DSL script into the "embedded" Groovy
interpreter provided with the distribution.

java -jar bin/shell.jar

Each line from the file below will need to be typed or copied into the interactive shell.

samples/ExampleSubmitWorkflow.groovy

import com.jayway.jsonpath.JsonPath
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import org.apache.hadoop.gateway.shell.workflow.Workflow

import static java.util.concurrent.TimeUnit.SECONDS

gateway = "https://localhost:8443/gateway/sample"
jobTracker = "sandbox:50300";
nameNode = "sandbox:8020";
username = "bob"
password = "bob-password"
inputFile = "LICENSE"
jarFile = "samples/hadoop-examples.jar"

definition = """\
<workflow-app xmlns="uri:oozie:workflow:0.2" name="wordcount-workflow">
    <start to="root-node"/>
    <action name="root-node">
        <java>
            <job-tracker>$jobTracker</job-tracker>
            <name-node>hdfs://$nameNode</name-node>
            <main-class>org.apache.hadoop.examples.WordCount</main-class>
            <arg>/tmp/test/input</arg>
            <arg>/tmp/test/output</arg>
        </java>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Java failed</message>
    </kill>
    <end name="end"/>
</workflow-app>
"""

configuration = """\
<configuration>
    <property>
        <name>user.name</name>
        <value>$username</value>
    </property>
    <property>
        <name>oozie.wf.application.path</name>
        <value>hdfs://$nameNode/tmp/test</value>
    </property>
</configuration>
"""

hadoop = Hadoop.login( gateway, username, password )

println "Delete /tmp/test " + Hdfs.rm(hadoop).file( "/tmp/test" ).recursive().now().statusCode
println "Mkdir /tmp/test " + Hdfs.mkdir(hadoop).dir( "/tmp/test").now().statusCode
putWorkflow = Hdfs.put(hadoop).text( definition ).to( "/tmp/test/workflow.xml" ).later() {
    println "Put /tmp/test/workflow.xml " + it.statusCode }
putData = Hdfs.put(hadoop).file( inputFile ).to( "/tmp/test/input/FILE" ).later() {
    println "Put /tmp/test/input/FILE " + it.statusCode }
putJar = Hdfs.put(hadoop).file( jarFile ).to( "/tmp/test/lib/hadoop-examples.jar" ).later() {
    println "Put /tmp/test/lib/hadoop-examples.jar " + it.statusCode }
hadoop.waitFor( putWorkflow, putData, putJar )

jobId = Workflow.submit(hadoop).text( configuration ).now().jobId
println "Submitted job " + jobId

status = "UNKNOWN";
count = 0;
while( status != "SUCCEEDED" && count++ < 60 ) {
  sleep( 1000 )
  json = Workflow.status(hadoop).jobId( jobId ).now().string
  status = JsonPath.read( json, "${SDS}.status" )
}
println "Job status " + status;

println "Shutdown " + hadoop.shutdown( 10, SECONDS )

exit

Example #3: WebHDFS & Templeton/WebHCat via cURL

The example below illustrates the sequence of curl commands that could be used to run a "word count" map reduce job. It utilizes the hadoop-examples.jar from a Hadoop install for running a simple word count job. A copy of that jar has been included in the samples directory for convenience. Take care to follow the instructions below for steps 4/5 and 6/7 where the Location header returned by the call to the NameNode is copied for use with the call to the DataNode that follows it. These replacement values are identified with { } markup.

# 0. Optionally cleanup the test directory in case a previous example was run without cleaning up.
curl -i -k -u bob:bob-password -X DELETE \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test?op=DELETE&recursive=true'

# 1. Create a test input directory /tmp/test/input
curl -i -k -u bob:bob-password -X PUT \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/input?op=MKDIRS'

# 2. Create a test output directory /tmp/test/input
curl -i -k -u bob:bob-password -X PUT \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/output?op=MKDIRS'

# 3. Create the inode for hadoop-examples.jar in /tmp/test
curl -i -k -u bob:bob-password -X PUT \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/hadoop-examples.jar?op=CREATE'

# 4. Upload hadoop-examples.jar to /tmp/test.  Use a hadoop-examples.jar from a Hadoop install.
curl -i -k -u bob:bob-password -T samples/hadoop-examples.jar -X PUT '{Value Location header from command above}'

# 5. Create the inode for a sample file README in /tmp/test/input
curl -i -k -u bob:bob-password -X PUT \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/input/README?op=CREATE'

# 6. Upload readme.txt to /tmp/test/input.  Use the readme.txt in {GATEWAY_HOME}.
curl -i -k -u bob:bob-password -T README -X PUT '{Value of Location header from command above}'

# 7. Submit the word count job via WebHCat/Templeton.
# Take note of the Job ID in the JSON response as this will be used in the next step.
curl -v -i -k -u bob:bob-password -X POST \
    -d jar=/tmp/test/hadoop-examples.jar -d class=wordcount \
    -d arg=/tmp/test/input -d arg=/tmp/test/output \
    'https://localhost:8443/gateway/sample/templeton/api/v1/mapreduce/jar'

# 8. Look at the status of the job
curl -i -k -u bob:bob-password -X GET \
    'https://localhost:8443/gateway/sample/templeton/api/v1/queue/{Job ID returned in JSON body from previous step}'

# 9. Look at the status of the job queue
curl -i -k -u bob:bob-password -X GET \
    'https://localhost:8443/gateway/sample/templeton/api/v1/queue'

# 10. List the contents of the output directory /tmp/test/output
curl -i -k -u bob:bob-password -X GET \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/output?op=LISTSTATUS'

# 11. Optionally cleanup the test directory
curl -i -k -u bob:bob-password -X DELETE \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test?op=DELETE&recursive=true'

Example #4: WebHDFS & Oozie via cURL

The example below illustrates the sequence of curl commands that could be used to run a "word count" map reduce job via an Oozie workflow. It utilizes the hadoop-examples.jar from a Hadoop install for running a simple word count job. A copy of that jar has been included in the samples directory for convenience. Take care to follow the instructions below where replacement values are
required. These replacement values are identified with { } markup.

# 0. Optionally cleanup the test directory in case a previous example was run without cleaning up.
curl -i -k -u bob:bob-password -X DELETE \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test?op=DELETE&recursive=true'

# 1. Create the inode for workflow definition file in /tmp/test
curl -i -k -u bob:bob-password -X PUT \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/workflow.xml?op=CREATE'

# 2. Upload the workflow definition file.  This file can be found in {GATEWAY_HOME}/templates
curl -i -k -u bob:bob-password -T templates/workflow-definition.xml -X PUT \
    '{Value Location header from command above}'

# 3. Create the inode for hadoop-examples.jar in /tmp/test/lib
curl -i -k -u bob:bob-password -X PUT \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/lib/hadoop-examples.jar?op=CREATE'

# 4. Upload hadoop-examples.jar to /tmp/test/lib.  Use a hadoop-examples.jar from a Hadoop install.
curl -i -k -u bob:bob-password -T samples/hadoop-examples.jar -X PUT \
    '{Value Location header from command above}'

# 5. Create the inode for a sample input file readme.txt in /tmp/test/input.
curl -i -k -u bob:bob-password -X PUT \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/input/README?op=CREATE'

# 6. Upload readme.txt to /tmp/test/input.  Use the readme.txt in {GATEWAY_HOME}.
# The sample below uses this README file found in {GATEWAY_HOME}.
curl -i -k -u bob:bob-password -T README -X PUT \
    '{Value of Location header from command above}'

# 7. Create the job configuration file by replacing the {NameNode host:port} and {JobTracker host:port}
# in the command below to values that match your Hadoop configuration.
# NOTE: The hostnames must be resolvable by the Oozie daemon.  The ports are the RPC ports not the HTTP ports.
# For example {NameNode host:port} might be sandbox:8020 and {JobTracker host:port} sandbox:50300
# The source workflow-configuration.xml file can be found in {GATEWAY_HOME}/templates
# Alternatively, this file can copied and edited manually for environments without the sed utility.
sed -e s/REPLACE.NAMENODE.RPCHOSTPORT/{NameNode host:port}/ \
    -e s/REPLACE.JOBTRACKER.RPCHOSTPORT/{JobTracker host:port}/ \
    <templates/workflow-configuration.xml >workflow-configuration.xml

# 8. Submit the job via Oozie
# Take note of the Job ID in the JSON response as this will be used in the next step.
curl -i -k -u bob:bob-password -T workflow-configuration.xml -H Content-Type:application/xml -X POST \
    'https://localhost:8443/gateway/sample/oozie/api/v1/jobs?action=start'

# 9. Query the job status via Oozie.
curl -i -k -u bob:bob-password -X GET \
    'https://localhost:8443/gateway/sample/oozie/api/v1/job/{Job ID returned in JSON body from previous step}'

# 10. List the contents of the output directory /tmp/test/output
curl -i -k -u bob:bob-password -X GET \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test/output?op=LISTSTATUS'

# 11. Optionally cleanup the test directory
curl -i -k -u bob:bob-password -X DELETE \
    'https://localhost:8443/gateway/sample/namenode/api/v1/tmp/test?op=DELETE&recursive=true'

Trademarks

Apache Knox Gateway, Apache, the Apache feather logo and the Apache Knox Gateway project logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.

License

Apache Knox uses the standard Apache license.

Privacy Policy

Apache Knox uses the standard Apache privacy policy.

Space shortcuts

Child pages