Blog

Topology Based Federation


Jira: KNOX-1247

KIP: KIP-11 Cloud Usecases

Introduction

This feature which ships with Knox 1.2.0 allows federation from one Knox instance to another. This is done by using  Header Based Pre-Auth authentication provider. Typically this is useful in cases where one has a hybrid, on-prem - cloud model, so that on-prem Knox instance can federate requests to the cloud instance which can be useful in cases like:

  • WebHDFS calls to on-prem Knox instance get re-dispatched to the cloud instance/s and results in files being put to or read from HDFS in the cloud.
  • Spark jobs submitted to Livy through on-prem instances get re-dispatched and are submitted as cloud workloads.
  • MapReduce jobs submitted to YARN RM through Knox will be submitted as workloads to the cloud.

The downside of this approach is that it adds an additional hop to the request which can slow things down in some cases. It is also extremely critical to make sure "two-way-ssl" is enabled and trust is properly established between on-prem and cloud Knox instances by provisioning certificates, since Header Based Pre Auth authentication by itself is not secure, perimeter security around cloud Knox instance is a must, e.g. VPC, ip whitelisting of the on-prem Knox instance/s ip address.

Setup

The following diagram describes the federated request flow.


We need to provision certificate for cloud Knox instance into on-prem Knox instance and vice versa and enable two-way-ssl, as shown below.

We will look at the setting for on-prem and cloud Knox topologies

On-prem

Authentication provider (topology): any

Dispatch (topology): org.apache.knox.gateway.dispatch.HeaderPreAuthFederationDispatch

Federation header name (gateway-site.xml): gateway.custom.federation.header.name

For authentication, since we will be authenticating locally, we can use any authentication provider we choose i.e.  local LDAP.

We can update the dispatch for the service that needs to be federated (WEBHDFS in the following example). You can override dispatch for a service in the topology itself e.g.

	  <service>
          <role>WEBHDFS</role>
          <url>https://my.cloudurl.com:8443/gateway/aws/webhdfs</url>
		  <dispatch>
             <classname>org.apache.knox.gateway.dispatch.HeaderPreAuthFederationDispatch</classname>
             <use-two-way-ssl>true</use-two-way-ssl>
          </dispatch>
      </service> 

gateway.custom.federation.header.name property in gateway-site.xml can be used to set a custom header name. Default value of this property is "SM_USER". 

This property value needs to be same as preauth.custom.header property used by Cloud topology HeaderPreAuth authentication provider.

e.g.

    <property>
        <name>gateway.custom.federation.header.name</name>
        <value>aws_header</value>
        <description>Custom header name to be used for federated requests.</description>
    </property>


Cloud

Authentication provider (topology): HeaderPreAuth

For cloud Knox instance, we need to use the HeaderPreAuth authentication provider and specify the "preauth.custom.header " parameter, "preauth.custom.header" should be exactly same as value of property "gateway.custom.federation.header.name" defined in on-prem gateway-site.xml (aws_header in our example above)


Following is the relevant topology snippet

     <provider>          
         <role>federation</role> 
         <name>HeaderPreAuth</name>          
         <enabled>true</enabled>         
         <param>
              <name>preauth.custom.header</name>
              <value>aws_header</value>
           </param>          
     </provider>

That's all there is, now your topology based federation should be ready.





KnoxShell kerberos support

JIRA: KNOX-1623

Introduction

KnoxShell Kerberos support should be available in Apache Knox 1.3.0. KnoxShell is a Apache Knox module that has scripting support to talk to Apache Knox, more details on setting up KnoxShell can be found in this blog post. With kerberos support now we can use cached tickets or keytabs to authenticate with a secure (Kerberos enabled) topology in Apache Knox. This blog demonstrates examples of how this can be achieved.

Prerequisite

  1. In order to get started, download and setup KnoxShell , setup instructions.
  2. Configure Apache Knox to use Hadoop Auth, setup instructions.

Make sure to test by sending a curl request through Knox

curl -k -i --negotiate -u : "https://{knoxhost}:{knoxport}/gateway/sandbox/webhdfs/v1/tmp?op=LISTSTATUS”


Kerberos Authentication

Following is the methods that can be used to initialize a session in Knox Shell

session = KnoxSession.kerberosLogin(url, jaasConfig, krb5Conf, debug)

Where:

  • url is the gateway url
  • jaasConfig is jaas configuration (optional)
  • krb5Conf krb5 config file (optional)
  • debug turn on debug statements (optional)

or

session = KnoxSession.kerberosLogin(url)

where:

  • url is the gateway url
  • Default jaasConfig is used which looks for cached token on an OS specific path
  • Looks for krb5 conf file at default location
  • debug is false

Example

Following is an example groovy script to talk to Kerberos enabled cluster fronted by Apache Knox using Hadoop Auth

    import groovy.json.JsonSlurper
    import org.apache.knox.gateway.shell.KnoxSession
    import org.apache.knox.gateway.shell.hdfs.Hdfs

    import org.apache.knox.gateway.shell.Credentials

    gateway = "https://gateway-site:8443/gateway/secure"

    session = KnoxSession.kerberosLogin(gateway)

    text = Hdfs.ls( session ).dir( "/" ).now().string
    json = (new JsonSlurper()).parseText( text )
    println json.FileStatuses.FileStatus.pathSuffix
    session.shutdown()

Following is an example of relevant parts of "secure" topology

       <provider>
          <role>authentication</role>
          <name>HadoopAuth</name>
          <enabled>true</enabled>
          <param>
            <name>config.prefix</name>
            <value>hadoop.auth.config</value>
          </param>
          <param>
            <name>hadoop.auth.config.signature.secret</name>
            <value>some-secret</value>
          </param>
          <param>
            <name>hadoop.auth.config.type</name>
            <value>kerberos</value>
          </param>
          <param>
            <name>hadoop.auth.config.simple.anonymous.allowed</name>
            <value>false</value>
          </param>
          <param>
            <name>hadoop.auth.config.token.validity</name>
            <value>1800</value>
          </param>
          <param>
            <name>hadoop.auth.config.cookie.domain</name>
            <!-- Cookie domain for your site -->
            <value>your.site</value>
          </param>
          <param>
            <name>hadoop.auth.config.cookie.path</name>
            <!-- Topology path -->
            <value>gateway/secure</value>
          </param>
          <param>
            <name>hadoop.auth.config.kerberos.principal</name>
            <value>HTTP/your.site@EXAMPLE.COM</value>
          </param>
          <param>
            <name>hadoop.auth.config.kerberos.keytab</name>
            <value>/etc/security/keytabs/spnego.service.keytab</value>
          </param>
          <param>
            <name>hadoop.auth.config.kerberos.name.rules</name>
            <value>DEFAULT</value>
          </param>
        </provider>

Now we kinit and then run the groovy script.

Note on credential cache location: Credential cache location for macos is in-memory which means the credentials are held in memory and not written on disk. KnoxShell unfortunately does not have access to in-memory cache so -c FILE:<cache location> option should be used while doing a kinit.

The following ticket cache location is specific for my machine, it may or may not be same in your case.

kinit -c FILE:/tmp/krb5cc_502 admin/your.site@EXAMPLE.COM

Next we just invoke the groovy script using KnoxShell

bin/knoxshell.sh samples/ExampleWebHdfsLs.groovy

If everything is setup properly you should see HDFS LS output.


Rewrite rules in Apache Knox can be difficult to follow if you are just starting to use Apache Knox, this blog tries to cover the basics of Apache Knox rewrite rules and then go in depth and talk about more advanced rules and how to use them. This blog builds upon the Adding a service to Apache Knox by Kevin Minder

Rules are defined in the rewrite.xml file, an example is:

<rules>
  <rule dir="IN" name="WEATHER/weather/inbound" pattern="*://*:*/**/weather/{path=**}?{**}">
    <rewrite template="{$serviceUrl[WEATHER]}/{path=**}?{**}"/>
  </rule>
</rules>

Simple service rule

A sample service.xml entry

<service role="WEATHER" name="weather" version="0.0.1">
  <routes>
    <route path="/weather/**"/>
  </routes>
</service>

service.xml file defines the high level URL pattern that will be exposed by the gateway for a service.

<service role="WEATHER">
  • The role/implementation/version triad is used through Knox for integration plugins.
  • Think of the role as an interface in Java.
  • This attribute declares what role this service “implements”.
  • This will need to match the topology file’s <topology><service><role> for this service.

<service name="weather">
  • In the role/implementation/version triad this is the implementation.
  • Think of this as a Java implementation class name relative to an interface.
  • As a matter of convention this should match the directory beneath <GATEWAY_HOME>/data/services
  • The topology file can optionally contain <topology><service><name> but usually doesn’t. This would be used to select a specific implementation of a role if there were multiple.

<service version="0.0.1">
  • As a matter of convention this should match the directory beneath the service implementation name.
  • The topology file can optionally contain <topology><service><version> but usually doesn’t. This would be used to select a specific version of an implementation there were multiple. This can be important if the protocols for a service evolve over time.


<service><routes><route path="/weather/**"></routes></service>
  • This tells the gateway that all requests starting starting with /weather/ are handled by this service.
  • Due to a limitation this will not include requests to /weather (i.e. no trailing /)
  • The ** means zero or more paths similar to Ant.
  • The scheme, host, port, gateway and topology components are not included (e.g. https://localhost:8443/gateway/sandbox)
  • Routes can, but typically don’t, take query parameters into account.
  • In this simple form there is no direct relationship between the route path and the rewrite rules!

Simple rewrite rules



<rules><rule pattern="*://*:*/**/weather/{path=**}?{**}"/></rules>
  • Defines the URL pattern for which this rule will apply.
  • The * matches exactly one segment of the URL.
  • The ** matches zero or more segments of the URL.
  • The {path=**} matches zero or more path segments and provides access them as a parameter named 'path’.
  • The {**} matches zero or more query parameters and provides access to them by name.
  • The values from matched {…} segments are “consumed” by the rewrite template below.

<rules><rule><rewrite template="{$serviceUrl[WEATHER]}/{path=**}?{**}"/></rules>
  • Defines how the URL matched by the rule will be rewritten.
  • The $serviceUrl[WEATHER]} looks up the <service><url> for the <service><role>WEATHER. This is a implemented as rewrite function and is another custom extension point.
  • The {path=**} extracts zero or more values for the 'path’ parameter from the matched URL.
  • The {**} extracts any “unused” parameters and uses them as query parameters.

Scope

Rewrites rules can be global and local to the service they are defined in. After Apache Knox 0.6.0 all the rewrites rules are local unless they are explicitly defined as global.

To define global rules use the property 'gateway.global.rules.services' in 'gateway-site.xml' that takes a list of services whose rewrite rules are made global. for. e.g.

    <property>
        <name>gateway.global.rules.services</name>
        <value>"NAMENODE","JOBTRACKER", "WEBHDFS", "WEBHCAT", "OOZIE", "WEBHBASE", "HIVE", "RESOURCEMANAGER"</value>
    </property>

Note: Rewrite rules rules for these services "NAMENODE","JOBTRACKER", "WEBHDFS", "WEBHCAT", "OOZIE", "WEBHBASE", "HIVE", "RESOURCEMANAGER" are global by default.

If you want to define a single rule to be scoped inside a global rewrite rules you can do so by using the attribute 'scope' e.g.

    <!-- Limit the scope of this rule just to WEBHDFS service -->
    <rule dir="OUT" scope="WEBHDFS" name="WEBHDFS/webhdfs/outbound" pattern="hdfs://*:*/{path=**}?{**}">
        <rewrite template="{$frontend[url]}/webhdfs/v1/{path=**}?{**}"/>
    </rule>


Direction

Rewrite rules can be applied to inbound (requests going to the Gateway - from browser, curl etc.) or outbound (response going from the Gateway towards browser) requests/responses. The direction is indicated by the "dir" attribute

<rule dir="IN">

The possible values are IN and OUT for inbound and outbound requests.


Flow

Flows are the logical AND, OR, ALL operators on the rules. So, a rewrite rule could match a pattern A OR pattern B, a rule could match a pattern A AND pattern B, a rule could match ALL the given patterns.

Valid flow values are:

  • OR
  • AND
  • ALL

e.g. OR (match )

<rule name="test-rule-with-complex-flow" flow="OR">
    <match pattern="*://*:*/~/{path=**}?{**}">
        <rewrite template="test-scheme-output://test-host-output:777/test-path-output/test-home/{path}?{**}"/>
    </match>
    <match pattern="*://*:*/{path=**}?{**}">
        <rewrite template="test-scheme-output://test-host-output:42/test-path-output/{path}?{**}"/>
    </match>
</rule>


Rewrite Variables

These variables can be used with the rewrite function.

$username

Username of authenticated user

	<rule name="OOZIE/oozie/user-name">
        <rewrite template="{$username}"/>
    </rule>


$inboundurl


  <rule dir="OUT" name="NODEUI/node/static" pattern="/static/{**}">
    <rewrite template="{$frontend[url]}/node/static/{**}?host={$inboundurl[host]}"/>
  </rule>


$serviceAddr

    <rule name="hdfs-addr">
        <rewrite template="hdfs://{$serviceAddr[NAMENODE]}"/>
    </rule>


$serviceHost

    <rule name="nn-host">
        <rewrite template="{$serviceHost[NAMENODE]}"/>
    </rule>


$serviceMappedAddr

    <rule name="OOZIE/oozie/name-node-url">
        <rewrite template="hdfs://{$serviceMappedAddr[NAMENODE]}"/>
    </rule>


$serviceMappedHost

 


$serviceMappedUrl

    <match pattern="{path=**}">
            <rewrite template="{$serviceMappedUrl[NAMENODE]}/{path=**}"/>
    </match>


$servicePath

    <rule name="nn-path">
        <rewrite template="{$servicePath[NAMENODE]}"/>
    </rule>


$servicePort

    <rule name="hdfs-path">
        <match pattern="{path=**}"/>
        <rewrite template="hdfs://{$serviceHost[NAMENODE]}:{$servicePort[NAMENODE]}/{path=**}"/>
    </rule>


$serviceScheme

<rule dir="IN" name="NODEUI/logs" pattern="*://*:*/**/node/logs/?{host}?{port}">
    <rewrite template="{$serviceScheme[NODEUI]}://{host}:{port}/logs/"/>
</rule>

$serviceUrl

  • $serviceUrl[SERVICE_NAME]  - looks up the <service><url> for the <service><role>SERVICE_NAME

$frontend

  • $frontend[path] - Gets the Knox path i.e. /gateway/sandbox/

$import

  • $import - This function enhances the $frontend function by adding '@import' prefix to the $frontend path. e.g.

    <rewrite template="{$import[&quot;, url]}/stylesheets/pretty.css&quot;;"/>

    . It takes following parameters as options:

$username

  • $username - This variable is used when we need to get the impersonated principal name (primary principal in case impersonated principal is absent).

    <rewrite template="test-output-scheme://{host}:{port}/test-output-path/{path=**}?user.name={$username}?{**}?test-query-output-name=test-query-output-value"/>

$prefix

  • $prefix - This function enhances the $frontend function just like $import but gives the ability to choose a prefix (unlike a constant @import in case of $import) e.g.

    <rewrite template="{$prefix[&#39;,url]}/zeppelin/components/{**}?{**}"/>
    
    
    • $prefix[PREFIX, url] - Adds a supplied PREFIX to the frontend url, e.g. in above case the rewritten url would be 'https://localhost:8443/

      zeppelin/components/navbar/navbar.html?v=1498928142479' (mind the single tick ' )

$postfix

  • $postfix - Just like prefix, postfix function is used to append a character or string to the gateway url (including topology path)

  • usage - {$postfix[url,<customString>]}

    <rewrite template="{scheme}://{host}:{port}/{gateway}/{knoxsso}/{api}/{v1}/{websso}?originalUrl={$postfix[url,/sparkhistory/]}"/>


$infix

  • $infix - This function is used to used to append custom prefix and postfix
  • usage - {$infix[<customString>,url,<customString>]}

    <rewrite template="{scheme}://{host}:{port}/{gateway}/{sandbox}/?query={$infix[&#39;,url,/sparkhistory/&#39;]}"/>

$hostmap

The purpose of the Hostmap provider is to handle situations where host are known by one name within the cluster and another name externally. This frequently occurs when virtual machines are used and in particular when using cloud hosting services. Currently, the Hostmap provider is configured as part of the topology file.

For more information see knox user guide


Rewrite rule example:


  <rewrite template="{gateway.url}/hdfs/logs?{scheme}?host={$hostmap(host)}?{port}?{**}"/>

Topology declaration example

<topology>
    <gateway>
        ...
        <provider>
            <role>hostmap</role>
            <name>static</name>
            <enabled>true</enabled>
            <param><name>external-host-name</name><value>internal-host-name</value></param>
        </provider>
        ...
    </gateway>
    ...
</topology>


$inboundurl

Only used by outbound rules

<rewrite template="{gateway.url}/datanode/static/{**}?host={$inboundurl[host]}"/>


Rules Filter

Sometimes you want the ability to rewrite the *.js, *.css and other non-html pages. FIlters are a way to rewrite these non-html files. FIlters are based on the content-type of the page.

These are the different types of filters that are supported by Apache Knox.

There are three declarations needed for filters, 

  1. Filter declaration, the Content-Type and the pattern to apply the filter to - rewrite.xml
  2. Rewrite rule to apply to matched patter - rewrite.xml
  3. Path to apply the filter to and to be applied on response or request body - service.xml

The is an example of Filters used in Proxying Zeppelin UI, the relevant code snippets in rewrite.xml and service.xml files are:

rewrite.xml
  <!-- Filters -->
  <rule dir="OUT" name="ZEPPELINUI/zeppelin/outbound/javascript/filter/app/home" >
    <rewrite template="{$frontend[path]}/zeppelin/app/home/home.html"/>
  </rule>
  
  <rule dir="OUT" name="ZEPPELINUI/zeppelin/outbound/javascript/filter/app/notebook" >
    <rewrite template="{$frontend[path]}/zeppelin/app/notebook/notebook.html"/>
  </rule>
  
  <rule dir="OUT" name="ZEPPELINUI/zeppelin/outbound/javascript/filter/app/jobmanager" >
    <rewrite template="{$frontend[path]}/zeppelin/app/jobmanager/jobmanager.html"/>
  </rule>
 
  <filter name="ZEPPELINUI/zeppelin/outbound/javascript/filter">
          <content type="application/javascript">
              <apply path="app/home/home.html" rule="ZEPPELINUI/zeppelin/outbound/javascript/filter/app/home"/>
              <apply path="app/notebook/notebook.html" rule="ZEPPELINUI/zeppelin/outbound/javascript/filter/app/notebook"/>
              <apply path="app/jobmanager/jobmanager.html" rule="ZEPPELINUI/zeppelin/outbound/javascript/filter/app/jobmanager"/>
          </content>
  </filter>
service.xml
    <!-- Filter -->
    <route path="/zeppelin/scripts/**">
      <rewrite apply="ZEPPELINUI/zeppelin/outbound/javascript/filter" to="response.body"/>
    </route>

A good example of how to use the filters is Proxying a UI using Knox.

Following are the different types of Content-Types supported by Apache Knox.

Form URL Rewrite Filter

Uses Content-Type "application/x-www-form-urlencoded", "*/x-www-form-urlencoded"

HTML URL Rewrite Filter

Uses Content-Type "application/html", "text/html", "*/html"

JavaScript URL Rewrite Filter

Uses Content-Type "application/javascript", "text/javascript", "*/javascript", "application/x-javascript", "text/x-javascript", "*/x-javascript"

JSON URL Rewrite FIlter

Uses Content-Type "application/json", "text/json", "*/json"

XML URL Rewrite FIlter

Uses Content-Type "application/xml", "text/xml", "*/xml"


Pattern Matching

Pattern matching for Knox unfortunately does not match the standard Regex format. Following is how pattern matching works in some of the cases

URL Templates

Path

  • {path} => {path=*}
  • {path=*} // Match single path level. (ie wildcard)
  • {path=**} // Match multiple path levels. (ie glob)
  • {path=*.ext} // Match single level with simplified regex pattern.

Query

  • {queryParam}
  • {queryParam=*} => {queryParam=*:queryParam} // Match single queryParam value.
  • {queryParam=**} => {queryParam=**:queryParam} // Match multiple queryParam values.
  • {queryParam=*suffix:other-queryParam}


URI Parser*

The following format is used for parsing URIs

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
      12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability;  they indicate the reference points for each subexpression (i.e., each paired parenthesis).  We refer to the value matched for subexpression  <n> as $<n>.  For example, matching the above expression to results in the following subexpression matches:

      $1 = http:
      $2 = http
      $3 = //www.ics.uci.edu
      $4 = www.ics.uci.edu
      $5 = /pub/ietf/uri/
      $6 = <undefined>
      $7 = <undefined>
      $8 = #Related
      $9 = Related

where <undefined> indicates that the component is not present, as is the case for the query component in the above example.  Therefore, we  can determine the value of the five components as

      scheme    = $2

      authority = $4

      path      = $5

      query     = $7

      fragment  = $9


JSON Parsing

For parsing JSON documents Knox uses JSONPATH

Sources

http://www.ics.uci.edu/pub/ietf/uri/#Related




Hadoop Group Lookup Provider

Introduction

Prior to the addition of the Hadoop Group Lookup Provider, group lookup was relegated to the authentication or federation provider that established the user identity.

Therefore, there was a limitation to which group lookup mechanisms were available.

As part of KIP-1 improvements and release 0.10.0, the Knox community as introduced an identity assertion provider that integrates the Hadoop Groups Mapping capability from Hadoop common.

This allows us to compose topologies that contain any authentication/federation provider as well as the Hadoop Group Lookup Provider as an identity assertion provider.

Eliminating the previous limitation of choices and enabling the same exact capabilities for group mapping that are being leveraged throughout the cluster.

This results in greater flexibility, consistency and choices for performance and complex lookup approaches.

Hadoop Group Lookup Provider

An identity assertion provider that looks up user’s ‘group membership’ for authenticated users using Hadoop’s group mapping service (GroupMappingServiceProvider).

This allows existing investments in the Hadoop mechanism to be leveraged within Knox and used within the access control policy enforcement at the perimeter.

The ‘role’ for this provider is ‘identity-assertion’ and name is ‘HadoopGroupProvider’.

    <provider>
        <role>identity-assertion</role>
        <name>HadoopGroupProvider</name>
        <enabled>true</enabled>
        <<param> ... </param>
    </provider>

Configuration

All the configuration for ‘HadoopGroupProvider’ resides in the provider section in a gateway topology file. The ‘hadoop.security.group.mapping’ property determines the implementation. Some of the valid implementation are as follows

org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback

This is the default implementation and will be picked up if ‘hadoop.security.group.mapping’ is not specified. This implementation will determine if the Java Native Interface (JNI) is available. If JNI is available, the implementation will use the API within Hadoop to resolve a list of groups for a user. If JNI is not available then the shell implementation, org.apache.hadoop.security.ShellBasedUnixGroupsMapping, is used, which shells out with the ‘bash -c groups’ command (for a Linux/Unix environment) or the ‘net group’ command (for a Windows environment) to resolve a list of groups for a user.

org.apache.hadoop.security.LdapGroupsMapping

This implementation connects directly to an LDAP server to resolve the list of groups. However, this should only be used if the required groups reside exclusively in LDAP, and are not materialized on the Unix servers.

For more information on the implementation and properties refer to Hadoop Group Mapping.

Example

The following example snippet works with the demo ldap server that ships with Apache Knox. Replace the existing ‘Default’ identity-assertion provider with the one below (HadoopGroupProvider).

    <provider>
        <role>identity-assertion</role>
        <name>HadoopGroupProvider</name>
        <enabled>true</enabled>
        <param>
            <name>hadoop.security.group.mapping</name>
            <value>org.apache.hadoop.security.LdapGroupsMapping</value>
        </param>
        <param>
            <name>hadoop.security.group.mapping.ldap.bind.user</name>
            <value>uid=tom,ou=people,dc=hadoop,dc=apache,dc=org</value>
        </param>
        <param>
            <name>hadoop.security.group.mapping.ldap.bind.password</name>
            <value>tom-password</value>
        </param>
        <param>
            <name>hadoop.security.group.mapping.ldap.url</name>
            <value>ldap://localhost:33389</value>
        </param>
        <param>
            <name>hadoop.security.group.mapping.ldap.base</name>
            <value></value>
        </param>
        <param>
            <name>hadoop.security.group.mapping.ldap.search.filter.user</name>
            <value>(&amp;(|(objectclass=person)(objectclass=applicationProcess))(cn={0}))</value>
        </param>
        <param>
            <name>hadoop.security.group.mapping.ldap.search.filter.group</name>
            <value>(objectclass=groupOfNames)</value>
        </param>
        <param>
            <name>hadoop.security.group.mapping.ldap.search.attr.member</name>
            <value>member</value>
        </param>
        <param>
            <name>hadoop.security.group.mapping.ldap.search.attr.group.name</name>
            <value>cn</value>
        </param>
    </provider>

Here, we are working with the demo ldap server running at ‘ldap://localhost:33389’ which populates some dummy users for testing that we will use in this example. This example uses the user ‘tom’ for LDAP binding. If you have different LDAP/AD settings you will have to update the properties accordingly.

Let’s test our setup using the following command (assuming the gateway is started and listening on localhost:8443). Note that we are using credentials for the user ‘sam’ along with the command.

    curl -i -k -u sam:sam-password -X GET 'https://localhost:8443/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS' 

The command should be executed successfully and you should see the groups ‘scientist’ and ‘analyst’ to which user ‘sam’ belongs to in gateway-audit.log i.e.

    ||a99aa0ab-fc06-48f2-8df3-36e6fe37c230|audit|WEBHDFS|sam|||identity-mapping|principal|sam|success|Groups: [scientist, analyst]

In large enterprise LDAP setups there could be cases where users under different OUs might have same userids for e.g.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In the diagram above we can see that we have the same userid in multiple OUs, i.e. uid=jerry,ou=people,dc=hadoop,dc=apache,dc=org and uid=jerry,ou=contractor,dc=hadoop,dc=apache,dc=org.

 

What if we would like both the users to successfully authenticate using Apache Knox ? one thing we could do is broaden the search base by going high up the LDAP tree, in this case using dc=hadoop,dc=apache,dc=org as the search base.

e.g.

			<param>
                <name>main.ldapRealm.searchBase</name>
                <value>dc=hadoop,dc=apache,dc=org</value>
            </param> 

 

Unfortunately, this approach will work only for one user and not the other. The reason for this is because Knox walks down one branch and tries to authenticate and if it fails to authenticate (say, because of a bad password) it will report a failure and stop, throwing an error "Failed to Authenticate with LDAP server: {1}".

The issue here is that after failure while traversing down one branch (say, ou=people,dc=hadoop,dc=apache,dc=org) the other branch (ou=contractor,dc=hadoop,dc=apache,dc=org) is ignored.

The solution is to use multiple LDAP realms. Multiple LDAP realms let's us traverse multiple branches (configured by *.searchBase property) even if a failure is encountered in any one of them.

 

Configuring multiple realms:

The rest of the post describes a test setup on how to configure multiple realms using the demo LDAP server that ships with Apache Knox

Creating sample test users

Let's use the users in the example diagram above. For this test we will use the demo LDAP server.

Before starting the demo LDAP server add the following to the {KNOX_HOME}/conf/users.ldif file

Add contractor OU just before the sample users entry

# Entry for a sample contractor container
# Please replace with site specific values
dn: ou=contractor,dc=hadoop,dc=apache,dc=org
objectclass:top
objectclass:organizationalUnit
ou: contractor

Add sample users with same uid but different OUs. You can add them after dn: uid=tom,ou=people,dc=hadoop,dc=apache,dc=org entry.

# entry for sample user jerry
dn: uid=jerry,ou=people,dc=hadoop,dc=apache,dc=org
objectclass:top
objectclass:person
objectclass:organizationalPerson
objectclass:inetOrgPerson
cn: jerry
sn: jerry
uid: jerry
userPassword:jerry-password
# entry for sample user jerry (contractor)
dn: uid=jerry,ou=contractor,dc=hadoop,dc=apache,dc=org
objectclass:top
objectclass:person
objectclass:organizationalPerson
objectclass:inetOrgPerson
cn: jerry
sn: jerry
uid: jerry
userPassword:other-jerry-password

Now we have same uid (jerry) in two different OUs

Update topology file:

Open the sandbox.xml topology file (or the one that you want to test).

Under the ShiroProvider (Shiro authentication provider) remove or comment out everything between property 'main.ldapRealm' and property 'main.ldapRealm.contextFactory.authenticationMechanism' and replace it with the following:

			<param>
                <name>main.ldapRealm</name>
                <value>org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm</value>
            </param>
            <param>
                <name>main.ldapContextFactory</name>
                <value>org.apache.hadoop.gateway.shirorealm.KnoxLdapContextFactory</value>
            </param>
            <param>
                <name>main.ldapRealm.contextFactory</name>
                <value>$ldapContextFactory</value>
            </param>
            <param>
                <name>main.ldapRealm.userSearchAttributeName</name>
                <value>uid</value>
            </param>
            
            <param>
                <name>main.ldapRealm.contextFactory.systemUsername</name>
                <value>uid=guest,ou=people,dc=hadoop,dc=apache,dc=org</value>
            </param>
            
            <param>
                <name>main.ldapRealm.contextFactory.systemPassword</name>
                <value>guest-password</value>
            </param>
            
            <param>
                <name>ldapRealm.userObjectClass</name>
                <value>person</value>
            </param>
            
            <!-- Search Base for Realm 1 -->
            <param>
                <name>main.ldapRealm.searchBase</name>
                <value>ou=people,dc=hadoop,dc=apache,dc=org</value>
            </param>
            
            <param>
                <name>main.ldapRealm.contextFactory.url</name>
                <value>ldap://localhost:33389</value>
            </param>
            <param>
                <name>main.ldapRealm.contextFactory.authenticationMechanism</name>
                <value>simple</value>
            </param>
            
            <!-- REALM #2 Start  -->
            <param>
                <name>main.ldapRealm2</name>
                <value>org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm</value>
            </param>
            <param>
                <name>main.ldapContextFactory</name>
                <value>org.apache.hadoop.gateway.shirorealm.KnoxLdapContextFactory</value>
            </param>
            <param>
                <name>main.ldapRealm2.contextFactory</name>
                <value>$ldapContextFactory</value>
            </param>
            
            <param>
                <name>main.ldapRealm2.userSearchAttributeName</name>
                <value>uid</value>
            </param>
            
            <param>
                <name>main.ldapRealm2.contextFactory.systemUsername</name>
                <value>uid=guest,ou=people,dc=hadoop,dc=apache,dc=org</value>
            </param>
            
            <param>
                <name>main.ldapRealm2.contextFactory.systemPassword</name>
                <value>guest-password</value>
            </param>
            
            <param>
                <name>ldapRealm.userObjectClass</name>
                <value>person</value>
            </param>
            <!-- Search Base for Realm 2 -->
            <param>
                <name>main.ldapRealm2.searchBase</name>
                <value>ou=contractor,dc=hadoop,dc=apache,dc=org</value>
            </param>
            
            <param>
                <name>main.ldapRealm2.contextFactory.url</name>
                <value>ldap://localhost:33389</value>
            </param>
            <param>
                <name>main.ldapRealm2.contextFactory.authenticationMechanism</name>
                <value>simple</value>
            </param>
            <!-- REALM #2 End  -->
            
            <!-- Let Knox know about the two different realms -->
            <param>
              <name>main.securityManager.realms</name> 
              <value>$ldapRealm, $ldapRealm2</value>
            </param>

 

In the above example we have defined two realms 'ldapRealm' and 'ldapRealm2'. For more information about the properties look at Apache Knox Advanced LDAP Configuration.

The important properties are described below in the context of multiple realms:

  • main.*.searchBase - Search base where Knox will start the search
  • main.*.userSearchAttributeName and ldapRealm.userObjectClass - LDAP Filter for doing a search, in our case (&(uid=jerry)(objectclass=person))
  • main.securityManager.realms - Defines the realms to be used for authentication, in this case we use two 'ldapRealm' and 'ldapRealm2'

The advantage of using multiple realms is that if search fails to match user in one realm (using its search base) the search continues in other realm or realms. Although caution is advised in defining the search base to prevent it from being too narrow or restricted.

Test:

Assuming you HDP sandbox, demo LDAP and Apache Knox gateway running on your machine, you can test both the users with the following command and should get a HTTP 200 response

curl -i -k -u jerry:jerry-password -X GET 'https://localhost:8443/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS'

curl -i -k -u jerry:other-jerry-password -X GET 'https://localhost:8443/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS'

 

You could also see login failure when Apache Knox tries to authenticate using a different branch, this of course is expected and shows how Apache Knox traversed to find the right user.

# curl -i -k -u jerry:other-jerry-password -X GET 'https://localhost:8443/gateway/sandbox/webhdfs/v1/?op=LISTSTATUS'
....
....
....
2017-03-01 15:51:45,530 INFO  hadoop.gateway (KnoxLdapRealm.java:getUserDn(724)) - Computed userDn: uid=jerry,ou=people,dc=hadoop,dc=apache,dc=org using ldapSearch for principal: jerry
2017-03-01 15:51:45,538 INFO  hadoop.gateway (KnoxLdapRealm.java:doGetAuthenticationInfo(203)) - Could not login: org.apache.shiro.authc.UsernamePasswordToken - jerry, rememberMe=false (0:0:0:0:0:0:0:1)
2017-03-01 15:51:45,538 ERROR hadoop.gateway (KnoxLdapRealm.java:doGetAuthenticationInfo(205)) - Shiro unable to login: javax.naming.AuthenticationException: [LDAP: error code 49 - INVALID_CREDENTIALS: Bind failed: ERR_229 Cannot authenticate user uid=jerry,ou=people,dc=hadoop,dc=apache,dc=org]
2017-03-01 15:51:45,541 INFO  hadoop.gateway (KnoxLdapRealm.java:getUserDn(724)) - Computed userDn: uid=jerry,ou=contractor,dc=hadoop,dc=apache,dc=org using ldapSearch for principal: jerry

 

 

Hadoop Auth is a Java library which enables Kerberos SPNEGO authentication for HTTP requests. It enforces authentication on protected resources, after successful authentication Hadoop Auth creates a signed HTTP Cookie with an authentication token, username, user principal, authentication type and expiration time. This cookie is used for all subsequent HTTP client requests to access a protected resource until the cookie expires.

Given Apache Knox's pluggable authentication providers it is easy to setup Hadoop Auth with Apache Knox with only few configuration changes. The purpose of this article to describe this process in detail and with examples.

Assumptions

Here we are assuming that we have a working Hadoop cluster with Apache Knox ( version 0.7.0 and up ) moreover the cluster is Kerberized. Kerberizing the cluster is beyond the scope of this article.

Setup

To use Hadoop Auth in Apache Knox we need to update the Knox topology. Hadoop Auth is configured as a provider so we need to configure it through the provider params. Apache Knox uses the same configuration parameters used by Apache Hadoop and they can be expected to behave in similar fashion. To update the Knox topology using Ambari go to Knox -> Configs -> Advanced topology.

Following is an example of the HadoopAuth provider snippet in the Apache Knox topology file

                <provider>
                  <role>authentication</role>
                  <name>HadoopAuth</name>
                  <enabled>true</enabled>
                  <param>
                    <name>config.prefix</name>
                    <value>hadoop.auth.config</value>
                  </param>
                  <param>
                    <name>hadoop.auth.config.signature.secret</name>
                    <value>my-seceret-key</value>
                  </param>
                  <param>
                    <name>hadoop.auth.config.type</name>
                    <value>kerberos</value>
                  </param>
                  <param>
                    <name>hadoop.auth.config.simple.anonymous.allowed</name>
                    <value>false</value>
                  </param>
                  <param>
                    <name>hadoop.auth.config.token.validity</name>
                    <value>1800</value>
                  </param>
                  <param>
                    <name>hadoop.auth.config.cookie.domain</name>
                    <value>ambari.apache.org</value>
                  </param>
                  <param>
                    <name>hadoop.auth.config.cookie.path</name>
                    <value>gateway/default</value>
                  </param>
                  <param>
                    <name>hadoop.auth.config.kerberos.principal</name>
                    <value>HTTP/c6401.ambari.apache.org@EXAMPLE.COM</value>
                  </param>
                  <param>
                    <name>hadoop.auth.config.kerberos.keytab</name>
                    <value>/etc/security/keytabs/spnego.service.keytab</value>
                  </param>
                  <param>
                    <name>hadoop.auth.config.kerberos.name.rules</name>
                    <value>DEFAULT</value>
                  </param>
                </provider>

Following are the parameters that needs to be updated at minimum:

  1.  hadoop.auth.config.signature.secret - This is the secret used to sign the delegation token in the hadoop.auth cookie. This same secret needs to be used across all instances of the Knox gateway in a given cluster. Otherwise, the delegation token will fail validation and authentication will be repeated each request.
  2. cookie.domain - domain to use for the HTTP cookie that stores the authentication token (e.g. mycompany.com)
  3. hadoop.auth.config.kerberos.principal - The web-application Kerberos principal name. The Kerberos principal name must start with HTTP/...
  4. hadoop.auth.config.kerberos.keytab - The path to the keytab file containing the credentials for the kerberos principal specified above.


For details on the other properties please refer to the Apache Knox documentation.

If you are using Ambari you will have to restart Knox, this is an Ambari requirement, no restart is required if topology is updated outside of Ambari (Apache Knox reloads the topology every time the topology time-stamp is updated).

Testing

For testing Hadoop Auth we will test with user 'guest', we are assuming that no such user exists on the system.

  1. Let's create a user 'guest' with group 'users'. Note that the group users was chosen because of the property 'hadoop.proxyuser.knox.groups=users'

    useradd guest -u 1590 -g users
  2. Add principal using 'kadmin.local'

    kadmin.local -q "addprinc guest/c6401.ambari.apache.org”
  3. Login using kinit

    kinit guest/c6401.ambari.apache.org@EXAMPLE.COM
  4. Test by sending a curl request through Knox

    curl -k -i --negotiate -u : "https://c6401.ambari.apache.org:8443/gateway/default/webhdfs/v1/tmp?op=LISTSTATUS”

You should see output similar to

# curl -k -i --negotiate -u : "https://c6401.ambari.apache.org:8443/gateway/default/webhdfs/v1/tmp?op=LISTSTATUS"
HTTP/1.1 401 Authentication required
Date: Fri, 24 Feb 2017 14:19:25 GMT
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Path=gateway/default; Domain=ambari.apache.org; Secure; HttpOnly
Content-Type: text/html; charset=ISO-8859-1
Cache-Control: must-revalidate,no-cache,no-store
Content-Length: 320
Server: Jetty(9.2.15.v20160210)

HTTP/1.1 200 OK
Date: Fri, 24 Feb 2017 14:19:25 GMT
Set-Cookie: hadoop.auth="u=guest&p=guest/c6401.ambari.apache.org@EXAMPLE.COM&t=kerberos&e=1487947765114&s=fNpq9FYy2DA19Rah7586rgsAieI="; Path=gateway/default; Domain=ambari.apache.org; Secure; HttpOnly
Cache-Control: no-cache
Expires: Fri, 24 Feb 2017 14:19:25 GMT
Date: Fri, 24 Feb 2017 14:19:25 GMT
Pragma: no-cache
Expires: Fri, 24 Feb 2017 14:19:25 GMT
Date: Fri, 24 Feb 2017 14:19:25 GMT
Pragma: no-cache
Content-Type: application/json; charset=UTF-8
X-FRAME-OPTIONS: SAMEORIGIN
Server: Jetty(6.1.26.hwx)
Content-Length: 276

{"FileStatuses":{"FileStatus":[{"accessTime":0,"blockSize":0,"childrenNum":1,"fileId":16398,"group":"hdfs","length":0,"modificationTime":1487855904191,"owner":"hdfs","pathSuffix":"entity-file-history","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"}]}}

 

 

 

 

(This article is work in progress)

Apache Knox has always had LDAP based authentication through the Apache Shiro authentication provider which makes the configuration a bit easier and flexible. However there are a number of limitations with the KnoxLdapRealm (KNOX-536), for instance only a single Organizational Unit (OU) is currently supported. Group lookup will not return the groups that are defined within the tree structure below that single OU. Also, group memberships that are indirectly defined through membership in a group that is itself a member of another group are not resolved. In Apache Knox 0.10.0 Knox introduced the ability to leverage the Linux PAM authentication mechanism. KNOX-537 added a KnoxPAMRealm to the Shiro provider for PAM support. This blog post discusses how to set up LDAP using the new PAM support provided by Knox with Linux SSSD daemon  and some of the advantages and key features of SSSD.

Some of the advantages of using this are:

  • Supported for nested OUs and nested groups

  • Faster lookups

  • Support more complex LDAP queries

  • Reduce load on the LDAP/AD server (caching by SSSD)

Scenarios

There are two scenarios that were tested

  • Nested groups
  • Nested OUs
  • Using Multiple Search Bases

Nested Groups

Following diagram represents a nested groups structure used for testing

 

 

In the above diagram we have OU=data which has multiple nested groups (2 levels) and we have a user 'jerry' who belongs to the final group datascience-b explicitly, but implicitly belongs to all the other groups that nest it (i.e. datascience-a and datascience)

When SSSD is properly configured (as explained later in the post) we get the following results

# id -a jerry
uid=4001(jerry) gid=4000(engineer) groups=4000(engineer),5000(datascientist),6000(datascientist-a),7000(datascientist-b)

When we try to access a resource secured by Knox using the user jerry we can see all the groups that user jerry belongs to are logged in gateway-audit.log (part of Knox logging)

Groups: [datascientist-a, datascientist-b, engineer, datascientist]

Nested OUs

Following diagram shows the nested OU structure used for testing

 

In this example we can see that the user kim is part of group 'processors' which is part of OU processing which is part of OU data which in turn is part of OU groups.

Following is the output of 'id' command, here we can see that our user kim and group that user belongs to are retrieved correctly.

# id -a kim
uid=8001(kim) gid=8000(processors) groups=8000(processors)

Similarly, when we try to access a resource secured by Knox using the user kim we get the following entry in gateway-audit.log (part of Knox logging)

Groups: [processors]

This demonstrates that Knox can authenticate and retrieve groups against nested OUs.

Using Multiple Search Bases

Following diagram shows nested parallel OUs (processing and processing-2)

 

In this test we will configure two different search bases 

  • ou=processing,ou=data,ou=groups,dc=hadoop,dc=apache,dc=org
  • ou=processing-2,ou=data,ou=groups,dc=hadoop,dc=apache,dc=org

sssd.conf settings (relevant) for this test are as follows:

[sssd]
....
domains = default, processing2
....

[domain/default]
....
ldap_search_base = ou=processing,ou=data,ou=groups,dc=hadoop,dc=apache,dc=org
....

[domain/processing2]
....
ldap_search_base = ou=processing-2,ou=data,ou=groups,dc=hadoop,dc=apache,dc=org
....

To check whether SSSD correctly picks up our users we use the id command

# id kim
uid=8001(kim) gid=8000(processors) groups=8000(processors)

# id jon
uid=9001(jon) gid=9000(processors-2) groups=9000(processors-2)

Similarly, when we try to access a resource secured by Knox using the user kim and jon we get the following entry in gateway-audit.log (part of Knox logging)

for kim
success|Groups: [processors]

for jon
success|Groups: [processors-2]

Also, if you take out 'processing2' service from sssd.conf file and restart sssd, user 'jon' will not be found but 'kim' can still be found:

# id jon
id: 'jon': no such user
# id kim
uid=8001(kim) gid=8000(processors) groups=8000(processors)

Thanks to Eric Yang for pointing out this scenario.

Setup Overview

Following diagram shows a high level set-up of the components involved.

 

 

 

Following are the component versions for this test

  • OpenLDAP - 2.4.40
  • SSSD - 1.14.1
  • Apache Knox - 0.10.0

LDAP

In order to support nesting of groups LDAP needs to support RFC 2307bis schema. For SSSD to talk to LDAP it has to be secure. Acquire a copy of the public CA certificate for the certificate authority used to sign the LDAP server certificate, you can test the certificate using the following openssl test command

openssl s_client -connect <ldap_host>:<ldap_port> -showcerts -state -CAfile <path_to_ca_directory>/cacert.pem

SSSD

SSSD is stricter than pam_ldap. In order to perform an authentication, SSSD requires that the communication channel be encrypted. This means that if sssd.conf has ldap_uri = ldap://<server>, it will attempt to encrypt the communication channel with TLS (transport layer security). If sssd.conf has ldap_uri = ldaps://<server>, then SSL will be used instead of TLS. This requires that the LDAP server

  1. Supports TLS or SSL
  2. Has TLS access enabled on the standard LDAP port (636) (or alternate port, if specified in the ldap_uri or has SSL access enabled on the standard LDAPS port (or alternate port).
  3. Has a valid certificate trust (can be relaxed by using ldap_tls_reqcert = never,  but it is a security risk and should ONLY be done for development and demos)

Copy the public CA certs needed to talk to LDAP at  /etc/openldap/certs

To configure sssd you can use the following 'authconfig' command

authconfig --enablesssd --enablesssdauth --enablelocauthorize --enableldap --enableldapauth --ldapserver=<ldap_host> --enableldaptls --ldapbasedn=dc=my-company,dc=my-org --enableshadow --enablerfc2307bis --enablemkhomedir --enablecachecreds --update

After the command executes you can see that sssd.conf file has been updated.

An example of sssd.conf file

[sssd]
config_file_version = 2
reconnection_retries = 3
sbus_timeout = 30
services = nss, pam, autofs
domains = default

[nss]
reconnection_retries = 3
homedir_substring = /home

[pam]
reconnection_retries = 3

[domain/default]
access_provider = ldap
autofs_provider = ldap
chpass_provider = ldap
cache_credentials = True
ldap_schema = rfc2307bis

id_provider = ldap
auth_provider = ldap
ldap_uri = ldap://<ldap_host>/

ldap_tls_cacertdir = /etc/openldap/certs
ldap_id_use_start_tls = True

# default bind dn
ldap_default_bind_dn = cn=admin,dc=apache,dc=org
ldap_default_authtok_type = password
ldap_default_authtok = my_pasword
ldap_search_base = dc=apache,dc=org

# For group lookup
ldap_group_member = member

# Enable nesting 
ldap_group_nesting_level = 5

[sudo]

[autofs]

[ssh]

[pac]

[ifp]

The important settings to note are:

  • ldap_schema = rfc2307bis - Needed if all groups are to be returned when using nested groups or primary/secondary groups.
  • ldap_tls_cacertdir = /etc/openldap/certs - certs to talk to LDAP server
  • ldap_id_use_start_tls = True - Secure communication with LDAP
  • ldap_group_nesting_level = 5 - Enable group nesting up-to 5 levels

NOTE: You might need to add / change some options in sssd.conf file to suite your needs. like debug level etc. After updating just restart the service and changes should be reflected.

Some additional settings that can be used to control caching of credentials by SSSD are

   
cache_credentialsBooleanOptional. Specifies whether to store user credentials in the local SSSD domain database cache. The default value for this parameter is false. Set this value to true for domains other than the LOCAL domain to enable offline authentication.
entry_cache_timeoutintegerOptional. Specifies how long, in seconds, SSSD should cache positive cache hits. A positive cache hit is a successful query.

Test SSSD is configuration

To check whether SSSD is configured correctly you can use the standard 'getent' or 'id' commands

$ getent passwd <ldap_user>
$ id -a <ldap_user>

Using the above commands you should be able to see all the groups that <ldap_user> belongs to. If you do not see the secondary groups check the 'ldap_group_nesting_level = 5' option and adjust it accordingly.

Knox

Setting up Knox is relatively easy, install Knox on the same machine as SSSD and update the topology to use PAM based auth

			<param>
                <name>main.pamRealm</name> 
                <value>org.apache.hadoop.gateway.shirorealm.KnoxPamRealm</value>
            </param>
            <param>
                <name>main.pamRealm.service</name> 
                <value>login</value>
            </param>

For more information and explanation on setting up Knox see the PAM Based Authentication section in Knox user guide.

Caveats

  • For nested group membership SSSD and LDAP should use rfc2307bis schema

  • SSSD requires SSL/TLS to talk to LDAP

Troubleshooting

 

 

 

Apache KNOX provides a single gateway to many services in your Hadoop cluster. You can leverage the KNOX shell DSL interface to interact with services such as WebHdfs, WebHCat (Templeton), Oozie, HBase, etc. For example, using groovy and DSL you can submit Hive queries via WebHCat (Templeton) as simple as:

println "[Hive.groovy] Copy Hive query file to HDFS"
Hdfs.put(session).text( hive_query ).to( jobDir + "/input/query.hive" ).now()

jobId = Job.submitHive(session) \
            .file("${jobDir}/input/query.hive") \
            .arg("-v").arg("--hiveconf").arg("TABLE_NAME=${tmpTableName}") \
            .statusDir("${jobDir}/output") \
            .now().jobId

submitSqoop Job API

With version of Apache KNOX 0.10.0, you can now write application using KNOX DSL for Apache SQOOP and easily submit SQOOP jobs. The WebHCAT Job class in DSL language now supports submitSqoop() as follow:

Job.submitSqoop(session)
    .command("import --connect jdbc:mysql://hostname:3306/dbname ... ")
    .statusDir(remoteStatusDir)
    .now().jobId

submitSqoop Request takes the following arguments:

  • command (String) - The sqoop command string to execute.
  • files (String) - Comma separated files to be copied to the templeton controller job.
  • optionsfile (String) - The remote file which contain Sqoop command need to run.
  • libdir (String) - The remote directory containing jdbc jar to include with sqoop lib
  • statusDir (String) - The remote directory to store status output.

which will return jobId as Response.

Simple example

In this example we will run a simple sqoop job to extract scBlastTab table to HFDS from the public genome database (mySQL) at UCSC.

First, import the following packages:

import com.jayway.jsonpath.JsonPath
import groovy.json.JsonSlurper
import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import org.apache.hadoop.gateway.shell.job.Job
import static java.util.concurrent.TimeUnit.SECONDS

Next, establish connection to KNOX gateway with Hadoop.login:

// Get gatewayUrl and credentials from environment
def env = System.getenv()
gatewayUrl = env.gateway
username = env.username
password = env.password

jobDir = "/user/" + username + "/sqoop"

session = Hadoop.login( gatewayUrl, username, password )
 
println "[Sqoop.groovy] Delete " + jobDir + ": " + Hdfs.rm( session ).file( jobDir ).recursive().now().statusCode
println "[Sqoop.groovy] Mkdir " + jobDir + ": " + Hdfs.mkdir( session ).dir( jobDir ).now().statusCode

Define your SQOOP job (assuming SQOOP is already configured with mySql driver already):

// Database connection information

db = [ driver:"com.mysql.jdbc.Driver", url:"jdbc:mysql://genome-mysql.cse.ucsc.edu/hg38", user:"genome", password:"", name:"hg38", table:"scBlastTab", split:"query" ]

targetdir = jobDir + "/" + db.table

sqoop_command = "import --driver ${db.driver} --connect ${db.url} --username ${db.user} --password ${db.password} --table ${db.table} --split-by ${db.split} --target-dir ${targetdir}"

You can now submit the sqoop_command to the cluster with submitSqoop:

jobId = Job.submitSqoop(session) \
            .command(sqoop_command) \
            .statusDir("${jobDir}/output") \
            .now().jobId

println "[Sqoop.groovy] Submitted job: " + jobId

You can then check job status and output as usual:

println "[Sqoop.groovy] Polling up to 60s for job completion..."

done = false
count = 0
while( !done && count++ < 60 ) {
  sleep( 1000 )
  json = Job.queryStatus(session).jobId(jobId).now().string
  done = JsonPath.read( json, "\$.status.jobComplete" )
  print "."; System.out.flush();
}
println ""
println "[Sqoop.groovy] Job status: " + done

// Check output directory
text = Hdfs.ls( session ).dir( jobDir + "/output" ).now().string
json = (new JsonSlurper()).parseText( text )
println json.FileStatuses.FileStatus.pathSuffix

println "\n[Sqoop.groovy] Content of stderr:"
println Hdfs.get( session ).from( jobDir + "/output/stderr" ).now().string

// Check table files
text = Hdfs.ls( session ).dir( jobDir + "/" + db.table ).now().string
json = (new JsonSlurper()).parseText( text )
println json.FileStatuses.FileStatus.pathSuffix

session.shutdown()

 

Here is sample output of the above example against Hadoop cluster. You need to have properly configured Hadoop cluster with Apache KNOX gateway, Apache Sqoop and WebHcat (Templeton). Test was ran against BigInsights Hadoop cluster.

:compileJava UP-TO-DATE
:compileGroovy
:processResources UP-TO-DATE
:classes
:Sqoop

[Sqoop.groovy] Delete /user/biadmin/sqoop: 200
[Sqoop.groovy] Mkdir /user/biadmin/sqoop: 200
[Sqoop.groovy] Submitted job: job_1476266127941_0692
[Sqoop.groovy] Polling up to 60s for job completion...
............................................
[Sqoop.groovy] Job status: true
[exit, stderr, stdout]

[Sqoop.groovy] Content of stderr:
log4j:WARN custom level class [Relative to Yarn Log Dir Prefix] not found.
16/11/03 16:53:05 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6_IBM_27
16/11/03 16:53:06 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/11/03 16:53:06 WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.
16/11/03 16:53:06 INFO manager.SqlManager: Using default fetchSize of 1000
16/11/03 16:53:06 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/iop/4.2.0.0/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/iop/4.2.0.0/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/11/03 16:53:07 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM scBlastTab AS t WHERE 1=0
16/11/03 16:53:07 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM scBlastTab AS t WHERE 1=0
16/11/03 16:53:08 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/iop/4.2.0.0/hadoop-mapreduce
Note: /tmp/sqoop-biadmin/compile/4432005ab10742f26cc82d5438497cae/scBlastTab.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/11/03 16:53:09 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-biadmin/compile/4432005ab10742f26cc82d5438497cae/scBlastTab.jar
16/11/03 16:53:09 INFO mapreduce.ImportJobBase: Beginning import of scBlastTab
16/11/03 16:53:09 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
16/11/03 16:53:09 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM scBlastTab AS t WHERE 1=0
16/11/03 16:53:10 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/11/03 16:53:10 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
16/11/03 16:53:15 INFO db.DBInputFormat: Using read commited transaction isolation
16/11/03 16:53:15 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(query), MAX(query) FROM scBlastTab
16/11/03 16:53:16 WARN db.TextSplitter: Generating splits for a textual index column.
16/11/03 16:53:16 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/11/03 16:53:16 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/11/03 16:53:16 INFO mapreduce.JobSubmitter: number of splits:5
16/11/03 16:53:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1476266127941_0693
16/11/03 16:53:16 INFO mapreduce.JobSubmitter: Kind: mapreduce.job, Service: job_1476266127941_0692, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@6fbb4061)
16/11/03 16:53:16 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ehaascluster, Ident: (HDFS_DELEGATION_TOKEN token 4660 for biadmin)
16/11/03 16:53:16 INFO mapreduce.JobSubmitter: Kind: RM_DELEGATION_TOKEN, Service: 172.16.222.2:8032,172.16.222.3:8032, Ident: (owner=biadmin, renewer=mr token, realUser=HTTP/bicloud-fyre-physical-17-master-3.fyre.ibm.com@IBM.COM, issueDate=1478191971063, maxDate=1478796771063, sequenceNumber=67, masterKeyId=66)
16/11/03 16:53:16 WARN token.Token: Cannot find class for token kind kms-dt
16/11/03 16:53:16 WARN token.Token: Cannot find class for token kind kms-dt
Kind: kms-dt, Service: 172.16.222.1:16000, Ident: 00 07 62 69 61 64 6d 69 6e 04 79 61 72 6e 05 68 62 61 73 65 8a 01 58 2b 1b 7b 34 8a 01 58 4f 27 ff 34 8e 03 a4 09
16/11/03 16:53:16 INFO mapreduce.JobSubmitter: Kind: MR_DELEGATION_TOKEN, Service: 172.16.222.3:10020, Ident: (owner=biadmin, renewer=yarn, realUser=HTTP/bicloud-fyre-physical-17-master-3.fyre.ibm.com@IBM.COM, issueDate=1478191972979, maxDate=1478796772979, sequenceNumber=52, masterKeyId=49)
16/11/03 16:53:17 INFO impl.YarnClientImpl: Submitted application application_1476266127941_0693
16/11/03 16:53:17 INFO mapreduce.Job: The url to track the job: http://bicloud-fyre-physical-17-master-2.fyre.ibm.com:8088/proxy/application_1476266127941_0693/
16/11/03 16:53:17 INFO mapreduce.Job: Running job: job_1476266127941_0693
16/11/03 16:53:24 INFO mapreduce.Job: Job job_1476266127941_0693 running in uber mode : false
16/11/03 16:53:24 INFO mapreduce.Job:  map 0% reduce 0%
16/11/03 16:53:32 INFO mapreduce.Job:  map 20% reduce 0%
16/11/03 16:53:33 INFO mapreduce.Job:  map 100% reduce 0%
16/11/03 16:53:34 INFO mapreduce.Job: Job job_1476266127941_0693 completed successfully
16/11/03 16:53:34 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=799000
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=644
		HDFS: Number of bytes written=148247
		HDFS: Number of read operations=20
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=10
	Job Counters 
		Launched map tasks=5
		Other local map tasks=5
		Total time spent by all maps in occupied slots (ms)=62016
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=31008
		Total vcore-milliseconds taken by all map tasks=31008
		Total megabyte-milliseconds taken by all map tasks=190513152
	Map-Reduce Framework
		Map input records=2379
		Map output records=2379
		Input split bytes=644
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=249
		CPU time spent (ms)=6590
		Physical memory (bytes) snapshot=1758576640
		Virtual memory (bytes) snapshot=35233165312
		Total committed heap usage (bytes)=2638741504
	Fit Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=148247
16/11/03 16:53:34 INFO mapreduce.ImportJobBase: Transferred 144.7725 KB in 23.9493 seconds (6.0449 KB/sec)
16/11/03 16:53:34 INFO mapreduce.ImportJobBase: Retrieved 2379 records.

[_SUCCESS, part-m-00000, part-m-00001, part-m-00002, part-m-00003, part-m-00004]

BUILD SUCCESSFUL
Total time: 1 mins 2.202 secs

From output above you can see the job output as well as the content of the table directory on HDFS which contains 5 parts (used 5 map tasks). WebHcat (Templeton) job console output will go to stderr in this case.

As part of compiling/running your code ensure you have the following dependency: org.apache.knox:gateway-shell:0.10.0.

 

 

By: Larry McCay

 Pseudo Federation Provider

This article will walk through the process of adding a new provider for establishing the identity of a user. The simple example of the Pseudo authentication mechanism in Hadoop will be used to communicate the general ideas for extending the preauthenticated federation provider that is available out of the box in Apache Knox. This is not a provider that should be used in a production environment and has at least one major limitation. It will however illustrate the general programming model for adding preauthenticated federation providers.

 Provider Types

Apache Knox has two types of providers for establishing the identity of the source of an incoming REST request. One is an Authentication Provider and the other is a Federation Provider.

 Authentication Providers

Authentication providers are responsible for actually collecting credentials of some sort from the end user. Examples of authentication providers would be things like HTTP BASIC authentication with username and password that gets authenticated against LDAP or RDBMS, etc. Apache Knox ships with HTTP BASIC authentication against LDAP using Apache Shiro. The Shiro provider can actually be configured in multiple ways.

Authentication providers are sometimes less than ideal since many organizations only want their users to provide credentials to the enterprise trusted/preferred solutions and to use some sort of SSO or federation of that authentication event across all other applications.

 Federation Providers

Federation providers, on the other hand, never see the users' actual credentials but instead federate a previous authentication event through the processing and validation of some sort of token. This allows for greater isolation and protection of user credentials while still providing a means to verify the trustworthiness of the incoming identity assertions. Examples of federation providers would be things like OAuth 2, SAML Assertions, JWT/SWT tokens, Header based identity propagation, etc. Out of the box, Apache Knox enables the use of custom headers for propagating things like the user principal and group membership through the HeaderPreAuth federation provider.

This is generally useful, for solutions such as CA SiteMinder and IBM Tivoli Access Manager. In these sorts of deployments, all traffic to Hadoop would have to go through the solution provider's gateway which authenticates the user and can inject identity propagation headers into the request. The fact that the network security does not allow for requests to bypass the solution gateway provides a level of trust for accepting the header based identity assertions. We also provide for additional validation through a pluggable mechanism and have an ip address validation that can be used out of the box.

 Let's add a Federation Provider

This article will discuss what is involved in adding a new federation provider that will actually extend the abstract bases that were introduced in the PreAuth provider module. It will be a very minimal provider that accepts a request parameter from the incoming request as the user's principal.

 The module and dependencies

The Apache Knox project uses Apache Maven for build and dependency management. We will need to create a new module for the Pseudo federation provider and include our own pom.xml.

<?xml version="1.0" encoding="UTF-8"?> 4.0.0

<groupId>net.minder</groupId>
<artifactId>gateway-provider-security-pseudo</artifactId>
<version>0.0.1</version>

<repositories>
    <repository>
        <id>apache.releases</id>
        <url>https://repository.apache.org/content/repositories/releases/</url>
    </repository>
</repositories>

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.apache.knox</groupId>
            <artifactId>gateway-spi</artifactId>
            <version>0.6.0</version>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <dependency>
        <groupId>org.apache.knox</groupId>
        <artifactId>gateway-spi</artifactId>
        <version>0.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.knox</groupId>
        <artifactId>gateway-util-common</artifactId>
        <version>0.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.knox</groupId>
        <artifactId>gateway-provider-security-preauth</artifactId>
        <version>0.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.eclipse.jetty.orbit</groupId>
        <artifactId>javax.servlet</artifactId>
        <version>3.0.0.v201112011016</version>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.11</version>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.easymock</groupId>
        <artifactId>easymock</artifactId>
        <version>3.0</version>
        <scope>test</scope>
    </dependency>

    <dependency>
        <groupId>org.apache.knox</groupId>
        <artifactId>gateway-test-utils</artifactId>
        <scope>test</scope>
        <version>0.6.0</version>
    </dependency>
</dependencies>

 

 Dependencies

NOTE: the "version" element must match the version indicated in the pom.xml of the Knox project. Otherwise, building will fail.

 gateway-provider-security-preauth

This particular federation provider is going to extend the existing PreAuth module with the capability to accept the user.name request parameter as an assertion of the identity by a trusted party. Therefore, we will depend on the preauth module in order to leverage the facilities available in the base classes available there for things like ip address validation, etc.

 gateway-spi

The gateway-spi dependency above pulls in the general interfaces, base classes and utilities that are expected for extended the Apache Knox gateway. The core GatewayServices are available through the gateway-spi module as well as a number of other foundational elements of gateway development.

 gateway-util-common

This gateway-util-common module, as the name suggests, provides common utility facilities for the developing the gateway product. This is where you find the auditing, JSON and url utilities classes for gateway development.

 javax.servlet from org.eclipse.jetty.orbit

This module provides the servlet filter specific classes that are need for the provider filter implementation.

 junit, easymock and gateway-test-utils

JUnit, easymock and gateway-test-utils provide the basis for writing REST based unit tests for the Apache Knox Gateway project and can be found in all of the existing unit tests for the various modules that make up the gateway offering.

 Apache Knox Topologies

In Apache Knox, individual Hadoop clusters are represented by descriptors called topologies that result in the deployment of specific endpoints that expose and protect access to the services of the associated Hadoop cluster. The topology descriptor describes the available services and their respective URL's within the actual Hadoop cluster as well as the policy for protecting access to those services. The policy is defined through the description of various Providers. Each provider and service within a Knox topology has a role and provider roles consist of: authentication, federation, authorization, identity assertion, etc. In this article we are concerned with a provider of type federation.

Since the Pseudo provider is assuming that authentication has happened at the OS level or from within another piece of middleware and that credentials were exchanged with some party other than Knox, we will be making this a federation provider. The typical provider configuration will look something like this:

<provider>
  <role>federation</role>
  <name>Pseudo</name>
  <enabled>true</enabled>
</provider>

Ultimately, an Apache Knox topology manifests as a web application deployed within the gateway process that exposes and protects the URLs associated with the services of the underlying Hadoop components in each cluster. Providers generally interject a ServletFilter into the processing path of the REST API requests that enter the gateway and are dispatched to the Hadoop cluster. The mechanism used to interject the filters, their related configuration and integration into the gateway is the ProviderDeploymentContributor.

 ProviderDeploymentContributor

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements.  See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership.  The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License.  You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.gateway.preauth.deploy;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;

import org.apache.hadoop.gateway.deploy.DeploymentContext;
import org.apache.hadoop.gateway.deploy.ProviderDeploymentContributorBase;
import org.apache.hadoop.gateway.descriptor.FilterParamDescriptor;
import org.apache.hadoop.gateway.descriptor.ResourceDescriptor;
import org.apache.hadoop.gateway.topology.Provider;
import org.apache.hadoop.gateway.topology.Service;

public class PseudoAuthContributor extends
    ProviderDeploymentContributorBase {
  private static final String ROLE = "federation";
  private static final String NAME = "Pseudo";
  private static final String PREAUTH_FILTER_CLASSNAME = "org.apache.hadoop.gateway.preauth.filter.PseudoAuthFederationFilter";

  @Override
  public String getRole() {
    return ROLE;
  }

  @Override
  public String getName() {
    return NAME;
  }

  @Override
  public void contributeFilter(DeploymentContext context, Provider provider, Service service, 
      ResourceDescriptor resource, List<FilterParamDescriptor> params) {
    // blindly add all the provider params as filter init params
    if (params == null) {
      params = new ArrayList<FilterParamDescriptor>();
    }
    Map<String, String> providerParams = provider.getParams();
    for(Entry<String, String> entry : providerParams.entrySet()) {
      params.add( resource.createFilterParam().name( entry.getKey().toLowerCase() ).value( entry.getValue() ) );
    }
    resource.addFilter().name( getName() ).role( getRole() ).impl( PREAUTH_FILTER_CLASSNAME ).params( params );
  }
}

The way in which the required DeploymentContributors for a given topology are located is based on the use of the role and the name of the provider as indicated within the topology descriptor. The topology deployment machinery within Knox first looks up the requried DeploymentContributor by role. In this case, it identifies the identity provider as being a type of federation. It then looks for the federation provider with the name of Pseudo.

Once the providers have been resolved into the required set of DeploymentContributors each contributor is given the opportunity to contribute to the construction of the topology web application that exposes and protects the service APIs within the Hadoop cluster.

This particular DeploymentContributor needs to add the PseudoAuthFederationFilter servlet Filter implementation to the topology specific filter chain. In addition to adding the filter to the chain, this provider will also add each of the provider params from the topology descriptor as filterConfig parameters. This enables the configuration of the resulting servlet filters from within the topology descriptor while enacapsulating the specific implementation details of the provider from the end user.

 PseudoAuthFederationFilter

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.gateway.preauth.filter;

import java.security.Principal;
import java.util.Set;

import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServletRequest;

public class PseudoAuthFederationFilter 
  extends AbstractPreAuthFederationFilter {

  @Override
  public void init(FilterConfig filterConfig) throws ServletException {
    super.init(filterConfig);
  }

  /**
   * @param httpRequest
   */
  @Override
  protected String getPrimaryPrincipal(HttpServletRequest httpRequest) {
    return httpRequest.getParameter("user.name");
  }

  /**
   * @param principals
   */
  @Override
  protected void addGroupPrincipals(HttpServletRequest request, 
      Set<Principal> principals) {
    // pseudo auth currently has no assertion of group membership
  }
}

The PseudoAuthFederationFilter above extends AbstractPreAuthFederationFilter. This particular base class takes care of a number of boilerplate type aspects of preauthenticated providers that would otherwise have to be done redundantly across providers. The two abstract methods that are specific to each provider are getPrimaryPrincipal and addGroupPrincipals. These methods are called by the base class in order to determine what principals should be created and added to the java Subject that will become the effective user identity for the request processing of the incoming request.

 getPrimaryPrincipal

Implementing the abstract method getPrimaryPrincipal allows the new provider to extract the established identity from the incoming request or however appropriate for the given provider and communicate it back the the AbstractPreAuthFederationFilter which will in turn add it to the java Subject being created to represent the user's identity. For this particular provider, all we have to do is return the request parameter by the name of "user.name".

 addGroupPrincipals

Given a set of Principals, the addGroupPrincipals is an opportunity to add additional group principals to the resulting java Subject that will be used to represent the user's identity. This is specifically done by adding new org.apache.hadoop.gateway.security.GroupPrincipals to the set. For the Pseudo authentication mechanism in Hadoop, there really is no way to communicate the group membership through the request parameters. One could easily envision adding an additional request parameter for this though - something like "user.groups".

 Configure as an Available Provider

In order for the deployment machinery to be able to discover the availability of your new provider implementation, you will need to make sure that the org.apache.hadoop.gateway.deploy.ProviderDeploymentContributor file is in the resources/META-INF/services directory and that it contains the classname of the new provider's DeploymentContributor - in this case PseudoAuthContributor.

 resources/META-INF/services/org.apache.hadoop.gateway.deploy.ProviderDeploymentContributor

##########################################################################
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##########################################################################

org.apache.hadoop.gateway.preauth.deploy.PseudoAuthContributor

 Add to Knox as a Gateway Module

At this point, the module should be able to be built as a standalone module with:

mvn clean install

However, we want to extend the Apache Knox Gateway build to include the new module in its build and release processes. In order to do this we will need to add it to a common pom.xml files.

 Root Level Pom.xml

At the root of the project source tree there is a pom.xml file that defines all of the modules that are official components of the gateway server release. You can find each of these modules in the "modules" element. We need to add our new module declaration there:

<modules>
  ...
  <module>gateway-provider-security-pseudo</module>
  ...
</modules>

Then later in the same file we have to add a fuller definition of our module to the dependencyManagement/dependencies element:

<dependencyManagement>
    <dependencies>
        ...
        <dependency>
            <groupId>${gateway-group}</groupId>
            <artifactId>gateway-provider-security-pseudo</artifactId>
            <version>${gateway-version}</version>
        </dependency>
        ...
    </dependencies>
</dependencyManagement>

 Gateway Release Module Pom.xml

Now, our Pseudo federation provider is building with the gateway project but it isn't quite included in the gateway server release artifacts. In order to get it included in the release archives and available to the runtime, we need to add it as a dependency to the appropriate release module. In this case, we are adding it to the pom.xml file within the gateway-release module:

<dependencies>
    ...
    <dependency>
        <groupId>${gateway-group}</groupId>
        <artifactId>gateway-provider-security-pseudo</artifactId>
    </dependency>
    ...
</dependencies>

Note that this is basically the same definition that was added to the root level pom.xml minus the "version" element.

 Build, Test and Deploy

At this point, we should have an integrated custom component that can be described for use within the Apache Knox topology descriptor file and engaged in the authentication of incoming requests for resources of the protected Hadoop cluster.

 building

You may use the same maven commands to:

mvn clean install

This will build and run the gateway unit tests.

You may also use the following to not only build and run the tests but to also package up the release artifacts. This is a great way to quickly setup a test instance in order to manually test your new Knox functionality.

ant package

 testing

To install the newly packaged release archive in a GATEWAY_HOME environment:

ant install-test-home

This will unzip the release bits into a local ./install directory and do some initial setup tasks to ensure that it is actually runnable.

We can now start a test ldap server that is seeded with a couple test users:

ant start-test-ldap

The sample topology files are setup to authenticate against this LDAP server for convenience and can be used as is in order to quickly do a sanity test of the install.

At this point, we can choose to run a test Knox instance or a debug Knox instance. If you want to run a test instance without the ability to connect a debugger then:

ant start-test-gateway

If you would like to connect a debugger and step through the code to debug or ensure that your functionality is running as expected then you need a debug instance:

ant start-debug-gateway

 curl

You may now test the out of the box authentication against LDAP using HTTP BASIC by using curl and one of the simpler APIs exposed by Apache Knox:

curl -ivk --user guest:guest-password "https://localhost:8443/gateway/sandbox/webhdfs/v1/tmp?op=LISTSTATUS"

 Change Topology Descriptor

Once the server is up and running and you are able to authenticate with HTTP BASIC against the test LDAP server, you can now change the topology descriptor to leverage your new federation provider.

Find the sandbox.xml file in the install/conf/topologies file and edit it to reflect your provider type, name and any provider specific parameters.

<provider>
   <role>federation</role>
   <name>PseudoProvider</name>
   <enabled>true</enabled>
   <param>
       <name>filter-init-param-name</name>
       <value>value</value>
   </param>
</provider

Once your federation provider is configured, just save the topology descriptor. Apache Knox will notice that the file has changed and automatically redeploy that particular topology. Any provider params described in the provider element will be added to the PseudoAuthFederationFilter as servlet filter init params and can be used to configure aspects of the filter's behavior.

 curl again

We are now ready to use curl again to test the new federation provider and ensure that it is working as expected:

curl -ivk "https://localhost:8443/gateway/sandbox/webhdfs/v1/tmp?op=LISTSTATUS&user.name=guest"

 More Resources

Apache Knox Developers Guide: http://knox.apache.org/books/knox-0-6-0/dev-guide.html

Apache Knox Users Guide: http://knox.apache.org/books/knox-0-6-0/user-guide.html

Github project for this article: https://github.com/lmccay/gateway-provider-security-pseudo

 Conclusion

This article has illustrated a simplified example of implementing a federation provider for establishing the identity of a previous authentication event and propagating that into the request processing for Hadoop REST APIs inside of Apache Knox.

The process to extend the preauthenticated federation provider is a quick and simple way to extend certain SSO capabilities into providing authenticated access to Hadoop resources through Apache Knox.

The Knox community is a growing community that welcomes contributions from interested users in order to grow the capabilities to include truly useful and impactful features.

NOTE: It is important to understand that the provider illustrated in this example has limitations that preclude it from being used in production. Most notably, it does not have any means to follow redirects due to the missing user.name parameter in the Location header. In order to do this, we would need to set a cookie to determine the user identity on the redirected request.

By: Kevin Minder - Nov 16, 2015

 

This article covers adding a service to Apache Knox.

The idea here is to provide an intentionally simple example, avoiding complexity wherever possible. The goal being getting something working as a starting point upon which more complicated scenarios could build. You may want to review the Apache Knox User’s Guide and Developer’s Guide before reading this.

The API used here is an OpenWeatherMap API that returns the current weather information for a given zip code. This is the cURL command to access this API directly. Give it a try.

If you are new to Knox you may also want to check out ’Setting up Apache Knox in three easy steps’.

curl 'http://api.openweathermap.org/data/2.5/weather?zip=95054,us&appid=2de143494c0b295cca9337e1e96b00e0'

This should return a JSON similar to the output shown below. Your results probably won’t be nicely formatted. Note that I’m not giving anything away here with the appid. This is what they use in all of their examples.

{  
   "coord":{"lon":-121.98,"lat":37.43},
   "weather":[{"id":800,"main":"Clear","description":"Sky is Clear","icon":"01d"}],
   "base":"cmc stations",
   "main":{"temp":282.235,"pressure":998.58,"humidity":51,"temp_min":282.235,"temp_max":282.235,"sea_level":1038.24,"grnd_level":998.58},
   "wind":{"speed":4.92,"deg":347.5},
   "clouds":{"all":0},
   "dt":1447699480,
   "sys":{"message":0.0049,"country":"US","sunrise":1447685347,"sunset":1447721770},
   "id":5323631,
   "name":"Alviso",
   "cod":200
}

This is the cURL command showing how we will expose that service via the gateway. Don’t try this now, it won’t work until later!

curl -ku guest:guest-password 'https://localhost:8443/gateway/sandbox/weather/data/2.5/weather?zip=95054,us&appid=2de143494c0b295cca9337e1e96b00e0'

So the fundamental job of the gateway is to translate the effective request URL it receives to the target URL and then transfer the request and response bodies. In this example we will ignore the request and response bodies and focus on the request URL. Lets take a look at how these two request URLs are related.


We can start by breaking down the Gateway URL and understanding where each of the URL parts come from.

PartDetails
httpsThe gateway has SSL/TLS enabled: See ssl.enabled in gateway-site.xml
localhostThe gateway is listening on 0.0.0.0: See gateway.host in gateway-site.xml
8443The gateway is listening on port 8443: See gateway.port in gateway-site.xml
gatewayThe gateway context path is ‘gateway’: See gateway.path in gateway-site.xml
sandboxThe topology file that includes the WEATHER service is named sandbox.xml
weatherThe unique root of all WEATHER service URLs. Identified in service’s service.xml
data/2.5/weatherThis portion of the URL is handled by the service’s rewrite.xml rules


In contrast we really only care about two parts of the service’s Direct URL.

PartDetails
http://api.openweathermap.orgThe network address of the service itself.
data/2.5/weatherThe path for the weather API of the service.

 

Now we need to get down to the business of actually making the gateway proxy this service. To do that we will be using the new configuration based extension model introduced in Knox 0.6.0. That will involve adding two new files under the <GATEWAY_HOME>/data/services directory and then modifying a topology file.

Note: The <GATEWAY_HOME> here represents the directory where Apache Knox is installed.

First you need to create a directory to hold your new service definition files. There are two conventions at work here that ultimately (but only loosely) relate to the content of the service.xml it will contain. Below the <GATEWAY_HOME>/data/services directory you will need to create a parent and child directory weather/0.0.1. As a convention the names of these directories duplicate the values in the attributes of the root element of the contained service.xml.

Create the two files with the content shown below and place them in the directories indicated. The links also provide the files for your convenience.

<GATEWAY_HOME>/data/services/weather/0.0.1/service.xml

<service role="WEATHER" name="weather" version="0.0.1">
  <routes>
    <route path="/weather/**"/>
  </routes>
</service>

<GATEWAY_HOME>/data/services/weather/0.0.1/rewrite.xml

<rules>
  <rule dir="IN" name="WEATHER/weather/inbound" pattern="*://*:*/**/weather/{path=**}?{**}">
    <rewrite template="{$serviceUrl[WEATHER]}/{path=**}?{**}"/>
  </rule>
</rules>

Once that is complete, the topology file must be updated to activate this new service in the runtime. In this case the sandbox.xml topology file is used but you may have another topology file such as default.xml. Edit which ever topology file you prefer and add the… markup shown below. If you aren’t using sandbox.xml be careful to replace sandbox with the name of your topology file through these examples.

<GATEWAY_HOME>/conf/topologies/sandbox.xml

<topology>
  ...
  <service>
    <role>WEATHER</role>
    <url>http://api.openweathermap.org:80</url>
  </service>
</topology>

With all of these changes made you must restart your Knox gateway server. Often times this isn’t necessary but adding a new service definition under [<GATEWAY_HOME>/data/services requires restart.

You should now be able to execute the curl command from way back at the top that accesses the OpenWeatherMap API via the gateway.

curl -ku guest:guest-password 'https://localhost:8443/gateway/sandbox/weather/data/2.5/weather?zip=95054,us&appid=2de143494c0b295cca9337e1e96b00e0'

Now that the new service definition is working lets go back and connect all the dots. This should help take some of the mystery out of the configuration above. The most important and confusing aspect is how values in different files are interrelated. I will focus on that.

service.xml

The service.xml file defines the high level URL patterns that will be exposed by the gateway for a service. If you are getting HTTP 404 errors there is probably a problem with this configuration.

<service role="WEATHER"

  • The role/implementation/version triad is used through Knox for integration plugins.
  • Think of the role as an interface in Java.
  • This attribute declares what role this service “implements”.
  • This will need to match the topology file’s <topology><service><role> for this service.

<service name="weather"

  • In the role/implementation/version triad this is the implementation.
  • Think of this as a Java implementation class name relative to an interface.
  • As a matter of convention this should match the directory beneath <GATEWAY_HOME>/data/services
  • The topology file can optionally contain <topology><service><name> but usually doesn’t. This would be used to select a specific implementation of a role if there were multiple.

<service version="0.0.1"

  • As a matter of convention this should match the directory beneath the service implementation name.
  • The topology file can optionally contain <topology><service><version> but usually doesn’t. This would be used to select a specific version of an implementation there were multiple. This can be important if the protocols for a service evolve over time.

<service><routes><route path="/weather/**"

  • This tells the gateway that all requests starting starting with /weather/ are handled by this service.
  • Due to a limitation this will not include requests to /weather (i.e. no trailing /)
  • The ** means zero or more paths similar to Ant.
  • The scheme, host, port, gateway and topology components are not included (e.g. https://localhost:8443/gateway/sandbox)
  • Routes can, but typically don’t, take query parameters into account.
  • In this simple form there is no direct relationship between the route path and the rewrite rules!

rewrite.xml

The rewrite.xml is configuration that drives the rewrite provider within Knox. It is important to understand that at runtime for a given topology, all of the rewrite.xml files for all active services are combined into a single file. This explains some of the seemingly complex patterns and naming conventions.

<rules><rule dir="IN"

  • Here dir means direction and IN means it should apply to a request.
  • This rule is a global rule meaning that any other service can request that a URL be rewritten as they process URLs. The rewrite provider keeps distinct trees of URL patterns for IN and OUT rules so that services can be specific about which to apply.
  • If it were not global it would not have a direction and probably not a pattern in the element.

<rules><rule name="WEATHER/weather/inbound"

  • Rules can be explicitly invoked in various ways. In order to allow that they are named.
  • The convention is role/name/<service specific hierarchy>.
  • Remember that all rules share a single namespace.

<rules><rule pattern="*://*:*/**/weather/{path=**}?{**}"

  • Defines the URL pattern for which this rule will apply.
  • The * matches exactly one segment of the URL.
  • The ** matches zero or more segments of the URL.
  • The {path=**} matches zero or more path segments and provides access them as a parameter named 'path’.
  • The {**} matches zero or more query parameters and provides access to them by name.
  • The values from matched {…} segments are “consumed” by the rewrite template below.

<rules><rule><rewrite template="{$serviceUrl[WEATHER]}/{path=**}?{**}"

  • Defines how the URL matched by the rule will be rewritten.
  • The $serviceUrl[WEATHER]} looks up the <service><url> for the <service><role>WEATHER. This is a implemented as rewrite function and is another custom extension point.
  • The {path=**} extracts zero or more values for the 'path’ parameter from the matched URL.
  • The {**} extracts any “unused” parameters and uses them as query parameters.

sandbox.xml

<topology>

...

  <service>
    <role>WEATHER</role>
    <url>http://api.openweathermap.org:80</url>
  </service>

...

<topology>
  • <role> causes the service definition with role WEATHER to be loaded into the runtime.
  • Since <name> and <version> are not present, a default is selected if there are multiple options.
  • <url> populates the data used by {$serviceUrl[WEATHER]} in the rules with the correct target URL.

Hopefully all of this provides a more gentle introduction to adding a service to Apache Knox than might be offered in the Apache Knox Developer’s Guide. If you have more questions, comments or suggestions please join the Apache Knox community. In particular you might be interested in one of the mailing lists.

Created by Kevin Minder, last modified on Dec 08, 2015

This article covers using Apache Knox with ActiveDirectory.

Currently Apache Knox comes “out of the box” setup with a demo LDAP server based on ApacheDS. This was a conscious decision made to simplify the initial user experience with Knox. Unfortunately, it can make the transition to popular enterprise identity stores such as ActiveDirectory confusing. This article is intended to remedy some of that confusion.

If you are new to Knox you may want to check out ’Setting up Apache Knox in three easy steps’.

Part 1

Lets go back to basics and build up an example from first principles. To do this we will start with the simplest topology file that will work. We will iteratively transform that topology file until it integrates with ActiveDirectory for both authentication and authorization.

Sample 1

The initial topology file we will start with doesn’t integrate with ActiveDirectory at all. Instead it uses a capability of Shiro to embed users directly within its configuration. This approach is largely taken to “shake out” the process of editing topology files for various purposes. At the same time it minimizes external dependencies to help ensure a successful starting point. Now, create this topology file.

<GATEWAY_HOME>/conf/topologies/sample1.xml

<topology>
  <gateway>

    <provider>
      <role>authentication</role>
      <name>ShiroProvider</name>
      <enabled>true</enabled>
      <param name="users.admin" value="admin-secret"/>
      <param name="urls./**" value="authcBasic"/>
    </provider>

  </gateway>
  <service>
    <role>KNOX</role>
  </service>
</topology>

If you are a seasoned Knox veteran, you may notice the alternative <param name=“” value=“”/> style syntax. Both this and <param><name></name><value></value></param> style are supported. I’ve used the attribute style here for compactness.

Once this topology file is created you will be able to access the Knox Admin API, which is what the KNOX service in the topology file provides. The cURL command shown below retrieves the version information from the Knox server. Notice -u admin:admin-secretin the command below matches <param name="users.admin" value="admin-secret"/> in the topology file above.

curl -u admin:admin-secret -ik 'https://localhost:8443/gateway/sample1/api/v1/version'

Below is an example response body output from the command above.
Note: The -i causes the return of the full response including status line and headers which aren’t shown below for brevity.

<?xml version="1.0" encoding="UTF-8"?>
<ServerVersion>
   <version>0.7.0-SNAPSHOT</version>
   <hash>9632b697060bfeffa2e03425451a3e9b3980c45e</hash>
</ServerVersion>

As an aside, if you prefer JSON you can request that using the HTTP Accept header via the cURL -H flag.
Don’t forget to scroll right in these code boxes as some of these commands will start to get long.

curl -u admin:admin-secret -H 'Accept: application/json' -ik 'https://localhost:8443/gateway/sample/api/v1/version'

Below is an example response JSON body for this command.

{
   "ServerVersion" : {
      "version" : "0.7.0-SNAPSHOT",
      "hash" : "9632b697060bfeffa2e03425451a3e9b3980c45e"
   }
}

Sample 2

With authentication working, now add authorization since the real goal is an example with ActiveDirectory including both. The second sample topology file below adds a second user (guest) and an authorization provider. The <param name="knox.acl" value="admin;*;*"/> dictates that only the admin user can access the knox service. Go ahead and create this topology file. Notice the examples use a different name for each topology file so you can always refer back to the previous ones.

<GATEWAY_HOME>/conf/topologies/sample2.xml

<topology>
  <gateway>

    <provider>
      <role>authentication</role>
      <name>ShiroProvider</name>
      <enabled>true</enabled>
      <param name="users.admin" value="admin-secret"/>
      <param name="users.guest" value="guest-secret"/>
      <param name="urls./**" value="authcBasic"/>
    </provider>

    <provider>
      <role>authorization</role>
      <name>AclsAuthz</name>
      <enabled>true</enabled>
      <param name="knox.acl" value="admin;*;*"/>
    </provider>

  </gateway>
  <service>
    <role>KNOX</role>
  </service>
</topology>

Once this is created, test it with the cURL commands below and see that the admin user can access the API but the guest user can’t.

curl -u admin:admin-secret -ik 'https://localhost:8443/gateway/sample2/api/v1/version'
curl -u guest:guest-secret -ik 'https://localhost:8443/gateway/sample2/api/v1/version'

The first command will succeed. The second command above will return a HTTP/1.1 403 Forbidden status along with an error response body.

Part 2

These embedded examples are all well and good but this article is supposed to be about ActiveDirectory. This takes us from examples that “just work” to examples that need to be customized for the environment in which they run. Specifically they require some basic network address and and bunch of LDAP information. The table below describes the initial information you will need from your environment and shows what is being used in the samples here. You will need to adjust values these when used in the samples to match your environment.

A word of caution is warranted here. There are as many ways to setup LDAP and ActiveDirectory as there are IT departments. This variability requires flexibility which in turn often causes confusion, especially given poor documentation (guilty). The examples here focus on a single specific pattern that is seen frequently, but your mileage may vary.

Name
Description
Example
Server HostThe hostname where ActiveDirectory is running.ad.qa.your-domain.com
Server PortThe port on which ActiveDirectory is listening.389
System UsernameThe distinguished name for a user with search permissions.CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com
System PasswordThe password for the system user. (See: Note1)********
Search BaseThe subset of users to search for authentication. (See: Note2)CN=Users,DC=hwqe,DC=hortonworks,DC=com
Search AttributeThe attribute containing the username to search for authentication.sAMAccountName
Search ClassThe object class for LDAP entities to search for authentication.person


Note1: In these samples the password will be embedded with the topology files. This is for simplicity. The password can be stored in a protected credential store as described here.
Note2: This search base should constrain the search as much as possible to limit the amount of data returned by the query.

To start things off on the right foot, lets execute an LDAP bind against ActiveDirectory. For this you will need your values for Server Host, Server Port, System Username and System Password described in the table above. This initial testing will be done using command line tools from OpenLDAP. If you don’t have these command line tools available don’t despair Knox provides some alternatives I’ll show you later.

The command below will help ensure that the values for Server Host, Server Port, System Username and System Password are correct. In this case I’m using my own test account as the system user because it happens to have search privileges.

ldapwhoami -h ad.qa.your-domain.com -p 389 -x -D 'CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com' -w '********'

This is brief description of each command line parameter used above.

  • -h: Provide your Server Host
  • -p: Provide your Server Port
  • -x: Use simple authentication vs SASL
  • -D: Provide your System Username
  • -w: Provide your System Password

For me this command returns the output below.

u:HWQE\kminder

Now lets make sure that the system user can actually search. Note that in this case the system user is searching for itself because -D and -b use the same value. You could change -b to search for other users.

ldapsearch -h ad.qa.your-domain.com -p 389 -x -D 'CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com' -w '********' -b 'CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com'
  • -b: Provide your System Username

This returns all of the LDAP attributes for the system user. Take note of a few key attributes like objectClass, which here is ‘person’, and sAMAccountName, which here is 'kminder’.

# extended LDIF
#
# LDAPv3
# base <CN=Users,DC=hwqe,DC=hortonworks,DC=com> with scope subtree
# filter: CN=Kevin Minder
# requesting: ALL
#

# Kevin Minder, Users, hwqe.hortonworks.com
dn: CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: user
cn: Kevin Minder
sn: Minder
givenName: Kevin
distinguishedName: CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com
instanceType: 4
whenCreated: 20151117175833.0Z
whenChanged: 20151117175919.0Z
displayName: Kevin Minder
uSNCreated: 26688014
uSNChanged: 26688531
name: Kevin Minder
objectGUID:: Eedvw9dqoUK/ERLNEFrQ5w==
userAccountControl: 66048
badPwdCount: 0
codePage: 0
countryCode: 0
badPasswordTime: 130922583862610479
lastLogoff: 0
lastLogon: 130922584014955481
pwdLastSet: 130922567133848037
primaryGroupID: 513
objectSid:: AQUAAAAAAAUVAAAA7TkHmDQ43l1xd4O/MigBAA==
accountExpires: 9223372036854775807
logonCount: 0
sAMAccountName: kminder
sAMAccountType: 805306368
userPrincipalName: kminder@hwqe.hortonworks.com
objectCategory: CN=Person,CN=Schema,CN=Configuration,DC=hwqe,DC=hortonworks,DC=com
dSCorePropagationData: 16010101000000.0Z
lastLogonTimestamp: 130922567243691894

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1

Next, lets check the values for Search Base, Search Attribute and Search Class with a command like the one below.
Again, don’t forget to scroll right to see the whole command.

ldapsearch -h ad.qa.your-domain.com -p 389 -x -D 'CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com' -w '********' -b 'CN=Users,DC=hwqe,DC=hortonworks,DC=com' -z 5 '(objectClass=person)' sAMAccountName
  • -z 5: Limit the search results to 5 entries. Note that by default AD will only return a max of 1000 entries.
  • ’(objectClass=person)’: Limit the search results to entries where objectClass=person. This value was taken from the search result above.
  • sAMAccountName: Return only the SAMAccountName attribute

If no results were returned go back and check the output from the search above for the correct settings. The results for this command should look something like what is shown below. Take note of the various attribute values returned for sAMAccountName. These are the usernames that will ultimately be used for login.

# extended LDIF
#
# LDAPv3
# base <CN=Users,DC=hwqe,DC=hortonworks,DC=com> with scope subtree
# filter: (objectClass=person)
# requesting: sAMAccountName
#

# Administrator, Users, hwqe.hortonworks.com
dn: CN=Administrator,CN=Users,DC=hwqe,DC=hortonworks,DC=com
sAMAccountName: Administrator

# guest, Users, hwqe.hortonworks.com
dn: CN=guest,CN=Users,DC=hwqe,DC=hortonworks,DC=com
sAMAccountName: guest

# cloudbase-init, Users, hwqe.hortonworks.com
dn: CN=cloudbase-init,CN=Users,DC=hwqe,DC=hortonworks,DC=com
sAMAccountName: cloudbase-init

# krbtgt, Users, hwqe.hortonworks.com
dn: CN=krbtgt,CN=Users,DC=hwqe,DC=hortonworks,DC=com
sAMAccountName: krbtgt

# ambari-server, Users, hwqe.hortonworks.com
dn: CN=ambari-server,CN=Users,DC=hwqe,DC=hortonworks,DC=com
sAMAccountName: ambari-server

# search result
search: 2
result: 4 Size limit exceeded

# numResponses: 6
# numEntries: 5

Sample 3

At this point you have verified all of the environmental information required for authentication, you are ready to create your third topology file. Just as with the first example, this topology file will only include authentication. We will tackle authorization later.

The table below highlights the the important settings in the topology file.

Parameter
Description
Example
main.ldapRealmThe class name for Knox’s Shiro Realm implementation.org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm
main.ldapContextFactoryThe class name for Knox’s Shiro LdapContextFactory implementation.org.apache.hadoop.gateway.shirorealm.KnoxLdapContextFactory
main.ldapRealm.contextFactorySets the context factory on the realm.$ldapContextFactory
main.ldapRealm.contextFactory.urlSets the AD URL on the context factory.ldap://ad.qa.your-domain.com:389
main.ldapRealm.contextFactory.systemUsernameSets the system users DN on the context factory.CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com
main.ldapRealm.contextFactory.systemPasswordSets the system users password on the context factory.********
main.ldapRealm.searchBaseThe subset of users to search for authentication.CN=Users,DC=hwqe,DC=hortonworks,DC=com
main.ldapRealm.userSearchAttributeNameThe attribute who’s value to use for username comparison.sAMAccountName
main.ldapRealm.userObjectClassThe objectClass to limit the search scope.person
urls./**Apply authentication to all URLs.authcBasic

 

Create this sample3 topology file. Take care to replace all of the example environment values with the correct values for your environment you discovered and verified above.

<GATEWAY_HOME>/conf/topologies/sample3.xml

<topology>
  <gateway>

    <provider>
      <role>authentication</role>
      <name>ShiroProvider</name>
      <enabled>true</enabled>
      <param name="main.ldapRealm" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm"/>
      <param name="main.ldapContextFactory" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapContextFactory"/>
      <param name="main.ldapRealm.contextFactory" value="$ldapContextFactory"/>

      <param name="main.ldapRealm.contextFactory.url" value="ldap://ad.qa.your-domain.com:389"/>
      <param name="main.ldapRealm.contextFactory.systemUsername" value="CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
      <param name="main.ldapRealm.contextFactory.systemPassword" value="********"/>

      <param name="main.ldapRealm.searchBase" value="CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
      <param name="main.ldapRealm.userSearchAttributeName" value="sAMAccountName"/>
      <param name="main.ldapRealm.userObjectClass" value="person"/>

      <param name="urls./**" value="authcBasic"/>
    </provider>

  </gateway>
  <service>
    <role>KNOX</role>
  </service>
</topology>

We could go straight to trying to access the Knox Admin API with cURL as we did before. However, lets take this opportunity to explore the new LDAP diagnostic tools introduced in Apache Knox 0.7.0.

This first command helps diagnose basic connectivity and system user issues.

bin/knoxcli.sh system-user-auth-test --cluster sample3
System LDAP Bind successful.

If the command above works you can move on to testing the LDAP search configuration settings of the topology. If you don’t provide the username and password via the command line switches you will be prompted to enter them.

bin/knoxcli.sh user-auth-test --cluster sample3 --u kminder --p '********'
LDAP authentication successful!

Once all of that is working go ahead and try the cURL command.

curl -u kminder:******** -ik 'https://localhost:8443/gateway/sample3/api/v1/version'

Sample 4

The next step is to enable authorization. To accomplish this there is a bit more environmental information needed. The OpenLDAP command line tools are useful here again to ensure that we have the correct values. Authorization requires determining group membership. We will be using searching to determine group membership. The way ActiveDirectory is setup for this example, this requires knowing four additional pieces of information: groupSearchBase, groupObjectClass, groupIdAttribute and memberAttribute.

The first, 'groupSearchBase’ is something that you will need to find out from your ActiveDirectory administrator. In my example, this is value 'OU=groups,DC=hwqe,DC=hortonworks,DC=com’. This value is a distinguished name that constrains the search groups for which a given user might be a member. Once you have this you can 'ldapsearch’ to see the attributes of some groups to determine the other three settings.

Here is an example of an 'ldapsearch’ using groupSearchBase from my environment.

ldapsearch -h ad.qa.your-domain.com -p 389 -x -D 'CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com' -w '********' -b 'OU=groups,DC=hwqe,DC=hortonworks,DC=com' -z 2

This is the output.

# extended LDIF
#
# LDAPv3
# base <OU=groups,DC=hwqe,DC=hortonworks,DC=com> with scope subtree
# filter: (objectclass=*)
# requesting: ALL
#

# groups, hwqe.hortonworks.com
dn: OU=groups,DC=hwqe,DC=hortonworks,DC=com
objectClass: top
objectClass: organizationalUnit
ou: groups
distinguishedName: OU=groups,DC=hwqe,DC=hortonworks,DC=com
instanceType: 4
whenCreated: 20150812202242.0Z
whenChanged: 20150812202242.0Z
uSNCreated: 42340
uSNChanged: 42341
name: groups
objectGUID:: RYIcbNyVWki5HmeANfzAbA==
objectCategory: CN=Organizational-Unit,CN=Schema,CN=Configuration,DC=hwqe,DC=h
 ortonworks,DC=com
dSCorePropagationData: 20150827225949.0Z
dSCorePropagationData: 20150812202242.0Z
dSCorePropagationData: 16010101000001.0Z

# scientist, groups, hwqe.hortonworks.com
dn: CN=scientist,OU=groups,DC=hwqe,DC=hortonworks,DC=com
objectClass: top
objectClass: group
cn: scientist
member: CN=sam repl2,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl1,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=bob,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam,CN=Users,DC=hwqe,DC=hortonworks,DC=com
distinguishedName: CN=scientist,OU=groups,DC=hwqe,DC=hortonworks,DC=com
instanceType: 4
whenCreated: 20150812213414.0Z
whenChanged: 20150828231624.0Z
uSNCreated: 42355
uSNChanged: 751045
name: scientist
objectGUID:: iXhbVo7kJUGkiQ+Sjlm0Qw==
objectSid:: AQUAAAAAAAUVAAAA7TkHmDQ43l1xd4O/SgUAAA==
sAMAccountName: scientist
sAMAccountType: 536870912
groupType: -2147483644
objectCategory: CN=Group,CN=Schema,CN=Configuration,DC=hwqe,DC=hortonworks,DC=
 com
dSCorePropagationData: 20150827225949.0Z
dSCorePropagationData: 16010101000001.0Z

# search result
search: 2
result: 4 Size limit exceeded

# numResponses: 3
# numEntries: 2

From the output, take note of:

  • the relevant objectClass: 'group’
  • attribute used to enumerate members: 'member’
  • the attributes that most uniquely name the group: 'cn’ or 'sAMAccountName’

These are the groupObjectClass and memberAttribute, values respectively. We will use groupObjectClass=group, memberAttribute=member and groupIdAttribute=sAMAccountName.

The command below repeats the search above but returns just the member attribute for up to 5 groups.

ldapsearch -h ad.qa.your-domain.com -p 389 -x -D 'CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com' -w '********' -b 'OU=groups,DC=hwqe,DC=hortonworks,DC=com' -z 5 member
# extended LDIF
#
# LDAPv3
# base <OU=groups,DC=hwqe,DC=hortonworks,DC=com> with scope subtree
# filter: (objectclass=*)
# requesting: member
#

# groups, hwqe.hortonworks.com
dn: OU=groups,DC=hwqe,DC=hortonworks,DC=com

# scientist, groups, hwqe.hortonworks.com
dn: CN=scientist,OU=groups,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl2,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl1,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=bob,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam,CN=Users,DC=hwqe,DC=hortonworks,DC=com

# analyst, groups, hwqe.hortonworks.com
dn: CN=analyst,OU=groups,DC=hwqe,DC=hortonworks,DC=com
member: CN=testLdap1,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl2,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl1,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=bob,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=tom,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam,CN=Users,DC=hwqe,DC=hortonworks,DC=com

# knox_hdp_users, groups, hwqe.hortonworks.com
dn: CN=knox_hdp_users,OU=groups,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl2,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl1,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam repl,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam,CN=Users,DC=hwqe,DC=hortonworks,DC=com

# knox_no_users, groups, hwqe.hortonworks.com
dn: CN=knox_no_users,OU=groups,DC=hwqe,DC=hortonworks,DC=com

# test grp, groups, hwqe.hortonworks.com
dn: CN=test grp,OU=groups,DC=hwqe,DC=hortonworks,DC=com
member: CN=testLdap1,CN=Users,DC=hwqe,DC=hortonworks,DC=com
member: CN=sam,CN=Users,DC=hwqe,DC=hortonworks,DC=com

# search result
search: 2
result: 0 Success

# numResponses: 7
# numEntries: 6

Armed with this group information you can now create a topology file that causes the Shiro authentication provider to retrieve group information. Keep in mind that we haven’t made it all the way to authorization yet. This step is just to prove that your can get the group information back from ActiveDirectory. Once we have the group lookup working, we will enable authorization in the next step.

The table below highlights the changes that you will be making in this topology file.

Parameter
Description
Example
main.ldapRealm.userSearchBaseReplaces main.ldapRealm.searchBaseCN=Users,DC=hwqe,DC=hortonworks,DC=com
main.ldapRealm.authorizationEnabledEnabled the group lookup functionality.true
main.ldapRealm.groupSearchBaseThe subset of groups to search for user membership.OU=groups,DC=hwqe,DC=hortonworks,DC=com
main.ldapRealm.groupObjectClassThe objectClass to limit the search scope.group
main.ldapRealm.groupIdAttributeThe attribute used to provide the group name.sAMAccountName
main.ldapRealm.memberAttributeThe attribute used to provide the group’s members.member

 

Create this topology file file now.

<GATEWAY_HOME>/conf/topologies/sample4.xml

<topology>
    <gateway>

        <provider>
            <role>authentication</role>
            <name>ShiroProvider</name>
            <enabled>true</enabled>
            <param name="main.ldapRealm" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm"/>
            <param name="main.ldapContextFactory" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapContextFactory"/>
            <param name="main.ldapRealm.contextFactory" value="$ldapContextFactory"/>

            <param name="main.ldapRealm.contextFactory.url" value="ldap://ad.qa.your-domain.com:389"/>
            <param name="main.ldapRealm.contextFactory.systemUsername" value="CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.contextFactory.systemPassword" value="********"/>

            <param name="main.ldapRealm.userSearchBase" value="CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.userSearchAttributeName" value="sAMAccountName"/>
            <param name="main.ldapRealm.userObjectClass" value="person"/>

            <param name="main.ldapRealm.authorizationEnabled" value="true"/>
            <param name="main.ldapRealm.groupSearchBase" value="OU=groups,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.groupObjectClass" value="group"/>
            <param name="main.ldapRealm.groupIdAttribute" value="sAMAccountName"/>
            <param name="main.ldapRealm.memberAttribute" value="member"/>

            <param name="urls./**" value="authcBasic"/>
        </provider>

    </gateway>
    <service>
        <role>KNOX</role>
    </service>
</topology>

Once again the Knox tooling can be used to test this configuration. This time the --g flag will be added to retrieve group information.

bin/knoxcli.sh user-auth-test --cluster sample4 --u sam --p '********' --g
LDAP authentication successful!
sam is a member of: analyst
sam is a member of: knox_hdp_users
sam is a member of: test grp
sam is a member of: scientist

Sample 5

The next sample adds in an authorization provider to act upon the groups. This is the same provider that was added back in the second sample. The parameter <param name="knox.acl" value="*;knox_hdp_users;*"/> in this case dictates that only members of group knox_hdp_users can addess the Knox Admin API via the sample5 topology. Create the topology shown below. Don’t forget to tailor it to your environment.

<GATEWAY_HOME>/conf/topologies/sample5.xml

<topolgy>
    <gateway>

        <provider>
            <role>authentication</role>
            <name>ShiroProvider</name>
            <enabled>true</enabled>
            <param name="main.ldapRealm" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm"/>
            <param name="main.ldapContextFactory" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapContextFactory"/>
            <param name="main.ldapRealm.contextFactory" value="$ldapContextFactory"/>
            <param name="main.ldapRealm.contextFactory.url" value="ldap://ad.qa.your-domain.com:389"/>
            <param name="main.ldapRealm.contextFactory.systemUsername" value="CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.contextFactory.systemPassword" value="********"/>
            <param name="main.ldapRealm.userSearchBase" value="CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.userSearchAttributeName" value="sAMAccountName"/>
            <param name="main.ldapRealm.userObjectClass" value="person"/>
            <param name="main.ldapRealm.authorizationEnabled" value="true"/>
            <param name="main.ldapRealm.groupSearchBase" value="OU=groups,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.groupObjectClass" value="group"/>
            <param name="main.ldapRealm.groupIdAttribute" value="sAMAccountName"/>
            <param name="main.ldapRealm.memberAttribute" value="member"/>
            <param name="urls./**" value="authcBasic"/>
        </provider>

        <provider>
            <role>authorization</role>
            <name>AclsAuthz</name>
            <enabled>true</enabled>
            <param name="knox.acl" value="*;knox_hdp_users;*"/>
        </provider>

    </gateway>
    <service>
        <role>KNOX</role>
    </service>
</topology>
curl -u kminder:'********' -ik 'https://localhost:8443/gateway/sample5/api/v1/version'
403
curl -u sam:'********' -ik 'https://localhost:8443/gateway/sample5/api/v1/version'
200

Sample 6

Next lets enable caching because out of the box this important performance enhancement isn’t enabled. The table below hilights the changes that will be made to the authentication provider settings.

Parameter
Description
Example
main.cacheManagerThe name of the class implementing the cache.org.apache.shiro.cache.ehcache.EhCacheManager
main.securityManager.cacheManagerSets the cache manager on the security manager.$cacheManager
main.ldapRealm.authenticationCachingEnabledEnabled the use of caching during authentication.true

 

Create the sample6 topology file now.

<GATEWAY_HOME>/conf/topologies/sample6.xml

<topology>
    <gateway>

        <provider>
            <role>authentication</role>
            <name>ShiroProvider</name>
            <enabled>true</enabled>
            <param name="main.ldapRealm" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm"/>
            <param name="main.ldapContextFactory" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapContextFactory"/>
            <param name="main.ldapRealm.contextFactory" value="$ldapContextFactory"/>
            <param name="main.ldapRealm.contextFactory.url" value="ldap://ad.qa.your-domain.com:389"/>
            <param name="main.ldapRealm.contextFactory.systemUsername" value="CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.contextFactory.systemPassword" value="********"/>
            <param name="main.ldapRealm.userSearchBase" value="CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.userSearchAttributeName" value="sAMAccountName"/>
            <param name="main.ldapRealm.userObjectClass" value="person"/>
            <param name="main.ldapRealm.authorizationEnabled" value="true"/>
            <param name="main.ldapRealm.groupSearchBase" value="OU=groups,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.groupObjectClass" value="group"/>
            <param name="main.ldapRealm.groupIdAttribute" value="sAMAccountName"/>
            <param name="main.ldapRealm.memberAttribute" value="member"/>

            <param name="main.cacheManager" value="org.apache.shiro.cache.ehcache.EhCacheManager"/>
            <param name="main.securityManager.cacheManager" value="$cacheManager"/>
            <param name="main.ldapRealm.authenticationCachingEnabled" value="true"/>

            <param name="urls./**" value="authcBasic"/>
        </provider>

        <provider>
            <role>authorization</role>
            <name>AclsAuthz</name>
            <enabled>true</enabled>
            <param name="knox.acl" value="*;knox_hdp_users;*"/>
        </provider>

    </gateway>
    <service>
        <role>KNOX</role>
    </service>
</topology>

With this topology file you can execute a sequence of cURL commands to demonstrate that the authentication is indeed cached.

curl -u sam:'********' -ik 'https://localhost:8443/gateway/sample6/api/v1/version'

Now unplug your network cable, turn off Wifi or disconnect from VPN. The intent being to temporarily prevent access to the ActiveDirectory server. The command below will continue to work even though no cookies are used and the ActiveDirectory server cannot be contacted. This is because the invocation above caused the user’s authentication and authorization information to be cached.

curl -u sam:'********' -ik 'https://localhost:8443/gateway/sample6/api/v1/version'

The command below uses and invalid password and is intended to prove that the previously authenticated credentials are re-verified. It is important to note that Knox does not store the actual password in the cache for this verification but rather a one way hash of the password.

curl -u sam:'invalid-password' -ik 'https://localhost:8443/gateway/sample6/api/v1/version'

Sample 7

Finally lets put it all together in a real topology file that doesn’t use the Knox Admin API. The important things to observe here are:

  1. the host and ports for the Hadoop services will need to be changed to match your environment
  2. the inclusion of the Hadoop services instead of the Knox Admin API
  3. the inclusion of the identity-assertion provider
  4. the exclusion of the hostmap provider as this is rarely required unless running Hadoop on local VMs with port mapping

Create the final sample7 topology file.

<GATEWAY_HOME>/conf/topologies/sample7.xml

<topology>
    <gateway>

        <provider>
            <role>authentication</role>
            <name>ShiroProvider</name>
            <enabled>true</enabled>
            <param name="main.ldapRealm" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm"/>
            <param name="main.ldapContextFactory" value="org.apache.hadoop.gateway.shirorealm.KnoxLdapContextFactory"/>
            <param name="main.ldapRealm.contextFactory" value="$ldapContextFactory"/>

            <param name="main.ldapRealm.contextFactory.url" value="ldap://ad.qa.your-domain.com:389"/>
            <param name="main.ldapRealm.contextFactory.systemUsername" value="CN=Kevin Minder,CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.contextFactory.systemPassword" value="********"/>

            <param name="main.ldapRealm.userSearchBase" value="CN=Users,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.userSearchAttributeName" value="sAMAccountName"/>
            <param name="main.ldapRealm.userObjectClass" value="person"/>

            <param name="main.ldapRealm.authorizationEnabled" value="true"/>
            <param name="main.ldapRealm.groupSearchBase" value="OU=groups,DC=hwqe,DC=hortonworks,DC=com"/>
            <param name="main.ldapRealm.groupObjectClass" value="group"/>
            <param name="main.ldapRealm.groupIdAttribute" value="sAMAccountName"/>
            <param name="main.ldapRealm.memberAttribute" value="member"/>

            <param name="main.cacheManager" value="org.apache.shiro.cache.ehcache.EhCacheManager"/>
            <param name="main.securityManager.cacheManager" value="$cacheManager"/>
            <param name="main.ldapRealm.authenticationCachingEnabled" value="true"/>

            <param name="urls./**" value="authcBasic"/>
        </provider>

        <provider>
            <role>authorization</role>
            <name>AclsAuthz</name>
            <enabled>true</enabled>
            <param name="knox.acl" value="*;knox_hdp_users;*"/>
        </provider>

        <provider>
            <role>identity-assertion</role>
            <name>Default</name>
            <enabled>true</enabled>
        </provider>

    </gateway>

    <service>
        <role>NAMENODE</role>
        <url>hdfs://your-nn-host.your-domain.com:8020</url>
    </service>

    <service>
        <role>JOBTRACKER</role>
        <url>rpc://your-jt-host.your-domain.com:8050</url>
    </service>

    <service>
        <role>WEBHDFS</role>
        <url>http://your-nn-host.your-domain.com:50070/webhdfs</url>
    </service>

    <service>
        <role>WEBHCAT</role>
        <url>http://your-webhcat-host.your-domain.com:50111/templeton</url>
    </service>

    <service>
        <role>OOZIE</role>
        <url>http://your-oozie-host.your-domain.com:11000/oozie</url>
    </service>

    <service>
        <role>WEBHBASE</role>
        <url>http://your-hbase-host.your-domain.com:60080</url>
    </service>

    <service>
        <role>HIVE</role>
        <url>http://your-hive-host.your-domain.com:10001/cliservice</url>
    </service>

    <service>
        <role>RESOURCEMANAGER</role>
        <url>http://your-rm-host.your-domain.com:8088/ws</url>
    </service>

</topology>

To verify topology files we frequently use the WebHDFS GETHOMEDIRECTORY command.

curl -ku guest:guest-password 'https://localhost:8443/gateway/sandbox/webhdfs/v1/?op=GETHOMEDIRECTORY' 

This should return a response body similar to what is shown below.

{"Path": "/user/guest"}

Hopefully this provides a more targeted and useful example of using Apache Knox with ActiveDirectory than can be provided in theApache Knox User’s Guide. If you have more questions, comments or suggestions please join the Apache Knox community. In particular you might be interested in one of the mailing lists.

by: Kevin Minder - Nov 18, 2015

 

This article covers setting up Apache Knox for development or just to play around with.

Step 1 - Clone the git repository

~/Projects> git clone https://git-wip-us.apache.org/repos/asf/knox.git

Step 2 - Build, install and start the servers

~/Projects> cd knox
~/Projects/knox> ant package install-test-home start-test-servers

This will generate a great deal of output. At the end though you should see something like this. If not, I’ve included some debugging tips later below.

start-test-ldap:
     [exec] Starting LDAP succeeded with PID 18226.

start-test-gateway:
     [exec] Starting Gateway succeeded with PID 18277.

Assuming that the started successfully you can access the Knox Admin API via cURL.

~/Projects/knox> curl -ku admin:admin-password 'https://localhost:8443/gateway/admin/api/v1/version'

This will return an XML response with some version information.

<?xml version="1.0" encoding="UTF-8"?>
<ServerVersion>
   <version>0.7.0-SNAPSHOT</version>
   <hash>fa56190a3de7d33ac07392f81def235bdb2d258c</hash>
</ServerVersion>

If the servers failed to start, here are some debugging tips and tricks.

The first thing to check for is other running gateway or ldap servers. The Java jps command is convenient for doing this. If you find other gateway.jar or ldap.jar processes running this is likely causing the issue. These will need to be stopped before you can proceed.

~/Projects/knox> jps
431 Launcher
18277 gateway.jar
18346 Jps
18226 ldap.jar

The next likely culprit is some other process running using port required by the gateway (8443) or the demo LDAP server (33389). On macos the lsof command is the tool of choice. If you find other processes already listening on these ports they will need to be stopped before you can proceed.

~/Projects/knox> lsof -n -i4TCP:8443 | grep LISTEN
java    18277 kevin.minder  167u  IPv6 0x2d785ee90129816b      0t0  TCP *:pcsync-https (LISTEN)

~/Projects/knox> lsof -n -i4TCP:33389 | grep LISTEN
java    18226 kevin.minder  226u  IPv6 0x2d785ee91fcce56b      0t0  TCP *:33389 (LISTEN)

Step 3 - Customize the topology for your cluster

Once the Knox servers are up and running you need to create or customize topology files match an existing Hadoop cluster. Please note that your directories may be different than what is shown below depending on what version of Knox you are using. The version shown here is the 0.7.0-SNAPSHOT version. Also note that the open command is a macos specific command that will likely launch the XML file in Xcode for editing. Any text editor is fine.

~/Projects/knox> cd install/knox-0.7.0-SNAPSHOT
~/Projects/knox/install/knox-0.7.0-SNAPSHOT> open conf/topologies/sandbox.xml

Right now all you need to worry about are the <service> sections in the topology file, in particular the <url> values. If you are running a local HDP Sandbox these values will be correct, otherwise they will need to be changed.

    <service>
        <role>NAMENODE</role>
        <url>hdfs://localhost:8020</url>
    </service>

    <service>
        <role>JOBTRACKER</role>
        <url>rpc://localhost:8050</url>
    </service>

    <service>
        <role>WEBHDFS</role>
        <url>http://localhost:50070/webhdfs</url>
    </service>

    <service>
        <role>WEBHCAT</role>
        <url>http://localhost:50111/templeton</url>
    </service>

    <service>
        <role>OOZIE</role>
        <url>http://localhost:11000/oozie</url>
    </service>

    <service>
        <role>WEBHBASE</role>
        <url>http://localhost:60080</url>
    </service>

    <service>
        <role>HIVE</role>
        <url>http://localhost:10001/cliservice</url>
    </service>

    <service>
        <role>RESOURCEMANAGER</role>
        <url>http://localhost:8088/ws</url>
    </service>

Once you have made the required changes to the <service> elements save the file. Within a few seconds the Knox gateway server will detect the change and reload the file. Then you can access the Hadoop cluster via the gateway with the sample cURL command below.

curl -ku guest:guest-password 'https://localhost:8443/gateway/sandbox/webhdfs/v1/?op=GETHOMEDIRECTORY' 

This should return a response body similar to what is shown below.

{"Path": "/user/guest"}

Hopefully this provides the shortest possible path to getting started with Apache Knox. Most of this information can also be found in the Apache Knox User’s Guide. If you have more questions, comments or suggestions please join the Apache Knox community. In particular you might be interested in one of the mailing lists.