Introduction

This describes how to set up Kerberos, Hadoop, Zookeeper and Giraph so that all components work with Hadoop's security features enabled.

Disclaimer

This is intended for development only: it is not intended as a best-practices guide to secure Hadoop deployment in a production setting. Actual production use will have additional and site-specific changes to enhance security.

Kerberos

Installation

sudo yum -y install krb5-server

Configuration

/etc/krb5.conf
[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = HADOOP.LOCALDOMAIN
 dns_lookup_realm = false
 dns_lookup_kdc = false
 ticket_lifetime = 1d
 renew_lifetime = 7d
 forwardable = yes
 proxiable = yes
 udp_preference_limit = 1
 extra_addresses = 127.0.0.1
 kdc_timesync = 1
 ccache_type = 4
 allow_weak_crypto = true

[realms]
 HADOOP.LOCALDOMAIN = {
  kdc =  localhost:88
  admin_server =  localhost:749
 }

[domain_realm]
 localhost = HADOOP.LOCALDOMAIN
 .compute-1.internal = HADOOP.LOCALDOMAIN
 .internal = HADOOP.LOCALDOMAIN
 internal = HADOOP.LOCALDOMAIN

[appdefaults]
 pam = {
  debug = false
  ticket_lifetime = 36000
  renew_lifetime = 36000
  forwardable = true
  krb4_convert = false
 }

[login]
 krb4_convert = true
 krb4_get_tickets = false

Initialize Kerberos KDC service

$ sudo kdb5_util create -s
Loading random data
Initializing database '/var/kerberos/krb5kdc/principal' for realm 'HADOOP.LOCALDOMAIN',
master key name 'K/M@HADOOP.LOCALDOMAIN'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key: 

Re-enter KDC database master key to verify: 

$

Startup

sudo service krb5kdc restart

Set up principals

Use this script (from https://github.com/ekoontz/kerb-setup). Run this script as a normal user who has sudo privileges: it will call sudo as needed. Choose a password that you will use for your own (ordinary user) principal, and pass this password as the first argument of the script:

./principals.sh mypassword

This script will save the keytab files in the current working directory in a file called services.keytab. We'll assume you have this file in the directory $HOME/kerb-setup/ and will use the full path $HOME/kerb-setup/services.keytab in the Hadoop configuration files below.

Hadoop

Build

git clone git://git.apache.org/hadoop-common.git
cd hadoop-common
git checkout origin/branch-1.0.2

Remove dependency on java5

Open build.xml in an editor to remove package's dependency on docs, cn-docs, so that it looks like:

  <target name="package" depends="compile, jar, javadoc, api-report, examples, tools-jar, jar-test, ant-tasks, package-librecordio"
          description="assembles multi-platform artifacts for distribution">

Run build

ant -Dcompile.native=true clean jsvc package

This causes a working hadoop runtime to be available within the directory $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT, but
we still need to configure it to enable security-related features.

Configuration

Replace $HOST with `hostname -f` and $HOME with `echo $HOME` below.

$HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT/conf/core-site.xml
<configuration>
  <property>
    <name>hadoop.security.authentication</name>
    <value>kerberos</value>
  </property>
  <property>
    <name>hadoop.security.authorization</name>
    <value>true</value>
  </property>
  <property>
    <name>giraph.zkList</name>
    <value>localhost:2181</value>
  </property>
</configuration>
$HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT/conf/hdfs-site.xml
<configuration>
  <property>
    <name>dfs.block.access.token.enable</name>
    <value>true</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://$HOST:8020/</value>
  </property>
  <property>
    <name>dfs.namenode.keytab.file</name>
    <value>$HOME/kerb-setup/services.keytab</value>
  </property>
  <property>
    <name>dfs.namenode.kerberos.principal</name>
    <value>hdfs/_HOST@HADOOP.LOCALDOMAIN</value>
  </property>
  <property>
    <name>dfs.https.enable</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.webhdfs.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.webhdfs.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.namenode.user.name</name>
    <value>hdfs</value>
  </property>
  <property>
    <name>dfs.http.address</name>
    <value>$HOST:8070</value>
  </property>
  <!-- NOTE: this is still needed even though https is not enabled. -->
  <property>
    <name>dfs.https.port</name>
    <value>8090</value>
  </property>
  <property>
    <name>dfs.datanode.address</name>
    <value>0.0.0.0:1004</value>
  </property>
  <property>
    <name>dfs.datanode.http.address</name>
    <value>0.0.0.0:1006</value>
  </property>
  <property>
    <name>dfs.datanode.keytab.file</name>
    <value>$HOME/kerb-setup/services.keytab</value>
  </property>
  <property>
    <name>dfs.datanode.kerberos.principal</name>
    <value>hdfs/_HOST@HADOOP.LOCALDOMAIN</value>
  </property>
</configuration>
$HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT/conf/mapred-site.xml
<configuration>
  <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>10</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>$HOST:8030</value>
  </property>
  <property>
    <name>mapred.job.tracker.http.address</name>
    <value>0.0.0.0:8040</value>
  </property>
  <property>
    <name>mapred.task.tracker.http.address</name>
    <value>0.0.0.0:8050</value>
  </property>
  <property>
    <name>mapreduce.jobtracker.keytab.file</name>
    <value>$HOME/kerb-setup/services.keytab</value>
  </property>
  <property>
    <name>mapreduce.jobtracker.kerberos.principal</name>
    <value>mapred/_HOST@HADOOP.LOCALDOMAIN</value>
  </property>
  <property>
    <name>mapreduce.tasktracker.keytab.file</name>
    <value>$HOME/kerb-setup/services.keytab</value>
  </property>
  <property>
    <name>mapreduce.tasktracker.kerberos.principal</name>
    <value>mapred/_HOST@HADOOP.LOCALDOMAIN</value>
  </property>
</configuration>

Add the following to your $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT/conf/hadoop-env.sh:

(look for immediately below the # Extra Java CLASSPATH elements. Optional. line).

_Note that the jars in the following HADOOP_CLASSPATH will only be present after they are fetched by Maven when you build Giraph (below). Therefore you should wait to start your Hadoop daemons _until you've build Giraph.

"hadoop-env.sh"
export HADOOP_CLASSPATH=$HOME/.m2/repository/com/google/guava/guava/r09/guava-r09.jar:$HOME/.m2/repository/commons-io/commons-io/1.3.2/commons-io-1.3.2.jar:$HOME/.m2/repository/org/apache/zookeeper/zookeeper/3.3.3/zookeeper-3.3.3.jar:$HOME/.m2/repository/org/json/json/20090211/json-20090211.jar:$HOME/.m2/repository/net/iharder/base64/2.3.8/base64-2.3.8.jar

Giraph

Build

git clone git://git.apache.org/giraph.git
cd giraph
mvn -DskipTests -Phadoop_1.0 clean package

Configuration

Note giraph.zkList in core.site.xml above.

Hadoop Daemon Startup

"hadoop-startup.sh"
cd $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT
rm -rf /tmp/hadoop-`whoami`
bin/hadoop namenode -format 
bin/hadoop namenode &
sleep 2
export HADOOP_SECURE_DN_USER=`whoami`
sudo -E bin/hadoop datanode &
bin/hadoop jobtracker &
sleep 2
bin/hadoop tasktracker &

Zookeeper

Build

git clone git://git.apache.org/zookeeper.git 
cd zookeeper
ant clean jar

Configuration

Create a conf/zoo.cfg file in your zookeeper directory:

dataDir=/tmp/zkdata
clientPort=2181

Startup

bin/zkServer.sh start-foreground

Initialize your principal

kinit

You'll be asked for a password; use the same password that you chose above when you ran principals.sh in the Set up principals section above.

Run your job!

cd $HOME/hadoop/build/hadoop-1.0.3-SNAPSHOT
bin/hadoop jar ~/giraph/target/munged/giraph-0.2-SNAPSHOT.jar org.apache.giraph.\
benchmark.PageRankBenchmark -e 1 -s 3 -v -V 50 -w 2
  • No labels