Introduction
This describes how to set up Kerberos, Hadoop, Zookeeper and Giraph so that all components work with Hadoop's security features enabled.
Disclaimer
This is intended for development only: it is not intended as a best-practices guide to secure Hadoop deployment in a production setting. Actual production use will have additional and site-specific changes to enhance security.
Kerberos
Installation
sudo yum -y install krb5-server
Configuration
[logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log [libdefaults] default_realm = HADOOP.LOCALDOMAIN dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 1d renew_lifetime = 7d forwardable = yes proxiable = yes udp_preference_limit = 1 extra_addresses = 127.0.0.1 kdc_timesync = 1 ccache_type = 4 allow_weak_crypto = true [realms] HADOOP.LOCALDOMAIN = { kdc = localhost:88 admin_server = localhost:749 } [domain_realm] localhost = HADOOP.LOCALDOMAIN .compute-1.internal = HADOOP.LOCALDOMAIN .internal = HADOOP.LOCALDOMAIN internal = HADOOP.LOCALDOMAIN [appdefaults] pam = { debug = false ticket_lifetime = 36000 renew_lifetime = 36000 forwardable = true krb4_convert = false } [login] krb4_convert = true krb4_get_tickets = false
Initialize Kerberos KDC service
$ sudo kdb5_util create -s Loading random data Initializing database '/var/kerberos/krb5kdc/principal' for realm 'HADOOP.LOCALDOMAIN', master key name 'K/M@HADOOP.LOCALDOMAIN' You will be prompted for the database Master Password. It is important that you NOT FORGET this password. Enter KDC database master key: Re-enter KDC database master key to verify: $
Startup
sudo service krb5kdc restart
Set up principals
Use this script (from https://github.com/ekoontz/kerb-setup). Run this script as a normal user who has sudo
privileges: it will call sudo
as needed. Choose a password that you will use for your own (ordinary user) principal, and pass this password as the first argument of the script:
./principals.sh mypassword
This script will save the keytab files in the current working directory in a file called services.keytab
. We'll assume you have this file in the directory $HOME/kerb-setup/
and will use the full path $HOME/kerb-setup/services.keytab
in the Hadoop configuration files below.
Hadoop
Build
git clone git://git.apache.org/hadoop-common.git cd hadoop-common git checkout origin/branch-1.0.2
Remove dependency on java5
Open build.xml
in an editor to remove package
's dependency on docs, cn-docs
, so that it looks like:
<target name="package" depends="compile, jar, javadoc, api-report, examples, tools-jar, jar-test, ant-tasks, package-librecordio" description="assembles multi-platform artifacts for distribution">
Run build
ant -Dcompile.native=true clean jsvc package
This causes a working hadoop runtime to be available within the directory $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT
, but
we still need to configure it to enable security-related features.
Configuration
Replace $HOST
with `hostname -f`
and $HOME
with `echo $HOME`
below.
<configuration> <property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property> <property> <name>hadoop.security.authorization</name> <value>true</value> </property> <property> <name>giraph.zkList</name> <value>localhost:2181</value> </property> </configuration>
<configuration> <property> <name>dfs.block.access.token.enable</name> <value>true</value> </property> <property> <name>fs.default.name</name> <value>hdfs://$HOST:8020/</value> </property> <property> <name>dfs.namenode.keytab.file</name> <value>$HOME/kerb-setup/services.keytab</value> </property> <property> <name>dfs.namenode.kerberos.principal</name> <value>hdfs/_HOST@HADOOP.LOCALDOMAIN</value> </property> <property> <name>dfs.https.enable</name> <value>false</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>false</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>false</value> </property> <property> <name>dfs.namenode.user.name</name> <value>hdfs</value> </property> <property> <name>dfs.http.address</name> <value>$HOST:8070</value> </property> <!-- NOTE: this is still needed even though https is not enabled. --> <property> <name>dfs.https.port</name> <value>8090</value> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:1004</value> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:1006</value> </property> <property> <name>dfs.datanode.keytab.file</name> <value>$HOME/kerb-setup/services.keytab</value> </property> <property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/_HOST@HADOOP.LOCALDOMAIN</value> </property> </configuration>
<configuration> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>10</value> </property> <property> <name>mapred.job.tracker</name> <value>$HOST:8030</value> </property> <property> <name>mapred.job.tracker.http.address</name> <value>0.0.0.0:8040</value> </property> <property> <name>mapred.task.tracker.http.address</name> <value>0.0.0.0:8050</value> </property> <property> <name>mapreduce.jobtracker.keytab.file</name> <value>$HOME/kerb-setup/services.keytab</value> </property> <property> <name>mapreduce.jobtracker.kerberos.principal</name> <value>mapred/_HOST@HADOOP.LOCALDOMAIN</value> </property> <property> <name>mapreduce.tasktracker.keytab.file</name> <value>$HOME/kerb-setup/services.keytab</value> </property> <property> <name>mapreduce.tasktracker.kerberos.principal</name> <value>mapred/_HOST@HADOOP.LOCALDOMAIN</value> </property> </configuration>
Add the following to your $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT/conf/hadoop-env.sh
:
(look for immediately below the # Extra Java CLASSPATH elements. Optional.
line).
_Note that the jars in the following HADOOP_CLASSPATH
will only be present after they are fetched by Maven when you build Giraph (below). Therefore you should wait to start your Hadoop daemons _until you've build Giraph.
export HADOOP_CLASSPATH=$HOME/.m2/repository/com/google/guava/guava/r09/guava-r09.jar:$HOME/.m2/repository/commons-io/commons-io/1.3.2/commons-io-1.3.2.jar:$HOME/.m2/repository/org/apache/zookeeper/zookeeper/3.3.3/zookeeper-3.3.3.jar:$HOME/.m2/repository/org/json/json/20090211/json-20090211.jar:$HOME/.m2/repository/net/iharder/base64/2.3.8/base64-2.3.8.jar
Giraph
Build
git clone git://git.apache.org/giraph.git cd giraph mvn -DskipTests -Phadoop_1.0 clean package
Configuration
Note giraph.zkList
in core.site.xml
above.
Hadoop Daemon Startup
cd $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT rm -rf /tmp/hadoop-`whoami` bin/hadoop namenode -format bin/hadoop namenode & sleep 2 export HADOOP_SECURE_DN_USER=`whoami` sudo -E bin/hadoop datanode & bin/hadoop jobtracker & sleep 2 bin/hadoop tasktracker &
Zookeeper
Build
git clone git://git.apache.org/zookeeper.git cd zookeeper ant clean jar
Configuration
Create a conf/zoo.cfg file in your zookeeper directory:
dataDir=/tmp/zkdata clientPort=2181
Startup
bin/zkServer.sh start-foreground
Initialize your principal
kinit
You'll be asked for a password; use the same password that you chose above when you ran principals.sh
in the Set up principals section above.
Run your job!
cd $HOME/hadoop/build/hadoop-1.0.3-SNAPSHOT bin/hadoop jar ~/giraph/target/munged/giraph-0.2-SNAPSHOT.jar org.apache.giraph.\ benchmark.PageRankBenchmark -e 1 -s 3 -v -V 50 -w 2