Introduction
This describes how to set up Kerberos, Hadoop, Zookeeper and Giraph so that all components work with Hadoop's security features enabled.
Disclaimer
This is intended for development only: it is not intended as a best-practices guide to secure Hadoop deployment in a production setting. Actual production use will have additional and site-specific changes to enhance security.
Kerberos
Installation
Code Block |
---|
sudo yum -y install krb5-server
|
Configuration
Code Block |
---|
|
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = HADOOP.LOCALDOMAIN
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 1d
renew_lifetime = 7d
forwardable = yes
proxiable = yes
udp_preference_limit = 1
extra_addresses = 127.0.0.1
kdc_timesync = 1
ccache_type = 4
allow_weak_crypto = true
[realms]
HADOOP.LOCALDOMAIN = {
kdc = localhost:88
admin_server = localhost:749
}
[domain_realm]
localhost = HADOOP.LOCALDOMAIN
.compute-1.internal = HADOOP.LOCALDOMAIN
.internal = HADOOP.LOCALDOMAIN
internal = HADOOP.LOCALDOMAIN
[appdefaults]
pam = {
debug = false
ticket_lifetime = 36000
renew_lifetime = 36000
forwardable = true
krb4_convert = false
}
[login]
krb4_convert = true
krb4_get_tickets = false
|
Initialize Kerberos KDC service
Code Block |
---|
$ sudo kdb5_util create -s
Loading random data
Initializing database '/var/kerberos/krb5kdc/principal' for realm 'HADOOP.LOCALDOMAIN',
master key name 'K/M@HADOOP.LOCALDOMAIN'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key:
Re-enter KDC database master key to verify:
$
|
Startup
Code Block |
---|
sudo service krb5kdc restart
|
Set up principals
Use this script (from https://github.com/ekoontz/kerb-setup). Run this script as a normal user who has sudo
privileges: it will call sudo
as needed. Choose a password that you will use for your own (ordinary user) principal, and pass this password as the first argument of the script:
Code Block |
---|
./principals.sh mypassword
|
This script will save the keytab files in the current working directory in a file called services.keytab
. We'll assume you have this file in the directory $HOME/kerb-setup/
and will use the full path $HOME/kerb-setup/services.keytab
in the Hadoop configuration files below.
Hadoop
Build
Code Block |
---|
git clone git://git.apache.org/hadoop-common.git
cd hadoop-common
git checkout origin/branch-1.0.2
|
Remove dependency on java5
Open build.xml
in an editor to remove package
's dependency on docs, cn-docs
, so that it looks like:
Code Block |
---|
<target name="package" depends="compile, jar, javadoc, api-report, examples, tools-jar, jar-test, ant-tasks, package-librecordio"
description="assembles multi-platform artifacts for distribution">
|
Run build
Code Block |
---|
ant -Dcompile.native=true clean jsvc package
|
This causes a working hadoop runtime to be available within the directory $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT
, but
we still need to configure it to enable security-related features.
Configuration
Replace $HOST
with `hostname -f`
and $HOME
with `echo $HOME`
below.
Code Block |
---|
title | $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT/conf/core-site.xml |
---|
|
<configuration>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
<property>
<name>giraph.zkList</name>
<value>localhost:2181</value>
</property>
</configuration>
|
Code Block |
---|
title | $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT/conf/hdfs-site.xml |
---|
|
<configuration>
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://$HOST:8020/</value>
</property>
<property>
<name>dfs.namenode.keytab.file</name>
<value>$HOME/kerb-setup/services.keytab</value>
</property>
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>hdfs/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
<name>dfs.https.enable</name>
<value>false</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.user.name</name>
<value>hdfs</value>
</property>
<property>
<name>dfs.http.address</name>
<value>$HOST:8070</value>
</property>
<!-- NOTE: this is still needed even though https is not enabled. -->
<property>
<name>dfs.https.port</name>
<value>8090</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>$HOME/kerb-setup/services.keytab</value>
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>hdfs/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
</configuration>
|
Code Block |
---|
title | $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT/conf/mapred-site.xml |
---|
|
<configuration>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>10</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>$HOST:8030</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>0.0.0.0:8040</value>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>0.0.0.0:8050</value>
</property>
<property>
<name>mapreduce.jobtracker.keytab.file</name>
<value>$HOME/kerb-setup/services.keytab</value>
</property>
<property>
<name>mapreduce.jobtracker.kerberos.principal</name>
<value>mapred/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
<name>mapreduce.tasktracker.keytab.file</name>
<value>$HOME/kerb-setup/services.keytab</value>
</property>
<property>
<name>mapreduce.tasktracker.kerberos.principal</name>
<value>mapred/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
</configuration>
|
Add the following to your $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT/conf/hadoop-env.sh
:
(look for immediately below the # Extra Java CLASSPATH elements. Optional.
line).
_Note that the jars in the following HADOOP_CLASSPATH
will only be present after they are fetched by Maven when you build Giraph (below). Therefore you should wait to start your Hadoop daemons _until you've build Giraph.
Code Block |
---|
|
export HADOOP_CLASSPATH=$HOME/.m2/repository/com/google/guava/guava/r09/guava-r09.jar:$HOME/.m2/repository/commons-io/commons-io/1.3.2/commons-io-1.3.2.jar:$HOME/.m2/repository/org/apache/zookeeper/zookeeper/3.3.3/zookeeper-3.3.3.jar:$HOME/.m2/repository/org/json/json/20090211/json-20090211.jar:$HOME/.m2/repository/net/iharder/base64/2.3.8/base64-2.3.8.jar
|
Giraph
Build
Code Block |
---|
git clone git://git.apache.org/giraph.git
cd giraph
mvn -DskipTests -Phadoop_1.0 clean package
|
Configuration
Note giraph.zkList
in core.site.xml
above.
Hadoop Daemon Startup
Code Block |
---|
|
cd $HOME/hadoop-common/build/hadoop-1.0.3-SNAPSHOT
rm -rf /tmp/hadoop-`whoami`
bin/hadoop namenode -format
bin/hadoop namenode &
sleep 2
export HADOOP_SECURE_DN_USER=`whoami`
sudo -E bin/hadoop datanode &
bin/hadoop jobtracker &
sleep 2
bin/hadoop tasktracker &
|
Zookeeper
Build
Code Block |
---|
git clone git://git.apache.org/zookeeper.git
cd zookeeper
ant clean jar
|
Configuration
Create a conf/zoo.cfg file in your zookeeper directory:
Code Block |
---|
dataDir=/tmp/zkdata
clientPort=2181
|
Startup
Code Block |
---|
bin/zkServer.sh start-foreground
|
Initialize your principal
You'll be asked for a password; use the same password that you chose above when you ran principals.sh
in the Set up principals section above.
Run your job!
Code Block |
---|
cd $HOME/hadoop/build/hadoop-1.0.3-SNAPSHOT
bin/hadoop jar ~/giraph/target/munged/giraph-0.2-SNAPSHOT.jar org.apache.giraph.\
benchmark.PageRankBenchmark -e 1 -s 3 -v -V 50 -w 2
|