The following process was developed for launching Hadoop clusters in EC2 in order to benchmark Mahout's clustering algorithms using a large document set (see Mahout-588). Specifically, we used the ASF mail archives that have been parsed and converted to the Hadoop SequenceFile format (block-compressed) and saved to a public S3 folder: s3://asf-mail-archives/mahout-0.4/sequence-files. Overall, there are 6,094,444 key-value pairs in 283 files taking around 5.7GB of disk.
You can also use Amazon's Elastic MapReduce, see Mahout on Elastic MapReduce. However, using EC2 directly is slightly less expensive and provides greater visibility into the state of running jobs via the JobTracker Web UI. You can launch the EC2 cluster from your development machine; the following instructions were generated on Ubuntu workstation. We assume that you have successfully completed the Amazon EC2 Getting Started Guide, see EC2 Getting Started Guide.
Note, this work was supported in part by the Amazon Web Services Apache Projects Testing Program.
Launch Hadoop Cluster
Gather Amazon EC2 keys / security credentials
You will need the following:
AWS Account ID
Access Key ID
Secret Access Key
X.509 certificate and private key (e.g. cert-aws.pem and pk-aws.pem)
EC2 Key-Pair (ssh public and private keys) for the US-EAST region.
Please make sure the file permissions are "rw------" (e.g. chmod 600 gsg-keypair.pem). You can create a key-pair for the US-East region using the Amazon console. If you are confused about any of these terms, please see: Understanding Access Credentials for AWS/EC2.
You should also export the EC2_PRIVATE_KEY and EC2_CERT environment variables to point to your AWS Certificate and Private Key files, for example:
These are used by the ec2-api-tools command to interact with Amazon Web Services.
Install and Configure the Amazon EC2 API Tools:
On Ubuntu, you'll need to enable the multi-verse in /etc/apt/sources.list to find the ec2-api-tools
Once installed, verify you have access to EC2 by executing:
Install Hadoop 0.20.2 Locally
You need to install Hadoop locally in order to get access to the EC2 cluster deployment scripts. We use
/mnt/dev as the base working directory because this process was originally conducted on an EC2 instance; be sure to replace this path with the correct path for your environment as you work through these steps.
The scripts we need are in $HADOOP_HOME/scr/contrib/ec2. There are other approaches to deploying a Hadoop cluster on EC2, such as Cloudera's CDH3. We chose to use the contrib/ec2 scripts because they are very easy to use provided there is an existing Hadoop AMI available.
Open hadoop/src/contrib/ec2/bin/hadoop-ec2-env.sh in your editor and set the Amazon security variables to match your environment, for example:
The value of PRIVATE_KEY_PATH should be your EC2 key-pair pem file, such as /mnt/dev/aws/gsg-keypair.pem. This key-pair must be created in the US-East region.
For Mahout, we recommended the following settings:
You do not need to worry about changing any variables below the comment that reads "The following variables are only used when creating an AMI.".
These settings will create a cluster of EC2 xlarge instances using the Hadoop 0.20.2 AMI provided by Bixo Labs.
Launch Hadoop Cluster
This will launch 3 xlarge instances (two workers + one for the NameNode aka "master"). It may take up to 5 minutes to launch a cluster named "mahout-clustering"; watch the console for errors. The cluster will launch in the US-East region so you won't incur any data transfer fees to/from US-Standard S3 buckets. You can re-use the cluster name for launching other clusters of different sizes. Behind the scenes, the Hadoop scripts will create two EC2 security groups that configure the firewall for accessing your Hadoop cluster.
Assuming your cluster launched successfully, establish a SOCKS tunnel to your master node to access the JobTracker Web UI from your local browser.
This command will output the URLs for the JobTracker and NameNode Web UI, such as:
Setup FoxyProxy (FireFox plug-in)
Once the FoxyProxy plug-in is installed in FireFox, go to Options > FoxyProxy Standard > Options to setup a proxy on localhost:6666 for the JobTracker and NameNode Web UI URLs from the previous step. For more information about FoxyProxy, please see: FoxyProxy
Now you are ready to run Mahout jobs in your cluster.
Launch Clustering Job from Master server
Login to the master server:
Hadoop does not start until all EC2 instances are running, look for java processes on the master server using: ps waux | grep java
Since this is EC2, you have the most disk space on the master node in /mnt.
From a distribution
NOTE: Substitute in the appropriate version number/URLs as necessary. 0.4 is not the latest version of Mahout.
You'll want to increase the Max Heap Size for the data nodes (mapred.child.java.opts) and set the correct number of reduce tasks based on the size of your cluster.
(NOTE: if this file doesn't exist yet, then the cluster nodes are still starting up. Wait a few minutes and then try again.)
Add the following properties:
You can safely run 3 reducers per node on EC2 xlarge instances with 4GB of max heap each. If you are using large instances, then you may be able to have 2 per node or only 1 if your jobs are CPU intensive.
Copy the vectors from S3 to HDFS
Use Hadoop's distcp command to copy the vectors from S3 to HDFS.
The files are stored in the US-Standard S3 bucket so there is no charge for data transfer to your EC2 cluster, as it is running in the US-EAST region.
Launch the clustering job (from the master server)
You can monitor the job using the JobTracker Web UI through FoxyProxy.
Once completed, you can view the results using Mahout's cluster dumper