= Hive and Amazon Web Services =

Background

This document explores the different ways of leveraging Hive on Amazon Web Services - namely S3, EC2 and Elastic Map-Reduce.

Hadoop already has a long tradition of being run on EC2 and S3. These are well documented in the links below which are a must read:

The second document also has pointers on how to get started using EC2 and S3. For people who are new to S3 - there's a few helpful notes in S3 for n00bs section below. The rest of the documentation below assumes that the reader can launch a hadoop cluster in EC2, copy files into and out of S3 and run some simple Hadoop jobs.

Introduction to Hive and AWS

There are three separate questions to consider when running Hive on AWS:

  1. Where to run the Hive CLI from and store the metastore db (that contains table and schema definitions).
  2. How to define Hive tables over existing datasets (potentially those that are already in S3)
  3. How to dispatch Hive queries (which are all executed using one or more map-reduce programs) to a Hadoop cluster running in EC2.

We walk you through the choices involved here and show some practical case studies that contain detailed setup and configuration instructions.

Running the Hive CLI

The CLI takes in Hive queries, compiles them into a plan (commonly, but not always, consisting of map-reduce jobs) and then submits them to a Hadoop Cluster. While it depends on Hadoop libraries for this purpose - it is otherwise relatively independent of the Hadoop cluster itself. For this reason the CLI can be run from any node that has a Hive distribution, a Hadoop distribution, a Java Runtime Engine. It can submit jobs to any compatible hadoop cluster (whose version matches that of the Hadoop libraries that Hive is using) that it can connect to. The Hive CLI also needs to access table metadata. By default this is persisted by Hive via an embedded Derby database into a folder named metastore_db on the local file system (however state can be persisted in any database - including remote mysql instances).

There are two choices on where to run the Hive CLI from:

  1. Run Hive CLI from within EC2 - the Hadoop master node being the obvious choice. There are several problems with this approach:

2. Run Hive CLI remotely from outside EC2. In this case, the user installs a Hive distribution on a personal workstation, - the main trick with this option is connecting to the Hadoop cluster - both for submitting jobs and for reading and writing files to HDFS. The section on Running jobs from a remote machine details how this can be done. Case Study 1 goes into the setup for this in more detail. This option solves the problems mentioned above:

However - the one downside of Option 2 is that jar files are copied over to the Hadoop cluster for each map-reduce job. This can cause high latency in job submission as well as incur some AWS network transmission costs. Option 1 seems suitable for advanced users who have figured out a stable Hadoop and Hive (and potentially external libraries) configuration that works for them and can create a new AMI with the same.

Loading Data into Hive Tables

It is useful to go over the main storage choices for Hadoop/EC2 environment:

Considering these factors, the following makes sense in terms of Hive tables:

  1. For long-lived tables, use S3 based storage mechanisms
    2. For intermediate data and tmp tables, use HDFS

Case Study 1 shows you how to achieve such an arrangement using the S3N filesystem.

If the user is running Hive CLI from their personal workstation - they can also use Hive's 'load data local' commands as a convenient alternative (to dfs commands) to copy data from their local filesystems (accessible from their workstation) into tables defined over either HDFS or S3.

Submitting jobs to a Hadoop cluster

This applies particularly when Hive CLI is run remotely. A single Hive CLI session can switch across different hadoop clusters (especially as clusters are bought up and terminated). Only two configuration variables:

  1. Querying files in S3 using EC2, Hive and Hadoop

    Appendix

    <<Anchor(S3n00b)>>

    S3 for n00bs

    One of the things useful to understand is how S3 is used as a file system normally. Each S3 bucket can be considered as a root of a File System. Different files within this filesystem become objects stored in S3 - where the path name of the file (path components joined with '/') become the S3 key within the bucket and file contents become the value. Different tools like [S3Fox|https:--addons.mozilla.org-en-US-firefox-addon-3247] and native S3 !FileSystem in Hadoop (s3n) show a directory structure that's implied by the common prefixes found in the keys. Not all tools are able to create an empty directory. In particular - S3Fox does (by creating a empty key representing the directory). Other popular tools like aws, s3cmd and s3curl provide convenient ways of accessing S3 from the command line - but don't have the capability of creating empty directories.