HiveAws HivingS3nRemotely
Skip to end of metadata
Go to start of metadata

= Querying S3 files from your PC (using EC2, Hive and Hadoop) =

Usage Scenario

The scenario being covered here goes as follows:

  • A user has data stored in S3 - for example Apache log files archived in the cloud, or databases backed up into S3.
  • The user would like to declare tables over the data sets here and issue SQL queries against them
  • These SQL queries should be executed using computed resources provisioned from EC2. Ideally, the compute resources can be provisioned in proportion to the compute costs of the queries
  • Results from such queries that need to be retained for the long term can be stored back in S3

This tutorial walks through the steps required to accomplish this. Please send email to the hive-users mailing list in case of any problems with this tutorial.

Required Software

On the client side (PC), the following are required:

  • Any version of Hive that incorporates HIVE-467. (As of this writing - the relevant patches are not committed. For convenience sake - a Hive distribution with this patch can be downloaded from here.)
  • A version of Hadoop ec2 scripts (src/contrib/ec2/bin) with a fix for here]. Again - since the relevant patches are not committed yet - a version of Hadoop-19 ec2 scripts with the relevant patches applied can be downloaded from [[http:--jsensarma.com-downloads-hadoop-0.19-ec2-remote.tar.gz]. These scripts must be used to launch hadoop clusters in EC2.

Hive requires a local directory of Hadoop to run (specified using environment variable HADOOP_HOME). This can be a version of Hadoop compatible with the one running on the EC2 clusters. This recipe has been tried with hadoop distribution created from from branch-19.

It is assumed that the user can successfully launch Hive CLI (bin/hive from the Hive distribution) at this point.

Hive Setup

A few Hadoop configuration variables are required to be specified for all Hive sessions. These can be set using the hive cli as follows:

The values assigned to s3n keys are just an example and need to be filled in by the user as per their account details. Explanation for the rest of the values can be found in Configuration Guide section below.

Instead of specifying these command lines each time the CLI is bought up - we can store these persistently within hive-site.xml in the conf/ directory of the Hive installation (from where they will be picked up each time the CLI is launched.

Example Public Data Sets

Some example data files are provided in the S3 bucket data.s3ndemo.hive. We will use them for the sql examples in this tutorial:

  • s3n://data.s3ndemo.hive/kv - Key Value pairs in a text file
  • s3n://data.s3ndemo.hive/pkv - Key Value pairs in a directories that are partitioned by date
  • s3n://data.s3ndemo.hive/tpch/* - Eight directories containing data corresponding to the eight tables used by TPCH benchmark. The data is generated for a scale 10 (approx 10GB) database using the standard dbgen utility provided by TPCH.

    Setting up tables (DDL Statements)

    In this example - we will use HDFS as the default table store for Hive. We will make Hive tables over the files in S3 using the external tables functionality in Hive. Executing DDL commands does not require a functioning Hadoop cluster (since we are just setting up metadata):
  • Declare a simple table containing key-value pairs:
  • Declare a partitioned table over a nested directory containing key-value pairs and associate table partitions with dirs:
  • Declare a table over a TPCH table:

The TPCH DDL statements are slightly modified versions of the original TPCH statements (since Hive does not support all the data types used in TPCH). All the TPCH DDL statements for Hive can be be found [^TpchDdlForHive.sql]

Executing Queries

Hive can execute some queries without a Hadoop cluster. For example:

select * queries with limit clauses can be performed locally on the Hive CLI itself. If you are doing this - please note that:

  • fs.default.name should be set to file:/// in case CLI is not configured to use a working Hadoop cluster
  • Please Please do not select all the rows from large data sets. This will cause large amount of data to be downloaded from S3 to outside AWS and incur charges on the host account for these data sets!

Of course - the real fun is in doing some non-trivial queries using map-reduce. For this we will need a Hadoop cluster (finally!):

  1. Start a Hadoop cluster on EC2 (using directions from Hadoop-EC2 tutorial - but making sure to use a version of ec2 scripts with HADOOP-5839 applied! User is free to allocate any number of nodes they wish - although this tutorial was tried out with 10 nodes.
  2. Note down the public hostnames of the master node. For example, the public hostname maybe something like:
  • ec2-12-34-56-78.compute-1.amazonaws.com
    1.#3 Point the Hive CLI to use this Hadoop cluster by executing:
  • 1.#4 Set up a ssh tunnel via port 2600 to the Hadoop master. This can be done by executing the following from another terminal/window:
  • $ ssh -i <path to Hadoop private key path> -D 2600 ec2-12-34-56-78.compute-1.amazonaws.com

Now we are all setup. The sample query from TPCH (1.sql) can be tried as follows:

This launches one map-reduce job and on 10 nodes with default hadoop/hive settings - this took about 10 minutes. The results in this case are stored in HDFS and can be obtained by doing a dfs -cat /tmp/tpcresults/1-2.sql/* - either from bin/hadoop or from hive CLI. The query above differs from the TPCH query in skipping the order by clause - since it's not implemented by Hive currently.

Storing results back in S3

The results could also have been stored as a file in S3 directly, for example, we could alter the previous insert clause to read as:

As another alternative, one could have created an external table over S3 and stored the results directly in it, for example:

Similarly, one could have stored the results back in a partition in an partitioned external table as well.

Using tmp tables in HDFS

Currently, Hive does not have any explicit support tmp tables. But tables defined over HDFS in EC2 are like tmp tables since they only last for the duration of the Hadoop cluster. Since they are likely to be much faster than accessing S3 directly - they can be used to stage data that may be accessed repeatedly during a session. For example - for the TPCH dataset - one may want to do some analysis of customer attributes against order details - and it would be first beneficial to materialize the join of these data sets and then do repeated queries against it. Here's some example sql that would do the same:

Appendix

<<Anchor(ConfigHell)>>

Configuration Guide

The socket related options allow Hive CLI to communicate with the Hadoop cluster using a ssh tunnel (that will be established later). The job.ugi is specified to avoid issues with permissions on HDFS. mapred.map.tasks specification is a hack that works around HADOOP-5861 and may need to be set higher for large clusters. mapred.reduce.tasks is specified to let Hive determine the number of reducers (see HIVE-490).

Links

Labels
  • No labels