Table of Contents
- Installation and Configuration
- DDL Operations
- DML Operations
- SQL Operations
- Simple Example Use Cases
DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now – although it may very well work on other similar platforms. It does not work on Cygwin.
Installation and Configuration
- Java 1.6
- Hadoop 0.20.x.
Installing Hive from a Stable Release
Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).
Next you need to unpack the tarball. This will result in the creation of a subdirectory named
Set the environment variable
HIVE_HOME to point to the installation directory:
$HIVE_HOME/bin to your
Building Hive from Source
The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunk
In the rest of the page we use
Compile Hive on Hadoop 23
Hive uses Hadoop, so:
- you must have Hadoop in your path OR
In addition, you must create
hive.metastore.warehouse.dir) and set them
chmod g+w in HDFS before you can create a table in Hive.
Commands to perform this setup:
You may find it useful, though it's not necessary, to set
To use the Hive command line interface (CLI) from the shell:
To run the HCatalog server from the shell in Hive release 0.11.0 and later:
To use the HCatalog command line interface (CLI) in Hive release 0.11.0 and later:
Running WebHCat (Templeton)
To run the WebHCat server from the shell in Hive release 0.11.0 and later:
Configuration Management Overview
- Hive by default gets its configuration from
- The location of the Hive configuration directory can be changed by setting the
- Configuration variables can be changed by (re-)defining them in
- Log4j configuration is stored in
- Hive configuration is an overlay on top of Hadoop – it inherits the Hadoop configuration variables by default.
- Hive configuration can be manipulated by:
- Editing hive-site.xml and defining any desired variables (including Hadoop variables) in it
- From the CLI using the set command (see below)
- Invoking hive using the syntax:
$ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
this sets the variables x1 and x2 to y1 and y2 respectively
- Setting the
HIVE_OPTSenvironment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above.
- Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.
- The CLI command 'SET' can be used to set any Hadoop (or Hive) configuration variable. For example:
-voption only the variables that differ from the base Hadoop configuration are displayed.
The latter shows all the current settings. Without the
Hive, Map-Reduce and Local-Mode
Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets – in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.
Starting with release 0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:
mapred.local.dir should point to a path that's valid on the local machine (for example
/tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space.)
Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are
Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:
- The total input size of the job is lower than:
hive.exec.mode.local.auto.inputbytes.max(128MB by default)
- The total number of map-tasks is less than:
hive.exec.mode.local.auto.tasks.max(4 by default)
- The total number of reduce tasks required is 1 or 0.
So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.
Note that there may be differences in the runtime environment of Hadoop server nodes and the machine running the Hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the Hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option
hive.mapred.local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.
Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is
WARN for Hive releases prior to 0.13.0. Starting with Hive 0.13.0, the default logging level is
INFO. The logs are stored in the folder:
Note: In local mode, the log file name is "
.log" instead of "
hive.log". This is a bug which will be fixed in release 0.13.0 (see HIVE-5528 and HIVE-5676).
If the user wishes – the logs can be emitted to the console by adding the arguments shown below:
bin/hive -hiveconf hive.root.logger=INFO,console
Alternatively, the user can change the logging level only by using:
bin/hive -hiveconf hive.root.logger=INFO,DRFA
Note that setting
hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time.
Hive also stores query logs on a per Hive session basis in
/tmp/<user.name>/, but can be configured in hive-site.xml with the
Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI.
When using local mode (using
mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting with release 0.6 – Hive uses the
hive-exec-log4j.properties (falling back to
hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under
/tmp/<user.name>. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.
Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to
The Hive DDL operations are documented in Hive Data Definition Language.
Creating Hive Tables
creates a table called pokes with two columns, the first being an integer and the other a string.
creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.
By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).
Browsing through Tables
lists all the tables.
lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html.
shows the list of columns.
Altering and Dropping Tables
Note that REPLACE COLUMNS replaces all existing columns and only changes the table's schema, not the data. The table must use a native SerDe. REPLACE COLUMNS can also be used to drop columns from the table's schema:
Metadata is in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named
javax.jdo.option.ConnectionURL. By default this location is
Right now, in the default configuration, this metadata can only be seen by one user at a time.
Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables
javax.jdo.option.ConnectionDriverName. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file
In the future, the metastore itself can be a standalone server.
If you want to run the metastore as a network server so it can be accessed from multiple nodes, see Hive Using Derby in Server Mode.
The Hive DML operations are documented in Hive Data Manipulation Language.
Loading data from flat files into Hive:
Loads a file that contains two columns separated by ctrl-a into pokes table. 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
The keyword 'OVERWRITE' signifies that existing data in the table is deleted. If the 'OVERWRITE' keyword is omitted, data files are appended to existing data sets.
- NO verification of data against the schema is performed by the load command.
- If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
The root of the Hive directory is specified by the option
hive-default.xml. We advise users to create this directory before trying to create tables via Hive.
The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.
The above command will load data from an HDFS file/directory to the table.
Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.
The Hive query operations are documented in Select.
Some example queries are shown below. They are available in
More are available in the Hive sources at
SELECTS and FILTERS
selects column 'foo' from all rows of partition
ds=2008-08-15 of the
invites table. The results are not stored anywhere, but are displayed on the console.
Note that in all the examples that follow,
INSERT (into a Hive table, local directory or HDFS directory) is optional.
selects all rows from partition
ds=2008-08-15 of the
invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory.
NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses.
Partitioned tables must always have a partition selected in the
WHERE clause of the statement.
selects all rows from pokes table into a local directory.
selects the sum of a column. The avg, min, or max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use
COUNT(1) in place of
Note that for versions of Hive which don't include HIVE-287, you'll need to use
COUNT(1) in place of
This streams the data in the map phase through the script
/bin/cat (like Hadoop streaming).
Similarly – streaming can be used on the reduce side (please see the Hive Tutorial for examples).
Simple Example Use Cases
MovieLens User Ratings
First, create a table with tab-delimited text file format:
Then, download and extract the data files:
And load it into the table that was just created:
Count the number of rows in table u_data:
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
Now we can do some complex data analysis on the table
Use the mapper script:
Note that if you're using Hive 0.5.0 or earlier you will need to use
COUNT(1) in place of
Apache Weblog Data
The format of Apache weblog is customizable, while most webmasters use the default.
For default Apache weblog, we can create a table with the following command.