...
DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now -–
although it may very well work on other similar platforms. It does not work on Cygwin.
...
Requirements
- Java 1.6
- Hadoop 0.20.x.
Installing Hive from a Stable Release
Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).
...
In addition, you must create /tmp
and /user/hive/warehouse
(aka hive.metastore.warehouse.dir
) and set them chmod g+w
in
HDFS before you can create a table in Hive.
Commands to perform this setup:
No Format |
---|
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse |
You may find it useful, though it's not necessary, to set HIVE_HOME
:
No Format |
---|
$ export HIVE_HOME=<hive-install-dir> |
...
For more information, see HCatalog Installation from Tarball and HCatalog CLI in the HCatalog manual.
Running WebHCat (Templeton)
...
For more information, see WebHCat Installation in the WebHCat manual.
Configuration management overview
- Hive by default gets its configuration from
<install-dir>/conf/hive-default.xml
- The location of the Hive configuration directory can be changed by setting the
HIVE_CONF_DIR
environment variable. - Configuration variables can be changed by (re-)defining them in
<install-dir>/conf/hive-site.xml
- Log4j configuration is stored in
<install-dir>/conf/hive-log4j.properties
- Hive configuration is an overlay on top of hadoop
...
- – it inherits the hadoop configuration variables by default.
- Hive configuration can be manipulated by:
- Editing hive-site.xml and defining any desired variables (including hadoop variables) in it
- From the cli using the set command (see below)
- Invoking hive using the syntax:
$ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
this sets the variables x1 and x2 to y1 and y2 respectively
- Setting the
HIVE_OPTS
environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above.
Runtime configuration
- Hive queries are executed using map-reduce queries and, therefore, the behavior
of such queries can be controlled by the hadoop configuration variables.
- The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:
The latter shows all the current settings. Without theNo Format hive> SET mapred.job.tracker=myhost.mycompany.com:50030; hive> SET -v;
-v
option only the
...
- variables that differ from the base hadoop configuration are displayed.
Hive, Map-Reduce and Local-Mode
...
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets - – in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.
...
In addition, mapred.local.dir
should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local
). (Otherwise, the user will get an exception allocating local disk space.).
Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are hive.exec.mode.local.auto
, hive.exec.mode.local.auto.inputbytes.max
, and hive.exec.mode.local.auto.tasks.max
:
No Format |
---|
hive> SET hive.exec.mode.local.auto=false; |
note Note that this feature is disabled by default. If enabled - , Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:
...
Hive uses log4j for logging. By default logs are not emitted to the
console by the CLI. The default logging level is WARN
and the logs are stored in the folder:
/tmp/<user.name>/hive.log
If the user wishes - – the logs can be emitted to the console by adding
the arguments shown below:
...
Note that setting hive.root.logger
via the 'set' command does not
change logging properties since they are determined at initialization time.
...
When using local mode (using mapred.job.tracker=local
), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-0.6 - – Hive uses the hive-exec-log4j.properties
(falling back to hive-log4j.properties
only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>
. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.
Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org
.
DDL Operations
Creating Hive
...
Tables
No Format |
---|
hive> CREATE TABLE pokes (foo INT, bar STRING); |
Creates creates a table called pokes with two columns, the first being an integer and the other a string.
No Format |
---|
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); |
Creates creates a table called invites with two columns and a partition column
called ds. The partition column is a virtual column. It is not part
of the data itself but is derived from the partition that a
particular dataset is loaded into.
By default, tables are assumed to be of text input format and the
delimiters are assumed to be ^A(ctrl-a).
Browsing through Tables
No Format |
---|
hive> SHOW TABLES; |
lists all the tables.
No Format |
---|
hive> SHOW TABLES '.*s'; |
lists all the table that end with 's'. The pattern matching follows Java regular
expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html.
No Format |
---|
hive> DESCRIBE invites; |
shows the list of columns.
Altering and Dropping Tables
Table As for altering tables, table names can be changed and additional columns can be dropped added or replaced:
No Format |
---|
hive> ALTER TABLE events RENAME TO 3koobecaf; hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); hive> ALTER TABLE events RENAME TO 3koobecaf ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2'); |
Note that REPLACE COLUMNS replaces all existing columns and only changes the table's schema, not the data. The table must use a native SerDe. REPLACE COLUMNS can also be used to drop columns from the table's schema:
No Format |
---|
hive> ALTER TABLE invites REPLACE COLUMNS (foo INT COMMENT 'only keep the first column');
|
Dropping tables:
No Format |
---|
hive> DROP TABLE pokes; |
...
Metadata is in an embedded Derby database whose disk storage location is determined by the
hive configuration variable named javax.jdo.option.ConnectionURL
. By default
this location is ./metastore_db
(see conf/hive-default.xml
), this location is ./metastore_db
Right now, in the default configuration, this metadata can only be seen by
one user at a time.
Metastore can be stored in any database that is supported by JPOX. The
location and the type of the RDBMS can be controlled by the two variables
javax.jdo.option.ConnectionURL
and javax.jdo.option.ConnectionDriverName
.
Refer to JDO (or JPOX) documentation for more details on supported databases.
The database schema is defined in JDO metadata annotations file package.jdo
at src/contrib/hive/metastore/src/model
.
...
If you want to run the metastore as a network server so it can be accessed
from multiple nodes try HiveDerbyServerMode.
...
Loads a file that contains two columns separated by ctrl-a into pokes table.
'localLOCAL' signifies that the input file is on the local file system. If 'localLOCAL'
is omitted then it looks for the file in HDFS.
The keyword 'overwriteOVERWRITE' signifies that existing data in the table is deleted.
If the 'overwriteOVERWRITE' keyword is omitted, data files are appended to existing data sets.
...
- NO verification of data against the schema is performed by the load command.
- If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
The root of the Hive directory is specified by the optionhive.metastore.warehouse.dir
inhive-default.xml
. We advise users to create this directory before
trying to create tables via Hive.
...
The two LOAD statements above load data into two different partitions of the table
invites. Table invites must be created as partitioned by the key ds for this to succeed.
...
Some example queries are shown below. They are available in build/dist/examples/queries
.
More are available in the hive sources at ql/src/test/queries/positive
.
SELECTS and FILTERS
No Format |
---|
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; |
selects column 'foo' from all rows of partition ds=2008-08-15
of the invites
table. The results are not
stored anywhere, but are displayed on the console.
Note that in all the examples that follow, INSERT
(into a hive table, local
directory or HDFS directory) is optional.
...
selects all rows from partition ds=2008-08-15
of the invites
table into an HDFS directory. The result data
is in files (depending on the number of mappers) in that directory.
NOTE: partition columns if any are selected by the use of *. They can also
be specified in the projection clauses.
...
No Format |
---|
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; |
Selects selects all rows from pokes table into a local directory.
No Format |
---|
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15'; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; |
Sum selects the sum of a column. The avg, min, or max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1)
in place of COUNT(*)
.
...
This streams the data in the map phase through the script /bin/cat
(like hadoop streaming).
Similarly - – streaming can be used on the reduce side (please see the Hive Tutorial for examples).
Simple Example Use Cases
MovieLens User Ratings
...
The format of Apache weblog is customizable, while most webmasters uses use the default.
For default Apache weblog, we can create a table with the following command.
More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662.
No Format |
---|
add jar ../build/contrib/hive_contrib.jar; CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE; |