Table of Contents
Table of Contents |
---|
DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now – although it may very well work on other similar platforms. It does not work on Cygwin.
...
Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z
:
No Format |
---|
$ tar -xzvf hive-x.y.z.tar.gz
|
Set the environment variable HIVE_HOME
to point to the installation directory:
No Format |
---|
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
|
Finally, add $HIVE_HOME/bin
to your PATH
:
No Format |
---|
$ export PATH=$HIVE_HOME/bin:$PATH
|
...
The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunk
No Format |
---|
$ svn co http://svn.apache.org/repos/asf/hive/trunk hive
$ cd hive
$ ant clean package
$ cd build/dist
$ ls
README.txt
bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)
examples/ (sample input and query files)
|
...
Compile Hive on Hadoop 23
No Format |
---|
$ svn co http://svn.apache.org/repos/asf/hive/trunk hive
$ cd hive
$ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -Dhadoop.mr.rev=23
$ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23
|
...
Commands to perform this setup:
No Format |
---|
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
|
You may find it useful, though it's not necessary, to set HIVE_HOME
:
No Format |
---|
$ export HIVE_HOME=<hive-install-dir>
|
To use the Hive command line interface (CLI) from the shell:
No Format |
---|
$ $HIVE_HOME/bin/hive
|
Running HCatalog
To run the HCatalog server from the shell in Hive release 0.11.0 and later:
No Format |
---|
$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh
|
To use the HCatalog command line interface (CLI) in Hive release 0.11.0 and later:
No Format |
---|
$ $HIVE_HOME/hcatalog/bin/hcat
|
...
To run the WebHCat server from the shell in Hive release 0.11.0 and later:
No Format |
---|
$ $HIVE_HOME/hcatalog/sbin/webhcat_server.sh
|
...
- Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.
The CLI command 'SET' can be used to set any Hadoop (or Hive) configuration variable. For example:
No Format hive> SET mapred.job.tracker=myhost.mycompany.com:50030; hive> SET -v;
The latter shows all the current settings. Without the
-v
option only the variables that differ from the base Hadoop configuration are displayed.
Hive, Map-Reduce and Local-Mode
Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:
No Format |
---|
mapred.job.tracker
|
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets – in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.
Starting with release 0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:
No Format |
---|
hive> SET mapred.job.tracker=local;
|
...
Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are hive.exec.mode.local.auto
, hive.exec.mode.local.auto.inputbytes.max
, and hive.exec.mode.local.auto.tasks.max
:
No Format |
---|
hive> SET hive.exec.mode.local.auto=false;
|
...
Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org
.problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org
.
Audit Logs
Audit logs are logged from the Hive metastore server for every metastore API invocation.
An audit log has the function and some of the relevant function arguments logged in the metastore log file. It is logged at the INFO level of log4j, so you need to make sure that the logging at the INFO level is enabled (see HIVE-3505). The name of the log entry is "HiveMetaStore.audit".
Audit logs were added in Hive 0.7 for secure client connections (HIVE-1948) and in Hive 0.10 for non-secure connections (HIVE-3277; also see HIVE-2797).
DDL Operations
The Hive DDL operations are documented in Hive Data Definition Language.
Creating Hive Tables
No Format |
---|
hive> CREATE TABLE pokes (foo INT, bar STRING);
|
creates a table called pokes with two columns, the first being an integer and the other a string.
No Format |
---|
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
|
...
Browsing through Tables
No Format |
---|
hive> SHOW TABLES;
|
lists all the tables.
No Format |
---|
hive> SHOW TABLES '.*s';
|
lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html.
No Format |
---|
hive> DESCRIBE invites;
|
shows the list of columns.
...
Table names can be changed and columns can be added or replaced:
No Format |
---|
hive> ALTER TABLE events RENAME TO 3koobecaf;
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
hive> ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');
|
Note that REPLACE COLUMNS replaces all existing columns and only changes the table's schema, not the data. The table must use a native SerDe. REPLACE COLUMNS can also be used to drop columns from the table's schema:
No Format |
---|
hive> ALTER TABLE invites REPLACE COLUMNS (foo INT COMMENT 'only keep the first column');
|
Dropping tables:
No Format |
---|
hive> DROP TABLE pokes;
|
Metadata Store
...
Loading data from flat files into Hive:
No Format |
---|
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
|
...
- NO verification of data against the schema is performed by the load command.
- If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
The root of the Hive directory is specified by the optionhive.metastore.warehouse.dir
inhive-default.xml
. We advise users to create this directory before trying to create tables via Hive.
No Format |
---|
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
|
The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.
No Format |
---|
hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
|
...
SELECTS and FILTERS
No Format |
---|
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
|
...
Note that in all the examples that follow, INSERT
(into a Hive table, local directory or HDFS directory) is optional.
No Format |
---|
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
|
...
Partitioned tables must always have a partition selected in the WHERE
clause of the statement.
No Format |
---|
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
|
selects all rows from pokes table into a local directory.
No Format |
---|
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
|
selects the sum of a column. The avg, min, or max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1)
in place of COUNT(*)
.
GROUP BY
No Format |
---|
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
|
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1)
in place of COUNT(*)
.
JOIN
No Format |
---|
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
|
MULTITABLE INSERT
No Format |
---|
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
|
STREAMING
No Format |
---|
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
|
...
First, create a table with tab-delimited text file format:
No Format |
---|
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
|
Then, download and extract the data files:
No Format |
---|
wget http://www.grouplens.org/sites/www.grouplens.org/external_files/data/ml-data.tar.gz
tar xvzf ml-data.tar.gz
|
And load it into the table that was just created:
No Format |
---|
LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;
|
Count the number of rows in table u_data:
No Format |
---|
SELECT COUNT(*) FROM u_data;
|
...
Create weekday_mapper.py
:
No Format |
---|
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])
|
Use the mapper script:
No Format |
---|
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
add FILE weekday_mapper.py;
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;
|
...
More about !RegexSerDe can be found here in HIVE-662 and HIVE-1719.
Code Block | ||||
---|---|---|---|---|
| ||||
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;
|