Hive has 3 main components:
Apart from these major components, Hive also contains a number of other components. These are as follows:
The following top level directories contain helper libraries, packaged configuration files etc..:
What is a SerDe?
Note that the "key" part is ignored when reading, and is always a constant when writing. Basically row object is stored into the "value".
One principle of Hive is that Hive does not own the HDFS file format. Users should be able to directly read the HDFS files in the Hive tables using other tools or use other tools to directly write to HDFS files that can be loaded into Hive through "CREATE EXTERNAL TABLE" or can be loaded into Hive through "LOAD DATA INPATH," which just move the file into Hive's table directory.
Note that org.apache.hadoop.hive.serde is the deprecated old SerDe library. Please look at org.apache.hadoop.hive.serde2 for the latest version.
Hive currently uses these FileFormat classes to read and write HDFS files:
Hive currently uses these SerDe classes to serialize and deserialize data:
LazySimpleSerDe: This SerDe can be used to read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol, however, it creates Objects in a lazy way which provides better performance. Starting in Hive 0.14.0 it also supports read/write data with a specified encode charset, for example:
ALTER TABLE person SET SERDEPROPERTIES ('serialization.encoding'='GBK'); |
LazySimpleSerDe can treat 'T', 't', 'F', 'f', '1', and '0' as extended, legal boolean literals if the configuration property hive.lazysimple.extended_boolean_literal is set to true
(Hive 0.14.0 and later). The default is false
, which means only 'TRUE' and 'FALSE' are treated as legal boolean literals.
Also:
s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar
for releases prior to 0.12.0.See SerDe for detailed information about input and output processing. Also see Storage Formats in the HCatalog manual, including CTAS Issue with JSON SerDe. For information about how to create a table with a custom or native SerDe, see Row Format, Storage Format, and SerDe.
Some important points about SerDe:
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.
ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.
NOTE: Apache Hive recommends that custom ObjectInspectors created for use with custom SerDes have a no-argument constructor in addition to their normal constructors for serialization purposes. See HIVE-5380 for more details.
As of Hive 0.14 a registration mechanism has been introduced for native Hive SerDes. This allows dynamic binding between a "STORED AS" keyword in place of a triplet of {SerDe, InputFormat, and OutputFormat} specification, in CreateTable statements.
The following mappings have been added through this registration mechanism:
Syntax | Equivalent |
---|---|
STORED AS AVRO / STORED AS AVROFILE | ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
STORED AS ORC / STORED AS ORCFILE | ROW FORMAT SERDE
' STORED AS INPUTFORMAT
' OUTPUTFORMAT
' |
STORED AS PARQUET / STORED AS PARQUETFILE | ROW FORMAT SERDE
' STORED AS INPUTFORMAT
' OUTPUTFORMAT
' |
STORED AS RCFILE | STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' |
STORED AS SEQUENCEFILE | STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat' |
STORED AS TEXTFILE | STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' |
To add a new native SerDe with STORED AS keyword, follow these steps:
Create a storage format descriptor class extending from AbstractStorageFormatDescriptor.java that returns a "stored as" keyword and the names of InputFormat, OutputFormat, and SerDe classes.
Add the name of the storage format descriptor class to the StorageFormatDescriptor registration file.
MetaStore contains metadata regarding tables, partitions and databases. This is used by Query Processor during plan generation.
The following are the main components of the Hive Query Processor:
A helpful overview of the Hive query processor can be found in this Hive Anatomy slide deck.
As of version 0.13 Hive uses Maven instead of Ant for its build. The following instructions are not up to date. See the Hive Developer FAQ for updated instructions. |
Hive can be made to compile against different versions of Hadoop.
From the root of the source tree:
ant package |
will make Hive compile against Hadoop version 0.19.0. Note that:
ant -Dtarget.dir=<my-install-dir> package |
ant -Dhadoop.version=0.17.1 package |
ant -Dhadoop.root=~/src/hadoop-19/build/hadoop-0.19.2-dev -Dhadoop.version=0.19.2-dev |
note that:
hadoop.root
is pointing to a distribution tree for Hadoop created by running ant package in Hadoop.hadoop.version
must match the version used in building Hadoop.In this particular example - ~/src/hadoop-19
is a checkout of the Hadoop 19 branch that uses 0.19.2-dev
as default version and creates a distribution directory in build/hadoop-0.19.2-dev
by default.
Run Hive from the command line with '$HIVE_HOME/bin/hive
', where $HIVE_HOME
is typically build/dist
under your Hive repository top-level directory.
$ build/dist/bin/hive |
If Hive fails at runtime, try 'ant very-clean package
' to delete the Ivy cache before rebuilding.
From Thejas:
export HIVE_OPTS='--hiveconf mapred.job.tracker=local --hiveconf fs.default.name=file:///tmp \ --hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse \ --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true' |
Then you can run 'build/dist/bin/hive
' and it will work against your local file system.
Hive uses JUnit for unit tests. Each of the 3 main components of Hive have their unit test implementations in the corresponding src/test directory e.g. trunk/metastore/src/test has all the unit tests for metastore, trunk/serde/src/test has all the unit tests for serde and trunk/ql/src/test has all the unit tests for the query processor. The metastore and serde unit tests provide the TestCase implementations for JUnit. The query processor tests on the other hand are generated using Velocity. The main directories under trunk/ql/src/test that contain these tests and the corresponding results are as follows:
As of version 0.13 Hive uses Maven instead of Ant for its build. The following instructions are not up to date. See the Hive Developer FAQ for updated instructions. |
Run all tests:
ant package test |
Run all positive test queries:
ant test -Dtestcase=TestCliDriver |
Run a specific positive test query:
ant test -Dtestcase=TestCliDriver -Dqfile=groupby1.q |
The above test produces the following files:
build/ql/test/TEST-org.apache.hadoop.hive.cli.TestCliDriver.txt
- Log output for the test. This can be helpful when examining test failures.build/ql/test/logs/groupby1.q.out
- Actual query result for the test. This result is compared to the expected result as part of the test.Run the set of unit tests matching a regex, e.g. partition_wise_fileformat tests 10-16:
ant test -Dtestcase=TestCliDriver -Dqfile_regex=partition_wise_fileformat1[0-6] |
Note that this option matches against the basename of the test without the .q suffix.
Apparently the Hive tests do not run successfully after a clean unless you run ant package
first. Not sure why build.xml doesn't encode this dependency.
First, write a new myname.q in ql/src/test/queries/clientpositive.
Then, run the test with the query and overwrite the result (useful when you add a new test).
ant test -Dtestcase=TestCliDriver -Dqfile=myname.q -Doverwrite=true |
Then we can create a patch by:
svn add ql/src/test/queries/clientpositive/myname.q ql/src/test/results/clientpositive/myname.q.out svn diff > patch.txt |
Similarly, to add negative client tests, write a new query input file in ql/src/test/queries/clientnegative and run the same command, this time specifying the testcase name as TestNegativeCliDriver instead of TestCliDriver. Note that for negative client tests, the output file if created using the overwrite flag can be be found in the directory ql/src/test/results/clientnegative.
See also Tips for Adding New Tests in Hive.
Hive code includes both client-side code (e.g., compiler, semantic analyzer, and optimizer of HiveQL) and server-side code (e.g., operator/task/SerDe implementations). Debugging is different for client-side and server-side code, as described below.
The client-side code runs on your local machine so you can easily debug it using Eclipse the same way you debug any regular local Java code. Here are the steps to debug code within a unit test.
ant model-jar
in hive/metastore and ant gen-test
in hive since the last time you ran ant clean
.The server-side code is distributed and runs on the Hadoop cluster, so debugging server-side Hive code is a little bit complicated. In addition to printing to log files using log4j, you can also attach the debugger to a different JVM under unit test (single machine mode). Below are the steps on how to debug on server-side code.
Compile Hive code with javac.debug=on. Under Hive checkout directory:
> ant -Djavac.debug=on package |
If you have already built Hive without javac.debug=on, you can clean the build and then run the above command.
> ant clean # not necessary if the first time to compile > ant -Djavac.debug=on package |
Run ant test with additional options to tell the Java VM that is running Hive server-side code to wait for the debugger to attach. First define some convenient macros for debugging. You can put it in your .bashrc or .cshrc.
> export HIVE_DEBUG_PORT=8000 > export HIVE_DEBUG="-Xdebug -Xrunjdwp:transport=dt_socket,address=${HIVE_DEBUG_PORT},server=y,suspend=y" |
In particular HIVE_DEBUG_PORT is the port number that the JVM is listening on and the debugger will attach to. Then run the unit test as follows:
> export HADOOP_OPTS=$HIVE_DEBUG > ant test -Dtestcase=TestCliDriver -Dqfile=<mytest>.q |
The unit test will run until it shows:
[junit] Listening for transport dt_socket at address: 8000 |
Now, you can use jdb to attach to port 8000 to debug
> jdb -attach 8000 |
or if you are running Eclipse and the Hive projects are already imported, you can debug with Eclipse. Under Eclipse Run -> Debug Configurations, find "Remote Java Application" at the bottom of the left panel. There should be a MapRedTask configuration already. If there is no such configuration, you can create one with the following property:
There is another way of debugging Hive code without going through Ant.
You need to install Hadoop and set the environment variable HADOOP_HOME to that.
> export HADOOP_HOME=<your hadoop home> |
Then, start Hive:
> ./build/dist/bin/hive --debug |
It will then act similar to the debugging steps outlines in Debugging Hive code. It is faster since there is no need to compile Hive code,
and go through Ant. It can be used to debug both client side and server side Hive.
If you want to debug a particular query, start Hive and perform the steps needed before that query. Then start Hive again in debug to debug that query.
> ./build/dist/bin/hive > perform steps before the query |
> ./build/dist/bin/hive --debug > run the query |
Note that the local file system will be used, so the space on your machine will not be released automatically (unlike debugging via Ant, where the tables created in test are automatically dropped at the end of the test). Make sure to either drop the tables explicitly, or drop the data from /User/hive/warehouse.
Please refer to Hive User Group Meeting August 2009 Page 59-63.
Please refer to Hive User Group Meeting August 2009 Page 64-70.
Please refer to Hive User Group Meeting August 2009 Page 71-73.
Please refer to Hive User Group Meeting August 2009 Page 74-87.