ORC Files
Table of Contents | ||
---|---|---|
|
ORC File Format
Info | ||
---|---|---|
| ||
Introduced in Hive version 0.11.0. |
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
...
With the ability to skip large sets of rows based on filter predicates, you can sort a table on its secondary keys to achieve a big reduction in execution time. For example, if the primary partition is transaction date, the table can be sorted on state, zip code, and last name. Then looking for records in one state will skip the records of all other states.
A complete specification of the format is given in the ORC specification.
HiveQL Syntax
File formats are specified at the table (or partition) level. You can specify the ORC file format with HiveQL statements such as these:
...
Key | Default | Notes |
---|---|---|
orc.compress | ZLIB | high level compression (one of NONE, ZLIB, SNAPPY) |
orc.compress.size | 262,144 | number of bytes in each compression chunk |
orc.stripe.size | 26843545667,108,864 | number of bytes in each stripe |
orc.row.index.stride | 10,000 | number of rows between index entries (must be >= 1000) |
orc.create.index | true | whether to create row indexes |
orc.bloom.filter.columns | "" | comma separated list of column names for which bloom filter should be created |
orc.bloom.filter.fpp | 0.05 | false positive probability for bloom filter (must >0.0 and <1.0) |
For example, creating an ORC stored table without compression:
...
The ORC file dump utility analyzes ORC files. To invoke it, use this command:
Code Block |
---|
// Hive version 0.11 through 0.14: hive --orcfiledump <hdfs-location <location-of-orc-file> // Hive version 1.1.0 and later: hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file> // Hive version 1.2.0 and later: hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file> // Hive version 1.3.0 and later: hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump] [--backup-path <new-path>] <location-of-orc-file-or-directory> |
Specifying -d
in the command will cause it to dump the ORC file data rather than the metadata (Hive 1.1.0 and later).
Specifying --rowindex
with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).
Specifying -t
in the command will print the timezone id of the writer.
Specifying -j
in the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata, add -p
to the command.
Specifying --recover
in the command will recover a corrupted ORC file generated by Hive streaming.
Specifying --skip-dump
along with --recover
will perform recovery without dumping metadata.
Specifying --backup-path
with a new-path will let the recovery tool move corrupted files to the specified backup path (default: /tmp).
<location-of-orc-file> is the URI of the ORC file.
<location-of-orc-file-or-directory> is the URI of the ORC file or directory. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.
ORC Configuration Parameters
The ORC configuration parameters are described in Hive Configuration Properties – ORC File Format.
Anchor | ||||
---|---|---|---|---|
|
The ORC specification has moved to ORC project.