Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{toc}
h1. Introduction
Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type like {long}, {double}, or a byte sequence). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory. A detailed design document is attached to the vectorized query execution JIRA, at [
Table of Contents

Introduction

...

https://issues.apache.org/jira/browse/HIVE-4160].

...



h1. Using Vectorized Query Execution

...


h2. Enabling vectorized execution

...


To use vectorized query execution, you must store your data in [ORC|LanguageManual ORC] format, and set the following variable as shown in Hive SQL:

...



{{set hive.vectorized.execution.enabled = true;

...

}}

Vectorized execution is off by default, so your queries only utilize it if this variable is turned on. To disable vectorized execution and go back to standard execution, do the following:

...



{{set hive.vectorized.execution.enabled = false;

...

Supported data types and operations

...

}}

h2. Supported data types and operations
The following data types are currently supported for vectorized execution: {{tinyint}}, {{smallint}}, {{int}}, {{bigint}}, {{boolean}}, {{float}}, {{double}}, {{timestamp}}, and {{string}}. Using other data types will cause your query to execute using standard, row-at-a-time execution.

...



The following expressions can be vectorized when used on supported types:

...



* arithmetic: +, -, *, /, %

...


* AND, OR, NOT

...


* comparisons <, >, <=, >=, =, !=,

...

 BETWEEN
* IS \[NOT\] NULL

...


* all math functions (SIN, LOG, etc.)

...


* string functions SUBSTR, CONCAT, TRIM, LTRIM, RTRIM, LOWER, UPPER, LENGTH

...


* type casts

...


* Hive user-defined functions, including standard and generic UDFs

...


* date functions (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP)

...



User-defined functions are supported using a backward compatibility bridge, so they don't run as fast as optimized implementations of built-in operators and functions. Filter operations are evaluated left-to-right, so for best performance, put UDFs on the right in an ANDed list of expressions in the WHERE clause. E.g., use

...



{{column1 = 10 and myUDF(column2) = "x"

...

}}

instead of

...



{{myUDF(column2) = "x" and column1 = 10

...

}}

Using a built-in operator or function that is not supported for vectorization will cause your query to run in standard row-at-a-time mode. While vectorization is in preview, some less common use cases for vectorization may cause a compile-time error. For example, this could occur if you pass non-constant second or third arguments to BETWEEN. To work around this, disable vectorization by setting {{hive.vectorized.execution.enabled}} to {{false}} for the specific query that is failing, to run it in standard mode. It is anticipated that before making vectorization on by default, any expression that can't be vectorized will cause the query to automatically run in standard mode.

...

 

Vectorized support continues to be added for additional functions and expressions. If you have a request for one, please comment on this page, or open a JIRA for it.

...



h2. Seeing whether vectorization is used for a query

...


You can verify which parts of your query are being vectorized using the *explain* feature. For example, with vectorization enabled and the table *alltypesorc* stored in ORC format, for this query:

...


{code
:sql
sql
}
select csmallint
from alltypesorc
where csmallint > 0;

the explain output contains this:

Code Block
texttext
{code}
the *explain* output contains this:
{code:text}
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        alltypesorc
          TableScan
            alias: alltypesorc
            Filter Operator
              predicate:
                  expr: (csmallint > 0)
                  type: boolean
              Vectorized execution: true
              Select Operator
                expressions:
                      expr: csmallint
                      type: smallint
                outputColumnNames: _col0
                Vectorized execution: true
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  Vectorized execution: true
{code}
The notation {{Vectorized execution: true}} shows that the operator containing that notation is vectorized. Absence of this notation means the operator is not vectorized, and uses the standard row-at-a-time execution path.

h1.

...

 Version Information

...


Vectorized execution is expected to be available in Hive 13 and later. The feature is currently in the Hive trunk branch.