Table of Contents |
---|
Introduction
Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory. A detailed design document is attached to the vectorized query execution JIRA, at https://issues.apache.org/jira/browse/HIVE-4160.
...
- arithmetic: +, -, *, /, %
- AND, OR, NOT
- comparisons <, >, <=, >=, =, !=, BETWEEN, IN ( list-of-constants ) as filters
- Boolean-valued expressions (non-filters) using AND, OR, NOT, <, >, <=, >=, =, !=
- IS [NOT] NULL
- all math functions (SIN, LOG, etc.)
- string functions SUBSTR, CONCAT, TRIM, LTRIM, RTRIM, LOWER, UPPER, LENGTH
- type casts
- Hive user-defined functions, including standard and generic UDFs
- date functions (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP)
- the IF conditional expression
User-defined functions are supported using a backward compatibility bridge, so although they do run vectorized, they don't run as fast as optimized vector implementations of built-in operators and functions. Vectorized filter operations are evaluated left-to-right, so for best performance, put UDFs on the right in an ANDed list of expressions in the WHERE clause. E.g., use
...
You can verify which parts of your query are being vectorized using the explain feature. For example, with vectorization enabled and the table alltypesorc stored in ORC format, for this query:
Code Block | ||||
---|---|---|---|---|
| ||||
select csmallint
from alltypesorc
where csmallint > 0;
|
the explain output contains this:
Code Block | ||||
---|---|---|---|---|
| ||||
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
alltypesorc
TableScan
alias: alltypesorc
Filter Operator
predicate:
expr: (csmallint > 0)
type: boolean
Vectorized execution: true
Select Operator
expressions:
expr: csmallint
type: smallint
outputColumnNames: _col0
Vectorized execution: true
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Vectorized execution: true
|
...