Table of Contents |
---|
Introduction
...
Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory. A detailed design document is attached to the vectorized query execution JIRA, at
https://issues.apache.org/jira/browse/HIVE-4160
...
.
...
Using
...
Vectorized
...
Query
...
Execution
...
Enabling
...
vectorized
...
execution
...
To
...
use
...
vectorized
...
query
...
execution,
...
you
...
must
...
store
...
your
...
data
...
in
...
...
format,
...
and
...
set
...
the
...
following
...
variable
...
as
...
shown
...
in
...
Hive
...
SQL:
...
set
...
hive.vectorized.execution.enabled
...
=
...
true;
...
Vectorized
...
execution
...
is
...
off
...
by
...
default,
...
so
...
your
...
queries
...
only
...
utilize
...
it
...
if
...
this
...
variable
...
is
...
turned
...
on.
...
To
...
disable
...
vectorized
...
execution
...
and
...
go
...
back
...
to
...
standard
...
execution,
...
do
...
the
...
following:
...
set
...
hive.vectorized.execution.enabled
...
=
...
false;
...
Supported data types and operations
The following data types are currently supported for vectorized execution: tinyint
, smallint
, int
, bigint
, boolean
, float
, double
, timestamp
, and string
. Using other data types will cause your query to execute using standard, row-at-a-time
...
execution.
...
The
...
following
...
expressions
...
can
...
be
...
vectorized
...
when
...
used
...
on
...
supported
...
types:
...
- arithmetic:
...
- +,
...
- -,
...
- *,
...
- /,
...
- %
...
- AND,
...
- OR,
...
- NOT
...
- comparisons
...
- <,
...
- >,
...
- <=,
...
- >=,
...
- =,
...
- !=,
...
- BETWEEN
Wiki Markup IS \[NOT\] NULL
...
- all
...
- math
...
- functions
...
- (SIN,
...
- LOG,
...
- etc.)
...
- string
...
- functions
...
- SUBSTR,
...
- CONCAT,
...
- TRIM,
...
- LTRIM,
...
- RTRIM,
...
- LOWER,
...
- UPPER,
...
- LENGTH
...
- type
...
- casts
...
- Hive
...
- user-defined
...
- functions,
...
- including
...
- standard
...
- and
...
- generic
...
- UDFs
...
- date
...
- functions
...
- (YEAR,
...
- MONTH,
...
- DAY,
...
- HOUR,
...
- MINUTE,
...
- SECOND,
...
- UNIX_TIMESTAMP)
...
User-defined
...
functions
...
are
...
supported
...
using
...
a
...
backward
...
compatibility
...
bridge,
...
so
...
they
...
don't
...
run
...
as
...
fast
...
as
...
optimized
...
implementations
...
of
...
built-in
...
operators
...
and
...
functions.
...
Filter
...
operations
...
are
...
evaluated
...
left-to-right,
...
so
...
for
...
best
...
performance,
...
put
...
UDFs
...
on
...
the
...
right
...
in
...
an
...
ANDed
...
list
...
of
...
expressions
...
in
...
the
...
WHERE
...
clause.
...
E.g.,
...
use
...
column1
...
=
...
10
...
and
...
myUDF(column2)
...
=
...
"x"
...
instead
...
of
...
myUDF(column2)
...
=
...
"x"
...
and
...
column1
...
=
...
10
...
Using
...
a
...
built-in
...
operator
...
or
...
function
...
that
...
is
...
not
...
supported
...
for
...
vectorization
...
will
...
cause
...
your
...
query
...
to
...
run
...
in
...
standard
...
row-at-a-time
...
mode.
...
While
...
vectorization
...
is
...
in
...
preview,
...
some
...
less
...
common
...
use
...
cases
...
for
...
vectorization
...
may
...
cause
...
a
...
compile-time
...
error.
...
For
...
example,
...
this
...
could
...
occur
...
if
...
you
...
pass
...
non-constant
...
second
...
or
...
third
...
arguments
...
to
...
BETWEEN.
...
To
...
work
...
around
...
this,
...
disable
...
vectorization
...
by
...
setting
...
hive.vectorized.execution.enabled
...
to
...
false
...
for
...
the
...
specific
...
query
...
that
...
is
...
failing,
...
to
...
run
...
it
...
in
...
standard
...
mode.
...
It
...
is
...
anticipated
...
that
...
before
...
making
...
vectorization
...
on
...
by
...
default,
...
any
...
expression
...
that
...
can't
...
be
...
vectorized
...
will
...
cause
...
the
...
query
...
to
...
automatically
...
run
...
in
...
standard
...
mode.
...
Vectorized
...
support
...
continues
...
to
...
be
...
added
...
for
...
additional
...
functions
...
and
...
expressions.
...
If
...
you
...
have
...
a
...
request
...
for
...
one,
...
please
...
comment
...
on
...
this
...
page,
...
or
...
open
...
a
...
JIRA
...
for
...
it.
...
Seeing
...
whether
...
vectorization
...
is
...
used
...
for
...
a
...
query
...
You
...
can
...
verify
...
which
...
parts
...
of
...
your
...
query
...
are
...
being
...
vectorized
...
using
...
the
...
explain
...
feature.
...
For
...
example,
...
with
...
vectorization
...
enabled
...
and
...
the
...
table
...
alltypesorc
...
stored
...
in
...
ORC
...
format,
...
for
...
this
...
query:
Code Block | ||||
---|---|---|---|---|
| ||||
{code:sql}
select csmallint
from alltypesorc
where csmallint > 0;
|
the explain output contains this:
Code Block | ||||
---|---|---|---|---|
| ||||
{code} the *explain* output contains this: {code:text} STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: alltypesorc TableScan alias: alltypesorc Filter Operator predicate: expr: (csmallint > 0) type: boolean Vectorized execution: true Select Operator expressions: expr: csmallint type: smallint outputColumnNames: _col0 Vectorized execution: true File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Vectorized execution: true {code} |
The
...
notation
...
Vectorized
...
execution:
...
true
...
shows
...
that
...
the
...
operator
...
containing
...
that
...
notation
...
is
...
vectorized.
...
Absence
...
of
...
this
...
notation
...
means
...
the
...
operator
...
is
...
not
...
vectorized,
...
and
...
uses
...
the
...
standard
...
row-at-a-time
...
execution
...
path.
...
Version
...
Information
...
Vectorized
...
execution
...
is
...
expected
...
to
...
be
...
available
...
in
...
Hive
...
13
...
and
...
later.
...
The
...
feature
...
is
...
currently
...
in
...
the
...
Hive
...
trunk
...
branch.