Introduction
This document describes changes to a) HiveQL, b) metastore schema, and c) metastore thrift API to support column level statistics in Hive. Please note that the document doesn’t describe the changes needed to persist histograms in the metastore yet.
Proposed HiveQL changes
HiveQL currently supports analyze command to compute statistics on tables and partitions. HiveQL’s analyze command will be extended to trigger statistics computation on one or more column in a Hive table/partition. The necessary changes to HiveQL are as below,
Wiki Markup |
---|
analyze table t \[partition p\] compute statistics for \[columns c,...\]; |
Proposed Metastore Schema
To persist column level statistics, we propose to add the following new tables,
...
Possible values for the histogram column are NONE, HEIGHT-BALANCED. Currently only NONE is a valid option. When we implement support for histograms, we will extend the metastore schema to persist the histogram buckets. We will check for the value of the histogram column in TAB_COL_STATS and PART_COL_STATS to decide if valid histogram buckets exist for the column in question.
Proposed Metastore Thrift API
We propose to add the following Thrift struct to transport column statistics,
...