...
The second milestone was to support column level statistics. See Column Statistics in Hive in the Design Documents.
Supported column stats are:
BooleanColumnStatsData | DoubleColumnStatsData | LongColumnStatsData | StringColumnStatsData | BinaryColumnStatsData | DecimalColumnStatsData | Date | DateColumnStatsData | Timestamp | TimestampColumnStatsData | union ColumnStatisticsData |
1: required i64 numTrues, | 1: optional double lowValue, | 1: optional i64 lowValue, | 1: required i64 maxColLen, | 1: required i64 maxColLen, | 1: optional Decimal lowValue, | 1: required i64 daysSinceEpoch | 1: optional Date lowValue, | 1: required i64 secondsSinceEpoch | 1: optional Timestamp lowValue, | 1: BooleanColumnStatsData booleanStats, |
2: required i64 numFalses, | 2: optional double highValue, | 2: optional i64 highValue, | 2: required double avgColLen, | 2: required double avgColLen, | 2: optional Decimal highValue, | 2: optional Date highValue, | 2: optional Timestamp highValue, | 2: LongColumnStatsData longStats, | ||
3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: DoubleColumnStatsData doubleStats, | ||
4: optional binary bitVectors | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: optional binary bitVectors | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: StringColumnStatsData stringStats, | ||
5: optional binary bitVectors, | 5: optional binary bitVectors, | 5: optional binary bitVectors | 5: optional binary bitVectors, | 5: optional binary bitVectors, | 5: optional binary bitVectors, | 5: BinaryColumnStatsData binaryStats, | ||||
6: optional binary histogram | 6: optional binary histogram | 6: optional binary histogram | 6: optional binary histogram | 6: optional binary histogram | 6: DecimalColumnStatsData decimalStats, | |||||
7: DateColumnStatsData dateStats, | ||||||||||
8: TimestampColumnStatsData timestampStats |
Info | ||
---|---|---|
| ||
Column level statistics were added in Hive 0.10.0 by HIVE-1362. |
...
Column level top K statistics are still pending; see HIVE-3421.
Quick overview
Description | Stored in | Collected by | Since |
---|---|---|---|
Number of partition the dataset consists of | Fictional metastore property: numPartitions | computed during displaying the properties of a partitioned table | Hive 2.3 |
Number of files the dataset consists of | Metastore table property: numFiles | Automatically during Metastore operations |
Total size of the dataset as its seen at the filesystem level | Metastore table property: totalSize |
Uncompressed size of the dataset | Metastore table property: rawDataSize | Computed, these are the basic statistics. Calculated automatically when hive.stats.autogather is enabled. | Hive 0.8 |
Number of rows the dataset consist of | Metastore table property: numRows |
Column level statistics | Metastore; TAB_COL_STATS table | Computed, Calculated automatically when hive.stats.column.autogather is enabled. Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS |
...
Implementation
The way the statistics are calculated is similar for both newly created and existing tables.
...