DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
...
The second milestone was to support column level statistics. See Column Statistics in Hive in the Design Documents.
Supported column stats are:
BooleanColumnStatsData | DoubleColumnStatsData | LongColumnStatsData | StringColumnStatsData | BinaryColumnStatsData | DecimalColumnStatsData | Date | DateColumnStatsData | Timestamp | TimestampColumnStatsData | union ColumnStatisticsData |
1: required i64 numTrues, | 1: optional double lowValue, | 1: optional i64 lowValue, | 1: required i64 maxColLen, | 1: required i64 maxColLen, | 1: optional Decimal lowValue, | 1: required i64 daysSinceEpoch | 1: optional Date lowValue, | 1: required i64 secondsSinceEpoch | 1: optional Timestamp lowValue, | 1: BooleanColumnStatsData booleanStats, |
2: required i64 numFalses, | 2: optional double highValue, | 2: optional i64 highValue, | 2: required double avgColLen, | 2: required double avgColLen, | 2: optional Decimal highValue, | 2: optional Date highValue, | 2: optional Timestamp highValue, | 2: LongColumnStatsData longStats, | ||
3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: required i64 numNulls, | 3: DoubleColumnStatsData doubleStats, | ||
4: optional binary bitVectors | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: optional binary bitVectors | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: required i64 numDVs, | 4: StringColumnStatsData stringStats, | ||
5: optional binary bitVectors, | 5: optional binary bitVectors, | 5: optional binary bitVectors | 5: optional binary bitVectors, | 5: optional binary bitVectors, | 5: optional binary bitVectors, | 5: BinaryColumnStatsData binaryStats, | ||||
6: optional binary histogram | 6: optional binary histogram | 6: optional binary histogram | 6: optional binary histogram | 6: optional binary histogram | 6: DecimalColumnStatsData decimalStats, | |||||
7: DateColumnStatsData dateStats, | ||||||||||
8: TimestampColumnStatsData timestampStats |
| Info | ||
|---|---|---|
| ||
Column level statistics were added in Hive 0.10.0 by HIVE-1362. |
...
Column level top K statistics are still pending; see HIVE-3421.
Quick overview
| Description | Stored in | Collected by | Since |
|---|---|---|---|
| Number of partition the dataset consists of | Fictional metastore property: numPartitions | computed during displaying the properties of a partitioned table | Hive 2.3 |
| Number of files the dataset consists of | Metastore table property: numFiles | Automatically during Metastore operations |
| Total size of the dataset as its seen at the filesystem level | Metastore table property: totalSize |
| Uncompressed size of the dataset | Metastore table property: rawDataSize | Computed, these are the basic statistics. Calculated automatically when hive.stats.autogather is enabled. | Hive 0.8 |
| Number of rows the dataset consist of | Metastore table property: numRows |
Column level statistics | Metastore; TAB_COL_STATS table | Computed, Calculated automatically when hive.stats.column.autogather is enabled. Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS |
...
Implementation
The way the statistics are calculated is similar for both newly created and existing tables.
...