Child pages
  • StatsDev
Skip to end of metadata
Go to start of metadata

Statistics in Hive

This document describes the support of statistics for Hive tables (see HIVE-33).

Motivation

Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them. Statistics may sometimes meet the purpose of the users' queries. Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans. Some examples are getting the quantile of the users' age distribution, the top 10 apps that are used by people, and the number of distinct sessions.

Scope

Table and Partition Statistics

The first milestone in supporting statistics was to support table and partition level statistics. Table and partition statistics are now stored in the Hive Metastore for either newly created or existing tables. The following statistics are currently supported for partitions:

  • Number of rows
  • Number of files
  • Size in Bytes

For tables, the same statistics are supported with the addition of the number of partitions of the table.

Version: Table and partition statistics

Table and partition level statistics were added in Hive 0.7.0 by HIVE-1361.

Column Statistics

The second milestone was to support column level statistics. See Column Statistics in Hive in the Design Documents.

Version: Column statistics

Column level statistics were added in Hive 0.10.0 by HIVE-1362.

Top K Statistics

Column level top K statistics are still pending; see HIVE-3421.

Implementation

The way the statistics are calculated is similar for both newly created and existing tables.

For newly created tables, the job that creates a new table is a MapReduce job. During the creation, every mapper while copying the rows from the source table in the FileSink operator, gathers statistics for the rows it encounters and publishes them into a Database (possibly MySQL). At the end of the MapReduce job, published statistics are aggregated and stored in the MetaStore.

A similar process happens in the case of already existing tables, where a Map-only job is created and every mapper while processing the table in the TableScan operator, gathers statistics for the rows it encounters and the same process continues.

It is clear that there is a need for a database that stores temporary gathered statistics. Currently there are two implementations, one is using MySQL and the other is using HBase. There are two pluggable interfaces IStatsPublisher and IStatsAggregator that the developer can implement to support any other storage. The interfaces are listed below:

Usage

Configuration Variables

See Statistics in Configuration Properties for a list of the variables that configure Hive table statistics. Configuring Hive describes how to use the variables.

Newly Created Tables

For newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are automatically computed by default. The user has to explicitly set the boolean variable hive.stats.autogather to false so that statistics are not automatically computed and stored into Hive MetaStore.

The user can also specify the implementation to be used for the storage of temporary statistics setting the variable hive.stats.dbclass. For example, to set HBase as the implementation of temporary statistics storage (the default is jdbc:derby or fs, depending on the Hive version) the user should issue the following command:

In case of JDBC implementations of temporary stored statistics (ex. Derby or MySQL), the user should specify the appropriate connection string to the database by setting the variable hive.stats.dbconnectionstring. Also the user should specify the appropriate JDBC driver by setting the variable hive.stats.jdbcdriver.

Queries can fail to collect stats completely accurately. There is a setting hive.stats.reliable that fails queries if the stats can't be reliably collected. This is false by default.

Existing Tables

For existing tables and/or partitions, the user can issue the ANALYZE command to gather statistics and write them into Hive MetaStore. The syntax for that command is described below:

When the user issues that command, he may or may not specify the partition specs. If the user doesn't specify any partition specs, statistics are gathered for the table as well as all the partitions (if any). If certain partition specs are specified, then statistics are gathered for only those partitions. When computing statistics across all partitions, the partition columns still need to be listed. As of Hive 1.2.0, Hive fully supports qualified table name in this command. User can only compute the statistics for a table under current database if a non-qualified table name is used.

When the optional parameter NOSCAN is specified, the command won't scan files so that it's supposed to be fast. Instead of all statistics, it just gathers the following statistics:

  • Number of files
  • Physical size in bytes

Version 0.10.0: FOR COLUMNS

As of Hive 0.10.0, the optional parameter FOR COLUMNS computes column statistics for all columns in the specified table (and for all partitions if the table is partitioned). See Column Statistics in Hive for details.

To display these statistics, use DESCRIBE FORMATTED [db_name.]table_name column_name [PARTITION (partition_spec)].

Examples

Suppose table Table1 has 4 partitions with the following specs:

  • Partition1: (ds='2008-04-08', hr=11)
  • Partition2: (ds='2008-04-08', hr=12)
  • Partition3: (ds='2008-04-09', hr=11)
  • Partition4: (ds='2008-04-09', hr=12)

and you issue the following command:

then statistics are gathered for partition3 (ds='2008-04-09', hr=11) only.

If you issue the command:

then column statistics are gathered for all columns for partition3 (ds='2008-04-09', hr=11). This is available in Hive 0.10.0 and later.

If you issue the command:

then statistics are gathered for partitions 3 and 4 only (hr=11 and hr=12).

If you issue the command:

then column statistics for all columns are gathered for partitions 3 and 4 only (Hive 0.10.0 and later).

If you issue the command:

then statistics are gathered for all four partitions.

If you issue the command:

then column statistics for all columns are gathered for all four partitions (Hive 0.10.0 and later).

For a non-partitioned table, you can issue the command:

to gather statistics of the table.

For a non-partitioned table, you can issue the command:

to gather column statistics of the table (Hive 0.10.0 and later).

If Table1 is a partitioned table,  then for basic statistics you have to specify partition specifications like above in the analyze statement. Otherwise a semantic analyzer exception will be thrown.

However for column statistics, if no partition specification is given in the analyze statement, statistics for all partitions are computed.

You can view the stored statistics by issuing the DESCRIBE command. Statistics are stored in the Parameters array. Suppose you issue the analyze command for the whole table Table1, then issue the command:

then among the output, the following would be displayed:

If you issue the command:

then among the output, the following would be displayed:

If you issue the command:

then statistics, number of files and physical size in bytes are gathered for partitions 3 and 4 only.

Current Status (JIRA)

Loading
T Key Summary Assignee Reporter P Status Resolution Created Updated Due
Bug HIVE-11128 Stats Annotation misses extracting stats for cols in some cases Ashutosh Chauhan Ashutosh Chauhan Major Resolved Fixed Jun 26, 2015 Jun 27, 2015
Bug HIVE-10840 NumberFormatException while running analyze table partition compute statics query Ashutosh Chauhan Jagruti Varia Major Resolved Fixed May 27, 2015 May 28, 2015
Bug HIVE-10832 ColumnStatsTask failure when processing large amount of partitions Unassigned Chao Sun Major Open Unresolved May 27, 2015 May 29, 2015
Improvement HIVE-10812 Scaling PK/FK's selectivity for stats annotation Pengcheng Xiong Pengcheng Xiong Major Resolved Fixed May 23, 2015 Jun 05, 2015
Bug HIVE-10807 Invalidate basic stats for insert queries if autogather=false Ashutosh Chauhan Gopal V Major Patch Available Unresolved May 23, 2015 May 30, 2015
Bug HIVE-10690 ArrayIndexOutOfBounds exception in MetaStoreDirectSql.aggrColStatsForPartitions() Vaibhav Gumashta Jason Dere Major Resolved Pending Closed May 12, 2015 May 23, 2015
Bug HIVE-10231 Compute partition column stats fails if partition col type is date Chaoyu Tang Chaoyu Tang Major Closed Fixed Apr 06, 2015 May 18, 2015
Bug HIVE-10226 Column stats for Date columns not supported Jason Dere Jason Dere Major Closed Fixed Apr 06, 2015 May 18, 2015
Improvement HIVE-10007 Support qualified table name in analyze table compute statistics for columns Chaoyu Tang Chaoyu Tang Major Closed Fixed Mar 18, 2015 May 18, 2015
Improvement HIVE-9931 Approximate nDV statistics from ORC bloom filter population Unassigned Gopal V Major Open Unresolved Mar 11, 2015 Mar 11, 2015
Bug HIVE-9717 The max/min function used by AggrStats for decimal type is not what we expected Pengcheng Xiong Pengcheng Xiong Major Closed Duplicate Feb 18, 2015 May 18, 2015
Bug HIVE-9647 Discrepancy in cardinality estimates between partitioned and un-partitioned tables Pengcheng Xiong Mostafa Mokhtar Major Closed Fixed Feb 10, 2015 May 18, 2015
Bug HIVE-9620 Cannot retrieve column statistics using HMS API if column name contains uppercase characters Chaoyu Tang Juan Yu Major Closed Fixed Feb 09, 2015 May 18, 2015
Bug HIVE-9619 Uninitialized read of numBitVectors in NumDistinctValueEstimator Alexander Pivovarov Alexander Pivovarov Minor Closed Fixed Feb 09, 2015 May 18, 2015
Test HIVE-9147 Add unit test for HIVE-7323 Unassigned Peter Slawski Minor Patch Available Unresolved Dec 17, 2014 May 08, 2015
Bug HIVE-8975 Possible performance regression on bucket_map_join_tez2.q Prasanth Jayachandran Jesus Camacho Rodriguez Major Resolved Fixed Nov 26, 2014 Feb 12, 2015
Bug HIVE-8863 Cannot drop table with uppercase name after "compute statistics for columns" Chaoyu Tang Juan Yu Major Resolved Fixed Nov 14, 2014 Feb 12, 2015
Sub-task HIVE-8580 Support LateralViewJoinOperator and LateralViewForwardOperator in stats annotation Prasanth Jayachandran Prasanth Jayachandran Critical Closed Won't Fix Oct 23, 2014 Nov 13, 2014
Sub-task HIVE-8549 NPE in PK-FK inference when one side of join is complex tree Prasanth Jayachandran Prasanth Jayachandran Critical Closed Fixed Oct 21, 2014 Nov 13, 2014
Bug HIVE-8329 Enable postgres for storing stats Damien Carol Damien Carol Major Resolved Won't Fix Oct 02, 2014 Jun 16, 2015
Showing 20 out of 132 issues Refresh

  • No labels