LanguageManual GroupBy
Skip to end of metadata
Go to start of metadata

Group By Syntax

In groupByExpression columns are specified by name, not by position number. However in Hive 0.11.0 and later, columns can be specified by position if hive.groupby.orderby.position.alias is set to true (the default is false).

Simple Examples

In order to count the number of rows in a table:

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

In order to count the number of distinct users by gender one could write the following query:

Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns. For example, the following is possible because count(DISTINCT) and sum(DISTINCT) specify the same column:

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.

Select statement and group by clause

When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count) in the select statement as well.
Let's take a simple example

A group by query on the above table could look like:

The above query works because the select clause contains a (the group by key) and an aggregation function (sum(b)).

However, the query below DOES NOT work:

This is because the select clause has an additional column (b) that is not included in the group by clause (and it's not an aggregation function either). This is because, if the table t1 looked like:

Since the grouping is only done on a, what value of b should Hive display for the group a=100? One can argue that it should be the first value or the lowest value but we all agree that there are multiple possible options. Hive does away with this guessing by making it invalid SQL (HQL, to be precise) to have a column in the select clause that is not included in the group by clause.

Advanced Features

Multi-Group-By Inserts

The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

Map-side Aggregation for Group By

hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Grouping Sets, Cubes, Rollups, and the GROUPING__ID Function

Version

Icon

Grouping sets, CUBE and ROLLUP operators, and the GROUPING__ID function were added in Hive release 0.10.0.

See Enhanced Aggregation, Cube, Grouping and Rollup for information about these aggregation operators.

Also see the JIRAs:

  • HIVE-2397 Support with rollup option for group by
  • HIVE-3433 Implement CUBE and ROLLUP operators in Hive
  • HIVE-3471 Implement grouping sets in Hive
  • HIVE-3613 Implement grouping_id function

New in Hive release 0.11.0:

  • HIVE-3552 HIVE-3552 performant manner for performing cubes/rollups/grouping sets for a high number of grouping set keys
Labels
  • No labels