Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
join_table:
    table_reference [INNER] JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)

table_reference:
    table_factor
  | join_table

table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )

join_condition:
    ON equality_expression ( AND equality_expression )*

equality_expression:
    expression = expression

Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive.

See Select Syntax for the context of this join syntax.

See Select Syntax for the context of this join syntax.

Info
titleVersion 0.13.0+: Implicit join notation

Implicit join notation is supported starting with Hive 0.13.0

Info
titleVersion

Implicit join notation is supported starting with Hive 0.13.0 (see HIVE-5558). This allows the FROM clause to join a comma-separated list of tables, omitting the JOIN keyword. For example:

SELECT *
FROM table1 t1, table2 t2, table3 t3
WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535';AND t1.zipcode = '02535';

Info
titleVersion 0.13.0+: Unqualified column references

Unqualified column references are supported in join conditions, starting with Hive 0.13.0 (see HIVE-6393). Hive attempts to resolve these against the inputs to a Join. If an unqualified column reference resolves to more than one table, Hive will flag it as an ambiguous reference.

For example:

CREATE TABLE a (k1 string, v1 string);
CREATE TABLE b (k2 string, v2 string);

SELECT k1, v1, k2, v2
FROM a JOIN b ON k1 = k2; 

Info
titleVersion 2.2.0+: Complex expressions in ON clause

Complex expressions in ON clause are supported, starting with Hive 2.2.0 (see HIVE-15211, HIVE-15251). Prior to that, Hive did not support join conditions that are not equality conditions.

In particular, syntax for join conditions was restricted as follows:

join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression

Examples

Some salient points to consider when writing join queries are as follows:

  • Only equality joins Complex join expressions are allowed e.g.

    Code Block
      SELECT a.* FROM a JOIN b ON (a.id = b.id)
    
    Code Block
      SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
    

    are both valid joins, however

    Code Block
      SELECT a.* FROM a LEFT OUTER JOIN b ON (a.id <> b.id)
    

    are valid joinsis NOT allowed.

  • More than 2 tables can be joined in the same query e.g.

    Code Block
      SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
    

    is a valid join.

...

  • LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way. As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. The restrictions of using LEFT SEMI JOIN is are that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.

    Code Block
      SELECT a.key, a.value
      FROM a
      WHERE a.key in
       (SELECT b.key
        FROM B);
    

    can be rewritten to:

    Code Block
       SELECT a.key, a.val
       FROM a LEFT SEMI JOIN b onON (a.key = b.key)
    
  • If all but one of the tables being joined are small, the join can be performed as a map only job. The query

    Code Block
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM a joinJOIN b onON a.key = b.key
    

    does not need a reducer. For every mapper of A, B is read completely. The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed.

  • If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join

    Code Block
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM a joinJOIN b onON a.key = b.key
    

    can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter

    Code Block
      set hive.optimize.bucketmapjoin = true
    
  • If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,

    Code Block
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM A a joinJOIN B b onON a.key = b.key
    

    can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:

    Code Block
      set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
      set hive.optimize.bucketmapjoin = true;
      set hive.optimize.bucketmapjoin.sortedmerge = true;
    

...

  • If all but one of the tables being joined are small, the join can be performed as a map only job. The query

    Code Block
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM a joinJOIN b onON a.key = b.key
    

    does not need a reducer. For every mapper of A, B is read completely.

...

  • Consider the possibility of multiple mapjoins on different keys:

    Code Block
    select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
      ( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
        bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne)                                                   
      ) firstjoin                                                             
      JOIN                                                                  
      smallTableTwo onON (firstjoin.idTwo = smallTableTwo.idTwo)                       
    

    The above query is not supported. Without the mapjoin hint, the above query would be executed as 2 map-only jobs. If the user knows in advance that the inputs are small enough to fit in memory, the following configurable parameters can be used to make sure that the query executes in a single map-reduce job.

    • hive.auto.convert.join.noconditionaltask - Whether Hive enable the optimization about converting common join into mapjoin based on the input file size. If this paramater is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a mapjoin (there is no conditional task).
    • hive.auto.convert.join.noconditionaltask.size - If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly converted to a mapjoin(there is no conditional task). The default is 10MB.

...