Several approaches can be considered to handle this problem. If we have reliable statistics about the data, like detailed histograms, we can process rows with the same value using a new join algorithm (using or dropping rows sharing the same value all in one shot). Or we can divide the rows into pieces such that each piece can fit into memory, and perform the probing in several passes. This probably will bring performance impact. Further, we can resort to regular shuffle join as a fallback option once we figure out Mapjoin cannot handle this situation.

Bloom Filter

A cheap Bloom filter can be created is built during the build phase and used during the probe phase, which can greatly reduce the size of an on-disk matchfile. This part will be described in detail in the next version of this design docof the Hybrid hashtable, which is consulted against before spilling a row into the matchfile. The goal is to minimize the number of records which end up being spilled to disk, which may not have any matches in the spilled hashtables. The optimization also benefits left outer joins since the row which entered the hybrid join can be immediately generated as output with appropriate nulls indicating a lack of match, while without the filter it would have to be serialized onto disk only to be reloaded without a match at the end of the probe.

References

Hybrid Hybrid Grace Hash Join presentation by Mostafa
MapJoinOptimization https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization
HIVE-1641 add map joined table to distributed cache
HIVE-1642 Convert join queries to map-join based on size of table/row
Database Management Systems, 3rd ed
Kitsuregawa, M. Application of Hash to Data Base Machine and Its Architecture
Shapiro, L. D. Join Processing in Database Systems with Large Main Memories
Dewitt, David J. Implementation techniques for main memory database systems
Jimmy Lin and Chris Dyer Data-Intensive Text Processing with MapReduce

Space shortcuts

Child pages

Versions Compared

Old Version 5

New Version 6

Key

Bloom Filter

References

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 5

New Version 6

Key

Bloom Filter

References