Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: bloom filter

...

Several approaches can be considered to handle this problem. If we have reliable statistics about the data, like detailed histograms, we can process rows with the same value using a new join algorithm (using or dropping rows sharing the same value all in one shot). Or we can divide the rows into pieces such that each piece can fit into memory, and perform the probing in several passes. This probably will bring performance impact. Further, we can resort to regular shuffle join as a fallback option once we figure out Mapjoin cannot handle this situation.

Bloom Filter

A cheap Bloom filter can be created is built during the build phase and used during the probe phase, which can greatly reduce the size of an on-disk matchfile. This part will be described in detail in the next version of this design docof the Hybrid hashtable, which is consulted against before spilling a row into the matchfile. The goal is to minimize the number of records which end up being spilled to disk, which may not have any matches in the spilled hashtables. The optimization also benefits left outer joins since the row which entered the hybrid join can be immediately generated as output with appropriate nulls indicating a lack of match, while without the filter it would have to be serialized onto disk only to be reloaded without a match at the end of the probe.

References

  • Hybrid Hybrid Grace Hash Join presentation by Mostafa

  • MapJoinOptimization https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization 

  • HIVE-1641 add map joined table to distributed cache

  • HIVE-1642 Convert join queries to map-join based on size of table/row

  • Database Management Systems, 3rd ed

  • Kitsuregawa, M.  Application of Hash to Data Base Machine and Its Architecture

  • Shapiro, L. D.  Join Processing in Database Systems with Large Main Memories

  • Dewitt, David J.  Implementation techniques for main memory database systems

  • Jimmy Lin and Chris Dyer  Data-Intensive Text Processing with MapReduce