This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers, along with min,max range information for the record keys in that def~base-file. The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates since the index is part of the data file itself. At runtime, checking the Bloom Index for a given set of record keys effectively amounts to checking all the bloom filters within a given partition, against the incoming records, using a Spark join. Much of the engineering effort towards the Bloom index has gone into scaling this join by caching the incoming RDD[HoodieRecord] and dynamically tuning join parallelism, to avoid hitting Spark limitations like 2GB maximum for partition size. As a result, Bloom Index implementation has been able to handle single upserts upto 5TB, in a reliable manner.

Bloom Index supports both global and non-global lookups and works best with the keys have some temporal prefix, enabling the range pruning to be maximally effective and cut down the search space dramatically. If the workload, consists of random keys, then its also prudent to turn off range pruning. See Tuning Guide for more details. 

DAG with Range Pruning:

  • No labels