Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

We propose modifying Hive to add Spark as a third execution backend(HIVE-7292), parallel to MapReduce and Tez.

...

More information about Spark can be found here:

...

While sortByKey provides no grouping, it’s easy to group the keys as rows with the same key will come consecutively. On the other hand,  groupByKey clusters the keys in a collection, which naturally fits the MapReduce’s reducer interface.

...

It’s rather complicated in implementing join in MapReduce world, as manifested in Hive. Hive has reduce-side join as well as map-side join (including map-side hash lookup and map-side sorted merge). We will keep Hive’s join implementations. However, extra attention needs to be paid on the shuffle behavior (key generation, partitioning, sorting, etc), since Hive extensively uses MapReduce’s shuffling in implementing reduce-side join. It’s expected that Spark is, or will be, able to provide flexible control over the shuffling, as pointed out in the previous section(Shuffle, Group, and Sort).

See: Hive on Spark: Join Design Master for detailed design.

Number of Tasks

As specified above, Spark transformations such as partitionBy will be used to connect mapper-side’s operations to reducer-side’s operations. The number of partitions can be optionally given for those transformations, which basically dictates the number of reducers.

...