Table of Contents  


Preliminaries
Overview
HIVE556 requests that Hive support nonequality joins commonly called theta joins. Theta joins cover all kinds of nonequality joins such as <, <=, >, >=, <>, LIKE, RLIKE, and Generic UDF.
Range joins, a specific kind of theta join, is extremely useful in temporal and spatial applications. This work will focus first on range join as it’s arguably the most important theta join. After range join is implemented the remaining theta join operators can be implemented with relative ease. Supporting theta join and specifically range join is important as it will makes Hive competitive in additional use cases and removes a deficiency when compared against traditional systems.
Specific Use Cases
GeoLocation
This use case is very common today for many common users of Hive websites, mobile telephone services, online ad marketers. The user wants to join website access logs with geolocation information of ip addresses. This can be implemented as an interval join. Currently this use case is implemented by creating a custom UDF which performs the lookup. Compiling and deploying UDF’s is an error prone process and not a common task for most users of Hive.
As an example of this use case, we can utilize the Maxmind GeoIP by country dataset which is 79,980 records. Consider joining that dataset with a very small access log of 5,000,000 entries using a cartesian product and then filtering the result. This approach results in 399,900,000,000 tuples that need to be filtered in the mapper. Hive trunk was modified to implement mapside interval join and the previously discussed join completed in 21 seconds on a single host while the cartesian product then filter approach ran for many hours before it was killed.
SideTable Similarity
This use case was originally proposed in HIVE556 by it’s creator Min Zhou. The user has a side table with some entries and wants to join the side table with a main table based on similarity. The example provide was using the RLIKE operator but LIKE or a generic boolean UDF could be used as well.
Requirements
A notable nongoal of this work is nway theta joins. The algorithm to implement nway theta joins was inspired by the algorithm to implement 2way theta joins. Therefore nway theta joins can be implemented in future work.
Goals
 Mapside 2way theta joins
 Reduceside 2way theta join
Specific NonGoals
 Mapside nway theta joins
 Reduceside nway theta joins
 Additional statistic collection
Literature Review
MapReduceMerge: Simplified Relational Data Processing on Large Clusters [1]
This work adds a Merge step to MapReduce which allows for easy expression of relational algebra operators. This is interesting but not immediately useful as it requires modification of the MapReduce framework it’s not immediately useful.
Efficient Parallel SetSimilarity Joins Using MapReduce [2]
This work studies a special type of set similarity, specifically similar strings or bit vectors. This work could be useful in implementing some operators such as LIKE. In addition this method which requires statistics to be calculated at run time results in multiple MapReduce jobs.
Processing ThetaJoins using MapReduce [3]
This work proposes an algorithm 1BucketTheta to perform theta joins on two relations in a single MapReduce job given some basic statistics, namely the cardinality of the two relations. This approach allows parallel implementation of cartesian product as well. The work also details how additional input statistics can be exploited to improve efficiency. The approach is to partition a joinmatrix of the two relations.
Efficient Multiway ThetaJoin Processing Using MapReduce [4]
This work is inspired by [3] and expands the method to Nway joins. The approach used is to partition a hypercube of the relations. An approach to merge the resulting many MapReduce jobs into a single job is also discussed with results similar to YSmart [5].
Design
Mapside
A large number of theta join use cases have the nice property that only one of the relations is “large”. Therefore many theta joins can be converted to mapjoins. Presently these use cases utilize a mapside cartesian product with postjoin filters. As noted in the geolocation use case above some of these use cases, specifically range joins, can see several orders of magnitude speedup utilizing theta join.
Currently Mapside join utilizes a hashmap and a join is performed when the incoming key matches a key in the hash map. To support range join this will abstracted into a pluggable interface. The plugin can decide how two keys are joined. The equality join interface will continue to utilize a hashmap while range join can use a data structure such as an interval tree. Other such optimizations can be made. For example the not equals join condition <> can use a view on top of a map.
Reduceside
Reduceside joins will be implemented via 1BucketTheta as described in [3]. This requires the cardinality of the two relations and therefore to perform a reduceside theta join statistics must be turned on. Initially if the required statistics do not exist an exception will be thrown indicating the problem. After the initial implementation we can use a method to estimate the cardinality.
As previously mention a detailed description of 1BucketTheta is located [3]. As such the discussion of the internals of the algorithm will be brief. Joining two relations S and T can be viewed as a matrix with the S, the smaller relation, on the left and T on the right.
The matrix is partitioned by r, the number of reducers. An example join matrix follows, with four reducers 14 each a separate color:
Row Ids  T 1  T 2  T 3  T 4 
S 1  1  1  2  2 
S 2  1  1  2  2 
S 3  3  3  4  4 
S 4  3  3  4  4 
In the map phase, each tuple in S is sent to all reducers which intersect the tuples’ row id. For example the Stuple with the row id of 2, is sent to reducers 1, and 2. Similarly each tuple in T is sent to all reducers which intersect the tuples’ row id. For example, the tuple with rowid 4, is sent to reducers 2 and 4.
In Hive and MapReduce row ids aren’t common available. Therefore we choose a random row id between 1 and S (cardinality of S) for S and 1 and T (cardinality of T) for T. Thus a reduceside theta join must know the estimated cardinality of each relation and statistics must be enabled. Random row id’s will result in well balanced reducer input when processing larger relations. As noted in [3], the partitioning scheme works such that if a relation is much smaller than it’s pair the smaller relation will be broadcast two all reducers. As such therefore randomskew which would occur for small relations does not impact the algorithm in practice. Additionally in Hive if a relation is small the join is converted to a mapside join and 1BucketTheta is not utilized.
Mapper
In the mapper the join matrix will be initialized, a random row id chosen, and then the tuple will be emitted for each reducer that intersects the row id. Hive already has a mechanism to set the hash code for a specific tuple which can be reused here. Additionally the tuples will need to be sorted in such a way so that tuples for S arrive in the reducer first. Luckily Hive already implements this via the JoinReorder class.
Reducer
The reducer is fairly simple, it buffers up the S relation and then performs the requested join on each T tuple that is received.