Hadoop-GIS Design Documents
Hadoop-GIS is a scalable and high performance spatial data warehousing system for running large-scale spatial queries on Hadoop.
Hadoop-GIS supports multiple types of spatial queries on MapReduce through space partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects on MapReduce. Hadoop-GIS takes advantage of global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture.
Hadoop-GIS comes with two deliverables:
- Hadoop-GIS library: a set of libraries on performing MapReduce based spatial queries, which can be called from users' applications.
- HiveSP: we integrate Hadoop-GIS with Hive, to support both structured queries and spatial queries with a unified query language and interface.
We will add three data types in Hive: Point, Line, and Polygon. We modified the data type declaration in the schema definition. We can support creating table with spatial type column. For example, we can create a table with Point type column:
CREATE TABLE IF NOT EXISTS point_table ( tile_id STRING, d_id STRING, rec_id STRING, outline POINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;
With these three spatial data types, we can support queries on spatial data to work with standard HQL queries.
We created query pipelines between Hive and RESQUE (our customizable spatial query engine). We have modified Hive’s source code (e.g. SemanticAnalyzer.java) to support our query pipelines. The spatial predicate (e.g. st_intersects or st_equals) will be delivered from Hive to RESQUE. Then RESQUE will do the spatial query (i.e. finding intersection or finding equality) and return the query results to Hive.
Ten Spatial Predicates
Now we can support the following 10 spatial predicates in Hadoop-GIS:
You can use the spatial query just like using standard HQL in Hive shell. For example, if we want to find the intersection between two spatial tables (say tc and td), you can use the following HQL sentence in Hive Shell:
- SELECT tc.rec_id, td.rec_id FROM tc JOIN td ON (st_intersects(tc.outline, td.outline) = TRUE);
Spatial Range Query
This is an index based HDFS file filtering. We firstly use one of the above 10 predicates, st_contains, to filter the input spatial data. Then we will deliver the filtered data to the further complex computation. We can save lots of computation time by this spatial range filtering.
Nearest Neighbor Queries
We use the Nearest Neighbor Queries (distance function) between spatial objects to cluster the data into specific partitions. Then under these specific partitions, we can do spatial queries (e.g. st_intersects or st_contains) efficiently. For example, we can just do st_intersects query among objects in the same partition. There is no need to do st_intersects among objects in two different partitions.
Our customizable spatial query engine RESQUE uses two dependent libraries: GEOS and spatialindex.
RESQUE uses geos-3.3.8 C++ interface to create the spatial object and to do the spatial operations. In RESQUE, we use GeometryFactory and Geometry objects to store the spatial data. Then we use spatial operations APIs (i.e. geos::geom::Geometry:: intersects (const Geometry *g) const and geos::geom::Geometry:: crosses (const Geometry *g) const) to do the spatial operations. At last, we write the output to the data stream (the query pipelines). Then Hive will read the output from data stream (the query pipelines) and print the output in Hive Shell or continue to do any further operations.
RESQUE use spatialindex to catch the exception in each spatial operation.
Modification for Hive Core Source Code
We mainly modified the sematic analyzer (SemanticAnalyzer.java) in Hive core source code. We made Hive to read the spatial function. We also made Hive to call our RESQUE for spatial computing and read the output via our query pipelines. The path of SemanticAnalyzer.java is:
We mainly modified the analyzeInternal function and other related functions.
In function analyzeInternal, we will use function analyzeSpatial to detect whether the HQL contains spatial request. In function analyzeSpatial, we modified the sematic tree to prepare for the spatial operations (computings), such as extracting the requested predicate (st_intersects, st_contains or something else). We also optimize the sematic tree in function analyzeSpatial.
Then in function genSpatialJoinOperator, we use the extracted spatial operation to generate a spatial query command and call our engine RESQUE via our query pipelines.