Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Hadoop-GIS Design Documents

Overview

Hadoop-GIS is a scalable and high performance spatial data warehousing system for running large-scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through space partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects on MapReduce. Hadoop-GIS takes advantage of global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. 

Hadoop-GIS comes with two deliverables: 

...

relies on RESQUE for spatial query processing. RESQUE is a internally developed tile based spatial query engine which is written in C++ and deployed as shared library.

HiveSP: we integrate Hadoop-GIS with Hive, to support both structured queries and spatial queries with a unified query language (HQL) and interface (Hive Shell).

 

Data Types

Query Language

At the language layer, Hadoop-GIS extends HQL to support spatial data types and query constructs.

JOIN operator – we kept JOIN keyword for backward compatibility. However, whenever there is a spatial operator in the join predicate, the query is considered as spatial join query and a spatial join query processing pipeline is applied to process this query.

e.g SELECT *  FROM a JOIN b on ST_INTERSECTS (a.spatialcolumn ,b.spatialcolumn)  = TRUE ;

Data Type

We will add a spatial data type in Hive: GEOMETRY. Geometry is an extension of String type with special serialization/deserialization, and operation support. For example, users can create a table with spatial column as shown in following exampleWe will add three data types in Hive: Point, Line, and Polygon. We modified the data type declaration in the schema definition. We can support creating table with spatial type column. For example, we can create a table with Point type column:

CREATE TABLE IF NOT EXISTS pointspatial_table ( tile_id STRING STRING, d_id STRING, rec_id STRING, outline POINT GEOMETRY) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;With these three spatial data types, we can support queries on spatial data to work with standard HQL queries.

Query Pipelines

We have created efficient query pipelines between Hive and RESQUE (our customizable spatial query engine). We have modified Hive’s source code (e.g. SemanticAnalyzer.java) to support our query pipelines. The spatial predicate (e.g. st_intersects or st_equals) will be delivered from Hive to RESQUE. Then RESQUE will do the spatial query (i.e. finding intersection or finding equality) and return the query results to Hive.

Ten Spatial Predicates

, to support various spatial queries. Again, with the philosophy of “minimum change to Hive”, current query pipelines implemented as set of “custom MapReduce codes” which interacts with Hive via TRANSFORM mechanism.

Specifically, a spatial SQL query will be intercepted at the query translation phase to translate the query into a Hive executable query operators. Basically, the spatial query processing part will be translated into set of custom Map and Reduce scripts which will interact with Hive via STDIN and STDOUT.

Before Hive submits the spatial query operator to the RESQUE for processing, it will use appropriate serialization method to transform data into a format that RESQUE can recognize. Then after REQUE processing, RESQUE will desterilize the data into a format that can be recognized by Hive.

Spatial Predicates

At this moment, Hadoop-GIS support the following spatial predicates which implemented as Hive UDF. More predicates being developed and will be integrate in future.Now we can support the following 10 spatial predicates in Hadoop-GIS:

  • st_intersects
  • st_touches
  • st_crosses
  • st_contains
  • st_adjacent
  • st_disjoint
  • st_equals
  • st_dwithin
  • st_within
  • st_overlaps

You We can use the spatial query just like using standard HQL in Hive shell. For example, if we want to find the intersection between two spatial spatially join two tables (say tc ta and td tb), you we can use the issue following HQL sentence in Hive Shell:

  • SELECT tcta.rec_id, tdtb.rec_id FROM tc ta JOIN td tb ON (st_intersects(tcta.outline, tdtb.outline) = TRUE); 

...

This is an index based HDFS file filtering. We firstly use one of the above 10 predicates, st_contains, to filter the input spatial data. Then we will deliver the filtered data to the further complex computation. We can save lots of computation time by this spatial range filtering.

Nearest Neighbor Queries

We use the Nearest Neighbor Queries (distance function) between spatial objects to cluster the data into specific partitions. Then under these specific partitions, we can do spatial queries (e.g. st_intersects or st_contains) efficiently. For example, we can just do st_intersects query among objects in the same partition. There is no need to do st_intersects among objects in two different partitions.

Dependent Libraries

Our customizable spatial query engine RESQUE uses two dependent libraries: GEOS and spatialindex.

RESQUE uses geos-3.3.8 C++ interface to create the spatial object and to do the spatial operations. In RESQUE, we use GeometryFactory and Geometry objects to store the spatial data. Then we use spatial operations APIs (i.e. geos::geom::Geometry:: intersects (const Geometry *g) const and geos::geom::Geometry:: crosses (const Geometry *g) const) to do the spatial operations. At last, we write the output to the data stream (the query pipelines). Then Hive will read the output from data stream (the query pipelines) and print the output in Hive Shell or continue to do any further operations.

RESQUE use spatialindex to catch the exception in each spatial operation.

will get the following output for the above st_intersects query:

  • ……
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2013-09-08 01:23:18,838 Stage-1 map = 0%,  reduce = 0%
  • 2013-09-08 01:23:24,889 Stage-1 map = 100%,  reduce = 0%
  • 2013-09-08 01:23:33,954 Stage-1 map = 100%,  reduce = 100%
  • Ended Job = job_201309080121_0001
  • MapReduce Jobs Launched:
  • Job 0: Map: 2  Reduce: 1   HDFS Read: 187984 HDFS Write: 40 SUCCESS
  • Total MapReduce CPU Time Spent: 0 msec
  • OK
  • 1             1
  • 1             2
  • 1             3
  • 1             4
  • 15           87
  • 34           78
  • 54           74
  • 61           54
  • Time taken: 28.207 seconds

Changes in Hive

We have tried to make minimum change to Hive to not to compromise the compatibility.

Changes are mostly at the language, and query optimization layer.

Lanague layer: Hive.g is changed to add data types and other spatial language support.

Parsing/Analyzing: Mostly the SemanticAnalyzer is changed (by adding functions) to generate an executable query plan.

Optimization: The generated query plan is optimized with a function which can produce optimal query plan according to the spatial predicate and table information.

The RESQUE library will be deployed as shared library, and a path to this library will be provided to hive to invoke functions in the library via RANSFORM mechanism.