This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • Spatial queries
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Hadoop-GIS Design Documents

Overview

Hadoop-GIS is a scalable and high performance spatial data warehousing system for running large-scale spatial queries on Hadoop.

Hadoop-GIS supports multiple types of spatial queries on MapReduce through space partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects on MapReduce. Hadoop-GIS takes advantage of global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. 

Hadoop-GIS comes with two deliverables: 

  • Hadoop-GIS library: a set of libraries on performing MapReduce based spatial queries, which can be called from users' applications.
  • HiveSP: we integrate Hadoop-GIS with Hive, to support both structured queries and spatial queries with a unified query language and interface.

Data Types

We will add three data types in Hive: Point, Line, and Polygon. We modified the data type declaration in the schema definition. We can support creating table with spatial type column. For example, we can create a table with Point type column:

CREATE TABLE IF NOT EXISTS point_table ( tile_id STRING, d_id STRING, rec_id STRING, outline POINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;

With these three spatial data types, we can support queries on spatial data to work with standard HQL queries.

Query Pipelines

We created query pipelines between Hive and RESQUE (our customizable spatial query engine). We have modified Hive’s source code (e.g. SemanticAnalyzer.java) to support our query pipelines. The spatial predicate (e.g. st_intersects or st_equals) will be delivered from Hive to RESQUE. Then RESQUE will do the spatial query (i.e. finding intersection or finding equality) and return the query results to Hive.

Ten Spatial Predicates

Now we can support the following 10 spatial predicates in Hadoop-GIS:

  • st_intersects
  • st_touches
  • st_crosses
  • st_contains
  • st_adjacent
  • st_disjoint
  • st_equals
  • st_dwithin
  • st_within
  • st_overlaps

You can use the spatial query just like using standard HQL in Hive shell. For example, if we want to find the intersection between two spatial tables (say tc and td), you can use the following HQL sentence in Hive Shell:

  • SELECT tc.rec_id, td.rec_id FROM tc JOIN td ON (st_intersects(tc.outline, td.outline) = TRUE); 

Spatial Range Query

This is an index based HDFS file filtering. We firstly use one of the above 10 predicates, st_contains, to filter the input spatial data. Then we will deliver the filtered data to the further complex computation. We can save lots of computation time by this spatial range filtering.

Nearest Neighbor Queries

We use the Nearest Neighbor Queries (distance function) between spatial objects to cluster the data into specific partitions. Then under these specific partitions, we can do spatial queries (e.g. st_intersects or st_contains) efficiently. For example, we can just do st_intersects query among objects in the same partition. There is no need to do st_intersects among objects in two different partitions.

Dependent Libraries

Our customizable spatial query engine RESQUE uses two dependent libraries: GEOS and spatialindex.

RESQUE uses geos-3.3.8 C++ interface to create the spatial object and to do the spatial operations. In RESQUE, we use GeometryFactory and Geometry objects to store the spatial data. Then we use spatial operations APIs (i.e. geos::geom::Geometry:: intersects (const Geometry *g) const and geos::geom::Geometry:: crosses (const Geometry *g) const) to do the spatial operations. At last, we write the output to the data stream (the query pipelines). Then Hive will read the output from data stream (the query pipelines) and print the output in Hive Shell or continue to do any further operations.

RESQUE use spatialindex to catch the exception in each spatial operation.

Changes in Hive

We have tried to make minimum change to Hive to not to compromise the compatibility.
Changes are mostly at the language, and query optimization layer.

Lanague layer: Hive.g is changed to add data types and other spatial language support.
Parsing/Analyzing: Mostly the SemanticAnalyzer is changed (by adding functions) to generate an executable query plan.
Optimization: The generated query plan is optimized with a function which can produce optimal query plan according to the spatial predicate and table information.

The RESQUE library will be deployed as shared library, and a path to this library will be provided to hive to invoke functions in the library via RANSFORM mechanism.

  • No labels