[Proposal] Geospatial support in Pinot

(Noted: the current doc is copied from this google doc, you can request access to the doc to see changes/discussions. Some of the content is meant to be changed/update, and we will keep improving it until it is accepted)

Motivation

Geospatial data has been widely used across industry, spanning multiple verticals, such as ride-sharing and delivery, transportation infrastructure, defense and intel, public health. Deriving insights from timely and accurate geospatial data could enable mission-critical use cases in the organizations and fuel a vibrant marketplace across industry. In this document, we discuss the challenges of analyzing geospatial at scale and propose the geospatial support in Pinot.

To better illustrate this, we show a few queries with geospatial features and how it derives important business insights.

The following query returns how many trips occur within a given area on a specific date. ST_Point is the standard geo function to construct a two-dimensional point using longitude and latitude, and ST_Contains is the standard geo function to compute whether a point is within a geo shape.

 SELECT c.city_id, count(*)FROM trips_table as tWHERE ST_Contains(\$geo_shape, ST_Point(t.dest_lng, t.dest_lat))WHERE datestr = \$date

Note that for a real city, it is not uncommon to see its geofence composed of hundreds or thousands of points. The time cost of executing ST_CONTAINS for one pair of point and geofence is proportional to the number of points in the geofence.

The following query answers the question “What streets does Atlantic Commons join with?” using the tables of all New York streets.

 SELECT nameFROM nyc_streetsWHERE ST_DWithin(  geom,  ST_GeomFromText('LINESTRING(586782 4504202,586864 4504216)', 26918),  0.1);

where the function ST_DWithin returns true if the geometries are within the specified distance of one another.

Challenges

Complex data types

Unlike other built-in types which are primitive, Geospatial features require a hierarchy of types to abstract and encapsulate spatial structures, such as points, lines and polygons. For example, OGC defines an object model for simple feature geometry that is commonly used by various GIS systems as shown in Figure 1.

Figure 1: geometry type hierarchy by OGC

Moreover, the data types are further complicated by geographics handling. Geometry uses Cartesian coordinates to represent a linear distance from an origin as plotted on a plane, while geographical (or spherical) coordinates describe angular coordinates (i.e. longitude, latitude) on a globe. As a result, the functionality of the two data types is very different, as reflected in the calculation of the distance between points far apart (.e.g distance between Berlin and Los Angeles explained in this article).

Many geospatial formats

There are many different geospatial formats established over the past decades:

The different formats adds the complexity of emitting and consuming geometries in a large number of formats for various applications and use cases.

Indexing large spatial datasets

Indexes are important for efficient records retrieval on large datasets. Without indexing, any search would require a sequential scan of every record. Given the complex data types and relationship operation, building effective and efficient geospatial indices has been a known challenge for many GIS databases. Several optimization algorithms have been developed in the past, such as Quadtree, Geohash and Grid. In particular, indexing geospatial data with grid systems has been a common approach for dealing with very large scales in recent years, such as Uber’s hexagonal grid H3, and Google’s spherical grid S2. One powerful feature provided by the Grid indexing is the flexibility of trading between accuracy and performance using the approximations. Implementers can decide between grids fixed to a single accuracy which can be somewhat lossy yet more performant or grids with multiple accuracies which can be less performant but mitigate against losses.

Scope

This proposal has the following in the scope

• Introduce the Geometry and Geography data types
• Define a collection of common geospatial functions such as ST_Length, ST_Equals defined in Simple Feature Access.
• Define type conversion functions for a few Geometry formats such as WKT and WKB
• Introduce an H3-grid based indexing for geospatial data
• Define a collection of geospatial functions on H3 index

The followings are not in the scope

• The support of comprehensive geospatial formats. However, the addition of new formats should not be disruptive to the implementation in the scope
• Other geospatial indexing methods than H3 Grid index. However, the pluggable indexing framework shall allow the addition of other indexing.
• Spatial join, which requires the JOIN support first

Data Type and Data Model

Types

There are several options to add the geospatial support in Pinot’s data type system.

1. The full Geometry type hierarchy. A full Geometry type hierarchy will provide a native and detailed geo-type support in Pinot’s data model, so that the geo functions can return specific types (e.g. ST_Point returns Point type) and strong type inference/analysis can be performed on the query. However, it demands significant changes to data types and downstream modules (e.g. reader/writer, functions, operators)
2. A generic type Geometry to data types. The geometry value has its internal type encoding of the specific subclasses (e.g. Point, Line, Polygon) but all geo functions operate on the generic Geometry type. This approach introduces only one more data type, so that geo functions/operators can have some type validation and analysis.
3. Reuse the Bytes type but not add any new type. This approach keeps the existing data model as it is and therefore requires less engineering efforts. However, all geo functions take generic bytes as the input and output, and therefore cannot perform static time type validation. Moreover, geometry de/serialization from/to bytes is required in geo functions’ input and output, and therefore it’s less performant when there are nested geo functions (e.g. ST_Distance(ST_POINT(-118.4079, 33.9434),ST_POINT(2.5559,49.0083)) )

The proposal in this doc is to implement #3 in the short term for its simplicity. In the longer-term, #2 is preferred when pluggable data type is available in Pinot, in which the Geometry type can be added as a plugin.

Because approach #3 uses raw bytes as the type, the value can be implementation-specific. It might be worth taking additional annotation in the field spec for the additional type information, such as geometry vs geography.

Values

There are two aspects to consider for the geospatial values in Pinot, the geo representation for computing in Pinot, and its serialization to bytes.

Geometry objects

On Github, there are several open-source Geospatial libraries available

Among them, a common choice for Geometry type is JTS, which is a topology suite for Euclidean planar linear geometry. JTS is used by multiple projects, in particular, PostGIS/Presto/GeoSpark in the big data space.

In the meanwhile, ESRI is a common choice for geography objects, and it has been employed by a number of big data projects such as Hive, Cassandra, HBase, and Storm. Though ESRI also supports OGC geometry standard, it has been called out for less performant and weaker ISO standard conformation.

As a result, we’ll choose JTS for representing the Geometry objects, and ESRI for representing Geography objects. Those objects will be used by the geometry and geography functions respectively.

Serialization

The geo object needs to be serialized and deserialized into bytes for both storage and function evaluation. Though there is an OGC standard WKB to define a common binary format for geometry data transferring and storage, it has the drawback of inefficient memory access and space usage. There have been efforts in the format improvement, such as TWKB. However, TWKB adoption is not wide enough, and in particular JTS has not supported it yet

In the meanwhile, many other big data frameworks have chosen to implement their own serialization format for efficiency. For example, Presto uses Slice for efficient encoding and GeoSpark chooses Kryo. It’s important to note that the lack of common standards creates challenges for deep integration with Presto for geo functions push-down to Pinot.

One option of serialization is to use the same encoding as Presto does. This is not ideal as Presto’s encoding is not a standard format and therefore it does not guarantee the implementation won’t change. Another option is to use a standard format (e.g. WKB) in the Presto connector with the cost of the additional format conversion.

The proposal is to implement a customized serialization format with an efficient byte encoding. The leading byte denotes the geo type, and it is followed by the type-specific bytes.

Ingestion

Another important consideration is the data transformation during the data ingestion. It would have a strong dependency on the upcoming column transform function feature (design proposal), as the raw message to ingest typically have the geo data in other forms such as WKT, or latitude/longitude pairs.

For example, the schema below shows the “meetupRsvp” quickstart example. And all extended columns are highlighted yellow.

 {  "metricFieldSpecs": [    {      "dataType": "INT",      "name": "rsvp_count"    }  ],  "dimensionFieldSpecs": [    {      "dataType": "STRING",      "name": "venue_name"    },    {      "dataType": "STRING",      "name": "event_name"    },    {      "dataType": "STRING",      "name": "event_id"    },    {      "dataType": "LONG",      "name": "event_time"    },    {      "dataType": "STRING",      "name": "group_city"    },    {      "dataType": "STRING",      "name": "group_country"    },    {      "dataType": "LONG",      "name": "group_id"    },    {      "dataType": "STRING",      "name": "group_name"    },    {      "dataType": "DOUBLE",      "name":"group_lat"    },    {      "dataType": "DOUBLE",      "name":"group_lon"    },    {      "dataType": "BYTES",      "name":"group_location",      "transformFunction": "Groovy({stPoint(group_lat,group_lon)}, group_lat, group_lon)"    }      ],  "timeFieldSpec": {    "incomingGranularitySpec": {      "timeType": "MILLISECONDS",      "dataType": "LONG",      "name": "mtime"    }  },  "schemaName": "meetupRsvp"}

Two more columns of location latitude (“group_lat”) and longitude (“group_lon”) are fetched from the Meetup RSVP Stream. In addition, one geolocation column (“group_location”) is created via the transform function that creates a Point from the latitude and longitude via the built-in function stPoint (which is also used for implementing the ST_POINT function).

Geospatial Functions and Operators

A subset of geospatial functions are planned initially to be implemented in this proposal. The functions can be categorized into the following. Note:

• The specific geometry/geography types are annotated but in the implementation they are declared as bytes in the metadata.
• All functions that conform to the SQL/MM 3 standard are annotated with [SQL/MM]

Also note the list of supported functions can grow as the implementation progresses.

Aggregations

ST_Union(geometry[] g1_array) → Geometry [SQL/MM]

This aggregate function returns a MULTI geometry or NON-MULTI geometry from a set of geometries. it ignores NULL geometries.

Constructors

ST_GeomFromText(String wkt) → Geometry [SQL/MM]

ST_GeomFromText(string wkt, int srid) → Geometry [SQL/MM]

Returns a geometry type object from WKT representation, with the optional spatial system reference.

ST_GeomFromWKB(bytes wkb) → Geometry [SQL/MM]

ST_GeomFromWKB(bytes wkb, int srid) → Geometry [SQL/MM]

Returns a geometry type object from WKB representation.

ST_Point(double x, double y) → Point [SQL/MM]

Returns a geometry type point object with the given coordinate values.

ST_Polygon(String wkt) → Polygon [SQL/MM]

Returns a geometry type polygon object from WKT representation.

ST_GeogFromWKB(bytes wkb) → Geography

Creates a geography instance from a Well-Known Binary geometry representation (WKB)

ST_GeogFromText(String wkt) → Geography

Return a specified geography value from Well-Known Text representation or extended (WKT).

Measurements

ST_Area(Geometry/Geography g) → double [SQL/MM]

For geometry type, it returns the 2D Euclidean area of a geometry. For geography, returns the area of a polygon or multi-polygon in square meters using a spherical model for Earth.

ST_Distance(Geometry/Geography g1, Geometry/Geography g2) → double [SQL/MM]

For geometry type, returns the 2-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units. For geography, returns the great-circle distance in meters between two SphericalGeography points. Note that g1, g2 shall have the same type.

ST_GeometryType(Geometry g) → String   [SQL/MM]

Returns the type of the geometry as a string. EG: 'ST_Linestring', 'ST_Polygon','ST_MultiPolygon' etc.

Outputs

ST_AsBinary(Geometry/Geography g) → bytes [SQL/MM]

Returns the WKB representation of the geometry.

ST_AsText(Geometry/Geography g) → string [SQL/MM]

Returns the WKT representation of the geometry/geography.

Relationship

ST_Contains(Geometry, Geometry) → boolean [SQL/MM]

Returns true if and only if no points of the second geometry lie in the exterior of the first geometry, and at least one point of the interior of the first geometry lies in the interior of the second geometry.

ST_Equals(Geometry, Geometry) → boolean [SQL/MM]

Returns true if the given geometries represent the same geometry.

ST_Within(Geometry, Geometry) → boolean [SQL/MM]

Returns true if first geometry is completely inside second geometry.

Geospatial Indexing

TBD. Geospatial indexing is an optimization for certain geospatial join and selection with functions like ST_Contains, ST_Distance. The indexing design will be planned after the basic data model and geo functions support are in place, as well as some performance data points are collected via benchmark. Moreover, it also relates to the design of the Join support in Pinot.

• No labels