This repository provides support for Bing tiles for Hive. Most of the functionality is a port of Presto's implementation, but there are differences described below.
Among other uses, Bing tiles can be used for efficient spatial joins in environments, like Hive, without built-in support for spatial joins or spatial indexing.
In environments like Hive without built-in support for spatial joins, we can use Bing tiles to make for more efficient joins.
As an example, let's say we have two tables: one table of Census geographies, with their WKT representations, and another of lat-long locations reported by IoT devices. We want to find the Census block group each report from a device occurred in. The natural, naive algorithm to do so computes the Cartesian product of the two tables, and then discards rows corresponding to the wrong block groups. Because Hive doesn't support spatial indexing, that's how it'll evaluate this query:
create table default.overlap as
select
io.*,
cg.geoid
from default.iot_observations io
cross join default.census_geo cg
where
cg.level = 'blockgroup' and
ST_Contains(ST_GeomFromText(cg.wkt),
ST_Point(io.longitude, io.latitude));
Of course, this is impractically slow for even moderately large datasets. In place of the quadtree and R-tree indexes we'd typically use to speed it up, which Hive lacks, we can use a grid-based index via Bing tiles:
create table default.overlap as
with
cg as (
select
cgi.geoid,
cgi.level,
ST_GeomFromText(cgi.wkt) as geom,
t.tile
from default.census_geo cgi
lateral view explode
(BT_TilesCoveringGeometry(ST_Envelope(ST_GeomFromText(cgi.wkt)), 10)) t as tile
),
iot as (
select
ioti.obs_id,
ST_Point(cast(ioti.latitude as double), cast(ioti.longitude as double)) as geom,
BT_FromLatLon(cast(ioti.latitude as double), cast(ioti.longitude as double), 10) as tile
from default.iot_observations ioti
where
-- bing tiles don't work very close to the poles, but that's usually fine
abs(cast(ioti.latitude as double)) <= 85
)
select
iot.obs_id,
cg.geoid,
cg.level
from iot
inner join cg on cg.tile = iot.tile
where
ST_Contains(cg.geom, iot.geom);
Joining on the tile reduces the number of rows to compare enough to let the query finish in a reasonable amount of time. (We could follow a similar grid-based strategy with the standard ST_Bin
and ST_BinEnvelope
functions, but the Bing tiles make it much easier.)
See also this post from the Presto team.
Differences from the Presto implementation include:
- Hive doesn't have user-defined types, so there's no Bing tile type. Tiles are instead stored as their quadkeys, in a text field.
- Different names for the functions. They're prefixed with
BT_
to mimic theST_
prefix of the OGC geospatial functions. In particular, thebing_tile_coordinates
function from Presto has been split intoBT_GetX
andBT_GetY
, and thebing_tile
constructor from coordinates has been renamed toBT_FromCoordinates
.
Download the most recent released jar file from the releases page, or for the development version, build the jar from this repo in the usual maven way.
You can either put the jar on the Java classpath by hand, or add it to Hive:
add jar /path/to/jar/dir/bing-tile-hive-X.Y.jar;
Finally, run the appropriate CREATE FUNCTION
statements to add the classes in the jar as UDFs. A script to create temporary (i.e., session-specific) functions is in this repo under /sql/install.sql
.