Introduction

GeoKV is a persistent key value store for data with a latitude/longitude that allows you to efficiently read/write this data and that is optimized for the efficient processing of this data under the assumption that at any time you access data in very specific regions.

The GeoKV also allows for querying using bounding boxes, polygons or circles with a radius. This particular feature is of course backed by my other GitHub project: geotools.

Why?

I’ve done a lot of processing of data related to places in the past few years. Generally what happens is that I get a bunch of data in either xml or json format with latitude/longitude and other meta data. Then I proceed to parse and process the data with some scripts. The size of the data typically ranges in the hundreds of MB or a few GB, compressed. Typically too much to keep it all in memory but not enough to waste time with elaborate hadoop based setups.

Typically the processing involves parsing, doing some processing and then writing. Mostly I have done this processing in a streaming fashion, which is kind of restrictive. A lot of the processing that I really would like to do would involve reading data about a place, examining relevant data that is nearby and then writing one or more results.

With data sets that don’t quite fit in memory and an access pattern that is geographically localized, what I really would like is a persistent key value store that deals with disk access in such a way that I can spend most of the time doing efficient in memory lookups/writes and the occasional disk access to read/write entire regions back to disk. Also, a lot of this data compresses really well. So, why waste the bytes and treat compression as an afterthought?

That, in a nutshell, is what GeoKV is about. It groups geospatial data by geohashes with a configurable length and simply reads/writes these blocks of data as a whole. The assumption here is that the data in a single geohash easily fits into memory, which is true for most datasets. The most ridiculous density you are likely to encounter would be in gigantic metropoles in e.g. China or South America where millions of people live in big cities and you might find tens of thousands of data points per square kilometer. Nothing you can’t handle with a few GB of memory.

Usage

The java API is very similar to the Map API, with the difference that the put operation requires a latitude and longitude in addition to the key. Keys are always Strings, Values can be any type. Since we have to persist, you have to tell it how to do that by providing a ValueProcessor with a serialize and parse method.

You can put, get, and remove values as well as iterate over the entire collection. Also, several filter operations are provided that allow you to iterate over values inside a bounding box, polygon or circle. Because you iterate, you don’t actually have to have the entire result set in memory.

How does it work?

GeoKV divides the world into rectangles (a bucket) and each bucket has its own gzip file on disk with all the content for the area covered by the associated rectangle. These bucket rectangles are calculated from the latitude and longitude of each entry using a geohash with a configurable length. Geohashes are a nice way to encode coordinates that have the useful property that nearby points have the same geohash prefix. For example, a geohash with length 7 covers about a city block and all geohashes of have a rectangular shape.

Each bucket file is a gzip file of lines with tab separated key, geohash, and serialized value. The name and directory of the file inside the datadir are based on the geohash prefix associated with the bucket.

The store loads these files as needed and uses a guava LRU cache to keep recently accessed buckets in memory. So, if you get and put entries in the same area, the buckets are mostly in memory. When buckets drop out of the cache, the data gets written to disk.

When iterating or querying the store, things are always returned bucket by bucket to prevent needless disk access. To prevent wasting memory during such operations, an Iterable is returned instead of a collection.

Caveats

When randomly accessing entries, performance may suffer because loading and unloading buckets is expensive. This may be an issue when ingesting large amounts of unsorted content.
GeoKV is not a transactional store nor does it do atomic writes. Instead it is an in memory store that occasionally accesses the disk and in fact tries to not do that too often in order to keep things fast.
GeoKV provides no protection against data corruption. So, if you run out of memory or experience some other failure, you may lose data and/or end up with a corrupted store. Should this happen, you may be able to recover some of the data by checking the gz files and regenerating ids.gz with a script (not provided).
Most IO exceptions are re-thrown as IllegalStateExceptions and no attempt is made to recover. That keeps the code small and simple, which is nice, but it has certain implications for robustness and reliability.
GeoKV should be thread safe. However, I’ve not done any elaborate testing for this and there may be issues here. The put and remove operations contain synchronized blocks and the various data structures used tend to be suitable for concurrent usage.
GeoKV could be used as a server but is not designed for that usage currently. Instead it is intended to be used for processing of geo spatial data in a way that makes efficient usage of memory and disk. These useful properties also make it of interest for server usage but more work would be needed to turn it into a proper server product.
You should not change the bucket size after you have picked one and written some data. GeoKV does not support changing this currently.
Given the above, you might want to back your data up regularly.
Given the above, you should probably not run this in a server side environment and certainly not in a production environment. If you do so anyway, I’d be happy to accept patches and bug reports and work with you to improve things.

Consider yourself warned. Otherwise it is all good :-).

Installation

It’s a maven project. So, checking it out and doing a mvn clean install should do the trick. You will need my geotools project as well, which you can find here: https://github.com/jillesvangurp/geotools.

Alternatively, you can exercise your rights under the license and simply copy and adapt. The license allows you to do this and I have no problems with this.

If you like to use pre-built jars, you can utilize my private maven repository as explained here

Should anyone like this licensed differently, please contact me.

If anyone wants to fix stuff just send me a pull request.

Changelog

1.2
- Migrate to geogeometry 2.0
- Fix deadlock issue with bucket locking
1.1
- Add better bucket locking to prevent reads overlapping with writes, which may lead to gzip related exceptions.
1.0
- First release of GeoKV

jillesvangurp / geokv