JavaScript GeoArrow Module Proposal

Question

JavaScript GeoArrow Module Proposal

kylebarron opened this issue 8 months ago · comments

JavaScript GeoArrow Module Proposal

The strength of Arrow is in its interoperability, and therefore I think it's worthwhile to discuss how to ensure all the pieces around GeoArrow in JavaScript fit together really well.

This is a corollary to the Python GeoArrow Module Proposal but focused on GeoArrow interoperability in JavaScript and WebAssembly. I don't know anyone doing GeoArrow-Wasm stuff in C, so this will focus on my efforts in Rust and TypeScript. Unlike in Python, there aren't other people currently working on JavaScript GeoArrow infrastructure, so this is a manifesto to solidify my ideas.

WebAssembly limitations

WebAssembly is sandboxed, which means that Wasm code can only access and modify memory within its own memory space. So Wasm code cannot access JavaScript objects directly.

This also means that two Wasm modules can't share memory. So if you have one Wasm-based NPM library that loads GeoParquet to GeoArrow and another Wasm-based NPM library that implements spatial operations on GeoArrow, there must be a copy from the first module's memory space into JavaScript and then into the second module's memory space.

This means that grouping Wasm functionality together into a single module is more performant, as I/O and operations can be done in a single memory space. This runs up against bundle size: JavaScript bundlers are able to tree-shake JavaScript code, but they can't tree-shake a prebuilt Wasm binary. Instead, the original Rust would have to be recompiled, excluding unwanted functions.

The solution I'm gravitating towards is to have a variety of NPM libraries, described in this document, where I/O or operations are distributed both as their own libraries but also in a "kitchen sink" build, which contains everything at the cost of a larger bundle size. Advanced users can compile custom Wasm binaries from the rust source, with only the desired functionality.

Goals

Similar goals to the Python module proposal:

Modular: the user can install what they need and choose which dependencies they want, with the goal of somewhat fine-tuned control of bundle size.
Interoperable: the user can use WebAssembly-based and pure-JavaScript GeoArrow libraries together smoothly.
Extensible: future developers can develop on top of geoarrow-wasm and largely reuse its JS bindings without having to create ones from scratch
Strongly typed. A method like convex_hull should always return a PolygonArray instead of a generic GeometryArray that the user can't "see into" statically.
Static typing: Full typing support and IDE autocompletion.

Data Movement

In contrast to Python, which is able to share the same memory space with native code, data movement between Wasm and JS is not always free, because they occupy two separate memory spaces. JS can see into Wasm memory but not the opposite. This means that data movement from Wasm -> JS can be zero-copy, but JS -> Wasm requires a copy.

The easiest data movement in JS is to use Arrow IPC buffers to move serialized data between JS and Wasm, but this has a number of drawbacks:

Significant memory overhead: when constructing the IPC buffer, all Data chunks need to be copied into a new ArrayBuffer, a full copy of the dataset, before the copy into/out of Wasm.
All Data chunks in JS memory are references onto the same backing ArrayBuffer (from the original IPC buffer), which means a Data instance can't be transferred to a WebWorker without a copy.

The most performant data movement in JS is to directly view data from Wasm memory and conversely for JS to write array data directly into the Wasm memory space. I've been working on this in arrow-js-ffi and it's a crucial part of Arrow interoperability in Wasm. This solves both of the downsides of Arrow IPC, as it avoids an extra data copy and the Data instances in JS have a backing buffer not shared with any other Data.

Module hierarchy

Here's a quick (messy) picture of the dependency graph. An arrow points to the library it depends on, so here geoarrow-wasm depends on geoarrow-rs.

The most important part is that there are no dependency cycles.

Rust Core (non-Wasm)

geoarrow-rs is the rust core with all core GeoArrow functionality. All algorithms, core I/O, etc are implemented in this crate so that as much as possible can be shared among pure-Rust, JS, and Python.

This crate does not on its own have any JS bindings. All JS functionality is exported in separate crates/packages below.

Rust crate name: geoarrow

Arrow-Wasm Core

Shared arrow definitions and FFI functionality to/from Arrow JS.

Rust crate name: arrow-wasm
JS package name: None? It's unclear whether this should even be published to NPM, as it's not useful on its own; it's useful as a building block for other libraries.
Dependencies:
- Only the arrow crate.
Defines common abstractions in Rust with JS-facing APIs for Table, Vector, Data, DataType.
Enables zero-copy (or one-copy, but serialization-free) interop with Arrow JS.

Computational library

Standalone library for spatial operations on GeoArrow arrays, without any I/O except for Arrow IPC and FFI. The slim compilation feature of geoarrow-wasm.

Rust crate name: geoarrow-wasm
JS package name: @geoarrow/geoarrow-wasm-slim
Dependencies:
- geoarrow-rs for computational algorithms to wrap for JS
- arrow-wasm for JS bindings for Arrow FFI with Arrow JS
- Other dependencies in the graph are only used with the full compilation feature, described below under "Kitchen Sink"
Algorithms to operate on GeoArrow memory
- All operations that have a pure-Rust core and can be compiled seamlessly to Wasm
- For now, includes all algorithms. Maaybe in the future, we could have different NPM packages for different sets of libraries, but that sounds like a lot of work.

I/O Wasm libraries

There should exist standalone libraries with a minimal bundle size to read and write various file formats to/from GeoArrow.

`parquet-wasm`

Standalone library to read and write Parquet files in Wasm.

Rust crate name: parquet-wasm
JS package name: parquet-wasm
Dependencies:
- arrow-wasm for JS bindings for Arrow FFI with Arrow JS

`geoparquet-wasm`

Standalone library to read and write GeoParquet files in Wasm.

Rust crate name: geoparquet-wasm
JS package name: @geoarrow/geoparquet-wasm
Dependencies:
- parquet-wasm for JS bindings to read/write Parquet
- geoarrow-rs to encode/decode WKB geometries to/from GeoArrow
Functional API:
- readGeoParquet: wraps parquet-wasm's readParquet, converting WKB column to GeoArrow before returning an arrow-wasm Table instance
- writeGeoParquet: wraps parquet-wasm's writeParquet, converting GeoArrow in the Table input to WKB before passing on to writeParquet.
- readGeoParquetStream: wraps parquet-wasm's readParquetStream
- TODO: more async APIs

`flatgeobuf-wasm`

Standalone library to read and write FlatGeobuf files in Wasm.

Rust crate name: flatgeobuf-wasm
JS package name: @geoarrow/flatgeobuf-wasm
Dependencies:
- arrow-wasm for JS bindings for Arrow FFI with Arrow JS
- geoarrow-rs to read/write FlatGeobuf to/from GeoArrow
Functional API:
- readFlatGeobuf: parses FlatGeobuf buffer, returning an arrow-wasm Table instance
- writeFlatGeobuf: creates a FlatGeobuf buffer from an arrow-wasm Table instance.
- Future: readFlatGeobufStream: generates an async iterable of arrow-wasm RecordBatch from a remote FlatGeobuf file
- Future: read data by bounding-box from a remote file

The kitchen sink

The full compilation feature of geoarrow-wasm.

Rust crate name: geoarrow-wasm
JS package name: @geoarrow/geoarrow-wasm
Dependencies:
- arrow-wasm for JS bindings for Arrow FFI with Arrow JS
- geoparquet-wasm for JS bindings for GeoParquet
- flatgeobuf-wasm for JS bindings for FlatGeobuf
- geoarrow-rs for algorithms

Pure JS Interop

This is designed to smoothly interop with pure-JavaScript Arrow libraries.

Arrow JS

The canonical implementation of Arrow in JS. It only supports IPC for data I/O.

Arrow JS FFI

A library to read/write Arrow data across the Wasm boundary. This interops with the core arrow-wasm crate above.

GeoArrow JS

A pure-JavaScript (TypeScript) implementation of GeoArrow. This uses the exact same memory layout as GeoArrow in Rust, so it should be possible to mix and match between pure-JS and wasm-based algorithms without changing data representations.

Isaac Besora Vilardaga · Answer 1 · Wed Nov 29 2023 16:47:15 GMT+0800 (China Standard Time)

Stupid question but is implementing something like geoparquet-wasm and all its dependencies in pure JS out of the question? Wasm modules not being able to share memory and being more performant when bundled together seems to push you towards a single package approach but not being able to have tree shaking sounds like a big issue.

Kyle Barron · Answer 2 · Thu Nov 30 2023 00:21:25 GMT+0800 (China Standard Time)

👋 Hi @ibesora , thanks for chiming in

towards a single package approach but not being able to have tree shaking sounds like a big issue

Yes... but that's why I plan to publish all the above modules as standalone NPM packages. So if you only want the I/O, you can only import @geoarrow/geoparquet-wasm. If you want the I/O plus spatial operations, then you'd bring in @geoarrow/geoarrow-full to have both I/O and everything else in a single memory space.

Effectively it just makes you choose which sets of functionality you want when adding the dependency. It's a downside of WebAssembly, but I think there are more than enough upsides to still warrant the work.

Stupid question but is implementing something like geoparquet-wasm and all its dependencies in pure JS out of the question?

There are differences of opinion in the community on this topic, but my own opinion is that it's not the best use of engineering effort.

Parquet is super complex with extensive data types (e.g. recursive nested lists and structs), varied encodings (e.g. run length encoding, delta encoding), and an array of available compressions. As of when I wrote parquet-wasm a year and a half ago, all previous pure-JS Parquet attempts had been abandoned (1, 2). More recently, Ib Green in loaders.gl has been working on GeoParquet in pure JS, but I fear a very long tail of bugs.

With Wasm, I'm able to reuse rock-solid libraries. No one has ever made an issue in parquet-wasm with a Parquet file that failed to read, because the Rust implementation of Parquet is really solid. And on top of that, we get really good performance for free. When I tested against loaders.gl's implementation in April 2022, the Wasm version was 480x faster. That's not intended to be disrespectful to loaders.gl and Ib's efforts... just that this is really, really hard!

Of course you can write a really efficient JS GeoParquet library with enough engineering resources, but I'm trying to bootstrap an ecosystem of GeoArrow with just myself and mostly in free time. And by putting as much as possible in Rust, we can reuse the exact same core code in Wasm and in Python, for free.

Nicholas Roberts · Answer 3 · Fri Feb 16 2024 22:28:40 GMT+0800 (China Standard Time)

Given the presence of parquet IO (with a fair few differences in priority - e.g. parquet-wasm is obviously not intended to have a python binding) in both this repo and parquet-wasm, is it still a worthwhile goal to delegate to parquet-wasm?

I get the sense that the cross-crate interaction is proving to be too much of an impediment (that or the API surfaces are just too different), or is the current situation one of 'implement separately, unify when the dust settles'?

Kyle Barron · Answer 4 · Fri Feb 16 2024 23:09:25 GMT+0800 (China Standard Time)

Thanks for chiming in @H-Plus-Time! This has been on my mind recently, and I really don't have any conclusions, so any suggestions are welcomed.

I think the core problem is I wish to have Parquet support that is

non-spatial for JS
Spatial-aware for JS
Spatial aware for Rust and Python

How to reuse code across those is unclear, especially with a tangled web of dependencies.

At this point parquet-wasm is intricately tied to wasm-bindgen. And its arrow table object is an arrow-wasm table. In this repo I'm exploring how parquet works with object store because for Python remote support for e.g. s3 is crucial.

Maybe I was wrong in kylebarron/parquet-wasm#392 (comment) and having an object-store based implementation in JS will be easiest? Or the GeoParquet reader uses the rust implementation from this repo instead of from parquet-wasm

Nicholas Roberts · Answer 5 · Thu Feb 22 2024 12:40:04 GMT+0800 (China Standard Time)

At this point parquet-wasm is intricately tied to wasm-bindgen

Agreed, I wouldn't use it outside the js geoparquet-wasm subcrate (the js dir). Both the wasm and python targeting parts necessarily have their own binding-specific bits, that's honestly the most useful part of parquet-wasm (that and the quasi-ObjectStore).

And its arrow table object is an arrow-wasm table

I wonder about this - am I right in figuring that going from an ArrowTable to a GeoTable (or vice versa) would be relatively low-cost?

I can kind of see how one would do from_arrow_wasm_table in the outer GeoTable (sort of, it does look like the build_arrow_schema function requires a builder, though I suppose setting parse_geoparquet_metadata to pub would be sufficient when dealing with an already finalized table). Since most of the arrow-wasm types have bidirectional From impls for their equivalent types, might be able to get away with it without too much extra code.

The streams would be another kettle of fish - I suspect that a more generic version of SharedIO (also a... much better name :| ), with as of AsyncParquetTable's behaviour shoved into it as possible, would be part of that.

Ignoring all the custom IO bits, that one top level reader struct would be quite acceptable to duplicate (since it's impossible to involve traits or generics in wasm-bindgen'd structs) - <50 lines of duplication.

Maybe I was wrong in kylebarron/parquet-wasm#392 (comment) and having an object-store based implementation in JS will be easiest?

Yeah, I didn't think deeply enough about it at the time - for this proposal to work, the bulk of the IO code really needs to come from neither geoarrow-rs nor parquet-wasm, object-store is the way.

I think with a combination of object-store-wasm, and avoiding the extra HEAD request, it should be feasible to pull in parquet-wasm as a dep of geoparquet-wasm.

I should have a repo up for that last part today (just as soon as I get off this paper straw of a connection (plane wifi)).

Kyle Barron · Answer 6 · Thu Feb 22 2024 14:47:31 GMT+0800 (China Standard Time)

I wonder about this - am I right in figuring that going from an ArrowTable to a GeoTable (or vice versa) would be relatively low-cost?

that'll always be O(1). as a note I do want to rework the GeoTable a bit to relax the geometry restriction and allow it to have either no geometry or multiple geometry columns, which might bring it to be just a Table

Kyle Barron · Answer 7 · Thu Feb 22 2024 14:50:28 GMT+0800 (China Standard Time)

I think it's probably fine for geoparquet-wasm to return the same general arrow-wasm object as parquet wasm. As long as it contains the extension metadata you'll still be able to see it represents a geometry