kylebarron / arrow-js-ffi

Zero-copy reading of Arrow data from WebAssembly

Home Page:https://www.npmjs.com/package/arrow-js-ffi

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Explore "nanoarrow-js"

kylebarron opened this issue · comments

Arrow JS is a big library! It's not really a tenable dependency for a very bundle size conscious library or application.

This is actually the same story as in C/C++/Python. The C++ Arrow library got so big that many projects didn't want to depend on it. That's why nanoarrow was created. As a super minimal library that works with the C Data Interface representation of Arrow arrays.

I think there's definitely potential for a low level Arrow library in JS, that hews very closely to the C Data Interface.

Data structures would be essentially the JS counterpart of C Data Interface structs. All array data (no matter the logical type) would be a Uint8Array, that could later be viewed as another type or as strings.

Because array data are all Uint8Arrays, it means an array could either be "owned" in JS memory or "viewed" from wasm memory. So the memory safety wouldn't be great, but this is JS after all!

It would make sense to have toArrowJS and fromArrowJS functions that convert to and from Arrow JS arrays/Data instances.

An emphasis should be placed on a functional api instead of a class API to keep bundle size low.

Ideally, this would allow high-performance programs to rely on Arrow memory without fear of a huge bundle size impact! But this would be complementary not competitive with Arrow JS.

Look at zarrita.js implementation to consider typescript typing for this approach. Seems like type guards would be very useful here.

let arrayData: ArrowArray = ...;

function isStringArray(data: ArrowArray): data is StringArray 

keep in mind though that if StringArray doesn't change the actual interface, a StringArray object will type check the same as a normal array

Arrow JS is a larger library but it is super treeshakeable. So if you don't need IPC reading/writing for example, you can get a much smaller bundle. If you just import one type, it can be tiny.

As a disclaimer, I'm horrible at bundling, so it's very possible I'm doing something wrong, but in geoarrow/geoarrow-js#20 I found that the apache-arrow import wasn't getting tree-shaken by esbuild.

In particular, comparing the tree shaking output of

import { BufferType, Type } from "apache-arrow/enum";
import { Data } from "apache-arrow/data";
import { Vector } from "apache-arrow/vector";
import { Field } from "apache-arrow/schema";

with

import { BufferType, Type } from "apache-arrow";
import { Data } from "apache-arrow";
import { Vector } from "apache-arrow";
import { Field } from "apache-arrow";

Just that one change (from the latter to the former) reduced the minified earcut worker from 205kb to 74kb.

I was suspicious because originally the unminified worker output from the latter still had IPC read/write code.

In the end, because I knew in this worker I was only using attributes of the Data class and no methods, I avoided any arrow import from the worker at all and got the compressed size down to 6kb. But for workers that return Arrow data, that won't be possible.

So, naively, it seems to enable tree shaking I have to ensure imports are from the internal file? Or maybe I'm using esbuild wrong 🤷‍♂️

In any case, as I mentioned here, I'm already spread too thin and don't think I have the bandwidth to make a stable nanoarrow-js right now.

I think esbuild doesn't treeshake. We have a bundle test in arrow that compares different bundlers.

$ yarn test:bundle
$ gulp bundle
[13:00:47] Using gulpfile ~/Code/arrow/js/gulpfile.js
[13:00:47] Starting 'bundle'...
[13:00:47] Starting 'bundle:clean'...
[13:00:47] Finished 'bundle:clean' after 12 ms
[13:00:47] Starting 'bundle:esbuild'...
[13:00:48] field-bundle.js: 196.29 kB (gzipped: 46.15 kB)
[13:00:48] makeTable-bundle.js: 197.89 kB (gzipped: 46.45 kB)
[13:00:48] makeVector-bundle.js: 197.81 kB (gzipped: 46.42 kB)
[13:00:48] schema-bundle.js: 196.29 kB (gzipped: 46.15 kB)
[13:00:48] table-bundle.js: 196.29 kB (gzipped: 46.14 kB)
[13:00:48] tableFromArrays-bundle.js: 199.53 kB (gzipped: 47.07 kB)
[13:00:48] tableFromIPC-bundle.js: 197.54 kB (gzipped: 46.4 kB)
[13:00:48] vector-bundle.js: 196.29 kB (gzipped: 46.15 kB)
[13:00:48] vectorFromArray-bundle.js: 199.44 kB (gzipped: 47.04 kB)
[13:00:48] Finished 'bundle:esbuild' after 197 ms
[13:00:48] Starting 'bundle:rollup'...
[13:00:53] table-bundle.js: 88.28 kB (gzipped: 19.07 kB)
[13:00:53] vectorFromArray-bundle.js: 101.59 kB (gzipped: 21.51 kB)
[13:00:53] vector-bundle.js: 66.66 kB (gzipped: 14.99 kB)
[13:00:53] schema-bundle.js: 13.79 kB (gzipped: 3.54 kB)
[13:00:53] field-bundle.js: 799 B (gzipped: 367 B)
[13:00:53] tableFromIPC-bundle.js: 195.94 kB (gzipped: 40.75 kB)
[13:00:53] makeTable-bundle.js: 91.71 kB (gzipped: 19.61 kB)
[13:00:53] makeVector-bundle.js: 74.75 kB (gzipped: 16.03 kB)
[13:00:53] tableFromArrays-bundle.js: 112.72 kB (gzipped: 24.4 kB)
[13:00:53] Finished 'bundle:rollup' after 4.88 s
[13:00:53] Starting 'bundle:webpack'...
[13:00:55] field-bundle.js: 14.68 kB (gzipped: 3.7 kB)
[13:00:55] makeTable-bundle.js: 74.28 kB (gzipped: 17.84 kB)
[13:00:55] makeVector-bundle.js: 60.11 kB (gzipped: 14.52 kB)
[13:00:55] schema-bundle.js: 14.68 kB (gzipped: 3.7 kB)
[13:00:55] table-bundle.js: 72.61 kB (gzipped: 17.53 kB)
[13:00:55] tableFromArrays-bundle.js: 91.64 kB (gzipped: 22.31 kB)
[13:00:55] tableFromIPC-bundle.js: 167.49 kB (gzipped: 37.04 kB)
[13:00:55] vector-bundle.js: 58.48 kB (gzipped: 14.2 kB)
[13:00:55] vectorFromArray-bundle.js: 83.03 kB (gzipped: 20 kB)
[13:00:55] Finished 'bundle:webpack' after 2.67 s

I filed an issue about it at evanw/esbuild#1922 but it sounds like esbuild will expect annotations so we should add those if esbuild is becoming popular.

I have pretty good experiences with rollup. Would be awesome if the problem just went away with a better bundler so you don't have to rewrite the Arrow APIs.

I'm going to close this because I don't have the maintenance bandwidth to try and implement data structures for Arrow outside of Arrow JS, and I don't have a use case at this point where Arrow JS's bundle size is a deal-breaker.