Explore "nanoarrow-js"
kylebarron opened this issue · comments
Arrow JS is a big library! It's not really a tenable dependency for a very bundle size conscious library or application.
This is actually the same story as in C/C++/Python. The C++ Arrow library got so big that many projects didn't want to depend on it. That's why nanoarrow
was created. As a super minimal library that works with the C Data Interface representation of Arrow arrays.
I think there's definitely potential for a low level Arrow library in JS, that hews very closely to the C Data Interface.
Data structures would be essentially the JS counterpart of C Data Interface structs. All array data (no matter the logical type) would be a Uint8Array
, that could later be viewed as another type or as strings.
Because array data are all Uint8Array
s, it means an array could either be "owned" in JS memory or "viewed" from wasm memory. So the memory safety wouldn't be great, but this is JS after all!
It would make sense to have toArrowJS
and fromArrowJS
functions that convert to and from Arrow JS arrays/Data
instances.
An emphasis should be placed on a functional api instead of a class API to keep bundle size low.
Ideally, this would allow high-performance programs to rely on Arrow memory without fear of a huge bundle size impact! But this would be complementary not competitive with Arrow JS.
Look at zarrita.js implementation to consider typescript typing for this approach. Seems like type guards would be very useful here.
let arrayData: ArrowArray = ...;
function isStringArray(data: ArrowArray): data is StringArray
keep in mind though that if StringArray doesn't change the actual interface, a StringArray
object will type check the same as a normal array
Arrow JS is a larger library but it is super treeshakeable. So if you don't need IPC reading/writing for example, you can get a much smaller bundle. If you just import one type, it can be tiny.
As a disclaimer, I'm horrible at bundling, so it's very possible I'm doing something wrong, but in geoarrow/geoarrow-js#20 I found that the apache-arrow
import wasn't getting tree-shaken by esbuild.
In particular, comparing the tree shaking output of
import { BufferType, Type } from "apache-arrow/enum"; import { Data } from "apache-arrow/data"; import { Vector } from "apache-arrow/vector"; import { Field } from "apache-arrow/schema";with
import { BufferType, Type } from "apache-arrow"; import { Data } from "apache-arrow"; import { Vector } from "apache-arrow"; import { Field } from "apache-arrow";Just that one change (from the latter to the former) reduced the minified earcut worker from 205kb to 74kb.
I was suspicious because originally the unminified worker output from the latter still had IPC read/write code.
In the end, because I knew in this worker I was only using attributes of the Data
class and no methods, I avoided any arrow import from the worker at all and got the compressed size down to 6kb. But for workers that return Arrow data, that won't be possible.
So, naively, it seems to enable tree shaking I have to ensure imports are from the internal file? Or maybe I'm using esbuild wrong 🤷♂️
In any case, as I mentioned here, I'm already spread too thin and don't think I have the bandwidth to make a stable nanoarrow-js
right now.
I think esbuild doesn't treeshake. We have a bundle test in arrow that compares different bundlers.
$ yarn test:bundle
$ gulp bundle
[13:00:47] Using gulpfile ~/Code/arrow/js/gulpfile.js
[13:00:47] Starting 'bundle'...
[13:00:47] Starting 'bundle:clean'...
[13:00:47] Finished 'bundle:clean' after 12 ms
[13:00:47] Starting 'bundle:esbuild'...
[13:00:48] field-bundle.js: 196.29 kB (gzipped: 46.15 kB)
[13:00:48] makeTable-bundle.js: 197.89 kB (gzipped: 46.45 kB)
[13:00:48] makeVector-bundle.js: 197.81 kB (gzipped: 46.42 kB)
[13:00:48] schema-bundle.js: 196.29 kB (gzipped: 46.15 kB)
[13:00:48] table-bundle.js: 196.29 kB (gzipped: 46.14 kB)
[13:00:48] tableFromArrays-bundle.js: 199.53 kB (gzipped: 47.07 kB)
[13:00:48] tableFromIPC-bundle.js: 197.54 kB (gzipped: 46.4 kB)
[13:00:48] vector-bundle.js: 196.29 kB (gzipped: 46.15 kB)
[13:00:48] vectorFromArray-bundle.js: 199.44 kB (gzipped: 47.04 kB)
[13:00:48] Finished 'bundle:esbuild' after 197 ms
[13:00:48] Starting 'bundle:rollup'...
[13:00:53] table-bundle.js: 88.28 kB (gzipped: 19.07 kB)
[13:00:53] vectorFromArray-bundle.js: 101.59 kB (gzipped: 21.51 kB)
[13:00:53] vector-bundle.js: 66.66 kB (gzipped: 14.99 kB)
[13:00:53] schema-bundle.js: 13.79 kB (gzipped: 3.54 kB)
[13:00:53] field-bundle.js: 799 B (gzipped: 367 B)
[13:00:53] tableFromIPC-bundle.js: 195.94 kB (gzipped: 40.75 kB)
[13:00:53] makeTable-bundle.js: 91.71 kB (gzipped: 19.61 kB)
[13:00:53] makeVector-bundle.js: 74.75 kB (gzipped: 16.03 kB)
[13:00:53] tableFromArrays-bundle.js: 112.72 kB (gzipped: 24.4 kB)
[13:00:53] Finished 'bundle:rollup' after 4.88 s
[13:00:53] Starting 'bundle:webpack'...
[13:00:55] field-bundle.js: 14.68 kB (gzipped: 3.7 kB)
[13:00:55] makeTable-bundle.js: 74.28 kB (gzipped: 17.84 kB)
[13:00:55] makeVector-bundle.js: 60.11 kB (gzipped: 14.52 kB)
[13:00:55] schema-bundle.js: 14.68 kB (gzipped: 3.7 kB)
[13:00:55] table-bundle.js: 72.61 kB (gzipped: 17.53 kB)
[13:00:55] tableFromArrays-bundle.js: 91.64 kB (gzipped: 22.31 kB)
[13:00:55] tableFromIPC-bundle.js: 167.49 kB (gzipped: 37.04 kB)
[13:00:55] vector-bundle.js: 58.48 kB (gzipped: 14.2 kB)
[13:00:55] vectorFromArray-bundle.js: 83.03 kB (gzipped: 20 kB)
[13:00:55] Finished 'bundle:webpack' after 2.67 s
I filed an issue about it at evanw/esbuild#1922 but it sounds like esbuild will expect annotations so we should add those if esbuild is becoming popular.
I have pretty good experiences with rollup. Would be awesome if the problem just went away with a better bundler so you don't have to rewrite the Arrow APIs.
I'm going to close this because I don't have the maintenance bandwidth to try and implement data structures for Arrow outside of Arrow JS, and I don't have a use case at this point where Arrow JS's bundle size is a deal-breaker.