Easily create and run user-defined functions (UDF) on Apache Arrow. You can define functions in Rust, Python, Java or JavaScript. The functions can be executed natively, or in WebAssembly, or in a remote server.
Language | Native | WebAssembly | Remote |
---|---|---|---|
Rust | arrow-udf | arrow-udf-wasm | |
Python | arrow-udf-python | arrow-udf-flight/python | |
JavaScript | arrow-udf-js or arrow-udf-js-deno | ||
Java | arrow-udf-flight/java |
In addition to the standard types defined by Arrow, these crates also support the following data types through Arrow's extension type. When using extension types, you need to add the ARROW:extension:name
key to the field's metadata.
Extension Type | Physical Type | Metadata |
---|---|---|
JSON | Utf8 | ARROW:extension:name = arrowudf.json |
Decimal | Utf8 | ARROW:extension:name = arrowudf.decimal |
JSON type is stored in string array in text form.
let json_field = Field::new(name, DataType::Utf8, true)
.with_metadata([("ARROW:extension:name".into(), "arrowudf.json".into())].into());
let json_array = StringArray::from(vec![r#"{"key": "value"}"#]);
Different from the fixed-point decimal type built into Arrow, this decimal type represents floating-point numbers with arbitrary precision or scale, that is, the unconstrained numeric in Postgres. The decimal type is stored in a string array in text form.
let decimal_field = Field::new(name, DataType::Utf8, true)
.with_metadata([("ARROW:extension:name".into(), "arrowudf.decimal".into())].into());
let decimal_array = StringArray::from(vec!["0.0001", "-1.23", "0"]);
We have benchmarked the performance of function calls in different environments. You can run the benchmarks with the following command:
cargo bench --bench wasm
Performance comparison of calling gcd
on a chunk of 1024 rows:
gcd/native 1.5237 µs x1
gcd/wasm 15.547 µs x10
gcd/js(quickjs) 85.007 µs x55
gcd/js(deno) 93.584 µs x62
gcd/python 175.29 µs x115
- RisingWave: A Distributed SQL Database for Stream Processing.