rapidsai / node

GPU-accelerated data science and visualization in node

Home Page:https://rapidsai.github.io/node/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: Non-casting binops

thomcom opened this issue · comments

Presently, all column types that get a .mul(2) command are casted to Float64. This is not completely desirable. It is preferable that a type be retained unless it must be upcast, such as for overflow or int.mul(float). I'm pretty sure this is all supported by libcudf.

rapids@tcomer-NVIDIA:~/node/modules/demo/api-server$ node
Welcome to Node.js v18.2.0.
Type ".help" for more information.
> const {Series, Int32, Int64, Float32, Float64} = require('@rapidsai/cudf')
undefined
> let a = Series.new([0, 1]).cast(new Int32)
undefined
> let b = Series.new([0, 1]).cast(new Int64)
undefined
> a.mul(2).type
Float64 [Float] { precision: 2 }
> b.mul(2).type
Float64 [Float] { precision: 2 }

All numbers in JS are double-precision floats (or 64-bit integers if using the n literal suffix). It isn't possible (without using runtime numeric analysis) that the number the user provided isn't a double, so we always have to up-cast in the case a number is passed. The workaround to get the non-casting behavior would be to construct a wrapping Scalar and pass it instead:

$ node
> const {Series, Int32, addon: { Scalar }} = require('@rapidsai/cudf')
undefined
> let a = Series.new([0, 1]).cast(new Int32)
undefined
> a.mul(new Scalar({value: 2, type: new Int32})).type
Int32 [Int] { isSigned: true, bitWidth: 32 }

That's pretty heavy-handed just for wanting your column to retain its dtype. Maybe add an argument like .add(2, elevateType=false, mr) ?

I can think of a few things:

  1. Add convenience functions for constructing scalars:
    a.mul(Scalar.int32(2))
  2. Add a signature that accepts a template literal string arg that infers the numeric type via pattern matching:
    a.mul(`2`).type // Int32
    a.mul(`12.32`).type // Float64

#1 just doesn't seem Javascript-y to me at all. As a JS developer I know I'd always assume that the type of the scalar argument is supposed to be cast to the type of the column, what about always constructing a scalar when we call a.mul?

#2 works well for simple arithmetic like this but gets really expensive if we're trying to do a programmatic mul of some kind, though we'd hope that a scatter and a column x column mul would be used instead of an iteration. In fact this way it compels the developer doing more than a few muls to write more efficient code, so I think I could be on board for it.

We always create a Scalar from the input, but the issue is figuring out the Scalar's dtype. If you pass a JS number, we must create a Scalar({type: new Float64}), because JS numbers are always doubles.

If we wanted to create another dtype, we'd have to do some kind of runtime numeric analysis on the double to see if it fits into
a smaller (or different) numeric type.

Ok, I should have phrased #1 better to say, "The Scalar matches the type of the input column, instead of the type of the JS number, unless otherwise specified." is in some way my preference. Auto-casting every operation to float64, and then having to cast it back down to another type, every time we do an operation, unless we consider to create a cudf.Scalar explicitly every time really seems like poor usability to me.

#1 beats the current situation, though! I'll work on a convenience method for Scalar construction in a bit.

@thomcom We can't do that because it'd be incorrect. If you do Series.new({data: [1, 2, 3], type: new Int8}).mul(12345.6789), the resulting values neither match the dtype "int", nor fit into a 1-byte signed integer. We have to compare the input LHS and RHS dtypes and find a common dtype between them in order to ensure correctness. Float64 is often the only common dtype in the degenerative case.

Like you said, in the Javascript case, Float64 isn't just the degenerative case, it is every case unless other steps are taken. At this time it seems to me like the default behavior is acceptable, but if the user creates a typed scalar like Series.new({data: [1, 2, 3], type: new Int8}).mul(new Scalar({value: 100, type: new Int8}) then the result should be [100, -55, 45].

Do we already have this functionality, but users need to know to create a Scalar as the argument to mul?

Yes all the binary ops already accept scalars as inputs.

I guess I'll close this? :)