rapidsai / node

GPU-accelerated data science and visualization in node

Home Page:https://rapidsai.github.io/node/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot execute a `sum` on a `DataFrame` created with `readParquet`

maxime-petitjean opened this issue · comments

If I try to execute this code:

const { DataFrame } = require('@rapidsai/cudf');
const frame = DataFrame.readParquet({ sourceType: 'files', sources: ['data.parquet'] });
const result = frame.sum(); // throw!

I have the error sum operation requires dataframe to be entirely of dtype FloatingPoint OR Integral. but parquet file contains only Float64 columns.

If I explicitly cast columns to Float64, it's working!

const { DataFrame, Float64 } = require('@rapidsai/cudf');
const frame = DataFrame.readParquet({ sourceType: 'files', sources: ['data.parquet'] });
const casted = frame.cast({ col1: new Float64(), col2: new Float64() });
const result = casted.sum(); // OK

If I log frame types I get:

  • before cast: { col1: { typeId: 3, precision: 2 }, col2: { typeId: 3, precision: 2 } }
  • after cast: { col1: Float64 [Float] { precision: 2 }, col2: Float64 [Float] { precision: 2 } }

Instance type of column type seems to be lost in readParquet function (type serialisation?).

@maxime-petitjean thanks for the bug report! That sounds like we're not fixing the types coming from C++ after loading the parquet file. I'll make a PR real quick with a fix.