Writing a Date column drops associated time information
mbostock opened this issue · comments
I think this is the same bug as duckdb/duckdb-wasm#1231…
Consider this test case:
import * as Arrow from "apache-arrow";
import * as Parquet from "parquet-wasm/node/arrow1.js";
const table = Arrow.tableFromArrays({test: [new Date("2012-01-01T12:34:56.789Z")]});
process.stdout.write(Parquet.writeParquet(Parquet.Table.fromIPCStream(Arrow.tableToIPC(table, "stream"))));
The resulting test
column erroneously contains the value 2012-01-01
instead of 2012-01-01T12:34:56.789Z
, dropping the associated time information.
This is actually an upstream arrow JS bug. Here's a repro case independent of parquet:
const arrow = require('apache-arrow');
const {writeFileSync} = require('fs');
const table = arrow.tableFromArrays({
test: [new Date("2012-01-01T12:34:56.789Z")],
});
const buffer = arrow.tableToIPC(table, 'file')
writeFileSync('table.arrow', buffer)
and then in Python:
import pyarrow.feather as feather
table = feather.read_table('table.arrow')
table.schema
# test: date64[ms] not null
table.to_pandas()
# test
# 0 2012-01-01
Also, if you look at the field info in JS before exporting to Python, you'll also see it's defined as a DateMillisecond
type, which doesn't store any time information.
> table.schema.fields[0]
Field {
name: 'test',
type: DateMillisecond [Date] { unit: 1 },
nullable: false,
metadata: Map(0) {}
}
Closing as I don't think this is related to parquet-wasm, but happy to discuss further