Upcasting and Downcasting inconsistencies with PyArrow Schema

Question

Upcasting and Downcasting inconsistencies with PyArrow Schema

sungwy opened this issue 2 months ago · comments

Apache Iceberg version

0.6.0 (latest release)

Please describe the bug 🐞

schema_to_pyarrow converts BinaryType to pa.large_binary() type. This creates inconsistencies with the arrow table schema produced from the data scan between:

when schema_to_pyarrow is used when there is no data in the table (pa.large_binary())
when we use the physical_schema of the file fragment to read the table (pa.binary())

Related PR: #409

The implication of this bug is that pa.Table read from the same Iceberg Table may yield different schema based on whether or not there is data within the defined table scan.

More importantly, it also means that if one of the files is empty, and another file has data within the same table scan, then the schema inconsistencies in the two arrow tables will result in an error as we attempt to pa.concat_tables(tables)

Sung Yun · Answer 1 · Fri Jun 07 2024 05:10:14 GMT+0800 (China Standard Time)

Working on this issue, I noticed that Parquet has a restriction, where data larger than 2GB cannot be stored - at the very least Arrow has a check that prevents this:

ArrowInvalid: Parquet cannot store strings with size 2GB or more

https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L169

If Parquet cannot hold data that is larger than 2GB, is there a benefit in supporting large_ types in Arrow in PyIceberg?

Sung Yun · Answer 2 · Fri Jun 07 2024 05:32:16 GMT+0800 (China Standard Time)

I'm seeing the same restriction when using PolaRs write_parquet, so it looks like a Parquet limitation, instead of an Arrow restriction:

ComputeError: parquet: File out of specification: A page can only contain i32::MAX uncompressed bytes. This one contains 3000000010

https://github.com/pola-rs/polars/blob/efac81c65d623081f249f54e4e25ea05d6454ec1/crates/polars-parquet/src/parquet/write/page.rs#L24

Fokko Driesprong · Answer 3 · Mon Jun 10 2024 02:28:14 GMT+0800 (China Standard Time)

This is interesting, why would Polars go with large_binary by default? See #409

Fokko Driesprong · Answer 4 · Mon Jun 10 2024 02:51:01 GMT+0800 (China Standard Time)

For Arrow, the binary cannot store more than 2GB in a single buffer, not a single field. See Arrow docs for more context.

This is interesting, why would Polars go with large_binary by default?

See: pola-rs/polars#7422

Still, I think the inconsistency is not good.

Sung Yun · Answer 5 · Mon Jun 10 2024 04:45:15 GMT+0800 (China Standard Time)

For Arrow, the binary cannot store more than 2GB in a single buffer, not a single field. See Arrow docs for more context.

My apologies - I think I might not have done a good job explaining the problem @Fokko . I think the issue is with Parquet, not Arrow or PolaRs. I'm using these two libraries as examples to show that writing a record that exceeds 2GB, even if they are able to be represented in memory as large Arrow data type, cannot be written into a Parquet file. This issue raised on PolaRs seems to reiterate that issue as well: pola-rs/polars#10774

This is just based on my research this week, so it is definitely possible that I'm missing something here, but so far I haven't been able to write an actually large record (>2GB) into Parquet

Fokko Driesprong · Answer 6 · Tue Jun 11 2024 05:13:38 GMT+0800 (China Standard Time)

I agree that you cannot write a single field of 2GB+ to a parquet file. In that case, Parquet is probably not the best way of storing such a big blob.
The difference between how the offsets are stored. With the large binary, the offsets are 64 longs, and with the binary, they are 32 bits. When we create an array in Arrow: [foo, bar, arrow], then this is stored as:

data = 'foobararrow'
offsets = [0, 3, 6, 11]

If the offsets are 32 bits, then you need to chunk them into smaller buffers, which negatively impacts performance.

Sung Yun · Answer 7 · Tue Jun 11 2024 05:23:02 GMT+0800 (China Standard Time)

Gotcha - thank you for the explanation @Fokko I didn't think of how using a large_binary could actually improve the performance because the data is grouped together into large buffers.

I think I might have been convoluting the issue with the 2GB limit of Parquet with that of the necessity of using a large type.

Simply, I was asking: if we can't write that large of a data into Parquet, why do we even bother using a type that is specifically designed to be able to support larger data (which can't be written into the file)?

But now I see that the motivation to support large types is different from the motivation to write larger data, and plus we might introduce a different file format as the storage medium that could support writing larger types in the future