Create iceberg table from existsing parquet files with slightly different schemas (schemas merge is possible).

Question

Create iceberg table from existsing parquet files with slightly different schemas (schemas merge is possible).

sergun opened this issue 5 months ago · comments

Question

Hi!

What is a right way to create an iceberg table from existsing parquet files with slightly different schemas? So merge of their schemas is possible.
I would like to create the iceberg table by iceberg-python library (without Spark).

Kevin Liu · Answer 1 · Sat Apr 13 2024 06:15:23 GMT+0800 (China Standard Time)

There's a Table.add_files API which supports directly adding parquet files. But it seems like the parquet files must have the same schema.

You can also read all the parquet files into memory with PyArrow, merge the different schemas (using pyarrow.unify_schemas) and then write Arrow as Iceberg table.

Maybe DuckDB works well here since read_parquet can union schemas, see union_by_name and https://duckdb.org/docs/data/parquet/tips.html#use-union_by_name-when-loading-files-with-different-schemas

sergun · Answer 2 · Mon Apr 15 2024 01:33:46 GMT+0800 (China Standard Time)

Thank you @kevinjqliu !
Do you know how to read parquet file with unified schema in pyarrow?

I successfully merged schemas:

    t1 = pq.read_table("data/1.parquet")
    t2 = pq.read_table("data/2.parquet")
    schema = pa.unify_schemas([t1.schema, t2.schema])
    print(schema)

but the next lines give an error:

t1 = pq.read_table("data/1.parquet", schema=schema)
t2 = pq.read_table("data/2.parquet", schema=schema)
# union of t1 and t2 and write to iceberg should follow

pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<z: int64, x: int64> output fields: struct<z: int64, x: int64, y: int64, w: struct<w1: int64>>

Reg. duckdb - unfortunately union by name does not work for nested parquet files with changes in schemas on any level of nested structures. BTW it works for json in duckdb. It is my question in duckdb discussion:
duckdb/duckdb#11633

Kevin Liu · Answer 3 · Tue Apr 16 2024 12:46:50 GMT+0800 (China Standard Time)

Looks like your schema is nested, which makes things more complicated. It's pretty difficult to deal with merging nested schemas. I'm not sure if there's an out-of-the-box solution for this.
One possible solution could be to use pandas (or another engine) to merge the data once it is read into memory. Then use the union schema as the Iceberg table's schema.
Another solution can be to read into memory, flatten the schemas, and then write to Iceberg.

That said, most of the difficulties here are not related to Iceberg. One thing I wonder is if PyIceberg can handle schema evolution of nested structs.

sergun · Answer 4 · Tue Apr 16 2024 15:42:42 GMT+0800 (China Standard Time)

@kevinjqliu
It is strange to me that in PyArrow there is pa.unify_schemas(...) which is able (I double-checked) to unify nested schemas (even with type promotions).
But there is no "dual" functionality to cast, concat, read or to do something different with corresponding data in pa.Table. None of e.g. pa.Table.cast(...), pa.concat_tables(...), , pa.parquet.read_table(...) is working.

sergun · Answer 5 · Tue Apr 16 2024 15:49:14 GMT+0800 (China Standard Time)

One thing I wonder is if PyIceberg can handle schema evolution of nested structs.

Looks like it can.
From https://py.iceberg.apache.org/api/#add-column:

with table.update_schema() as update:
    update.add_column("retries", IntegerType(), "Number of retries to place the bid")
    # In a struct
    update.add_column("details.confirmed_by", StringType(), "Name of the exchange")

sergun · Answer 6 · Tue Apr 16 2024 15:57:18 GMT+0800 (China Standard Time)

BTW: Found some explaination why merge of Arrow tables with different schemas is not possible:
apache/arrow#35424
The reason looks weired, but yes, as I remeber e.g. Spark dataframes may have columns with duplicated names.

Probably it is possible to implement table merge in PyArrow after the check that there are no duplicated column names in each struct and on root level.

Fokko Driesprong · Answer 7 · Tue Apr 16 2024 17:52:54 GMT+0800 (China Standard Time)

The reason looks weired, but yes, as I remeber e.g. Spark dataframes may have columns with duplicated names.

Wow, I learned something today. I hope nobody uses that in real life.

That said, most of the difficulties here are not related to Iceberg. One thing I wonder is if PyIceberg can handle schema evolution of nested structs.

Nested structs, or structs inside a maps and lists are all supported :)

@sergun In PyIceberg we also have a union_by_name that will add the missing columns to the schema. Would that work?

Tssit · Answer 8 · Fri May 31 2024 16:57:26 GMT+0800 (China Standard Time)

have any java soultion that import parquet files ? @Fokko