Can't use the extension if my data catalog did not create a version-hint.text file

Question

Can't use the extension if my data catalog did not create a version-hint.text file

jacopotagliabue opened this issue 7 months ago · comments

Jacopo Tagliabue commented 7 months ago

My s3 bucket with iceberg (picture below) cannot be queried with

iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)

nor

iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)

In particular the system is trying to find a very specific file (so the * pattern gets ignored):

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text

Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.

That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?

Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?

Jacopo Tagliabue · Answer 1 · Wed Nov 22 2023 07:31:19 GMT+0800 (China Standard Time)

Sorry to be a bit clearer: even if we fix the version-hint problem, the fact that the system is looking at https://bucket.s3.amazonaws.com/iceberg/metadata/ as a base path seems to be not aligned with the state of my data lake (see the picture above for the current layout, written by Spark Nessie).

Happy to help debug this if there's something we can quickly try out.

Harel Efraim · Answer 2 · Mon Nov 27 2023 00:37:12 GMT+0800 (China Standard Time)

I ran into similar issue using AWS with Glue as the catalog for Iceberg.

The metadata files stored in S3 are of the following pattern:

00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json
00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json

I suspect Glue holds the pointer to the current metadata.

Rusty Conover · Answer 3 · Mon Nov 27 2023 10:44:30 GMT+0800 (China Standard Time)

It does. You can see the current pointer in table properties if you call Glue’s DescribeTable

…

On Sun, Nov 26, 2023 at 10:37 Harel Efraim ***@***.***> wrote: I ran into similar issue using AWS with Glue as the catalog for Iceberg. The metadata files stored in S3 are of the following pattern: 00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json 00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json I suspect Glue holds the pointer to the current metadata. — Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFSWJLGUYXUROAA5YGISITYGNV4HAVCNFSM6AAAAAA7VLFQWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWHAZTAMZYGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Sam Ansmink · Answer 4 · Mon Nov 27 2023 19:42:54 GMT+0800 (China Standard Time)

Currently no iceberg catalog implementations are available in the iceberg extension. Without a version hint you will need to pass the direct path to the correct metadata file manually, check:
#18

Jacopo Tagliabue · Answer 5 · Mon Nov 27 2023 21:36:24 GMT+0800 (China Standard Time)

@samansmink thanks, but the work-around does not seem the work tough: I get s3://bucet/iceberg/taxi_fhvhv_bbb/metadata/aaa.metadata.json from my datacatalog manually and pass it to my query:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

I still get a 404 with version file

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "....metadata.json/metadata/version-hint.text": 404 (Not Found)

As if it was trying to append the metadata/version-hint.text to my JSON path. Am I doing something dumb?

Jacopo Tagliabue · Answer 6 · Mon Nov 27 2023 21:47:47 GMT+0800 (China Standard Time)

Small update - I needed to update to 0.9.2 to scan a json file (posting here in case others stumble). The new error I get is No such file or directory on a path the scan found

"s3a://bucketiceberg/taxi_fhvhv/metadata/snap-aaaa.avro"

If I try with allow_moved_paths (the only thing it came to mind), I then get:

duckdb.duckdb.InvalidInputException: Invalid Input Error: Enabling allow_moved_paths is not enabled for directly scanning metadata files.

Any way around all of this?

Jacopo Tagliabue · Answer 7 · Mon Nov 27 2023 22:09:55 GMT+0800 (China Standard Time)

Small update 2 - I think I know why the avro path resolution does not work, just by looking closely at:

duckdb.duckdb.IOException: IO Error: Cannot open file "s3a://.......avro": No such file or directory

A nessie (written with Spark) file system uses s3a:// as the prefix, not s3 like presumably duckdb does. In fact, if I manually change s3a://.......avro into s3://.......avro, I can find the file in my data lake!

Quick way to patch this would be to replace the nessie prefix with the standard s3 one for object storage paths (or allow a flag that somehow toggles that behavior etc.). A longer term fix seems to have nessie return non-nessie-specific paths, but more general ones.

What do you think could be a short-term work-around @samansmink ?

Sam Ansmink · Answer 8 · Mon Nov 27 2023 22:30:14 GMT+0800 (China Standard Time)

@jacopotagliabue s3a urls are indeed not supported currently.

If s3a:// urls are interoperable with s3 urls which, as far as i can tell from a quick look, seems to be the case? we could consider adding it to duckdb which would solve this issue

Jacopo Tagliabue · Answer 9 · Mon Nov 27 2023 22:34:38 GMT+0800 (China Standard Time)

That would be great and the easiest fix - I'll reach out to the nessie folks anyway to let them know about this, but if you could do the change in duckdb that would (presumably?) solve the current issue.

Sam Ansmink · Answer 10 · Mon Nov 27 2023 22:51:25 GMT+0800 (China Standard Time)

duckdb/duckdb#9817

Harel Efraim · Answer 11 · Sun Dec 17 2023 23:35:19 GMT+0800 (China Standard Time)

For Java iceberg users out there, I found a solution to retrieve the latest metadata without having to query the catalog directly.

Once you load the table from the catalog, you can issue the following method that will return the latest metadata location.
You can use that location with iceberg_scan function.

public static String currentMetadataLocation(org.apache.iceberg.Table table) {
    return ((BaseTable) table).operations().current().metadataFileLocation();
}

I tested it on both Glue and Nessie.

It should make it somewhat easier, but I still hope there will be a cleaner solution in the extension later on

Jacopo Tagliabue · Answer 12 · Wed Dec 20 2023 05:07:47 GMT+0800 (China Standard Time)

hi @harel-e, just making sure I understand.

If you pass the json you get back from a nessie endpoint using the standard API for the table, and the issue something like:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

you are able to get duckdb iceberg working?

Harel Efraim · Answer 13 · Wed Dec 20 2023 13:15:41 GMT+0800 (China Standard Time)

Yes, DuckDB 0.9.2 with Iceberg is working for me on the following setups:

a. AWS S3 + AWS Glue
b. MinIO + Nessie

Mark Wolfe · Answer 14 · Sat Mar 09 2024 07:42:20 GMT+0800 (China Standard Time)

I was able to get this working by looking up the current metadata URL using the glue API/CLI, then used that URL to query iceberg.

select count(*) from iceberg_scan('s3://cfanalytics-abc123/cloudfront_logs_analytics/metadata/abf3a652-02cb-4a8e-8b6c-2089a2acfe6c.metadata.json');

Works for me at the moment.

Teague Sterling · Answer 15 · Mon Mar 25 2024 04:04:48 GMT+0800 (China Standard Time)

This appears to also be an issue with iceberg tables created using the Iceberg quick start at https://iceberg.apache.org/spark-quickstart/#docker-compose (using duckdb 0.10.0)

There are a few other oddities and observations:

If you manually create a version-hint.text file pointing to one of the existing metadata.json files, the iceberg scanner ends up looking for a file prefixed with a "v" (e.g., 00000-d30b41d6-48c0-42db-b32e-29083b874a80 in version-hint.text looks for v00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json (but only 00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json exists in the directory)
If you also copy the create the .metadata.json to the expected v....metadata.json path, everything works as expected.
If you accidentally create the .metadata.json file as a binary minio lx.meta file (as I did), you can crash DuckDB with a a segfault --- which may be more of a security risk than anything else.
If the version-hint.text contains invalid characters for a path (e.g., a trailing newline) they will be directly included in the requested ...metadata.json path.

The prefixing of the "v" when looking for the .metadata.json seems to be the most burdensome as it's not terribly difficult to maintain a version-hint.text file but it would be difficult to rename versions.

Jacopo Tagliabue · Answer 16 · Thu Apr 04 2024 22:55:40 GMT+0800 (China Standard Time)

Confirming that

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

still does not work with Dremio created table, Nessie catalog.

Error is:
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "https://bauplan-openlake-db87a23.s3.amazonaws.com/iceberg/taxi_fhvhv_partitioned/metadata/00000-136374fe-87d3-4cc6-8202-0a11f6af0b56.metadata.json/metadata/version-hint.text": 404 (Not Found)

Any chance we could make the version hint optional if they are not part of official Iceberg specs and many implementations seem to ignore them?

0x000000 · Answer 17 · Sun Apr 07 2024 10:33:47 GMT+0800 (China Standard Time)

Can confirm that this still does not work for iceberg tables created with catalog.create_table()

query: f"SELECT * FROM iceberg_scan('{lakehouse_path}') WHERE id = {mock_team_id}"

error: duckdb.duckdb.HTTPException: HTTP Error: Unable to connect to URL "https://local-lakehousesta-locallakehousebuck-mnrnr57ascjc.s3.amazonaws.com/metadata/version-hint.text": 404 (Not Found)

Pyiceberg workaround: Load the Iceberg table using a pyiceberg catalog (i'm using glue), then use the metadata_location field for the scan.

lakehouse_catalog = load_catalog( "glue", **{"type": "glue", "s3.region": "us-east-1"} )

team_table = lakehouse_catalog.load_table("default.Team")

changed_team_record = conn.sql( f"SELECT * FROM iceberg_scan('{team_table.metadata_location}') WHERE id = {mock_team_id}" ).to_df()