duckdb / duckdb_iceberg

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't use the extension if my data catalog did not create a version-hint.text file

jacopotagliabue opened this issue · comments

My s3 bucket with iceberg (picture below) cannot be queried with

iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)

nor

iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)

In particular the system is trying to find a very specific file (so the * pattern gets ignored):

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text

Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.

That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?

Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?

s3_structure

Sorry to be a bit clearer: even if we fix the version-hint problem, the fact that the system is looking at https://bucket.s3.amazonaws.com/iceberg/metadata/ as a base path seems to be not aligned with the state of my data lake (see the picture above for the current layout, written by Spark Nessie).

Happy to help debug this if there's something we can quickly try out.

I ran into similar issue using AWS with Glue as the catalog for Iceberg.

The metadata files stored in S3 are of the following pattern:

00000-0b4430d2-fbee-4b0d-90c9-725f013d6f82.metadata.json
00001-6e3b4909-7e6b-486f-bf81-b1331eba3ac8.metadata.json

I suspect Glue holds the pointer to the current metadata.

Currently no iceberg catalog implementations are available in the iceberg extension. Without a version hint you will need to pass the direct path to the correct metadata file manually, check:
#18

@samansmink thanks, but the work-around does not seem the work tough: I get s3://bucet/iceberg/taxi_fhvhv_bbb/metadata/aaa.metadata.json from my datacatalog manually and pass it to my query:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

I still get a 404 with version file

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "....metadata.json/metadata/version-hint.text": 404 (Not Found)

As if it was trying to append the metadata/version-hint.text to my JSON path. Am I doing something dumb?

Small update - I needed to update to 0.9.2 to scan a json file (posting here in case others stumble). The new error I get is No such file or directory on a path the scan found

"s3a://bucketiceberg/taxi_fhvhv/metadata/snap-aaaa.avro"

If I try with allow_moved_paths (the only thing it came to mind), I then get:

duckdb.duckdb.InvalidInputException: Invalid Input Error: Enabling allow_moved_paths is not enabled for directly scanning metadata files.

Any way around all of this?

Small update 2 - I think I know why the avro path resolution does not work, just by looking closely at:

duckdb.duckdb.IOException: IO Error: Cannot open file "s3a://.......avro": No such file or directory

A nessie (written with Spark) file system uses s3a:// as the prefix, not s3 like presumably duckdb does. In fact, if I manually change s3a://.......avro into s3://.......avro, I can find the file in my data lake!

Quick way to patch this would be to replace the nessie prefix with the standard s3 one for object storage paths (or allow a flag that somehow toggles that behavior etc.). A longer term fix seems to have nessie return non-nessie-specific paths, but more general ones.

What do you think could be a short-term work-around @samansmink ?

@jacopotagliabue s3a urls are indeed not supported currently.

If s3a:// urls are interoperable with s3 urls which, as far as i can tell from a quick look, seems to be the case? we could consider adding it to duckdb which would solve this issue

That would be great and the easiest fix - I'll reach out to the nessie folks anyway to let them know about this, but if you could do the change in duckdb that would (presumably?) solve the current issue.

For Java iceberg users out there, I found a solution to retrieve the latest metadata without having to query the catalog directly.

Once you load the table from the catalog, you can issue the following method that will return the latest metadata location.
You can use that location with iceberg_scan function.

public static String currentMetadataLocation(org.apache.iceberg.Table table) {
    return ((BaseTable) table).operations().current().metadataFileLocation();
}

I tested it on both Glue and Nessie.

It should make it somewhat easier, but I still hope there will be a cleaner solution in the extension later on

hi @harel-e, just making sure I understand.

If you pass the json you get back from a nessie endpoint using the standard API for the table, and the issue something like:

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

you are able to get duckdb iceberg working?

Yes, DuckDB 0.9.2 with Iceberg is working for me on the following setups:

a. AWS S3 + AWS Glue
b. MinIO + Nessie

I was able to get this working by looking up the current metadata URL using the glue API/CLI, then used that URL to query iceberg.

select count(*) from iceberg_scan('s3://cfanalytics-abc123/cloudfront_logs_analytics/metadata/abf3a652-02cb-4a8e-8b6c-2089a2acfe6c.metadata.json');

Works for me at the moment.

This appears to also be an issue with iceberg tables created using the Iceberg quick start at https://iceberg.apache.org/spark-quickstart/#docker-compose (using duckdb 0.10.0)

There are a few other oddities and observations:

  • If you manually create a version-hint.text file pointing to one of the existing metadata.json files, the iceberg scanner ends up looking for a file prefixed with a "v" (e.g., 00000-d30b41d6-48c0-42db-b32e-29083b874a80 in version-hint.text looks for v00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json (but only 00000-d30b41d6-48c0-42db-b32e-29083b874a80.metadata.json exists in the directory)
  • If you also copy the create the .metadata.json to the expected v....metadata.json path, everything works as expected.
  • If you accidentally create the .metadata.json file as a binary minio lx.meta file (as I did), you can crash DuckDB with a a segfault --- which may be more of a security risk than anything else.
  • If the version-hint.text contains invalid characters for a path (e.g., a trailing newline) they will be directly included in the requested ...metadata.json path.

The prefixing of the "v" when looking for the .metadata.json seems to be the most burdensome as it's not terribly difficult to maintain a version-hint.text file but it would be difficult to rename versions.

Confirming that

SELECT PULocationID, DOLocationID, trip_miles, trip_time FROM iceberg_scan('s3://bucket/iceberg/taxi/metadata/aaa.metadata.json') WHERE pickup_datetime >= '2022-01-01T00:00:00-05:00' AND pickup_datetime < '2022-01-02T00:00:00-05:00'

still does not work with Dremio created table, Nessie catalog.

Error is:
duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL "https://bauplan-openlake-db87a23.s3.amazonaws.com/iceberg/taxi_fhvhv_partitioned/metadata/00000-136374fe-87d3-4cc6-8202-0a11f6af0b56.metadata.json/metadata/version-hint.text": 404 (Not Found)

Any chance we could make the version hint optional if they are not part of official Iceberg specs and many implementations seem to ignore them?

Can confirm that this still does not work for iceberg tables created with catalog.create_table()

query: f"SELECT * FROM iceberg_scan('{lakehouse_path}') WHERE id = {mock_team_id}"

error: duckdb.duckdb.HTTPException: HTTP Error: Unable to connect to URL "https://local-lakehousesta-locallakehousebuck-mnrnr57ascjc.s3.amazonaws.com/metadata/version-hint.text": 404 (Not Found)

Pyiceberg workaround: Load the Iceberg table using a pyiceberg catalog (i'm using glue), then use the metadata_location field for the scan.

lakehouse_catalog = load_catalog( "glue", **{"type": "glue", "s3.region": "us-east-1"} )

team_table = lakehouse_catalog.load_table("default.Team")

changed_team_record = conn.sql( f"SELECT * FROM iceberg_scan('{team_table.metadata_location}') WHERE id = {mock_team_id}" ).to_df()