gupy-io / target-s3-parquet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is this target only for Athena?

pnadolny13 opened this issue · comments

Hey @ndrluis @lorransr @Marcos314 - this is more of a question that an issue but I was just checking out this target and based on the requirement to have athena_database and using aws wrangler I was wondering if this is meant only to be used specifically with Athena or is it intending to support writing plain parquet files to S3 as well. Similar to what https://github.com/transferwise/pipelinewise-target-s3-csv does with CSVs but in Parquet format instead. No problem either way but wanted to clarify the use cases for this target 😄 .

Hello @pnadolny13,

Is this target only for Athena?

No, but it's for Glue as Catalog, this target generates the schema inside the specified database by this property.

for example, here at Gupy we are using Trino with Glue Catalog for our raw zone.

It's possible to write into s3 with dataset mode disable and the expected behaviour are just write the parquet file.

We are open to accept any changes into our target to behave as you expect.

https://github.com/gupy-io/target-s3-parquet/blob/main/target_s3_parquet/sinks.py#L87
https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.to_parquet.html

@ndrluis great thanks for the response, very helpful!

@ndrluis

This looks very promising for populating Minio S3 storage with Iceberg tables using a JDBC catalog. Any chance this could work from your POV?

Hello @mtthsbrr, I'm not sure if AWS Data Wrangler works with Minio S3, but regarding Iceberg tables, I know that we cannot work with them because there is no way to write Iceberg tables in Python. As for the JDBC catalog, another abstraction would have to exist to be able to implement it.

Anyway, we use Iceberg with JDBC catalog here at Gupy, but in the more advanced layers of our lake. In any case, there is a possibility that in the future we can develop or assist some project that can implement writing in this way.