Feature request: File sync to parquet using Parquet partitions

Question

Feature request: File sync to parquet using Parquet partitions

kzk2000 opened this issue 10 months ago · comments

kzk2000 commented 10 months ago

Hi,
I'm excited about the new file sink to Parquet feature.
Would it be possible to also allow for parquet file partitions as well?

My use case: subscribe to a websocket that streams market data for Cryptos for many product_ids in the same feed. Ideally, I'd like to sink this into partitioned parquet files on disk that use Parquet's partitioning by, say, product and 1min timestamps, e.g. ["product_id", "HH:MM" ]

If that's already doable, happy to follow your guidance on how to do this.

Jackson Newhouse · Answer 1 · Wed Aug 30 2023 06:27:10 GMT+0800 (China Standard Time)

Hi, there, thanks for the question. No, we don't currently support partitioning parquet output, but the approach we've taken should allow for it. I take it you'd want a file per (product_id, timestamp) pair? A rough plan for implementing this would look like:

Integrate partitioning into the SQL schema. It looks like data fusion supports Hive-style PARTITIONED BY syntax that would meet our needs.
Update the parquet sink to write a file per partition. Additionally, integrate with the watermark system when partitioning by a time dimension so that you only write finished files.
Convert SQL to the parquet sink operator.

Also happy to discuss more details in the discord: https://discord.gg/cjCr5rVmyR

kzk2000 · Answer 2 · Sun Oct 08 2023 05:38:06 GMT+0800 (China Standard Time)

Adding comments from Discord: https://discord.com/channels/1092704334808092754/1153049232647934072/1155941228102303764

@jacksonrnewhouse — 09/25/2023 2:57 PM
@miek , wanted to follow-up on partitioned sinks. I think this is worth implementing, as it is a pretty common pattern. After looking through the different options, I think the simplest way would be to add an option to the filesystem sink where you specify partition by columns, e.g.

CREATE TABLE bids (
  auction bigint,
  bidder bigint,
  price bigint,
  hour timestamp
) WITH (
  connector = 'filesystem',
  path = 'https://s3.us-west-2.amazonaws.com/demo/s3-uri',
  format = 'parquet',
  parquet_compression = 'zstd',
  parquet_version = '2.6'   # miek's suggestion to add 'version' here, too
  partition_by = 'hour'
  
);

This is being done "outside" of SQL, but given that DataFusion doesn't support partitioned inserts currently this seems like the most sensible approach.

Jackson Newhouse · Answer 3 · Sun Oct 08 2023 05:46:28 GMT+0800 (China Standard Time)

Ha, this is very close to what I've implemented on the partitioning branch. Should have it in the next version this coming week.

Jackson Newhouse · Answer 4 · Wed Nov 08 2023 04:14:54 GMT+0800 (China Standard Time)

The initial version of this was released in 0.7! In 0.8 it will be improved upon with a more performant implementation and integration with delta lake.