ArroyoSystems / arroyo

Distributed stream processing engine in Rust

Home Page:https://arroyo.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature request: File sync to parquet using Parquet partitions

kzk2000 opened this issue · comments

Hi,
I'm excited about the new file sink to Parquet feature.
Would it be possible to also allow for parquet file partitions as well?

My use case: subscribe to a websocket that streams market data for Cryptos for many product_ids in the same feed. Ideally, I'd like to sink this into partitioned parquet files on disk that use Parquet's partitioning by, say, product and 1min timestamps, e.g. ["product_id", "HH:MM" ]

If that's already doable, happy to follow your guidance on how to do this.

Hi, there, thanks for the question. No, we don't currently support partitioning parquet output, but the approach we've taken should allow for it. I take it you'd want a file per (product_id, timestamp) pair? A rough plan for implementing this would look like:

  • Integrate partitioning into the SQL schema. It looks like data fusion supports Hive-style PARTITIONED BY syntax that would meet our needs.
  • Update the parquet sink to write a file per partition. Additionally, integrate with the watermark system when partitioning by a time dimension so that you only write finished files.
  • Convert SQL to the parquet sink operator.

Also happy to discuss more details in the discord: https://discord.gg/cjCr5rVmyR

Adding comments from Discord: https://discord.com/channels/1092704334808092754/1153049232647934072/1155941228102303764

@jacksonrnewhouse — 09/25/2023 2:57 PM
@miek , wanted to follow-up on partitioned sinks. I think this is worth implementing, as it is a pretty common pattern. After looking through the different options, I think the simplest way would be to add an option to the filesystem sink where you specify partition by columns, e.g.

CREATE TABLE bids (
  auction bigint,
  bidder bigint,
  price bigint,
  hour timestamp
) WITH (
  connector = 'filesystem',
  path = 'https://s3.us-west-2.amazonaws.com/demo/s3-uri',
  format = 'parquet',
  parquet_compression = 'zstd',
  parquet_version = '2.6'   # miek's suggestion to add 'version' here, too
  partition_by = 'hour'
  
);

This is being done "outside" of SQL, but given that DataFusion doesn't support partitioned inserts currently this seems like the most sensible approach.

Ha, this is very close to what I've implemented on the partitioning branch. Should have it in the next version this coming week.

The initial version of this was released in 0.7! In 0.8 it will be improved upon with a more performant implementation and integration with delta lake.