pyarrow date partitioned dataset support

Question

pyarrow date partitioned dataset support

quazzuk opened this issue 3 years ago · comments

Hi

I'm just wondering whether there are any plans to support pyarrrow partitioned datasets as a storage engine? I have other tick level datasources stored as date partitioned datasets and was just examining the cryptostore source to see how best to support this format directly, avoiding intermediate copies and additional data processing.

Do you have any suggestions on how date partitioning could be supported in combination with incremental appends? I would like to be able to configure to store data incrementally every 1 or 2H to the file and then close and create a new file on the change of date. My understanding is the current mechanism is a rolling window not date aligned. Would you suggest creating a new data storage engine (similar to the existing parquet support) and/or modifying the the Aggregator functionality. Or perhaps it's more easily supported via the export mechanism (currently used for syncing to cloud storage) to write to a partitioned pyarrow dataset, deleting off the original parquet file after the export completes?

Thanks
Andy

yohplala · Answer 1 · Sun Mar 14 2021 17:13:09 GMT+0800 (China Standard Time)

Hi @quazzuk
Not on my side, and I have in mind Bryant is using Arctic, so except if someone is ready to take the topic, I guess it is unlikely.

I would like to be able to configure to store data incrementally every 1 or 2H to the file

Yes this does not work exactly this way: you specify a storage interval, and number of times you append; If there is new data always, then there is always something to append, and with some parameter values, you end up in having 1 or 2 hours of data to the file.
If it is a pair with few new data, then for some 'storage interval', no new data is written, and it takes more time to complete the full number of append.
But it could be made working at fixed interval I think, decreasing the append_countereven if no new data is appended.

Would you suggest creating a new data storage engine (similar to the existing parquet support) and/or modifying the the Aggregator functionality.

The logic you appear to want is:
Every storage_interval, write in different partitions the data depending its creation date.
Do you confirm the time you care about is the time of the data creation? (not time of reception)

I feel, ticket #56 shares a 'similar' difficulty, and it has been closed unsolved.
To write full order book at begining of parquet file, it means that you have to:

analyze the data you receive (is it full snapshot or deltas)
if it is full snapshot, then close previous file, and start a new one.

Assuming the data we receive is 'sorted' meaning no older data can be received after younger data (not sure about this?),
then I would suggest you to:

start from existing parquet storage
the option you want to modify is append_counter: instead of relying on a counter, you want the appending to continue until reaching the start of a new date-based partition.
you have to modify the condition triggering the close of the file and creation of a new parquet file. Instead of closing the file if append_counter is reached, you want to close the file if there is data with date creation after a specific time. With the remaining of the data (that is newer), you create a new parquet file.
you have to think about how generating dynamically these 'specific times' (for instance using pd.period_range()?).

Not impossible I think.
Good luck. :)

quazzuk · Answer 2 · Wed Mar 17 2021 17:15:22 GMT+0800 (China Standard Time)

@yohplala thanks for the pointers, I'll take a look...