bmoscon / cryptostore

A scalable storage service for cryptocurrency data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

S3 Parquet file_format

mattgrint opened this issue · comments

When using Parquet with S3, it seems that the file_format configuration parameter is overridden to make the transfer to S3. For example, if we use the file_format: [exchange, data_type, timestamp], the .parquet.tmp files will aggregate on the local disk but then transfer into a file path including [exchange, data_type, symbol, timestamp].

This appears to lead to incorrect (dis)aggregation behaviour - if we use the file_format [exchange, data_type, timestamp] the data are collected in .parquest.tmp files without a symbol, and when the data are transferred to S3 and split into different files, these each contain not just the named symbol but also a selection of others as well.

Is this expected behaviour / a configuration issue? I believe something similar has been briefly mentioned here: #154 (comment)

Thanks

Hi @mattgrint

the .parquet.tmp files will aggregate on the local disk but then transfer into a file path including [exchange, data_type, symbol, timestamp].

This behavior is normal and is 'legacy' behavior. See line 166 in data/parquet.py, where when transferred to a distant storage (GCS, GD or S3), file name is forced.

 path = self._default_path(exchange, data_type, pair) + f'/{exchange}-{data_type}-{pair}-{timestamp}.parquet'

This appears to lead to incorrect (dis)aggregation behaviour - if we use the file_format [exchange, data_type, timestamp] the data are collected in .parquest.tmp files without a symbol, and when the data are transferred to S3 and split into different files, these each contain not just the named symbol but also a selection of others as well.

I think your formulation is incorrect.
I think the trouble is not when files are transferred to S3, but right when they are written into parquet.
If you collect several symbols, but do not specify a file_format with symbol, then cryptostore may by luck succeed to keep track of which file contains data for which symbol if they have not been created with the same timestamps.
But sometime (i find it a bit weird though...) if they have the same creation timestamp (that is second precise, so I am very surprised this can happen) then cryptotore would append data for different symbols to the same file.

Please, do you have any way to check what is the situation before transferring to S3?
(just comment out the S3 part in your configuration file)

@bmoscon
Bryant, would you agree to make all 4 information feed, symbol, channel, timestamp compulsory in file_format?
In my opinion this would solve this ticket.

To be noticed, I am thinking that there are reports about this now that there is parquet file appending.
It does not necessarily mean that this is because of parquet file appending though.
In the past, without appending, files would simply be overwritten, meaning indeed no data from different symbols in the same file, but data loss instead (I guess less visible).
This is a thought.

Bests,

@mattgrint
I think I understand now why you don't use symbol in file_format while you are collecting data for different symbols.
This is because you are transferring to S3, and that data is then moved into a directory which name embeds this information, right?

Please, also check if using symbol in file_format solves your trouble.
I am guessing it will.

Hi @yohplala

Thanks for the helpful response.

If you collect several symbols, but do not specify a file_format with symbol, then cryptostore may by luck succeed to keep track of which file contains data for which symbol if they have not been created with the same timestamps.

Yes, I think we are getting at the same core point - the files are saved to the local disk after discarding information about which file contains which symbols, so cannot then be written out to S3 with symbol-level precision in their naming without searching every file to see which symbols it contains and splitting it out again.

My specific use case is collecting Option data where we may have several hundred symbols but relatively infrequent data. In the requested data_type, a symbol field is provided in the actual data that is received, so ideally I would collect all these in a single file rather than having several hundred sparse ticker or trade data files. It seems to work collecting the data locally in the .parquet.tmp file (all the symbols for each data type are in each file, but are identifiable by the symbol field) however when transferred to S3, as you point out, the file format is forced to include symbol which (I think) causes the incorrect splitting of the files.

Rather than forcing the requirement to include all four fields ['feed', 'symbol', 'channel', 'timestamp'], could we not just copy the .parquet files verbatim? The .parquet.tmp files on the local disk are correctly aggregated for my purposes and as you say the problem is introduced in trying to force the four-field file_format when transferring to S3.

Say, for example, we are collecting the trades and ticker channels for three symbols "A", "B" and "C" for a single exchange, "EXCHANGE". On the local disk, these are all collected in files using the format [feed, channel, timestamp]. We could end up with these files for example:

Exchange-ticker-1616510013.parquet.tmp

feed pair bid ask receipt_timestamp timestamp
EXCHANGE A 0.206 2.5810 1.616499e+09 1.616499e+09
EXCHANGE B 0.305 2.7920 1.616499e+09 1.616499e+09
EXCHANGE C 0.109 1.8580 1.616499e+09 1.616499e+09

Exchange-trades-1616510013.parquet.tmp

feed pair timestamp receipt_timestamp side amount price id
EXCHANGE A 1.616495e+09 1.616495e+09 buy 1.0 0.0005 137945189
EXCHANGE B 1.616495e+09 1.616495e+09 sell 2.0 0.0070 137945157
EXCHANGE C 1.616495e+09 1.616495e+09 buy 2.0 1787.2500 137945160

This contains all the data that is needed and it is straightforward to determine the data for each symbol. I think a simple transfer to S3 would suit this use case perfectly, rather than forcing the different file format, unless I am missing something?

Please let me know your thoughts.

I don't have the mind completely clear about the transfer step and I am amazed by the result you mention.

1st parquet appending to local:
What you think is a feature is actually an unexpected behavior :) or at least one I was certainly not expecting when implementing appending.

Data for each symbol is supposed to be tracked independently of what happen for other symbol and supposed to be written in distinct files.

But to keep track of what happen for each combination, a dict is used with keys based on file_format.
And if all combination have the same file_format, this does not work... and behaves as you witness it.

...
file_name = f'{exchange}-{data_type}-{pair}-{timestamp}.parquet'            # for instance
...
f_name_tips = tuple(file_name.split(timestamp))
...
self.buffer[f_name_tips] = {'counter': 0, 'writer': writer, 'timestamp': timestamp}
...

buffer dict keeps track or necessary data to manage the correct file correctly (how to find it, and closing current one and creating a new one if append_counter is consumed)

Now, coming to S3, I have no idea what happens.
There is no such thing in the code as 'splitting files given their content'. What is supposed to be done is simple copy.
So in a last attempt to find an explanation, maybe S3 has specific behavior with parquet files?
The appending of parquet file creates each time a new row group.
You could compare each of these row groups as an independent file actually and the parquet file as a tar file of all these row groups.
So perhaps, in your S3 UI, it is showing you directly the row groups?
(I am not using S3, and don't know about it)