bmoscon / cryptostore

A scalable storage service for cryptocurrency data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Parquet] ValueError: Table schema does not match schema used to create file

anovv opened this issue · comments

Describe the bug
I'm running cryptostore to get l2/trades/ticker feed from Binance for multiple pairs. It works fine first few minutes, but then I get

'2021-02-04 11:01:48,993 : ERROR : Aggregator running on PID 56410 died due to exception
...
ValueError: Table schema does not match schema used to create file:
...
'

To Reproduce
I'm using kafka as a medium, storing data locally on disk in parquet files

Config: https://pastebin.com/fpQQDm0r

Trace: https://pastebin.com/9T3wHTTW

Operating System:
Running locally on Mac

Cryptofeed Version
I'm building off master on github

Looks like 'amount' in schema is double while 'amount' in file is int64. Why is it so?

Got another error like this, with a different field now ('size'), same type mismatch: double vs int64
Trace: https://pastebin.com/630QDAHd

Hello @dirtyValera hi @bmoscon
I had a quick look at the source and I possibly have some clue.
size and amount are 2 columns I am not initializing when creating the parquet files, they are inferred.

You have set 60s storage interval, and on your ETH-based pairs, I would not be surprised there is one or even 0 trade for 60s.

      - QTUM-ETH
      - EOS-ETH
      - SNT-ETH
      - BNT-ETH

Like BNT for instant or SNT. (I don't even know what SNT is :))
So my guess is that there are very few trades each time a parquet file is written (maybe only one), and that at this time, it can happen that the amount or size is an int.
I think the reversed message is also possible.

I don't have time to delve more into the topic at the moment.
Please, on your side:

  • can you confirm for more traded pairs, this error message do not show up (take a BTC-USDT pair, you will be fine :))
  • as a workaround for the time being, with these pairs, you can:
    • switch off append_counter and keep a storage_interval to 300. Fingers crossed thhat you will not run out of memory...
      ETH-BTC, and BNB-BTC l2 book order will however consume some. If you have at least 4Go of RAM, it should be ok. If you have less than 2Go, ... really not sure.
    • purely switch off append_counter with a slightly higher storage_interval. It will create more files, but this may be an acceptable temporary solution?

Same error here. Using Redis & Binance.

# Cryptostore sample config file

cache: redis
redis:
    ip: '127.0.0.1'
    port: 6379
    socket: null
    del_after_read: true
    retention_time: null
    start_flush: true

binance_symbols: &binance_symbols [BTC-USDT, ETH-USDT, DOT-USDT, EGLD-USDT, XLM-USDT, BNB-USDT, ADA-USDT, LTC-USDT, NEO-USDT, ATOM-USDT, IOTA-USDT, SAND-USDT, COCOS-USDT, ALGO-USDT, THETA-USDT, DASH-USDT, DUSK-USDT, BTC-EUR, ETH-EUR, DOT-EUR, XLM-EUR, BNB-EUR, ADA-EUR, LTC-EUR, BTC-AUD, ETH-AUD, BNB-AUD, BTC-BRL, ETH-BRL, BNB-BRL, LTC-BRL, BTC-RUB, ETH-RUB, BNB-RUB, LTC-RUB, EUR-USDT, AUD-USDT, USDT-BRL, USDT-RUB]
exchanges:
    BINANCE:
        retries: -1
        l2_book:
            symbols: *binance_symbols
            max_depth: 500
            book_delta: true
        ticker: *binance_symbols


storage: [parquet]
storage_retries: 5
storage_retry_wait: 30
storage_interval: 60
#storage: [arctic]
#arctic: mongodb://127.0.0.1
parquet:
    del_file: true
    append_counter: 4
    path: ./
    file_format: [timestamp]
    prefix_date: false
    compression:
        codec: BROTLI
        level: 6
    S3:
        endpoint: null
        key_id: xxxxxx
        secret: xxxxx
        bucket: xxxxx
        prefix: null

Hello @emiliobasualdo hello @dirtyValera
I posted this PR to fix this trouble.
It works on the type of data you reported: size and amount.
Please, can you give it a check?
(I tested on my side, but not on your config example)

Thanks for your feedback,
Bests,

This looks to have been fixed. Please confirm either way. I'll close this ticket out if I don't hear back

@yohplala @bmoscon Thanks for the answer!

I have tested it an still have the following error:

....
2021-02-24 15:31:43,716 : INFO : l2_book-BINANCE-EGLD-USDT: Read 21 messages from Redis
2021-02-24 15:31:43,724 : ERROR : Aggregator running on PID 2292 died due to exception
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.8/site-packages/cryptostore/aggregator/aggregator.py", line 108, in loop
    store.write(exchange, dtype, pair, time.time())
  File "/home/ec2-user/.local/lib/python3.8/site-packages/cryptostore/data/storage.py", line 37, in write
    s.write(exchange, data_type, pair, timestamp)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/cryptostore/data/parquet.py", line 146, in write
    writer.write_table(table=self.data)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 649, in write_table
    raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
timestamp: double
receipt_timestamp: double
delta: bool
side: dictionary<values=string, indices=int32, ordered=0>
price: double
size: double vs.
file:
feed: dictionary<values=string, indices=int32, ordered=0>
symbol: dictionary<values=string, indices=int32, ordered=0>
bid: double
ask: double
receipt_timestamp: double
timestamp: double
Task exception was never retrieved
....

Please confirm this is the correct way of upgrading

git clone https://github.com/bmoscon/cryptostore.git
python setup.py install --force

I checked the PR with the fix and believe it doesn't solve the issue.

Cherry picking fields to force casting to double is not a scalable solution (as you can see the above example has a new receipt_timestamp field which also needs a cast).

Is there any way to cast all fields to the same type?

receipt timestamp is not new, its been in the codebase for a while. The traceback posted above has a much more serious issue:

side: dictionary<values=string, indices=int32, ordered=0>

the side should always be a string. Also the comparison of the fields indicates that its comparing a book schema to a ticker schema, so probably something unrelated to this

Yes, my point was that the fix was just whitelisting (to be cast) fields which were reported to be buggy (from my two previous traces). By receipt_timestamp being new I meant that it popped up with the same problem of type mismatch (not that it was just introduced). Hence casting specific fields which people report here is not a viable solution

receipt timestamp is not new, its been in the codebase for a while. The traceback posted above has a much more serious issue:
side: dictionary<values=string, indices=int32, ordered=0>
the side should always be a string.

@bmoscon, this is not an issue. I turned it into a category a while ago (called dictionary in pyarrow world) As it has only 2 values, you save a terrific amount of space.
This has never been raising any issue on my side, and I confirm when opening a file that it is correctly stored as a category.

I [...] believe it doesn't solve the issue.

Believing is not always a solution ;)

Cherry picking fields to force casting to double is not a scalable solution (as you can see the above example has a new receipt_timestamp field which also needs a cast).
Is there any way to cast all fields to the same type?

No, I would not want to do that.
All channels go through these lines of code, including future ones, not yet existing and 'weird one' like MARKET_INFO from COINGECKO and TRANSACTIONS from WHALE_ALERT.
It means an awful lot of different possible columns.

Also the comparison of the fields indicates that its comparing a book schema to a ticker schema, so probably something unrelated to this

Yes, this is the trouble. Schema between file and table are certainly never going to match.

@emiliobasualdo

There is something I don't understand in your config file, the file_format parameter is simply:

file_format: [timestamp]

@bmoscon, is this not an error. Shouldn't it be catched?
Without the other fileds exchanges, symbol, data_type, how is it you are not overwriting on your own file @emiliobasualdo ?

@emiliobasualdo I am sorry, i am not a yaml expert, what does mean &binance_symbols in this line?

binance_symbols: &binance_symbols [BTC-USDT, ETH-USDT,

yes, the file_format would write all updates to the same file. Thats the issue for @emiliobasualdo . You can only use that format if you are only storing one type of data, not 2

yes, the file_format would write all updates to the same file. Thats the issue for @emiliobasualdo . You can only use that format if you are only storing one type of data, not 2

  • and one pair, as symbol is not stored for all channels,
  • and same for exchange.
    You can see it in above example: book does not seem to record feed and symbol.

Shouldn't we raise an issue if not all 4 words are not in file format?

@dirtyValera

By receipt_timestamp being new I meant that it popped up with the same problem of type mismatch (not that it was just introduced).

Please, could you hilight the error message indicating receipt_timestamp is raising trouble? I did not notice it.

Please, can you give it a try on your side?
I 'believe' from your wording that you did not and I would like to make sure the trouble only lies in file_format.
Thanks a lot for your help.
Bests

@yohplala I didn't have this issue with receipt_timestamp, this is what I guessed from @emiliobasualdo's trace. I could be wrong so please feel free to close the issue if it's not related.

@yohplala I didn't have this issue with receipt_timestamp, this is what I guessed from @emiliobasualdo's trace. I could be wrong so please feel free to close the issue if it's not related.

Please, on your side, does the PR solve your issue?
Could you give it a try?

Error reported by @emiliobasualdo is clearly related to the fact that all the parquet files generated by cryptostore at the same second wil have the same name, whatever the exchange, pair and data type so during the same second, file is overwritten x times for x different sources of data. It cannot work. IMHO, cryptostore should raise an exception if all 4 fields are not provided. @bmoscon, I can do a PR on this if you agree with this solution

@dirtyValera, but your own problem is not related to file naming. Is it now solved?

Thanks for your feedback.
Bests

@yohplala, I will test it and let you know, although I believe it should solve the issue