[Parquet] ValueError: Table schema does not match schema used to create file

Question

[Parquet] ValueError: Table schema does not match schema used to create file

anovv opened this issue 3 years ago · comments

Describe the bug
I'm running cryptostore to get l2/trades/ticker feed from Binance for multiple pairs. It works fine first few minutes, but then I get

'2021-02-04 11:01:48,993 : ERROR : Aggregator running on PID 56410 died due to exception
...
ValueError: Table schema does not match schema used to create file:
...
'

To Reproduce
I'm using kafka as a medium, storing data locally on disk in parquet files

Config: https://pastebin.com/fpQQDm0r

Trace: https://pastebin.com/9T3wHTTW

Operating System:
Running locally on Mac

Cryptofeed Version
I'm building off master on github

Andrey Novitskiy · Answer 1 · Thu Feb 04 2021 16:21:13 GMT+0800 (China Standard Time)

Looks like 'amount' in schema is double while 'amount' in file is int64. Why is it so?

Andrey Novitskiy · Answer 2 · Thu Feb 04 2021 23:25:08 GMT+0800 (China Standard Time)

Got another error like this, with a different field now ('size'), same type mismatch: double vs int64
Trace: https://pastebin.com/630QDAHd

yohplala · Answer 3 · Fri Feb 05 2021 15:47:42 GMT+0800 (China Standard Time)

Hello @dirtyValera hi @bmoscon
I had a quick look at the source and I possibly have some clue.
size and amount are 2 columns I am not initializing when creating the parquet files, they are inferred.

You have set 60s storage interval, and on your ETH-based pairs, I would not be surprised there is one or even 0 trade for 60s.

      - QTUM-ETH
      - EOS-ETH
      - SNT-ETH
      - BNT-ETH

Like BNT for instant or SNT. (I don't even know what SNT is :))
So my guess is that there are very few trades each time a parquet file is written (maybe only one), and that at this time, it can happen that the amount or size is an int.
I think the reversed message is also possible.

I don't have time to delve more into the topic at the moment.
Please, on your side:

can you confirm for more traded pairs, this error message do not show up (take a BTC-USDT pair, you will be fine :))
as a workaround for the time being, with these pairs, you can:
- switch off append_counter and keep a storage_interval to 300. Fingers crossed thhat you will not run out of memory...
  ETH-BTC, and BNB-BTC l2 book order will however consume some. If you have at least 4Go of RAM, it should be ok. If you have less than 2Go, ... really not sure.
- purely switch off append_counter with a slightly higher storage_interval. It will create more files, but this may be an acceptable temporary solution?

Emilio Basualdo Cibils · Answer 4 · Wed Feb 24 2021 00:40:20 GMT+0800 (China Standard Time)

Same error here. Using Redis & Binance.

# Cryptostore sample config file

cache: redis
redis:
    ip: '127.0.0.1'
    port: 6379
    socket: null
    del_after_read: true
    retention_time: null
    start_flush: true

binance_symbols: &binance_symbols [BTC-USDT, ETH-USDT, DOT-USDT, EGLD-USDT, XLM-USDT, BNB-USDT, ADA-USDT, LTC-USDT, NEO-USDT, ATOM-USDT, IOTA-USDT, SAND-USDT, COCOS-USDT, ALGO-USDT, THETA-USDT, DASH-USDT, DUSK-USDT, BTC-EUR, ETH-EUR, DOT-EUR, XLM-EUR, BNB-EUR, ADA-EUR, LTC-EUR, BTC-AUD, ETH-AUD, BNB-AUD, BTC-BRL, ETH-BRL, BNB-BRL, LTC-BRL, BTC-RUB, ETH-RUB, BNB-RUB, LTC-RUB, EUR-USDT, AUD-USDT, USDT-BRL, USDT-RUB]
exchanges:
    BINANCE:
        retries: -1
        l2_book:
            symbols: *binance_symbols
            max_depth: 500
            book_delta: true
        ticker: *binance_symbols


storage: [parquet]
storage_retries: 5
storage_retry_wait: 30
storage_interval: 60
#storage: [arctic]
#arctic: mongodb://127.0.0.1
parquet:
    del_file: true
    append_counter: 4
    path: ./
    file_format: [timestamp]
    prefix_date: false
    compression:
        codec: BROTLI
        level: 6
    S3:
        endpoint: null
        key_id: xxxxxx
        secret: xxxxx
        bucket: xxxxx
        prefix: null

yohplala · Answer 5 · Wed Feb 24 2021 05:17:45 GMT+0800 (China Standard Time)

Hello @emiliobasualdo hello @dirtyValera
I posted this PR to fix this trouble.
It works on the type of data you reported: size and amount.
Please, can you give it a check?
(I tested on my side, but not on your config example)

Thanks for your feedback,
Bests,

Bryant Moscon · Answer 6 · Wed Feb 24 2021 08:36:08 GMT+0800 (China Standard Time)

This looks to have been fixed. Please confirm either way. I'll close this ticket out if I don't hear back

Emilio Basualdo Cibils · Answer 7 · Wed Feb 24 2021 23:35:11 GMT+0800 (China Standard Time)

@yohplala @bmoscon Thanks for the answer!

I have tested it an still have the following error:

....
2021-02-24 15:31:43,716 : INFO : l2_book-BINANCE-EGLD-USDT: Read 21 messages from Redis
2021-02-24 15:31:43,724 : ERROR : Aggregator running on PID 2292 died due to exception
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.8/site-packages/cryptostore/aggregator/aggregator.py", line 108, in loop
    store.write(exchange, dtype, pair, time.time())
  File "/home/ec2-user/.local/lib/python3.8/site-packages/cryptostore/data/storage.py", line 37, in write
    s.write(exchange, data_type, pair, timestamp)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/cryptostore/data/parquet.py", line 146, in write
    writer.write_table(table=self.data)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 649, in write_table
    raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
timestamp: double
receipt_timestamp: double
delta: bool
side: dictionary<values=string, indices=int32, ordered=0>
price: double
size: double vs.
file:
feed: dictionary<values=string, indices=int32, ordered=0>
symbol: dictionary<values=string, indices=int32, ordered=0>
bid: double
ask: double
receipt_timestamp: double
timestamp: double
Task exception was never retrieved
....

Please confirm this is the correct way of upgrading

git clone https://github.com/bmoscon/cryptostore.git
python setup.py install --force

Andrey Novitskiy · Answer 8 · Wed Feb 24 2021 23:38:04 GMT+0800 (China Standard Time)

I checked the PR with the fix and believe it doesn't solve the issue.

Cherry picking fields to force casting to double is not a scalable solution (as you can see the above example has a new receipt_timestamp field which also needs a cast).

Is there any way to cast all fields to the same type?

Bryant Moscon · Answer 9 · Thu Feb 25 2021 02:27:31 GMT+0800 (China Standard Time)

receipt timestamp is not new, its been in the codebase for a while. The traceback posted above has a much more serious issue:

side: dictionary<values=string, indices=int32, ordered=0>

the side should always be a string. Also the comparison of the fields indicates that its comparing a book schema to a ticker schema, so probably something unrelated to this

Andrey Novitskiy · Answer 10 · Thu Feb 25 2021 02:31:39 GMT+0800 (China Standard Time)

Yes, my point was that the fix was just whitelisting (to be cast) fields which were reported to be buggy (from my two previous traces). By receipt_timestamp being new I meant that it popped up with the same problem of type mismatch (not that it was just introduced). Hence casting specific fields which people report here is not a viable solution

yohplala · Answer 11 · Thu Feb 25 2021 02:41:55 GMT+0800 (China Standard Time)

receipt timestamp is not new, its been in the codebase for a while. The traceback posted above has a much more serious issue:
side: dictionary<values=string, indices=int32, ordered=0>
the side should always be a string.

@bmoscon, this is not an issue. I turned it into a category a while ago (called dictionary in pyarrow world) As it has only 2 values, you save a terrific amount of space.
This has never been raising any issue on my side, and I confirm when opening a file that it is correctly stored as a category.

yohplala · Answer 12 · Thu Feb 25 2021 02:45:40 GMT+0800 (China Standard Time)

I [...] believe it doesn't solve the issue.

Believing is not always a solution ;)

Cherry picking fields to force casting to double is not a scalable solution (as you can see the above example has a new receipt_timestamp field which also needs a cast).
Is there any way to cast all fields to the same type?

No, I would not want to do that.
All channels go through these lines of code, including future ones, not yet existing and 'weird one' like MARKET_INFO from COINGECKO and TRANSACTIONS from WHALE_ALERT.
It means an awful lot of different possible columns.

yohplala · Answer 13 · Thu Feb 25 2021 02:46:27 GMT+0800 (China Standard Time)

Also the comparison of the fields indicates that its comparing a book schema to a ticker schema, so probably something unrelated to this

Yes, this is the trouble. Schema between file and table are certainly never going to match.

yohplala · Answer 14 · Thu Feb 25 2021 02:50:12 GMT+0800 (China Standard Time)

@emiliobasualdo

There is something I don't understand in your config file, the file_format parameter is simply:

file_format: [timestamp]

@bmoscon, is this not an error. Shouldn't it be catched?
Without the other fileds exchanges, symbol, data_type, how is it you are not overwriting on your own file @emiliobasualdo ?

yohplala · Answer 15 · Thu Feb 25 2021 02:52:36 GMT+0800 (China Standard Time)

@emiliobasualdo I am sorry, i am not a yaml expert, what does mean &binance_symbols in this line?

binance_symbols: &binance_symbols [BTC-USDT, ETH-USDT,

Bryant Moscon · Answer 16 · Thu Feb 25 2021 02:54:40 GMT+0800 (China Standard Time)

yes, the file_format would write all updates to the same file. Thats the issue for @emiliobasualdo . You can only use that format if you are only storing one type of data, not 2

yohplala · Answer 17 · Thu Feb 25 2021 02:57:16 GMT+0800 (China Standard Time)

yes, the file_format would write all updates to the same file. Thats the issue for @emiliobasualdo . You can only use that format if you are only storing one type of data, not 2

and one pair, as symbol is not stored for all channels,
and same for exchange.
You can see it in above example: book does not seem to record feed and symbol.

yohplala · Answer 18 · Thu Feb 25 2021 02:58:08 GMT+0800 (China Standard Time)

Shouldn't we raise an issue if not all 4 words are not in file format?

yohplala · Answer 19 · Thu Feb 25 2021 03:00:33 GMT+0800 (China Standard Time)

@dirtyValera

By receipt_timestamp being new I meant that it popped up with the same problem of type mismatch (not that it was just introduced).

Please, could you hilight the error message indicating receipt_timestamp is raising trouble? I did not notice it.

Please, can you give it a try on your side?
I 'believe' from your wording that you did not and I would like to make sure the trouble only lies in file_format.
Thanks a lot for your help.
Bests

Andrey Novitskiy · Answer 20 · Thu Feb 25 2021 16:10:16 GMT+0800 (China Standard Time)

@yohplala I didn't have this issue with receipt_timestamp, this is what I guessed from @emiliobasualdo's trace. I could be wrong so please feel free to close the issue if it's not related.

yohplala · Answer 21 · Fri Feb 26 2021 16:08:26 GMT+0800 (China Standard Time)

@yohplala I didn't have this issue with receipt_timestamp, this is what I guessed from @emiliobasualdo's trace. I could be wrong so please feel free to close the issue if it's not related.

Please, on your side, does the PR solve your issue?
Could you give it a try?

Error reported by @emiliobasualdo is clearly related to the fact that all the parquet files generated by cryptostore at the same second wil have the same name, whatever the exchange, pair and data type so during the same second, file is overwritten x times for x different sources of data. It cannot work. IMHO, cryptostore should raise an exception if all 4 fields are not provided. @bmoscon, I can do a PR on this if you agree with this solution

@dirtyValera, but your own problem is not related to file naming. Is it now solved?

Thanks for your feedback.
Bests

Andrey Novitskiy · Answer 22 · Sat Feb 27 2021 16:28:48 GMT+0800 (China Standard Time)

@yohplala, I will test it and let you know, although I believe it should solve the issue