bmoscon / cryptostore

A scalable storage service for cryptocurrency data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Parquete + Kafka] receipt_timestamp is bool for l2_book files

anovv opened this issue · comments

Describe the bug
I'm trying to understand if I'm doing something wrong, but it seems like a bug to me. Parquete files for l2 books have receipt_timestamp as bool, and delta as double (with value of a timestamp). Could it be that these two are accidentally swapped? Please see links/screenshots

File contents
https://pastebin.com/EEKVAmZF

File schema
https://pastebin.com/fpvsV6N1

Screenshots
image

image

Config
https://pastebin.com/Yf6SjzkQ

Hi @dirtyValera
Using redis and parquet on my side, no trouble.
Please, would you mind giving redis a try?
This might then be due to Kafka.

Capture d’écran de 2021-02-27 13-42-26

@yohplala, Redis seems to be working correctly. What could be the problem here? I thought parquet write logic was independent from the medium.

image

Parquet is only an end of the process.
You are seeing this trouble with kafka + parquet, but it does not mean it would not appear with kafka + another storage.
Would be useful to know about the conbination of Kafka + Arctic for instance.

@bmoscon
do you think you could give it a try?

@yohplala, not directly related to the issue, but I have a question regarding timestamp field precision. It looks like the real value is truncated when being stored in parquete. If I subscribe to kafka topic I get higher precision. If I read parquete file with pandas and for example call diff() on the timestamp column I get zeroes. Is it a bug?

Hi, please, could provide an example? I am not sure to understand.
On my side, timestamp is inferred by pyarrow library as a float64. Because of float64 type, it is possible that there is 'approximation' of value.
Then, up to you to convert it into a datetime64 through pandas indeed.
I don't know how is managed timestamp through Kafka.
Bests,

@yohplala, I think I figured it out. I thought each row should have a unique timestamp, which is not the case since we get multiple price updates with a unique frame from the socket.

Still able to reproduce

I'll try and reproduce this locally this week