[Parquete + Kafka] receipt_timestamp is bool for l2_book files
anovv opened this issue · comments
Describe the bug
I'm trying to understand if I'm doing something wrong, but it seems like a bug to me. Parquete files for l2 books have receipt_timestamp as bool, and delta as double (with value of a timestamp). Could it be that these two are accidentally swapped? Please see links/screenshots
File contents
https://pastebin.com/EEKVAmZF
File schema
https://pastebin.com/fpvsV6N1
@yohplala, Redis seems to be working correctly. What could be the problem here? I thought parquet write logic was independent from the medium.
Parquet is only an end of the process.
You are seeing this trouble with kafka + parquet, but it does not mean it would not appear with kafka + another storage.
Would be useful to know about the conbination of Kafka + Arctic for instance.
@bmoscon
do you think you could give it a try?
@yohplala, not directly related to the issue, but I have a question regarding timestamp field precision. It looks like the real value is truncated when being stored in parquete. If I subscribe to kafka topic I get higher precision. If I read parquete file with pandas and for example call diff() on the timestamp column I get zeroes. Is it a bug?
Hi, please, could provide an example? I am not sure to understand.
On my side, timestamp is inferred by pyarrow library as a float64. Because of float64 type, it is possible that there is 'approximation' of value.
Then, up to you to convert it into a datetime64 through pandas indeed.
I don't know how is managed timestamp through Kafka.
Bests,
@yohplala, I think I figured it out. I thought each row should have a unique timestamp, which is not the case since we get multiple price updates with a unique frame from the socket.
Still able to reproduce
I'll try and reproduce this locally this week