Add data corruption handling

Question

Add data corruption handling

devbharat opened this issue 5 years ago · comments

From slack discussion here

In sdlog2 (ancient I know) logging format their used to be these identifier bytes xA3 x95 before start of every valid packet, so if their was corruption in previous logged packet you could seek for these bytes in the file stream and recover parsing again. It seems that ulog packets start directly with 3 bytes of MSG_TYPE and MSG_SIZE. How does one recover if while parsing a file you encounter a MAG_TYPE byte that isn't supposed to be there?
You can't just look of a valid MSG_TYPE byte in the stream following the corrupt byte, it might randomly occur in the payload packets and you'd endup reading incorrect bytes for MSG_SIZE

Beat Küng · Answer 1 · Wed May 15 2019 18:06:04 GMT+0800 (China Standard Time)

If you look at the ULog spec, you see that there is a sync message. It's not implemented because it was not required so far.

Bharat Tak · Answer 2 · Wed May 22 2019 21:30:29 GMT+0800 (China Standard Time)

I had implemented a 2Hz 8 byte sync msg to recover parsing and have been testing it for a while. I am really wondering if this way of logging is really the better choice (logging without per-packet sync bytes and sending sync 'stream' at a certain rate).

The reason being, ones the MSG_TYPE/MSG_SIZE byte is corrupt, you to have 'trash' all bytes until the next sync message, so you lose a bunch of well formed packets in the middle. You can always send the sync stream at a higher rate, but then I don't see the point of not having per-packet sync. I guess there is some tradeoff between the two approaches to keeping the logfile size the smallest, and it'd vary somehow with the expected number of bytes that you'd expect to have been corrupt. Was there some discussion/ literature about it when the specs were first designed? Also, similar request regarding the choice of using 8 bytes for sync, rather than a smaller number.

Beat Küng · Answer 3 · Thu May 23 2019 14:21:34 GMT+0800 (China Standard Time)

Exactly, these are the trade-offs. I have not added it since the current parser handles it even in case of corruptions. Corruptions are generally extremely rare and if you have them it means there's a bug somewhere (i.e. in the file system implementation, or transmission protocol).

Was there some discussion/ literature about it when the specs were first designed?

I have not designed that, so I don't know. I have only added the sync message as a precaution, because I had the same questions in mind as you do now. My idea was, should it ever be required, to write it after every N amount of bytes (where N is a trade-off between log size overhead and worst-case loss).

Bharat Tak · Answer 4 · Thu May 23 2019 19:52:00 GMT+0800 (China Standard Time)

I have not added it since the current parser handles it even in case of corruptions.

Not if the corrupt bytes belong to the header (MSG_TYPE / MSG_SIZE)

Corruptions are generally extremely rare and if you have them it means there's a bug somewhere (i.e. in the file system implementation, or transmission protocol).

I'd hope so. I have an average of 1 header bytes corrupt for every 2 MB of data in the ulog file, but I am not sure how rare/not-rare that number is.

Beat Küng · Answer 5 · Fri May 24 2019 13:54:40 GMT+0800 (China Standard Time)

I have an average of 1 header bytes corrupt for every 2 MB of data in the ulog file, but I am not sure how rare/not-rare that number is.

That is for a single file or in general? If in general, there's something wrong, and even for a single case that is not ok. What is the path that log file is taking (including transmission protocols and which implementations you use) until it gets to you?

Bharat Tak · Answer 6 · Sat May 25 2019 23:58:13 GMT+0800 (China Standard Time)

In general. I've been looking for it, and I am now pretty sure the issue is with file writing itself (I added crc16 before transmission, checked it after transmission(its boost ipc with reliable_message_queue) right before flushing to file, and then checked it in pyulog while postprocessing). The platform is RPi and writing is on it's sdcard. I'll look further into it though.

Anyways we digress, the robustification of postprocessing needs to happen regardless.

Bharat Tak · Answer 7 · Fri May 31 2019 00:15:40 GMT+0800 (China Standard Time)

I found it, the issue was that the code that was pushing PX4 ulog messages(the mavlink handler parsing ulog mavlink packets) was sometimes pushing Ulog packets in parts into the ipc queue, and the data would appear to be corrupt when one of the onboard computer's Ulog packet would interleave. Ones that was fixed, there weren't data corruptions anymore, but I would still keep (and push) the sync recovery code.