TID distance of 1 may cause timed out state to be selected for incoming packet

Question

TID distance of 1 may cause timed out state to be selected for incoming packet

magshaw opened this issue 5 years ago · comments

We've been running a system with lots of nodes, some sharing the same set of messages.

We found that after a target node had timeout out once the system enters a strange state where new messages (with a higher TID) pick up state from an old transfer and don't process because of TID/TOGGLE mismatches.

On further inspection we can see that

Tracing this back through the code I think there is a bug in the computation of the not_previous_tid variable - this causes needs_restart to not be true and the transfer to be rejected for bad TIDs and TOGGLEs

Substituting:

    const bool not_previous_tid =
        computeTransferIDForwardDistance((uint8_t) rx_state->transfer_id, TRANSFER_ID_FROM_TAIL_BYTE(tail_byte)) > 1;

for

    const bool not_previous_tid =
        computeTransferIDForwardDistance((uint8_t) rx_state->transfer_id, TRANSFER_ID_FROM_TAIL_BYTE(tail_byte)) != 0;

Fixes this issue. This seems plausible as well, as a distance of 1 is indeed not_previous_tid

Does this seem reasonable or are we maybe missing something here?

Pavel Kirienko · Answer 1 · Thu Aug 01 2019 01:48:12 GMT+0800 (China Standard Time)

I apologize for the slow response. I've read your post back in June, but I knew immediately that sorting this out is going to take time so I had to table the issue until later. The later is now.

Does this seem reasonable or are we maybe missing something here?

I've just spent an hour digging through my old records trying to understand what is happening. Long story short, the problem is that the arguments to computeTransferIDForwardDistance() are swapped. Being human, I am prone to mistakes of that sort.

This is the pseudocode from Specification:

Observe that the local state is on the right; the received transfer-ID is on the left. The algorithm requires us to compute the number of increment operations that need to be applied to the received value in order to equalize it with the local state. This algorithm is implemented correctly in libuavcan and pyuavcan:

https://github.com/UAVCAN/libuavcan/blob/67e56232362aea9f9606d0e80454e6abcae5ff5a/libuavcan/src/transport/uc_transfer_receiver.cpp#L201

https://github.com/UAVCAN/pyuavcan/blob/281680b28657574818348ec5a67347f129525d44/pyuavcan/transport/can/_session/_transfer_receiver.py#L53

Notice that in libcanard the argument order is different, hence the problem you've observed. Swapping the argument order on your branch while keeping the correct comparison > 1 (strictly greater) makes your unit tests pass.

mike7c2 · Answer 2 · Tue Aug 20 2019 00:42:31 GMT+0800 (China Standard Time)

Hi Pavel, my apologies too, things have been hectic my end! I will get this done and back to you tonight/tomorrow.

Pavel Kirienko · Answer 3 · Sun Mar 08 2020 11:27:19 GMT+0800 (China Standard Time)

Fixed in #142

Fovery · Answer 4 · Tue Feb 20 2024 10:08:03 GMT+0800 (China Standard Time)

遇到的同样的问题，发现原因是：发送端应该每条消息使用独立的transfer_id，而不要所有的消息共用同一个递增的transfer_id。
The same issue encountered was found to be caused by the sender using a unique transfer_id for each message, instead of all messages sharing the same incrementing transfer_id.

Pavel Kirienko · Answer 5 · Tue Feb 20 2024 18:01:49 GMT+0800 (China Standard Time)

A publishing node shall use a separate transfer-ID counter per subject (topic). Usage of a shared transfer-ID counter for all subjects is non-compliant and will cause issues.