Deadlock with last log: "DEBUG portalnet::gossip: propagating validated content ids=["0x.."]"

Question

Deadlock with last log: "DEBUG portalnet::gossip: propagating validated content ids=["0x.."]"

morph-dev opened this issue 3 months ago · comments

While on flamingo rotation, I noticed on glados that several machines are stuck.

I checked their logs and the very last 2 log messages on them were:

DEBUG discv5::service: Received RPC response: Response: Response 050108000000<redacted> to request: TALK: protocol: 500b, request: 040400000000<redacted> from: Node: 0x<redacted>, addr: <redacted>:9009
DEBUG portalnet::gossip: propagating validated content ids=["0x<redacted>"]

(the <redacted> wasn't the same but it seems irrelevant)

After restarting the docker images, they keep working fine.

It seems to me that there is some deadlock happening, most likely in the same place (and most likely during gossiping), but further investigation is needed.

Milos Stankovic · Answer 1 · Tue Aug 20 2024 23:06:55 GMT+0800 (China Standard Time)

These lines in portalnet/src/gossip.rs look like a potential candidates for the deadlock:

let permit = match utp_controller {
    Some(ref utp_controller) => match utp_controller.get_outbound_semaphore() {
        Some(permit) => Some(permit),
        None => continue,
    },
    None => None,
};

Kolby Moroz Liebl · Answer 2 · Tue Aug 20 2024 23:09:35 GMT+0800 (China Standard Time)

These lines in portalnet/src/gossip.rs look like a potential candidates for the deadlock:

let permit = match utp_controller {

    Some(ref utp_controller) => match utp_controller.get_outbound_semaphore() {

        Some(permit) => Some(permit),

        None => continue,

    },

    None => None,

};

try_acquire_owned() doesn't block so that rules out that candidate

Milos Stankovic · Answer 3 · Fri Aug 23 2024 04:24:21 GMT+0800 (China Standard Time)

Two more nodes got stuck, containing my last PR with log messages, so I was able to conclude that deadlock is happening at this line: portalnet/src/gossip.rs#L61

let kbuckets = kbuckets.read();

The documentation says:
Note that attempts to recursively acquire a read lock on a RwLock when the current thread already holds one may result in a deadlock.

With that being said, it seems that either this thread already holds the lock, or something else is stuck in a deadlock and holds write lock indefinitely. I'm more inclined to think that it's the first, considering that last log is always the same (otherwise I would assume something else would get stuck as well, leading to different message).

Milos Stankovic · Answer 4 · Wed Sep 25 2024 02:05:21 GMT+0800 (China Standard Time)

We didn't observe any more deadlocks, so I will consider this as fixed by #1458 .