Deadlock with last log: "DEBUG portalnet::gossip: propagating validated content ids=["0x.."]"
morph-dev opened this issue · comments
While on flamingo rotation, I noticed on glados that several machines are stuck.
I checked their logs and the very last 2 log messages on them were:
DEBUG discv5::service: Received RPC response: Response: Response 050108000000<redacted> to request: TALK: protocol: 500b, request: 040400000000<redacted> from: Node: 0x<redacted>, addr: <redacted>:9009
DEBUG portalnet::gossip: propagating validated content ids=["0x<redacted>"]
(the <redacted>
wasn't the same but it seems irrelevant)
After restarting the docker images, they keep working fine.
It seems to me that there is some deadlock happening, most likely in the same place (and most likely during gossiping), but further investigation is needed.
These lines in portalnet/src/gossip.rs look like a potential candidates for the deadlock:
let permit = match utp_controller {
Some(ref utp_controller) => match utp_controller.get_outbound_semaphore() {
Some(permit) => Some(permit),
None => continue,
},
None => None,
};
These lines in portalnet/src/gossip.rs look like a potential candidates for the deadlock:
let permit = match utp_controller { Some(ref utp_controller) => match utp_controller.get_outbound_semaphore() { Some(permit) => Some(permit), None => continue, }, None => None, };
try_acquire_owned() doesn't block so that rules out that candidate
Two more nodes got stuck, containing my last PR with log messages, so I was able to conclude that deadlock is happening at this line: portalnet/src/gossip.rs#L61
let kbuckets = kbuckets.read();
The documentation says:
Note that attempts to recursively acquire a read lock on a RwLock when the current thread already holds one may result in a deadlock.
With that being said, it seems that either this thread already holds the lock, or something else is stuck in a deadlock and holds write
lock indefinitely. I'm more inclined to think that it's the first, considering that last log is always the same (otherwise I would assume something else would get stuck as well, leading to different message).
We didn't observe any more deadlocks, so I will consider this as fixed by #1458 .