Concurrent `deactivatePeerConnection` calls result in TrConnectionExists error
bolt12 opened this issue · comments
We have identified a bug related to issue #4555, which involves an erroneous transition of peer states and a multiplexer (mux) error due to concurrent calls to deactivatePeerConnection
. The sequence of events leading to this issue is as follows:
- A
DemoteHotPeer
job is initiated. - This triggers
deactivatePeerConnection
, resulting in the termination of hot mini-protocols. - Concurrently, the
peerMonitoringLoop
detects the successful termination of a hot mini-protocol and also invokesdeactivatePeerConnection
. - Both invocations of
deactivatePeerConnection
block, awaiting the timeoutspsDeactivateTimeout (atomically $ awaitAllResults SingHot pchAppHandles)
, but fail due to a mux error. - The failure of
jobDemoteActivePeers
leads to the execution of an error handler, which erroneously removes the peer frominProgressDemoteHot
, active, and established sets. - The
peerMonitoringLoop
encounters an exception, executing an error handler that incorrectly sets the peer toPeerCold
status and rethrows the error.
Core Problem:
The peer is prematurely removed from relevant sets by jobDemoteActivePeers
, while the connection persists. The peerMonitoringLoop
then inappropriately transitions the peer directly to PeerCold
status. This causes the peer to be momentarily forgotten and quickly relearned, leading the governor to attempt reconnection while the old connection still exists, triggering a TrConnectionExists
trace.
The Solution:
- Revise the error handler in
peerMonitoringLoop
to invokewaitForOutboundDemotion
before transitioning the peer toPeerCold
status. - Refrain from removing the peer from sets upon a failed hot demotion, as this scenario should be managed by the
peerMonitoringLoop
.