IntersectMBO / ouroboros-network

Specifications of network protocols and implementations of components running these protocols which support a family of Ouroboros Consesus protocols; the diffusion layer of the Cardano Node.

Home Page:https://ouroboros-network.cardano.intersectmbo.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Concurrent `deactivatePeerConnection` calls result in TrConnectionExists error

bolt12 opened this issue · comments

We have identified a bug related to issue #4555, which involves an erroneous transition of peer states and a multiplexer (mux) error due to concurrent calls to deactivatePeerConnection. The sequence of events leading to this issue is as follows:

  1. A DemoteHotPeer job is initiated.
  2. This triggers deactivatePeerConnection, resulting in the termination of hot mini-protocols.
  3. Concurrently, the peerMonitoringLoop detects the successful termination of a hot mini-protocol and also invokes deactivatePeerConnection.
  4. Both invocations of deactivatePeerConnection block, awaiting the timeout spsDeactivateTimeout (atomically $ awaitAllResults SingHot pchAppHandles), but fail due to a mux error.
  5. The failure of jobDemoteActivePeers leads to the execution of an error handler, which erroneously removes the peer from inProgressDemoteHot, active, and established sets.
  6. The peerMonitoringLoop encounters an exception, executing an error handler that incorrectly sets the peer to PeerCold status and rethrows the error.

Core Problem:
The peer is prematurely removed from relevant sets by jobDemoteActivePeers, while the connection persists. The peerMonitoringLoop then inappropriately transitions the peer directly to PeerCold status. This causes the peer to be momentarily forgotten and quickly relearned, leading the governor to attempt reconnection while the old connection still exists, triggering a TrConnectionExists trace.

The Solution:

  • Revise the error handler in peerMonitoringLoop to invoke waitForOutboundDemotion before transitioning the peer to PeerCold status.
  • Refrain from removing the peer from sets upon a failed hot demotion, as this scenario should be managed by the peerMonitoringLoop.