JRS tests are failing with branching after restart
imalygin opened this issue · comments
Description
There are several tests failing relatively consistently with branching after restart
Here are the examples
Crypto-Update-Setting-1.5k-15m
- data directory
Crypto-Update-Jar-1.5k-25m
- data directory
@alittley :
It can be explained with the following sequence of events:
1. node3 creates an event X, which is gossiped out and written to PCES by at least 1 node in the network, but NOT by node3
2. the network is restarted without a freeze
3. node3 comes back online well before other nodes, and thus cannot gossip for the full duration of the OBSERVING status, as is otherwise intended
4. node3 transitions to CHECKING, having not yet received event X
5. node3 creates event X' and gossips it out
6. the network (including node3 itself) correctly determines that node3 branched
Steps to reproduce
Run Crypto-Update-Setting-1.5k-15m
or Crypto-Update-Jar-1.5k-25m
Additional context
No response
Hedera network
other
Version
v0.51
Operating system
None
My original description (duplicated in the body of the ticket) may not be correct. Another test failure, where the branching still occurs shortly after startup, but it isn't the first node to reach CHECKING
that starts branching. By the time the node branches, all nodes in the network had already been gossiping for some time, so I'm not sure why the branching node didn't hear about its previous self event.
Another addition:
I don't think nodes not freezing before an update is actually playing a role. I found the pre-update logs for these tests (which I didn't previously know existed), and nodes are freezing before restarting.
Instead, I now suspect that it's just post-freeze signature events that are causing the branching.