Websockets hanging
skzap opened this issue · comments
It seems some people have an error around the node's consensus hanging rarely and randomly. According to logs it's because the node never reaches the 2/3+ threshold to pass consensus. I suppose that some websockets are left hanging (as if the internet cord was broke) and not properly terminated and re-opened, which is kind of crucial in terms of consensus. If a node thinks an active leader is online, but isn't validating blocks, then we end up in this situation where an observer node just gets stuck, and an active leader node forks on it's own chain.
I'm creating a websocket-terminate
branch where it's already tracking the websockets health. Can see data in /peers from API e.g. curl http://localhost:3001/peers | jq ".[].lastMessageTime"
I will try to verify the theory is correct (some hanged websockets should show up in /peers with very long lastMessageTime.
Seems like it got better following the memory-fix
branch merged into master. Closing for now but might be an issue in the future again, maybe it was related to a bad node that's now gone from consensus.