waku-org / js-waku

JavaScript implementation of Waku v2

Home Page:https://js.waku.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bug: lightPush is not able to keep node connections

vpavlin opened this issue · comments

bug report

Problem

Similarly to my issues with Filter (#1923), when I keep a light-js example running for ~15 minutes with clean waku:peers localStorage, LightPush "breaks" - I get an error message when attempting to send in debug logs:

waku:warn:light-push No peers found. Ensure you have a connection to the network. +0ms

After reload it took less then 5 minutes to the lightnode to lose all the lightpush peers - started with 3

Screenshot from 2024-04-19 10-52-15

3 minutes later there was only 1 peer
Screenshot from 2024-04-19 10-54-21

And a minute later there were no peers.

It does not seem to recover from this state

Proposed Solutions

  • Ensure that js-waku attempts to reconnect to a peer after losing a connection

Notes

  • js-waku version: 0.0.25-fd60cc2.0

Did some investigating of this issue. Replicated locally and noticed some patterns that lead to connections closed or peers deleted from store:

libp2p:connection:outbound:4zjz071714024970277:error could not create new stream for protocols /ipfs/ping/1.0.0 +2s Error: unexpected end of input

libp2p:ping:error error while pinging 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +34s Error: unexpected end of input

waku:error:keep-alive Ping failed for peer (16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE).                Next ping will be attempted in 300 seconds.               +20m

libp2p:gossipsub connection ended 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +924ms
common.js:113

libp2p:gossipsub delete peer 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +0ms

This one is the most common:

waku:error:connection-manager Error dialing peer 16Uiu2HAmKAeC9VYRWWHwZxpjtPdf3bGeFg6aN2L1YYQmyc3FJRUU - The dial request has no valid addresses +1ms

connection_manager.js:201 waku:info:connection-manager Deleting undialable peer 16Uiu2HAmKAeC9VYRWWHwZxpjtPdf3bGeFg6aN2L1YYQmyc3FJRUU from peer store. Reason: The dial request has no valid addresses +1ms
libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
    at http://localhost:3000/dist/bundle.js:55349:41
    
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms

Eventually I reached a state where there was a single peer left that I can send a message to using light push. The peer is NOT in the keepAliveManager.

The connection gets reset and the peer is removed....

libp2p:gossipsub:error CodeError: stream reset    at MplexStream.reset (http://localhost:3000/dist/bundle.js:39511:21)    at MplexStreamMuxer._handleIncoming (http://localhost:3000/dist/bundle.js:39919:28)    at MplexStreamMuxer.sink (http://localhost:3000/dist/bundle.js:39826:36)    at async Promise.all (index 0) +10m
common.js:113 libp2p:gossipsub connection ended 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +10m
common.js:113 libp2p:gossipsub delete peer 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms

but the same peer quickly reconnects:

libp2p:mplex initiator stream with id 31 and protocol /vac/waku/relay/2.0.0 ended +2ms
common.js:113 libp2p:mplex new receiver stream 8 +188ms
common.js:113 libp2p:connection:outbound:8k5axg1714026544411 incoming stream opened on /vac/waku/relay/2.0.0 +3m
common.js:113 libp2p:gossipsub new peer 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +191ms
common.js:113 libp2p:gossipsub Added peer has no IP in current address 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV /dns4/node-01.ac-cn-hongkong-c.waku.sandbox.status.im/tcp/8000/wss/p2p/16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms
common.js:113 libp2p:gossipsub create inbound stream 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms
common.js:113 libp2p:connection:outbound:8k5axg1714026544411 starting new stream for protocols /vac/waku/relay/2.0.0 +1ms

and I can keep sending messages. It's been a little while and this last peer still hasn't been lost. I'm curious if the peer being absent from waku.connectionManager.keepAliveManager has something to do with it.

I while leave the light-js example running overnight and check on it again. Let me know if any of these logs are helpful @danisharora099 @weboko

I modified connection manager such that it doesn't delete the dial error after a peer is removed. I waited to reach the state where there were no more peers to send a light push, and found that every single dial error was the same:

[
    "The dial request has no valid addresses",
    "The dial request has no valid addresses",
    ...
    "The dial request has no valid addresses",
    "The dial request has no valid addresses"
]

This seems odd. I understand it should be the case for some of the peers, but for 133 of them to result in this error seems excessive.

I modified connection manager such that it doesn't delete the dial error after a peer is removed. I waited to reach the state where there were no more peers to send a light push, and found that every single dial error was the same:

[
    "The dial request has no valid addresses",
    "The dial request has no valid addresses",
    ...
    "The dial request has no valid addresses",
    "The dial request has no valid addresses"
]

This seems odd. I understand it should be the case for some of the peers, but for 133 of them to result in this error seems excessive.

It seems that most peers are not running wss, in which case this is expected. I have a PR open for this but I want to do some additional testing before merging: #1983

Something else I discovered is that for some reason the peer store "loses" the metadata for a peer, specifically the shard info. Without this, the function called to getPeers for light push filters out all of the peers because it thinks they don't support autosharding

Something else I discovered is that for some reason the peer store "loses" the metadata for a peer, specifically the shard info. Without this, the function called to getPeers for light push filters out all of the peers because it thinks they don't support autosharding

I believe I have found the culprit, but like the PR above it probably needs more testing #1984

I believe I identified the prime suspect causing this issue (and dropping connections in general):

libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
    at http://localhost:3000/dist/bundle.js:55349:41
    
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms

It seems what happens is:

  1. peer tries to ping js-waku
  2. there are already maximum amount of inbound streams opened from the peer to the node for the ping protocol
  3. libp2p closes the entire connection?

I believe I identified the prime suspect causing this issue (and dropping connections in general):

libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
    at http://localhost:3000/dist/bundle.js:55349:41
    
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms

It seems what happens is:

  1. peer tries to ping js-waku
  2. there are already maximum amount of inbound streams opened from the peer to the node for the ping protocol
  3. libp2p closes the entire connection?

A mitigation for this that seems to work is increasing the maximum number of incoming connections for the ping protocol to 10 from the default of 2. I tried this with the light-js example and it's been running for 40 minutes without dropping below 9 peers, sending a light push message every 10 seconds. I'm not sure what are the performance implications of increasing this number but it seems preferable to the node losing peer connections and not recovering them.

The most important fixes for this have been included in the latest release of js-waku. The remaining PR for filtering wss connections is not as relevant. @weboko can this be considered resolved?

Agree that this task is complete. If we observe any issues - let's action in the scope of some other bug / task.
Thank you for documenting investigation well!