bug: lightPush is not able to keep node connections

Question

bug: lightPush is not able to keep node connections

vpavlin opened this issue 3 months ago · comments

bug report

Problem

Similarly to my issues with Filter (#1923), when I keep a light-js example running for ~15 minutes with clean waku:peers localStorage, LightPush "breaks" - I get an error message when attempting to send in debug logs:

waku:warn:light-push No peers found. Ensure you have a connection to the network. +0ms

After reload it took less then 5 minutes to the lightnode to lose all the lightpush peers - started with 3

3 minutes later there was only 1 peer

And a minute later there were no peers.

It does not seem to recover from this state

Proposed Solutions

Ensure that js-waku attempts to reconnect to a peer after losing a connection

Notes

js-waku version: 0.0.25-fd60cc2.0

Arseniy Klempner · Answer 1 · Thu Apr 25 2024 15:17:01 GMT+0800 (China Standard Time)

Did some investigating of this issue. Replicated locally and noticed some patterns that lead to connections closed or peers deleted from store:

libp2p:connection:outbound:4zjz071714024970277:error could not create new stream for protocols /ipfs/ping/1.0.0 +2s Error: unexpected end of input

libp2p:ping:error error while pinging 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +34s Error: unexpected end of input

waku:error:keep-alive Ping failed for peer (16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE).                Next ping will be attempted in 300 seconds.               +20m

libp2p:gossipsub connection ended 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +924ms
common.js:113

libp2p:gossipsub delete peer 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +0ms

This one is the most common:

waku:error:connection-manager Error dialing peer 16Uiu2HAmKAeC9VYRWWHwZxpjtPdf3bGeFg6aN2L1YYQmyc3FJRUU - The dial request has no valid addresses +1ms

connection_manager.js:201 waku:info:connection-manager Deleting undialable peer 16Uiu2HAmKAeC9VYRWWHwZxpjtPdf3bGeFg6aN2L1YYQmyc3FJRUU from peer store. Reason: The dial request has no valid addresses +1ms

libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
    at http://localhost:3000/dist/bundle.js:55349:41
    
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms

Eventually I reached a state where there was a single peer left that I can send a message to using light push. The peer is NOT in the keepAliveManager.

The connection gets reset and the peer is removed....

libp2p:gossipsub:error CodeError: stream reset    at MplexStream.reset (http://localhost:3000/dist/bundle.js:39511:21)    at MplexStreamMuxer._handleIncoming (http://localhost:3000/dist/bundle.js:39919:28)    at MplexStreamMuxer.sink (http://localhost:3000/dist/bundle.js:39826:36)    at async Promise.all (index 0) +10m
common.js:113 libp2p:gossipsub connection ended 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +10m
common.js:113 libp2p:gossipsub delete peer 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms

but the same peer quickly reconnects:

libp2p:mplex initiator stream with id 31 and protocol /vac/waku/relay/2.0.0 ended +2ms
common.js:113 libp2p:mplex new receiver stream 8 +188ms
common.js:113 libp2p:connection:outbound:8k5axg1714026544411 incoming stream opened on /vac/waku/relay/2.0.0 +3m
common.js:113 libp2p:gossipsub new peer 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +191ms
common.js:113 libp2p:gossipsub Added peer has no IP in current address 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV /dns4/node-01.ac-cn-hongkong-c.waku.sandbox.status.im/tcp/8000/wss/p2p/16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms
common.js:113 libp2p:gossipsub create inbound stream 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms
common.js:113 libp2p:connection:outbound:8k5axg1714026544411 starting new stream for protocols /vac/waku/relay/2.0.0 +1ms

and I can keep sending messages. It's been a little while and this last peer still hasn't been lost. I'm curious if the peer being absent from waku.connectionManager.keepAliveManager has something to do with it.

I while leave the light-js example running overnight and check on it again. Let me know if any of these logs are helpful @danisharora099 @weboko

Arseniy Klempner · Answer 2 · Fri Apr 26 2024 02:54:45 GMT+0800 (China Standard Time)

I modified connection manager such that it doesn't delete the dial error after a peer is removed. I waited to reach the state where there were no more peers to send a light push, and found that every single dial error was the same:

[
    "The dial request has no valid addresses",
    "The dial request has no valid addresses",
    ...
    "The dial request has no valid addresses",
    "The dial request has no valid addresses"
]

This seems odd. I understand it should be the case for some of the peers, but for 133 of them to result in this error seems excessive.

Arseniy Klempner · Answer 3 · Fri Apr 26 2024 14:10:42 GMT+0800 (China Standard Time)

I modified connection manager such that it doesn't delete the dial error after a peer is removed. I waited to reach the state where there were no more peers to send a light push, and found that every single dial error was the same:
[
    "The dial request has no valid addresses",
    "The dial request has no valid addresses",
    ...
    "The dial request has no valid addresses",
    "The dial request has no valid addresses"
]
This seems odd. I understand it should be the case for some of the peers, but for 133 of them to result in this error seems excessive.

It seems that most peers are not running wss, in which case this is expected. I have a PR open for this but I want to do some additional testing before merging: #1983

Something else I discovered is that for some reason the peer store "loses" the metadata for a peer, specifically the shard info. Without this, the function called to getPeers for light push filters out all of the peers because it thinks they don't support autosharding

Arseniy Klempner · Answer 4 · Fri Apr 26 2024 14:37:54 GMT+0800 (China Standard Time)

Something else I discovered is that for some reason the peer store "loses" the metadata for a peer, specifically the shard info. Without this, the function called to getPeers for light push filters out all of the peers because it thinks they don't support autosharding

I believe I have found the culprit, but like the PR above it probably needs more testing #1984

Arseniy Klempner · Answer 5 · Sat Apr 27 2024 06:01:35 GMT+0800 (China Standard Time)

I believe I identified the prime suspect causing this issue (and dropping connections in general):

libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
    at http://localhost:3000/dist/bundle.js:55349:41
    
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms

It seems what happens is:

peer tries to ping js-waku
there are already maximum amount of inbound streams opened from the peer to the node for the ping protocol
libp2p closes the entire connection?

Arseniy Klempner · Answer 6 · Sat Apr 27 2024 08:02:15 GMT+0800 (China Standard Time)

I believe I identified the prime suspect causing this issue (and dropping connections in general):
libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
    at http://localhost:3000/dist/bundle.js:55349:41
    
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms
It seems what happens is:

peer tries to ping js-waku

there are already maximum amount of inbound streams opened from the peer to the node for the ping protocol

libp2p closes the entire connection?

A mitigation for this that seems to work is increasing the maximum number of incoming connections for the ping protocol to 10 from the default of 2. I tried this with the light-js example and it's been running for 40 minutes without dropping below 9 peers, sending a light push message every 10 seconds. I'm not sure what are the performance implications of increasing this number but it seems preferable to the node losing peer connections and not recovering them.

Arseniy Klempner · Answer 7 · Wed May 01 2024 12:40:07 GMT+0800 (China Standard Time)

The most important fixes for this have been included in the latest release of js-waku. The remaining PR for filtering wss connections is not as relevant. @weboko can this be considered resolved?

Sasha · Answer 8 · Wed May 15 2024 05:48:53 GMT+0800 (China Standard Time)

Agree that this task is complete. If we observe any issues - let's action in the scope of some other bug / task.
Thank you for documenting investigation well!