Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused.

Question

Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused.

shamil-gadelshin opened this issue 2 years ago · comments

Summary

I use start_providing Kademlia API

Here is my error after 32 requests:

2022-10-20T09:54:15.538792Z  WARN libp2p_kad::handler: New inbound substream to PeerId("12D3KooWNwexVxD22CAdxv5CAkxrD3QgUqhycYAE1e1NH2sLbZ7v") exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream.

This error is similar to the recent discussion: #2957

However, I have only 2 peers (local machine setup) and 3 seconds between start_providing requests.

Here is my local inbound_streams buffer:

 ["WaitingUser: id=UniqueConnecId(11)", "WaitingUser: id=UniqueConnecId(13)", "WaitingUser: id=UniqueConnecId(16)", "WaitingUser: id=UniqueConnecId(18)", "WaitingUser: id=UniqueConnecId(20)", "WaitingUser: id=UniqueConnecId(22)", "WaitingUser: id=UniqueConnecId(24)", "WaitingUser: id=UniqueConnecId(26)", "WaitingUser: id=UniqueConnecId(28)", "WaitingUser: id=UniqueConnecId(3)", "WaitingUser: id=UniqueConnecId(30)", "WaitingUser: id=UniqueConnecId(32)", "WaitingUser: id=UniqueConnecId(34)", "WaitingUser: id=UniqueConnecId(36)", "WaitingUser: id=UniqueConnecId(39)", "WaitingUser: id=UniqueConnecId(41)", "WaitingUser: id=UniqueConnecId(43)", "WaitingUser: id=UniqueConnecId(45)", "WaitingUser: id=UniqueConnecId(47)", "WaitingUser: id=UniqueConnecId(49)", "WaitingUser: id=UniqueConnecId(5)", "WaitingUser: id=UniqueConnecId(51)", "WaitingUser: id=UniqueConnecId(53)", "WaitingUser: id=UniqueConnecId(55)", "WaitingUser: id=UniqueConnecId(57)", "WaitingUser: id=UniqueConnecId(59)", "WaitingUser: id=UniqueConnecId(61)", "WaitingUser: id=UniqueConnecId(63)", "WaitingUser: id=UniqueConnecId(65)", "WaitingUser: id=UniqueConnecId(67)", "WaitingUser: id=UniqueConnecId(7)", "WaitingUser: id=UniqueConnecId(9)"]

It seems that I need to acknowledge some of the requests but the API doesn't expect this. Also, I can send hundreds of get_closest_peers requests with no errors.

When an error begins to manifest the requesting peer gets a QueryStat:

QueryStats { requests: 1, success: 0, failure: 1, start: Some(Instant { t: 2021689972560 }), end: Some(Instant { t: 2021689984792 }) }

That indicates that the first out of two requests (it seems that start_providing issues FindNode and AddProvider requests) produces an error but numerous GetClosestPeers work ok. It doesn't make sense to me.

Also, when I increase an interval between start_providing API calls to 17secs - it doesn't seem to produce an error. I tried to set Kademlia query timeout from the default 60 to just 1 second (I suspected some pending process) but it doesn't make a difference.

Am I missing something?

Expected behaviour

I expect the local setup to handle multiple requests per second with no issues.

Actual behaviour

Debug Output

FindNodeReq: ConnectionId(2)
2022-10-20T11:43:57.299813Z  INFO subspace_networking::node_runner: Kademlia event: InboundRequest { request: FindNode { num_closer_peers: 1 } }
2022-10-20T11:43:57.300924Z DEBUG yamux::connection::stream: 44d40e66/139: eof    
2022-10-20T11:43:57.301396Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:57.301431Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:57.301473Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol    
2022-10-20T11:43:57.302384Z  INFO subspace_networking::behavior::custom_record_store: New provider record added: ProviderRecord { key: Key(b"\0 \x89\x06;^\xab\xa0hP\xc4\xb6\xc9W\x1cFV\xc8|\xce\xb6\x04&M%\xe9 y\xe3\x9a\x89\x81[\xa6"), provider: PeerId("12D3KooWSPnggmEVXnPYDjaueqRm5EjC1cVejyaRUg2V6BThVxpR"), expires: Some(Instant { t: 4087037477371 }), addresses: ["/ip4/127.0.0.1/tcp/60010"] }
2022-10-20T11:43:57.302474Z  INFO subspace_networking::node_runner: Kademlia event: InboundRequest { request: AddProvider { record: None } }
2022-10-20T11:43:57.302482Z  INFO subspace_networking::node_runner: Add provider request received: None
2022-10-20T11:43:57.899984Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:57.900226Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:58.899019Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:58.899206Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:59.000034Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:59.000098Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:59.000151Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol    
2022-10-20T11:43:59.000197Z  WARN libp2p_kad::handler: New inbound substream to PeerId("12D3KooWSPnggmEVXnPYDjaueqRm5EjC1cVejyaRUg2V6BThVxpR") exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream.    
2022-10-20T11:43:59.000496Z  WARN libp2p_kad::handler: Inbound streams: ["WaitingUser: id=UniqueConnecId(12)", "WaitingUser: id=UniqueConnecId(14)", "WaitingUser: id=UniqueConnecId(16)", "WaitingUser: id=UniqueConnecId(18)", "WaitingUser: id=UniqueConnecId(2)", "WaitingUser: id=UniqueConnecId(21)", "WaitingUser: id=UniqueConnecId(23)", "WaitingUser: id=UniqueConnecId(25)", "WaitingUser: id=UniqueConnecId(27)", "WaitingUser: id=UniqueConnecId(29)", "WaitingUser: id=UniqueConnecId(31)", "WaitingUser: id=UniqueConnecId(33)", "WaitingUser: id=UniqueConnecId(35)", "WaitingUser: id=UniqueConnecId(37)", "WaitingUser: id=UniqueConnecId(39)", "WaitingUser: id=UniqueConnecId(42)", "WaitingUser: id=UniqueConnecId(44)", "WaitingUser: id=UniqueConnecId(46)", "WaitingUser: id=UniqueConnecId(48)", "WaitingUser: id=UniqueConnecId(5)", "WaitingUser: id=UniqueConnecId(50)", "WaitingUser: id=UniqueConnecId(52)", "WaitingUser: id=UniqueConnecId(54)", "WaitingUser: id=UniqueConnecId(56)", "WaitingUser: id=UniqueConnecId(58)", "WaitingUser: id=UniqueConnecId(60)", "WaitingUser: id=UniqueConnecId(62)", "WaitingUser: id=UniqueConnecId(64)", "WaitingUser: id=UniqueConnecId(66)", "WaitingUser: id=UniqueConnecId(68)", "WaitingUser: id=UniqueConnecId(7)", "WaitingUser: id=UniqueConnecId(9)"]    
2022-10-20T11:43:59.898505Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:59.898595Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:59.950796Z DEBUG subspace_networking::node_runner: Initiate connection to known peers local_peer_id=12D3L7AUynGJcx7Lb4SAo4bR1m78UYydB4TRy2Nre6vViLHPgEem connected_peers=1
2022-10-20T11:44:00.701763Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:44:00.701828Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:44:00.701882Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol

Possible Solution

Version

I use 0.46.1 in my own branch but the latest 0.49.0 produces the same result.

Would you like to work on fixing this bug?

No

shamil-gadelshin · Answer 1 · Fri Oct 28 2022 01:13:44 GMT+0800 (China Standard Time)

This issue I mentioned on the last community call. @mxinden @thomaseizinger You had already a similar discussion but this issue arises on a much smaller load.

Thomas Eizinger · Answer 2 · Fri Oct 28 2022 12:27:54 GMT+0800 (China Standard Time)

Thanks for reporting this! This definitely doesn't look right. I am putting it on my list to work on for next week :)

Thomas Eizinger · Answer 3 · Wed Nov 02 2022 15:05:40 GMT+0800 (China Standard Time)

Are you polling the Swarm properly? Can you share the code that you are running?

I looked into the code and these substreams shouldn't really be sitting there idling for very long. They are waiting for the NetworkBehaviour to gather the data, which should be extremely quick and can then resume again.

Maybe the main event loop of the Behaviours is blocked from some reason? Can you reproduce the problem if you are just using the Kademlia behaviour and nothing else?

Thomas Eizinger · Answer 4 · Wed Nov 02 2022 15:29:06 GMT+0800 (China Standard Time)

Also, can you try and test with #3074? I think this one is more correct in terms of task wake behaviour. Maybe that is the issue.

Thomas Eizinger · Answer 5 · Tue Nov 08 2022 06:06:41 GMT+0800 (China Standard Time)

Friendly ping @shamil-gadelshin!

Does #3074 help at all?

shamil-gadelshin · Answer 6 · Tue Nov 08 2022 14:48:26 GMT+0800 (China Standard Time)

I tried that. The issue stays. Sorry for the delayed response. I was going to debug it again and provide you with extensive feedback. I'm going to do it this week.

Thank you for your attention to this bug. I appreciate it.

Thomas Eizinger · Answer 7 · Tue Nov 08 2022 16:21:20 GMT+0800 (China Standard Time)

I tried that. The issue stays. Sorry for the delayed response. I was going to debug it again and provide you with extensive feedback. I'm going to do it this week.

Thank you for your attention to this bug. I appreciate it.

Ah damn. Any chance you can provide a minimal example that produces the problem?

shamil-gadelshin · Answer 8 · Tue Nov 08 2022 16:39:46 GMT+0800 (China Standard Time)

Sure. I will try again debugging my code to rule out a silly mistake and then try to reproduce the issue having the minimal setup.

shamil-gadelshin · Answer 9 · Fri Nov 11 2022 21:22:12 GMT+0800 (China Standard Time)

Here is the minimal Kademlia setup that shows differences between start_providing and put_record or get_closest_peers . It's possible that I misuse the Kademlia API and I appreciate a hint @thomaseizinger

shamil-gadelshin · Answer 10 · Fri Nov 11 2022 21:23:15 GMT+0800 (China Standard Time)

https://github.com/shamil-gadelshin/kad-example @thomaseizinger

Thomas Eizinger · Answer 11 · Sat Nov 12 2022 04:30:32 GMT+0800 (China Standard Time)

Thank you! I'll have a look!

shamil-gadelshin · Answer 12 · Mon Nov 21 2022 21:00:26 GMT+0800 (China Standard Time)

a friendly ping @thomaseizinger Did you have a chance to look at the kad-example project? Does it reproduce the error?

Thomas Eizinger · Answer 13 · Tue Nov 22 2022 05:21:18 GMT+0800 (China Standard Time)

a friendly ping @thomaseizinger Did you have a chance to look at the kad-example project? Does it reproduce the error?

Sorry, I haven't yet but I'll do so today!

Thomas Eizinger · Answer 14 · Tue Nov 22 2022 07:51:22 GMT+0800 (China Standard Time)

Okay, I figured out what I think the issue is.

As per the spec, outbound substreams may be reused. Our implementation never does that (we only send 1 request per substream) but our inbound streams wait for additional messages on that stream and thus fill up the buffer.

There seems to be a bug where the implementation for the inbound stream does not detect that the other side closed the stream and it should thus stop waiting for a message.

Thomas Eizinger · Answer 15 · Tue Nov 22 2022 08:28:34 GMT+0800 (China Standard Time)

Damn, that is not the issue ...

Thomas Eizinger · Answer 16 · Tue Nov 22 2022 08:49:20 GMT+0800 (China Standard Time)

Okay, I have a fix.

The issue was that we had substream that were waiting for a response from the behaviour even though for AddProvider, there is no response. So instead of actually reusing the substeam, the substream was in WaitingUser state but that one never got answered.,

Thomas Eizinger · Answer 17 · Tue Nov 22 2022 13:40:57 GMT+0800 (China Standard Time)

I opened a fix here: #3152

With this patch, the example you provided no longer issues the warnings. Thanks for providing that example, it was really helpful in the debugging process!

shamil-gadelshin · Answer 18 · Tue Nov 22 2022 16:42:11 GMT+0800 (China Standard Time)

I will test the fix in the test project and in our main project as well. Thanks a lot!