libp2p / rust-libp2p

The Rust Implementation of the libp2p networking stack.

Home Page:https://libp2p.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused.

shamil-gadelshin opened this issue · comments

Summary

I use start_providing Kademlia API

Here is my error after 32 requests:

2022-10-20T09:54:15.538792Z  WARN libp2p_kad::handler: New inbound substream to PeerId("12D3KooWNwexVxD22CAdxv5CAkxrD3QgUqhycYAE1e1NH2sLbZ7v") exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream.    

This error is similar to the recent discussion: #2957

However, I have only 2 peers (local machine setup) and 3 seconds between start_providing requests.

Here is my local inbound_streams buffer:

 ["WaitingUser: id=UniqueConnecId(11)", "WaitingUser: id=UniqueConnecId(13)", "WaitingUser: id=UniqueConnecId(16)", "WaitingUser: id=UniqueConnecId(18)", "WaitingUser: id=UniqueConnecId(20)", "WaitingUser: id=UniqueConnecId(22)", "WaitingUser: id=UniqueConnecId(24)", "WaitingUser: id=UniqueConnecId(26)", "WaitingUser: id=UniqueConnecId(28)", "WaitingUser: id=UniqueConnecId(3)", "WaitingUser: id=UniqueConnecId(30)", "WaitingUser: id=UniqueConnecId(32)", "WaitingUser: id=UniqueConnecId(34)", "WaitingUser: id=UniqueConnecId(36)", "WaitingUser: id=UniqueConnecId(39)", "WaitingUser: id=UniqueConnecId(41)", "WaitingUser: id=UniqueConnecId(43)", "WaitingUser: id=UniqueConnecId(45)", "WaitingUser: id=UniqueConnecId(47)", "WaitingUser: id=UniqueConnecId(49)", "WaitingUser: id=UniqueConnecId(5)", "WaitingUser: id=UniqueConnecId(51)", "WaitingUser: id=UniqueConnecId(53)", "WaitingUser: id=UniqueConnecId(55)", "WaitingUser: id=UniqueConnecId(57)", "WaitingUser: id=UniqueConnecId(59)", "WaitingUser: id=UniqueConnecId(61)", "WaitingUser: id=UniqueConnecId(63)", "WaitingUser: id=UniqueConnecId(65)", "WaitingUser: id=UniqueConnecId(67)", "WaitingUser: id=UniqueConnecId(7)", "WaitingUser: id=UniqueConnecId(9)"]    

It seems that I need to acknowledge some of the requests but the API doesn't expect this. Also, I can send hundreds of get_closest_peers requests with no errors.

When an error begins to manifest the requesting peer gets a QueryStat:

QueryStats { requests: 1, success: 0, failure: 1, start: Some(Instant { t: 2021689972560 }), end: Some(Instant { t: 2021689984792 }) }

That indicates that the first out of two requests (it seems that start_providing issues FindNode and AddProvider requests) produces an error but numerous GetClosestPeers work ok. It doesn't make sense to me.

Also, when I increase an interval between start_providing API calls to 17secs - it doesn't seem to produce an error. I tried to set Kademlia query timeout from the default 60 to just 1 second (I suspected some pending process) but it doesn't make a difference.

Am I missing something?

Expected behaviour

I expect the local setup to handle multiple requests per second with no issues.

Actual behaviour

Debug Output

FindNodeReq: ConnectionId(2)
2022-10-20T11:43:57.299813Z  INFO subspace_networking::node_runner: Kademlia event: InboundRequest { request: FindNode { num_closer_peers: 1 } }
2022-10-20T11:43:57.300924Z DEBUG yamux::connection::stream: 44d40e66/139: eof    
2022-10-20T11:43:57.301396Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:57.301431Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:57.301473Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol    
2022-10-20T11:43:57.302384Z  INFO subspace_networking::behavior::custom_record_store: New provider record added: ProviderRecord { key: Key(b"\0 \x89\x06;^\xab\xa0hP\xc4\xb6\xc9W\x1cFV\xc8|\xce\xb6\x04&M%\xe9 y\xe3\x9a\x89\x81[\xa6"), provider: PeerId("12D3KooWSPnggmEVXnPYDjaueqRm5EjC1cVejyaRUg2V6BThVxpR"), expires: Some(Instant { t: 4087037477371 }), addresses: ["/ip4/127.0.0.1/tcp/60010"] }
2022-10-20T11:43:57.302474Z  INFO subspace_networking::node_runner: Kademlia event: InboundRequest { request: AddProvider { record: None } }
2022-10-20T11:43:57.302482Z  INFO subspace_networking::node_runner: Add provider request received: None
2022-10-20T11:43:57.899984Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:57.900226Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:58.899019Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:58.899206Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:59.000034Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:59.000098Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:59.000151Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol    
2022-10-20T11:43:59.000197Z  WARN libp2p_kad::handler: New inbound substream to PeerId("12D3KooWSPnggmEVXnPYDjaueqRm5EjC1cVejyaRUg2V6BThVxpR") exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream.    
2022-10-20T11:43:59.000496Z  WARN libp2p_kad::handler: Inbound streams: ["WaitingUser: id=UniqueConnecId(12)", "WaitingUser: id=UniqueConnecId(14)", "WaitingUser: id=UniqueConnecId(16)", "WaitingUser: id=UniqueConnecId(18)", "WaitingUser: id=UniqueConnecId(2)", "WaitingUser: id=UniqueConnecId(21)", "WaitingUser: id=UniqueConnecId(23)", "WaitingUser: id=UniqueConnecId(25)", "WaitingUser: id=UniqueConnecId(27)", "WaitingUser: id=UniqueConnecId(29)", "WaitingUser: id=UniqueConnecId(31)", "WaitingUser: id=UniqueConnecId(33)", "WaitingUser: id=UniqueConnecId(35)", "WaitingUser: id=UniqueConnecId(37)", "WaitingUser: id=UniqueConnecId(39)", "WaitingUser: id=UniqueConnecId(42)", "WaitingUser: id=UniqueConnecId(44)", "WaitingUser: id=UniqueConnecId(46)", "WaitingUser: id=UniqueConnecId(48)", "WaitingUser: id=UniqueConnecId(5)", "WaitingUser: id=UniqueConnecId(50)", "WaitingUser: id=UniqueConnecId(52)", "WaitingUser: id=UniqueConnecId(54)", "WaitingUser: id=UniqueConnecId(56)", "WaitingUser: id=UniqueConnecId(58)", "WaitingUser: id=UniqueConnecId(60)", "WaitingUser: id=UniqueConnecId(62)", "WaitingUser: id=UniqueConnecId(64)", "WaitingUser: id=UniqueConnecId(66)", "WaitingUser: id=UniqueConnecId(68)", "WaitingUser: id=UniqueConnecId(7)", "WaitingUser: id=UniqueConnecId(9)"]    
2022-10-20T11:43:59.898505Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:59.898595Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:59.950796Z DEBUG subspace_networking::node_runner: Initiate connection to known peers local_peer_id=12D3L7AUynGJcx7Lb4SAo4bR1m78UYydB4TRy2Nre6vViLHPgEem connected_peers=1
2022-10-20T11:44:00.701763Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:44:00.701828Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:44:00.701882Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol  

Possible Solution

Version

  • I use 0.46.1 in my own branch but the latest 0.49.0 produces the same result.

Would you like to work on fixing this bug?

No

This issue I mentioned on the last community call. @mxinden @thomaseizinger You had already a similar discussion but this issue arises on a much smaller load.

Thanks for reporting this! This definitely doesn't look right. I am putting it on my list to work on for next week :)

Are you polling the Swarm properly? Can you share the code that you are running?

I looked into the code and these substreams shouldn't really be sitting there idling for very long. They are waiting for the NetworkBehaviour to gather the data, which should be extremely quick and can then resume again.

Maybe the main event loop of the Behaviours is blocked from some reason? Can you reproduce the problem if you are just using the Kademlia behaviour and nothing else?

Also, can you try and test with #3074? I think this one is more correct in terms of task wake behaviour. Maybe that is the issue.

Friendly ping @shamil-gadelshin!

Does #3074 help at all?

I tried that. The issue stays. Sorry for the delayed response. I was going to debug it again and provide you with extensive feedback. I'm going to do it this week.

Thank you for your attention to this bug. I appreciate it.

I tried that. The issue stays. Sorry for the delayed response. I was going to debug it again and provide you with extensive feedback. I'm going to do it this week.

Thank you for your attention to this bug. I appreciate it.

Ah damn. Any chance you can provide a minimal example that produces the problem?

Sure. I will try again debugging my code to rule out a silly mistake and then try to reproduce the issue having the minimal setup.

Here is the minimal Kademlia setup that shows differences between start_providing and put_record or get_closest_peers . It's possible that I misuse the Kademlia API and I appreciate a hint @thomaseizinger

Thank you! I'll have a look!

a friendly ping @thomaseizinger Did you have a chance to look at the kad-example project? Does it reproduce the error?

a friendly ping @thomaseizinger Did you have a chance to look at the kad-example project? Does it reproduce the error?

Sorry, I haven't yet but I'll do so today!

Okay, I figured out what I think the issue is.

As per the spec, outbound substreams may be reused. Our implementation never does that (we only send 1 request per substream) but our inbound streams wait for additional messages on that stream and thus fill up the buffer.

There seems to be a bug where the implementation for the inbound stream does not detect that the other side closed the stream and it should thus stop waiting for a message.

Damn, that is not the issue ...

Okay, I have a fix.

The issue was that we had substream that were waiting for a response from the behaviour even though for AddProvider, there is no response. So instead of actually reusing the substeam, the substream was in WaitingUser state but that one never got answered.,

I opened a fix here: #3152

With this patch, the example you provided no longer issues the warnings. Thanks for providing that example, it was really helpful in the debugging process!

I will test the fix in the test project and in our main project as well. Thanks a lot!