It seems that the priority is not working
ufm opened this issue · comments
We have three links with different priorities:
root@ygg5:/# yggdrasilctl getpeers|grep dada
tls://[fe80::38b7:e8ff:fe5b:3be7%25ens20] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 3m22s 5kb 93kb 0 -
tls://[fe80::be24:11ff:fe78:89f5%25ens19] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 3m22s 15kb 71kb 3 -
tcp://193.111.115.215:54230 Up In 200:dada:feda:f443:52ec:d2e4:b853:60bc 3m22s 62kb 359kb 4 -
Send traffic to this address:
root@ygg5:/# ping -s 10000 200:dada:feda:f443:52ec:d2e4:b853:60bc
PING 200:dada:feda:f443:52ec:d2e4:b853:60bc(200:dada:feda:f443:52ec:d2e4:b853:60bc) 10000 data bytes
10008 bytes from 200:dada:feda:f443:52ec:d2e4:b853:60bc: icmp_seq=1 ttl=64 time=7.09 ms
10008 bytes from 200:dada:feda:f443:52ec:d2e4:b853:60bc: icmp_seq=2 ttl=64 time=2.27 ms
10008 bytes from 200:dada:feda:f443:52ec:d2e4:b853:60bc: icmp_seq=3 ttl=64 time=1.65 ms
10008 bytes from 200:dada:feda:f443:52ec:d2e4:b853:60bc: icmp_seq=4 ttl=64 time=1.17 ms
10008 bytes from 200:dada:feda:f443:52ec:d2e4:b853:60bc: icmp_seq=5 ttl=64 time=1.10 ms
--- 200:dada:feda:f443:52ec:d2e4:b853:60bc ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 1.100/2.655/7.088/2.255 ms
And now it's evident that the traffic was distributed across all three links:
root@ygg5:/# yggdrasilctl getpeers|grep dada
tls://[fe80::38b7:e8ff:fe5b:3be7%25ens20] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 3m48s 26kb 93kb 0 -
tls://[fe80::be24:11ff:fe78:89f5%25ens19] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 3m48s 25kb 71kb 3 -
tcp://193.111.115.215:54230 Up In 200:dada:feda:f443:52ec:d2e4:b853:60bc 3m48s 83kb 410kb 4 -
Priority is for when there are multiple links to the same node, i.e. Ethernet and WiFi peerings, and you want to prefer one over the other.
It does not change routing decisions between different peers.
Hmm. These are three links between the same nodes.
Two - over ethernet (physically different interfaces)
One - over tcp.
Also protocol traffic can and will be sent over all peerings, it should just be that actual application traffic is prioritised accordingly, so you may need to send more than just pings to see it in action / separate from the protocol traffic background noise.
root@ygg5:~# yggdrasilctl getpeers|grep dada
tls://[fe80::38b7:e8ff:fe5b:3be7%25ens20] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 8h28m20s 2mb 67mb 0 -
tls://[fe80::be24:11ff:fe78:89f5%25ens19] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 8h28m20s 2mb 64mb 3 -
tcp://193.111.115.215:54230 Up In 200:dada:feda:f443:52ec:d2e4:b853:60bc 8h28m20s 14mb 390mb 4 -
root@ygg5:~#
root@ygg5:~# iperf3 -c 200:dada:feda:f443:52ec:d2e4:b853:60bc
Connecting to host 200:dada:feda:f443:52ec:d2e4:b853:60bc, port 5201
[ 5] local 22d:d3dd:3afe:9599:3da9:d89f:6ae:8401 port 41842 connected to 200:dada:feda:f443:52ec:d2e4:b853:60bc port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 58.4 MBytes 490 Mbits/sec 236 146 KBytes
[ 5] 1.00-2.00 sec 55.0 MBytes 461 Mbits/sec 81 634 KBytes
[ 5] 2.00-3.00 sec 67.5 MBytes 566 Mbits/sec 74 829 KBytes
[ 5] 3.00-4.00 sec 82.5 MBytes 692 Mbits/sec 64 1.05 MBytes
[ 5] 4.00-5.00 sec 75.0 MBytes 629 Mbits/sec 98 634 KBytes
[ 5] 5.00-6.00 sec 50.0 MBytes 419 Mbits/sec 103 488 KBytes
[ 5] 6.00-7.00 sec 47.5 MBytes 398 Mbits/sec 74 488 KBytes
[ 5] 7.00-8.00 sec 32.5 MBytes 273 Mbits/sec 41 488 KBytes
[ 5] 8.00-9.00 sec 42.5 MBytes 356 Mbits/sec 74 390 KBytes
[ 5] 9.00-10.00 sec 50.0 MBytes 419 Mbits/sec 68 97.5 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 561 MBytes 470 Mbits/sec 913 sender
[ 5] 0.00-10.00 sec 560 MBytes 470 Mbits/sec receiver
iperf Done.
root@ygg5:~# yggdrasilctl getpeers|grep dada
tls://[fe80::38b7:e8ff:fe5b:3be7%25ens20] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 8h28m44s 3mb 144mb 0 -
tls://[fe80::be24:11ff:fe78:89f5%25ens19] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 8h28m44s 3mb 140mb 3 -
tcp://193.111.115.215:54230 Up In 200:dada:feda:f443:52ec:d2e4:b853:60bc 8h28m44s 16mb 844mb 4 -
In my opinion, there is enough traffic to see that it is distributed among all peers regardless of priority.
P.S.The fact that the reverse traffic goes through some other path is a separate conversation. I always believed that having a direct link between two nodes is a sufficient guarantee for the traffic to always flow directly between these nodes.
P.S.The fact that the reverse traffic goes through some other path is a separate conversation. I always believed that having a direct link between two nodes is a sufficient guarantee for the traffic to always flow directly between these nodes.
That is concerning, as you're absolutely right, having a direct peer should be the one case where we can be certain that the network will avoid unnecessary intermediate hops. There was a bug in the early v0.5.X versions that could have prevented that in some cases, but I thought I managed to fix it -- maybe not.
If possible, would you provide the yggdrasil --version
and any important info about how it was installed? (built from git, downloaded an "official" build from our github, installed a distro package, etc) I just want to make sure it's running the right thing before I get too carried away trying to reproduce/debug it on my end.
Both node:
Build name: yggdrasil
Build version: 0.5.4
Installed from
deb http://neilalexander.s3.dualstack.eu-west-2.amazonaws.com/deb/ debian yggdrasil
One node - ubunty jammy. Other - ubuntu focal
In local (broadcast) mesh - 15 nodes.
One of the nodes is the current root of our long-suffering network and public peer, so it's better not to shake it too much. :)
One of the nodes is the current root of our long-suffering network and public peer, so it's better not to shake it too much. :)
Ah, that's useful to know! I haven't been able to reproduce it in local tests, but it's not too hard to believe that there could be an edge case where things stop working right for the root node specifically, so I guess I should test from an isolated network.
Perhaps there are issues with link priorities? Although @neilalexander claims that everything is working correctly, it's evident that it's not.
If you need my configurations, let me know, and I'll provide them to you (I can even include the keys, but not here. We can do it in Matrix, for example).
Although @neilalexander claims that everything is working correctly, it's evident that it's not.
Not claiming that everything is working correctly for sure, but at the moment I can't reproduce what you are seeing. I have three peerings to the same node with different priorities and traffic is definitely taking the path that I expect (the one that is the lowest priority and, in case two links have the same priority, the highest uptime).
Can you please confirm that getPeers
on both sides show the same priority values for each peering?
First side:
root@yds:/home/ufm# yggdrasilctl getpeers|grep 22d
tls://[fe80::be24:11ff:fe47:857b%25ens20] Up In 22d:d3dd:3afe:9599:3da9:d89f:6ae:8401 154h37m57s 292mb 13mb 0 -
tls://[fe80::6487:beff:fecb:9145%25ens22] Up In 22d:d3dd:3afe:9599:3da9:d89f:6ae:8401 154h37m57s 147mb 13mb 3 -
tcp://193.93.119.42:14244 Up Out 22d:d3dd:3afe:9599:3da9:d89f:6ae:8401 154h37m57s 883mb 71mb 4 -
Second side:
root@ygg5:/home/ufm# yggdrasilctl getpeers|grep dada
tls://[fe80::38b7:e8ff:fe5b:3be7%25ens20] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 154h38m49s 13mb 292mb 0 -
tls://[fe80::be24:11ff:fe78:89f5%25ens19] Up Out 200:dada:feda:f443:52ec:d2e4:b853:60bc 154h38m49s 13mb 147mb 3 -
tcp://193.111.115.215:54230 Up In 200:dada:feda:f443:52ec:d2e4:b853:60bc 154h38m49s 71mb 883mb 4 -
interface ens20 on the first side equal ens20 on the second.
interface ens22 on the first side equal ens19 on the second.
TCP is TCP.
As you can see, everything matches. The transmitted/received byte count, uptime and priority.
Just in case, I'll repeat: the second side is the root of the tree. Perhaps the issue is indeed related to this?
Thanks for the confirmation. I've been looking over the code and not really sure how this can happen, unless it's not actually session traffic but protocol traffic instead. The order of evaluation when selecting the next-hop is:
- Shortest tree distance
- If distance equal, lower key
- If key equal, lower priority
- If priority equal, higher uptime
... which makes me think that something is wrong in the first two steps. Being root might have something to do with it, I'll keep digging.
Could both issues be related? The fact that the peer with the lowest priority isn't selected and that the reverse traffic is taking a different path despite the existence of a direct connection.
I made some changes to the configuration and re-ran the tests. I removed the root role from my node (moved it to a separate instance connected only to my node), and I have a few updates.
- @Arceliar Regarding non-symmetric traffic, it seems to have been my mistake. I forgot that iperf3, by default, doesn't perform a bidirectional test.
- The fact that a node is the root does not affect the selection of the link priority.
- The priority is chosen incorrectly.
Here is how the test results look now:
root@ygg5:~# yggdrasilctl getpeers|grep dada
tls://[fe80::38b7:e8ff:fe5b:3be7%25ens20] Up In 200:dada:feda:f443:52ec:d2e4:b853:60bc 27m31s 5gb 5gb 0 -
tls://[fe80::be24:11ff:fe78:89f5%25ens19] Up In 200:dada:feda:f443:52ec:d2e4:b853:60bc 27m31s 31gb 32gb 3 -
tcp://193.111.115.215:53354 Up In 200:dada:feda:f443:52ec:d2e4:b853:60bc 27m29s 6kb 9kb 4 -
@neilalexander It's evident that the link with priority 3 is being selected. However, some traffic still goes through the link with priority 0, but at around 20% (I hope I didn't make a mistake here, and priority 0 is higher than priority 3, right?)
OK, I think we've narrowed down the problem and I've pushed a new yggdrasil-develop
package build to my S3 repo that contains the fix. If you don't mind testing, the fixed version is 0.5.4-5
.
I think the reason I was unable to reproduce it was that it was predicated on both the priority value and the order of connections, I guess in my environment they matched up but in yours they didn't.
It seems that in 0.5.4-5-g5da4c11 this issue is not present. Thank you very much!
Looks like it's fixed.
Thanks for testing & confirming!