Problem with Multiple Links

Question

Problem with Multiple Links

amjal opened this issue 4 months ago · comments

I have been trying to incorporate Homa into Mininet hosts. That is a cheaper and easier way for me to do certain tests that do not require >10Gbps bandwidth. Plus, this way I can use different software switches like bmv2.

I set up a small topology consisting of 12 hosts and 100Mbps links. Adding entries to /etc/hosts allowed me to run the cp_node utility program on my Mininet network. In my benchmark, I ran these commands each on a different host:

server1 > cp_node server --protocol homa --first-port 4000 > logs/homa_vs_tcp/homa-server-1.txt &
client1 > cp_node client --protocol homa --first-port 4000 --workload 10000 --servers 1 > logs/homa_vs_tcp/homa-client-1.txt &
server2 > cp_node server --protocol homa --first-port 4001 > logs/homa_vs_tcp/homa-server-2.txt &
client2 > cp_node client --protocol homa --first-port 4001 --workload 10000 --servers 2 > logs/homa_vs_tcp/homa-client-2.txt &
...

I also ran these commands to test TCP:

server1 > cp_node server --protocol tcp --first-port 4000 > logs/homa_vs_tcp/tcp-server-1.txt &
client1 > cp_node client --protocol tcp --first-port 4000 --workload 10000 --servers 1 > logs/homa_vs_tcp/tcp-client-1.txt &
server2 > cp_node server --protocol tcp --first-port 4001 > logs/homa_vs_tcp/tcp-server-2.txt &
client2 > cp_node client --protocol tcp --first-port 4001 --workload 10000 --servers 2 > logs/homa_vs_tcp/tcp-client-2.txt &
...

Here is the module configuration:

sysctl .net.homa.link_mbps=100
sysctl .net.homa.timeout_resends=50
sysctl .net.homa.resend_ticks=50
sysctl .net.homa.resend_interval=50

Here are the results I got for Homa on the clients (these are for one client but the results are the same for all):

1708487176.717311252 Clients: 0.09 Kops/sec, 0.01 Gbps out, 0.01 Gbps in, RTT (us) P50 10667.64 P99 10676.73 P99.9 10680.13, avg. req. length 10000.0 bytes
1708487180.718111920 Clients: 0.09 Kops/sec, 0.01 Gbps out, 0.01 Gbps in, RTT (us) P50 10667.84 P99 10700.83 P99.9 10714.93, avg. req. length 10000.0 bytes

Here are the results I got for TCP on the clients (these are for one client but the results are the same for all):

1708487191.763639441 Clients: 0.56 Kops/sec, 0.04 Gbps out, 0.04 Gbps in, RTT (us) P50 1772.89 P99 1808.76 P99.9 1976.60, avg. req. length 10000.0 bytes
1708487195.765673529 Clients: 0.56 Kops/sec, 0.05 Gbps out, 0.05 Gbps in, RTT (us) P50 1774.09 P99 1831.05 P99.9 1905.58, avg. req. length 10000.0 bytes

Looking at the time traces, there is evidence as to why Homa has lower throughput and higher latency compared to TCP:

 1007.364 us (+   0.683 us) [C24] Finished queueing packet: rpc id 1694709, offset 8520, len 1480, granted 10000
 1149.781 us (+   1.410 us) [C24] Finished queueing packet: rpc id 1694712, offset 0, len 8520, granted 10000
 1894.174 us (+   0.404 us) [C24] Finished queueing packet: rpc id 1694712, offset 8520, len 1480, granted 10000
 2036.510 us (+   1.410 us) [C24] Finished queueing packet: rpc id 1694714, offset 0, len 8520, granted 10000
 2781.758 us (+   0.791 us) [C24] Finished queueing packet: rpc id 1694714, offset 8520, len 1480, granted 10000
 2923.029 us (+   1.413 us) [C24] Finished queueing packet: rpc id 1694716, offset 0, len 8520, granted 10000

The send calls are at least 120 - 150 us apart. What I expect is when c1 -> s1, c2 -> s2, ... are happening at the same time, we should see packet transmissions that are only a couple of us apart. But, seems like the pacer_thread is pacing all the packets globally. That is because of this line of the homa_pacer_xmit function where it busy-waits for the NIC queue wait time to drop below the max_nic_queue_ns config parameter.

Just increasing the max_nic_queueu_ns config parameter is not a solution, since it interferes with the SRPT on the client side. The problem is the Module keeps and updates only one link_idle_time value, whereas, we might have multiple links on the host each of which will require their own link_idle_time value (Mininet creates virtual links for the hosts). To verify this is the source of the issue, I disabled the throttle queue by setting the HOMA_FLAG_DONT_THROTTLE bit. Here are the results when we bypass the throttle queue and thereby, the busy wait on link_idle_time:

1708567417.962724782 Clients: 0.49 Kops/sec, 0.04 Gbps out, 0.04 Gbps in, RTT (us) P50 2047.35 P99 2091.64 P99.9 3550.46, avg. req. length 10000.0 bytes
1708567421.964527966 Clients: 0.49 Kops/sec, 0.04 Gbps out, 0.04 Gbps in, RTT (us) P50 2045.86 P99 2090.98 P99.9 2311.49, avg. req. length 10000.0 bytes

Which are a lot better. So how about keeping one link_idle_time value for each link instead of keeping one for the host? I don't think implementing it is too much headache since the device names for each RPC are already available through homa_get_dst(rpc->peer, rpc->hsk)->dev so we can have access to the active Homa links using a linked list or sth.

John Ousterhout · Answer 1 · Sat Feb 24 2024 06:10:26 GMT+0800 (China Standard Time)

I think there may be a different issue at work here.

First, it makes sense to space out the packets by 700 us, since this is how long it takes to actually transmit the first 8 KB of the message. Also, it's taking 10 ms round-trip for Homa, so an extra 700 us each way doesn't explain the overall time.

I suspect that the real problem is max_gso_size. Can you run "sysctl .net.homa.max_gso_size" to see what it returns? If this is larger than the MTU of the network (1500 B?), try changing it to the MTU: "sudo sysctl .net.homa.max_gso_size=1500" and see if things get better. I suspect what's happening is this:

By default, Homa assumes that the NIC is capable of performing segmentation offload for Homa, using the same mechanism as TCP (TSO),
This works for the Mellanox NICs where I test, but for many NICs, it doesn't: the NIC sees that the protocol isn't TCP, so it refuses to do the segmentation and drops the packet.
This results in timeouts and retransmissions (see if there are RESEND requests in your timetraces?). Retransmitted packets don't use TSO, so they get through successfully.
Setting max_gso_size to the MTU will ensure that Homa doesn't try to use TSO.

Another experiment to try is to change the workload to "--workload 100" and see if Homa gets faster. Then gradually increase the message length until things get dramatically slower. I suspect this will happen somewhere around 1500 bytes.

If this doesn't solve the problem, grab a timetrace from each of client and server and attach them here; I'll then take a look.

amjal · Answer 2 · Sat Feb 24 2024 11:49:28 GMT+0800 (China Standard Time)

I enabled the throttle queue using sysctl .net.homa.flags=0, and set the GSO size using sysctl .net.homa.max_gso_size=1500 to equal the MTU. These are the Homa results:

1708740145.851823025 Clients: 0.09 Kops/sec, 0.01 Gbps out, 0.01 Gbps in, RTT (us) P50 10491.33 P99 10520.76 P99.9 10799.92, avg. req. length 10000.0 bytes
1708740146.851936851 Clients: 0.10 Kops/sec, 0.01 Gbps out, 0.01 Gbps in, RTT (us) P50 10490.95 P99 10515.80 P99.9 10799.92, avg. req. length 10000.0 bytes

I see no RESEND requests in the module time traces. I think I can clarify the original issue a bit further. When I say I expect to see packet transmissions that are a couple of us apart, I don't mean all the packet tx timestamps should be spaced a couple of us. Maybe looking at the link activities on my network during the Homa experiment helps:

These links are all virtual and handled by one physical host running one instance of the pacer_thread. When multiple virtual hosts are transmitting packets using separate virtual NICs, starting the transmission the same time, I expect the packet tx timestamps of different virtual hosts to be very close. But timestamps of different packets of the same virtual host should be spaced according to the packet size and link speed which should be 100s of us in my network.

Consider a real-life network where hosts have synchronized transmission start times. In that scenario, each host will have its own Homa Kernel Module and perhaps a single link connecting it to the ToR (at least that is what the Module is assuming). In that case, pkt1 of host1 and pkt1 of host2 will have the same global transmission time stamp because host1 and host2 have synchronized transmission start times. Now imagine the whole network resides on a single physical host with one Homa Kernel Module and one pacer_thread running. This is the case I have with the Mininet network. All the virtual links you see in the image are handled by the host machine. The problem is apparently, the Module still assumes there is only one link on the machine. Of course, this has implications for my virtual network as I explained in the original post, but also will likely cause problems with real-life cases that have multiple active NICs present on one physical host such as NIC teaming. With the current implementation, this is what happens:

At t0: h1 adds pkt1 -> throttle queue
At t0: h2 adds pkt1 -> throttle queue
At t0: pacer_thread sends h1.pkt1 
pacer_thread busy-waits on link idle_time for 150us // This is unnecessary as h1 & h2 are connected to separate links
At t0 + 150us: pacer_thread sends h2.pkt2 // should have been at t0

John Ousterhout · Answer 3 · Mon Feb 26 2024 13:47:20 GMT+0800 (China Standard Time)

First, your experiment with flags suggests that pacing isn't the problem, since turning it off didn't improve performance.

Second, you lost me with the description of your environment. If you have multiple virtual hosts, each with its own virtual link, then there should be a separate pacer for each virtual host, which manages its virtual link. Having a shared pacer doesn't make sense to me. Perhaps you could describe in more detail exactly what your topology is in terms of virtual and physical hosts, links, and switches, exactly where you have plugged Homa into that topology, and what you are trying to achieve. It sounds like you may be trying to use the Linux Homa module in an environment for which it wasn't designed.

John Ousterhout · Answer 4 · Mon Feb 26 2024 13:48:41 GMT+0800 (China Standard Time)

Oops, I see that by setting flags=0 you turned throttling on, not off. So ignore my "First" comment.

amjal · Answer 5 · Tue Feb 27 2024 08:31:57 GMT+0800 (China Standard Time)

If you have multiple virtual hosts, each with its own virtual link, then there should be a separate pacer for each virtual host, which manages its virtual link.

Right, I should have better clarified what my setup looks like. A Mininet network consists of virtual interfaces for the switches and the hosts. The Mininet hosts are essentially user space shell processes with dedicated network namespaces and route tables. The switches can be custom programs (any software switch). I think of a Mininet network as a virtual network that resides in the main host machine and uses its resources: Memory, CPU, and NICs. When a Mininet host generates traffic, whether it is TCP, UDP, or Homa, a bash process essentially generates the traffic, and it is handled by the TCP/IP stack of the main host that the network (and its hosts) resides in. So you can see that the Mininet hosts do not have their own operating systems, but all share the main host's kernel and it is the IP stack of the main host's kernel that routes the packets generated by the Mininet hosts through the correct virtual interfaces. Since there is only one kernel, there is only one homa kernel module and only one pacer thread.

It sounds like you may be trying to use the Linux Homa module in an environment for which it wasn't designed.

Maybe, I wouldn't know. But, I don't think it's the environment that is the issue here. As long as the Homa packets generated are correctly directed towards the module, which is happening, the Homa module is not concerned with where they come from. And as long as the IP stack routes the packets it gets from the module correctly, which it does, Homa is not concerned with the routing either. Besides, TCP works without any problems in this same environment. With Homa, It's just this tiny issue that the module, regardless of how many NICs are on a host, assumes there is only one and creates a pacer_kthread to handle the NIC queues. When one NIC has a tx queue that exceeds the max_nic_queue_ns, since the module perceives only one NIC queue it does not allow packets to be added to any other queues until that one queue is cleared. This is wasting bandwidth with the virtual interfaces that are sharing the pacer thread. This is also an issue with a variety of other scenarios. For example, when we have multiple physical links and we want to use NIC teaming for redundancy or higher throughput.

I think there is value in enabling Homa with Mininet, because it allows researchers/developers to develop and test simple stuff against the module on their local systems with 0 cost and 0 waiting time (given the system has enough resources to support the emulated network). And only go to the Cloudlab platform if they need >10Gbps link speeds for their test/experiments. What do you think?

John Ousterhout · Answer 6 · Wed Feb 28 2024 06:51:17 GMT+0800 (China Standard Time)

I think I'm starting to understand the issue better: Homa's pacer doesn't understand that there could be multiple NICs, each of which should be paced independently. I suspect that the solution is for Homa to eliminate its built-in pacer and use the Linux Traffic Controller with queue disciplines (perhaps including a new Homa queue discipline). Would this solve the problems you have been having?

…

On Mon, Feb 26, 2024 at 4:32 PM amjal ***@***.***> < ***@***.***> wrote: If you have multiple virtual hosts, each with its own virtual link, then there should be a separate pacer for each virtual host, which manages its virtual link. Right, I should have better clarified what my setup looks like. A Mininet network consists of virtual interfaces for the switches and the hosts. The Mininet hosts are essentially user space shell processes with dedicated network namespaces and route tables. The switches can be custom programs (any software switch). I think of a Mininet network as a virtual network that resides in the *main* host machine and uses its resources: Memory, CPU, and NICs. When a Mininet host generates traffic, whether it is TCP, UDP, or Homa, a bash process essentially generates the traffic, and it is handled by the TCP/IP stack of the main host that the network (and its hosts) resides in. So you can see that the Mininet hosts do not have their own operating systems, but all share the main host's kernel and it is the IP stack of the main host's kernel that routes the packets generated by the Mininet hosts through the correct virtual interfaces. Since there is only one kernel, there is only one homa kernel module and only one pacer thread. It sounds like you may be trying to use the Linux Homa module in an environment for which it wasn't designed. Maybe, I wouldn't know. But, I don't think it's the environment that is the issue here. As long as the Homa packets generated are correctly directed towards the module, which is happening, the Homa module is not concerned with where they come from. And as long as the IP stack routes the packets it gets from the module correctly, which it does, Homa is not concerned with the routing either. Besides, TCP works without any problems in this same environment. With Homa, It's just this tiny issue that the module, regardless of how many NICs are on a host, assumes there is only one and creates a *pacer_kthread* to handle the NIC queues. When one NIC has a tx queue that exceeds the max_nic_queue_ns, since the module perceives only on NIC queue it does not allow packets to be added to any other queues until that one queue is cleared. This is wasting bandwidth with the virtual interfaces that are sharing the pacer thread. This is also an issue with a variety of other scenarios. For example, when we have multiple physical links and we want to use NIC teaming for redundancy or higher throughput. I think there is value in enabling Homa with Mininet, because it allows researchers/developers to develop and test simple stuff against the module on their local systems with 0 cost and 0 waiting time (given the system has enough resources to support the emulated network). And only go to the Cloudlab platform if they need >10Gbps link speeds for their test/experiments. What do you think? — Reply to this email directly, view it on GitHub <#53 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOOUCSAGQXPZ4LZBZ7GZIDYVUSQVAVCNFSM6AAAAABDUGXRWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRVGU3TQNJRGY> . You are receiving this because you commented.Message ID: ***@***.***>

amjal · Answer 7 · Thu Feb 29 2024 10:05:37 GMT+0800 (China Standard Time)

Correct! The solution that originally occurred to me and I discussed in the previous comments was to modify Homa's pacer to keep a list of active devices (kind of like output_queue in the Linux kernel) and check the link_idle_time for each, and dequeue packets from them in a round-robin fashion. But, I see trying to enforce SRPT on the sender using the max_nic_queue_ns paramter might have deeper flaws such as interferece with the protocol itself. For example, it also paces the unscheduled packets that are supposed to be burst at line rate. But, after you mentioned the queue discipline, I did a little search and it seems Linux's priority qdisc coupled with the appropriate classifier can achieve what a Homa sender needs.

John Ousterhout · Answer 8 · Sat Mar 02 2024 01:40:37 GMT+0800 (China Standard Time)

The need for Homa to use Linux's queue disciplines instead of its own pacer has come up in several different contexts over the last few weeks, so this is now on my list of things to do. However, it probably won't happen until at least the summer.