PlatformLab / HomaModule

A Linux kernel module that implements the Homa transport protocol.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Napi and SoftIRQ CPU Metrics

pmo73 opened this issue · comments

Hello, I am currently preparing for my master thesis and thought I would like to analyze the pacer in more detail. My idea was to use a smart NIC to offload the functionality to the hardware as well for tx and for rx to reduce CPU utilization and hopefully reduce even more the tail latency. In the course of this, I took a closer look at the CPU metrics and saw that the code states that the usage for NAPI and Softirq can only be seen with a modified kernel. Could you tell me what needs to be modified or provide me with a patch, as these measurements would be interesting for me for the overall CPU utilization?

It would be great if you could send me the patches when you are back. Thanks in advance for your support

Hello, thank you for your time and support.
Could it be that something went wrong with the attachment? I can't find one here in the issue.

Weird... I see the attachment in my copy of the email, but I don't see it on GitHub. I'm trying again, this time sending the response directly from GitHub, rather than by email.

-John-

0001-Homa-instrumentation.patch

Now it worked, thank you very much

Hi John, we have fetched your modified kernel 6.1.38+ from CloudLab. I took a look at the patch you sent and wanted to ask, is all the code commented out in your modified kernel? I have made measurements with the standard linux kernel 6.2.0 and for Homa I see clear differences in the bandwidth. So your modified kernel achieves a much higher bandwidth and now I am looking for the differences between the 2 kernels. In your patch there are actually only lines included that are relevant for statistics, but have nothing to do with packet processing, they are all commented out. In short, are the lines in the kernel in Cloudlab also commented out or are there other differences? I haven't tried building and measuring the kernel with your patch myself.

Hi John, sorry for the very delayed reply. I tried to install the patch from you and copied all headers that are not included in the patch from the HomaModule to the expected locations, but ultimately failed due to the error message in the attachment. Are there any files missing in the patch or can you tell me what I am doing wrong?

I have done further tests on the different kernel versions and it has turned out that this is influenced by the parameters of the module. I had always used the parameters from your cluster Xl710, but now I have started to change them and then the difference between the kernel versions disappears. But to make sure I find the perfect settings, I would be interested to know how you found the parameters? Did you determine the parameters with your cp_config script and try so many different values and can the parameters be determined independently of each other?

image

First, the compilation errors:

  • It looks like the file kernel/timetrace.c isn't including the right header file to declare socket structures such as sockaddr_in6. You'll need to figure out which kernel header file declares that and add a #include in kernel/timetrace.c (for some reason this isn't a problem for me).
  • It looks like there is no definition for the function trace_homa_event (in looking at the code, I don't understand why it compiles for me, but it does seem to). I don't think that the function homa_trace is used, so you can probably just delete that definition from kernel/timetrace.c.

Second, Homa's configuration parameters. The current settings were determined by experimentation, but I haven't done enough experimentation to be confident that these are absolutely the best values. For many of the parameters the exact value doesn't seem to matter, within a broad range. Other parameters are workload dependent: what works well for one workload may not be optimal for another (for these, I've picked "in between" values that produce the best possible results over a range of workloads). Some parameters are relatively independent of each other, which others are no.

Ideally Homa should have a configuration tool you can run that tries different values and picks the best ones, but I haven't had time to write such a tool yet.

Hello John, thank you very much for your reply. I have added the missing headers and deleted the function accordingly and was able to build the kernel with the patch.

I just have the effect that the parameters depend very much on the workload. I see the effect most with the parameter "max_gso_size". Here I tried different values with a sweep and for small values such as approx. 10000 the workloads W2 and W3 reach the specified bandwidth, whereas the workloads W4 and W5 only reach half of it. If I now make the value larger and set it to for example 60000, workloads W2 and W3 no longer reach the specified bandwidth, whereas W4 and W5 do. The latter seems to make sense to me, but do you have an explanation why the small workloads are so massively affected by this parameter? They should actually be completely unaffected if the Homa module forwards larger packets to Linux or did i misunderstand something?

Hello John,
my test setup consists of 4 nodes which are all equipped with the following hardware:

CPU: Intel Xeon E5-2697A v4 @ 2.60GHz
RAM: 4x 32GB DDR4-2666 MHz DIMMs
NIC: Mellanox ConnectX-5 100Gbps
Switch: Arista 7050QX

However, the network cards are only linked to the switch with 40G and not with 100G, as the switch provides a maximum of 40G. In this respect, the network cards should support TSO.

At the same time, however, I also have a second test setup consisting of 2 nodes, in which I have repeated the whole thing with Smart NICs, which have not implemented TSO at all, so that the segmentation for Homa and TCP has to be done in software and I also see the same effect here, that the bandwidth for the workloads W2 and W3 gets worse the higher I set the parameter.

Hello John, I hope I have understood your suggestion correctly. I have now executed the script cp_vs_tcp only for the workload W2 and changed the parameter max_gso once. I have left all other settings identical. I tested with 4 nodes and the Mellanox ConnectX-5 network cards. I have attached the two cperf.log files for you. As I understand that the parameter should only affect large packets, I would not expect any difference between the two measurements. However, the fact is that the bandwidth drops by 1Gbps for the larger value.

Did you mean this measurement as an experiment or what did you have in mind?

cperf_max_gso_10000.log
cperf_max_gso_50000.log

Thanks for the additional information; I think I'm starting to understand the experiment. I'm assuming that the metric of interest for you is "Overall for homa_w2 experiment" line in each file, showing about 2 Gbps with max_gso=50000 and about 3.1 Gbps with max_gso=10000? That is indeed curious; I will need to dive deeper to figure out what is going on.

Can you run the same experiment again, and while the experiment is running (about 15-20 seconds through the 30-second test for Homa), invoke the command sudo sysctl .net.homa.action=7 on one of the nodes in the experiment? This will capture detailed "timetraces" of absolutely everything happening on every node over period of a few tens of milliseconds. Then, once the experiment is finished, invoke the command ttprint.py > nodeN.tt (where N is the node number) on each of the nodes. Then send me all of the .tt files. In addition, can you send me all of the files in the log directory generated by the experiment? In particular, I'm interested in the .metrics files, but other files may also prove useful. With that information I should have a pretty good chance of figuring out what's going on.

By the way, you mentioned in your original message that your goal is to analyze the pacer. If so, W2 probably isn't a good workload to be using, because (a) almost all messages are small, so they don't need the pacer, and (b) this workload can't come close to saturating the network, so it's even less likely that the pacer will kick in.

Hi John, that's right, I looked at the throughput for the "Overall for homa_w2 experiment" case for the experiment and was puzzled as to why it drops.

I ran your commands for both parameter settings, i.e. max_gso with 10000 and 50000. I packed the results into the attached tar file and all other files from the logs folder. Unfortunately I had to delete the rtt files because they are too big to upload here at Github. If you need them as well, I would have to send them to you by other means.

Thank you for your help and advice. Originally my goal was to implement the pacer in hardware, but in the first measurements in my test setup I realized that I could not see much influence from the pacer in contrast to your paper. This is probably due to the lower number of nodes and almost certainly due to the other parameters at the time, with which I only achieved a fraction of the throughput. In the first measurements, however, it became clear that my small number of nodes meant that receive side scaling was a problem, as only 1 RX queue is used with 2 nodes, so I am now working on several concepts for an RSS and for a Homa Segmentation Offload in my master's thesis. In this respect, my topic has changed.

homa-test_max_gso_workload2.tar.gz

I have figured what is causing the performance difference. The settings for the benchmark specify 3 client threads to generate the request stream. This turns out to be just barely enough for this workload on your machines in the max_gso=10000 case. In the max_gso=50000 case it is taking Homa a lot longer to allocate packet buffers (I'm not totally sure why this is the case, but there are known issues with Homa's approach to bracket buffer allocation; this needs to be redone). As a result, the client threads can't issue requests at the desired rate. The solution is increase the --client-ports parameter from 3 to 5 or 6; with additional threads, you should get the same performance with max_gso=50000.

By the way, what kind of machines are you running on (type and clock rate)?

Hi John, I have repeated the tests and increased the throughput, but not to the previous value. I have again compressed the log folders and attached them. For a higher value like 6 the throughput dropped again.

I would be interested to know how you know which parameter you need to change?

My test setup consists of the following hardware:
CPU: Intel Xeon E5-2697A v4 @ 2.60GHz
RAM: 4x 32GB DDR4-2666 MHz DIMMs
NIC: Mellanox ConnectX-5 100Gbps
Switch: Arista 7050QX (max 40Gbps)

4nodes_client_ports5_copy.tar.gz
4nodes_client_ports6_copy.tar.gz

Sorry for my slow response. I finally got some time to look into this, but I don't see any timetrace files in the information you sent. Can you make another run and collect the .tt files as described in my earlier comment? With those, hopefully I'll be able to figure out what is going on. The performance improvement from increasing --client-ports to 6 was not as much as I would have hoped.

In response to your question about how I know which parameter to change: this is done on a case-by-case basis by looking at the available data. Unfortunately I can't provide a simple recipe for tracking down performance problems; so far every situation seems to be a bit different. In my previous analysis, I noticed lines like "Lag due to overload: 41.7%" in the nodeN.log files for the max_gso 50000 case. This line means that the clients were not able to actually generate the requested workload (they lagged about 42% behind the rate required to generate 3.2 Gbps of traffic). This meant either (a) requests were taking to long to complete (there is a limit of how many outstanding RPCs the experiment will generate at once) or (b) the sending clients couldn't generate requests fast enough. I then saw lines like "Outstanding client RPCs: 11" in the same files. I know that the limit is 200 from the --client-max argument, so the benchmark wasn't hitting the limit and the problem couldn't be case (a). Thus it had to be case (b). I was also able to verify from the timetrace files that the clients were generating requests back-to-back as fast as they could, which suggests case (b).

With the new run I'm seeing "Lag due to overload" of 15-25%, which is better, but there shouldn't be any lag at all.

After sending my previous message, I noticed that the number of "Outstanding client RPCs" jumped up to about 180-190 (close to the limit of 200) with 6 client ports, which suggests that maybe the bottleneck has shifted from the clients to the servers. So, you might also try increasing --server-ports from 3 to 4 and see if that improves throughput.

Hi John, I had actually forgotten the timetrace files in the previous test. I have now increased the number of threads for the server and the throughput has increased slightly, but not to the target bandwidth. I have reattached the logs and also added the timetrace files.

Thank you for your analysis and your explanation of how to recognize which parameter you need to change.

4nodes_client_port_6_server_port_4.tar.gz

With the timetrace files you provided I was able to track down what's going on (but it wasn't easy). The server is still the bottleneck. This surprised me, given the number of ports and threads on the servers, but with the timetraces I was able to see that the application-level threads are being interfered with fairly significantly by the kernel transport and device driver code; the kernel-level code is stealing about half of the available cycles from the application code. How about adjusting the parameters as follows:

--client-ports 6 --port-receivers 2 --server-ports 4 --port-threads 5

In addition to increasing the number of threads per server port, this reduces the number of receiver threads per client port (I don't think 3 threads per port are needed, and reducing this frees up cores for server threads). If this still doesn't work, you might try reducing --port-receivers to 1 and increasing --port-threads to 6 (it's possible this will make things worse because of the low value of --port-receivers).

This benchmark may be approaching the limit of what can be handled with a 32-core machine; it's possible that there just aren't enough cores to make the benchmark work at this rate. It looks to me like TCP is struggling to get 3.2 Gbps as well.

Also, can you upgrade to the head before re-running? Your current version of Homa is missing a few recent improvements in instrumentation.

Hi John, thank you for your new analysis unfortunately my test setup is completely blocked last week and this week, so I won't be able to run the tests until next week. I'll get back to you next week with the new test results.

No rush...