Napi and SoftIRQ CPU Metrics

Question

Napi and SoftIRQ CPU Metrics

pmo73 opened this issue 7 months ago · comments

Hello, I am currently preparing for my master thesis and thought I would like to analyze the pacer in more detail. My idea was to use a smart NIC to offload the functionality to the hardware as well for tx and for rx to reduce CPU utilization and hopefully reduce even more the tail latency. In the course of this, I took a closer look at the CPU metrics and saw that the code states that the usage for NAPI and Softirq can only be seen with a modified kernel. Could you tell me what needs to be modified or provide me with a patch, as these measurements would be interesting for me for the overall CPU utilization?

John Ousterhout · Answer 1 · Thu Nov 23 2023 19:02:36 GMT+0800 (China Standard Time)

I’d be happy to send you my kernel patches, but I am off the grid for the next couple of weeks without my laptop. I’ll try to do this when I get back in early December.

…

-John-

On Tue, Nov 21, 2023 at 4:26 PM Björn Petersen ***@***.***> wrote: Hello, I am currently preparing for my master thesis and thought I would like to analyze the pacer in more detail. My idea was to use a smart NIC to offload the functionality to the hardware as well for tx and for rx to reduce CPU utilization and hopefully reduce even more the tail latency. In the course of this, I took a closer look at the CPU metrics and saw that the code states that the usage for NAPI and Softirq can only be seen with a modified kernel. Could you tell me what needs to be modified or provide me with a patch, as these measurements would be interesting for me for the overall CPU utilization? — Reply to this email directly, view it on GitHub <#49>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOOUCXPBEDTKOFZDOKUSE3YFT57DAVCNFSM6AAAAAA7VDCS42VHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYDIOJYHA2TKMA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Björn Petersen · Answer 2 · Thu Nov 23 2023 19:06:02 GMT+0800 (China Standard Time)

It would be great if you could send me the patches when you are back. Thanks in advance for your support

John Ousterhout · Answer 3 · Wed Dec 06 2023 08:42:16 GMT+0800 (China Standard Time)

I'm back on the grid, and I'm attaching to this message a git patch containing all of my instrumentation changes to the Linux kernel. These are for Linux 6.1.38; if you're using a different kernel you may have to make some adjustments in them.

…

-John-

On Thu, Nov 23, 2023 at 3:06 AM Björn Petersen ***@***.***> wrote: It would be great if you could send me the patches when you are back. Thanks in advance for your support — Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOOUCU66GC3ECYNWPUMA5TYF4U2NAVCNFSM6AAAAAA7VDCS42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRUGIZTIMBVGE> . You are receiving this because you commented.Message ID: ***@***.***>

Björn Petersen · Answer 4 · Wed Dec 06 2023 23:45:59 GMT+0800 (China Standard Time)

Hello, thank you for your time and support.
Could it be that something went wrong with the attachment? I can't find one here in the issue.

John Ousterhout · Answer 5 · Thu Dec 07 2023 00:37:32 GMT+0800 (China Standard Time)

Weird... I see the attachment in my copy of the email, but I don't see it on GitHub. I'm trying again, this time sending the response directly from GitHub, rather than by email.

-John-

0001-Homa-instrumentation.patch

Björn Petersen · Answer 6 · Thu Dec 07 2023 00:39:25 GMT+0800 (China Standard Time)

Now it worked, thank you very much

Björn Petersen · Answer 7 · Wed Mar 13 2024 21:12:38 GMT+0800 (China Standard Time)

Hi John, we have fetched your modified kernel 6.1.38+ from CloudLab. I took a look at the patch you sent and wanted to ask, is all the code commented out in your modified kernel? I have made measurements with the standard linux kernel 6.2.0 and for Homa I see clear differences in the bandwidth. So your modified kernel achieves a much higher bandwidth and now I am looking for the differences between the 2 kernels. In your patch there are actually only lines included that are relevant for statistics, but have nothing to do with packet processing, they are all commented out. In short, are the lines in the kernel in Cloudlab also commented out or are there other differences? I haven't tried building and measuring the kernel with your patch myself.

John Ousterhout · Answer 8 · Wed Mar 13 2024 23:55:50 GMT+0800 (China Standard Time)

Hi Bjorn, The changes I made to the Linux kernel are entirely for instrumentation; there are no changes required for Homa to operate correctly and I haven't made any performance improvements in the Linux external that I can recall. As for the patch, there are many lines commented out in the patch (instrumentation that was useful at one point but doesn't seem to be useful anymore); these are also commented out in the actual kernel. I'm surprised that you would be seeing significant performance differences. I'd be interested to hear more about that. For starters, I'd suggest gathering metrics files using ttmetrics.py (these are gathered automatically if you run standard benchmarks such as util/cp_vs_tcp) and comparing those for the two different kernels to see if they yield any clues.

…

-John-

On Wed, Mar 13, 2024 at 6:13 AM Björn Petersen ***@***.***> < ***@***.***> wrote: Hi John, we have fetched your modified kernel 6.1.38+ from CloudLab. I took a look at the patch you sent and wanted to ask, is all the code commented out in your modified kernel? I have made measurements with the standard linux kernel 6.2.0 and for Homa I see clear differences in the bandwidth. So your modified kernel achieves a much higher bandwidth and now I am looking for the differences between the 2 kernels. In your patch there are actually only lines included that are relevant for statistics, but have nothing to do with packet processing, they are all commented out. In short, are the lines in the kernel in Cloudlab also commented out or are there other differences? I haven't tried building and measuring the kernel with your patch myself. — Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOOUCQFXKVAOXIHUMRBSXLYYBGFZAVCNFSM6AAAAAA7VDCS42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGM3TOMZTGU> . You are receiving this because you commented.Message ID: ***@***.***>

Björn Petersen · Answer 9 · Thu Apr 11 2024 00:36:53 GMT+0800 (China Standard Time)

Hi John, sorry for the very delayed reply. I tried to install the patch from you and copied all headers that are not included in the patch from the HomaModule to the expected locations, but ultimately failed due to the error message in the attachment. Are there any files missing in the patch or can you tell me what I am doing wrong?

I have done further tests on the different kernel versions and it has turned out that this is influenced by the parameters of the module. I had always used the parameters from your cluster Xl710, but now I have started to change them and then the difference between the kernel versions disappears. But to make sure I find the perfect settings, I would be interested to know how you found the parameters? Did you determine the parameters with your cp_config script and try so many different values and can the parameters be determined independently of each other?

John Ousterhout · Answer 10 · Thu Apr 11 2024 05:42:17 GMT+0800 (China Standard Time)

First, the compilation errors:

It looks like the file kernel/timetrace.c isn't including the right header file to declare socket structures such as sockaddr_in6. You'll need to figure out which kernel header file declares that and add a #include in kernel/timetrace.c (for some reason this isn't a problem for me).
It looks like there is no definition for the function trace_homa_event (in looking at the code, I don't understand why it compiles for me, but it does seem to). I don't think that the function homa_trace is used, so you can probably just delete that definition from kernel/timetrace.c.

Second, Homa's configuration parameters. The current settings were determined by experimentation, but I haven't done enough experimentation to be confident that these are absolutely the best values. For many of the parameters the exact value doesn't seem to matter, within a broad range. Other parameters are workload dependent: what works well for one workload may not be optimal for another (for these, I've picked "in between" values that produce the best possible results over a range of workloads). Some parameters are relatively independent of each other, which others are no.

Ideally Homa should have a configuration tool you can run that tries different values and picks the best ones, but I haven't had time to write such a tool yet.

Björn Petersen · Answer 11 · Thu Apr 11 2024 18:08:19 GMT+0800 (China Standard Time)

Hello John, thank you very much for your reply. I have added the missing headers and deleted the function accordingly and was able to build the kernel with the patch.

I just have the effect that the parameters depend very much on the workload. I see the effect most with the parameter "max_gso_size". Here I tried different values with a sweep and for small values such as approx. 10000 the workloads W2 and W3 reach the specified bandwidth, whereas the workloads W4 and W5 only reach half of it. If I now make the value larger and set it to for example 60000, workloads W2 and W3 no longer reach the specified bandwidth, whereas W4 and W5 do. The latter seems to make sense to me, but do you have an explanation why the small workloads are so massively affected by this parameter? They should actually be completely unaffected if the Homa module forwards larger packets to Linux or did i misunderstand something?

John Ousterhout · Answer 12 · Fri Apr 12 2024 00:05:11 GMT+0800 (China Standard Time)

I suspect you are running into NIC issues. Check out the paragraph starting with "NIC support for TSO" in the README.md file. What model of NICs are you using?

…

-John-

On Thu, Apr 11, 2024 at 3:08 AM Björn Petersen ***@***.***> wrote: Hello John, thank you very much for your reply. I have added the missing headers and deleted the function accordingly and was able to build the kernel with the patch. I just have the effect that the parameters depend very much on the workload. I see the effect most with the parameter "max_gso_size". Here I tried different values with a sweep and for small values such as approx. 10000 the workloads W2 and W3 reach the specified bandwidth, whereas the workloads W4 and W5 only reach half of it. If I now make the value larger and set it to for example 60000, workloads W2 and W3 no longer reach the specified bandwidth, whereas W4 and W5 do. The latter seems to make sense to me, but do you have an explanation why the small workloads are so massively affected by this parameter? They should actually be completely unaffected if the Homa module forwards larger packets to Linux or did i misunderstand something? — Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOOUCTBOC7EOCCH5AHKKOLY4ZOKTAVCNFSM6AAAAAA7VDCS42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBZGM2TQMJWGY> . You are receiving this because you commented.Message ID: ***@***.***>

Björn Petersen · Answer 13 · Fri Apr 12 2024 00:43:47 GMT+0800 (China Standard Time)

Hello John,
my test setup consists of 4 nodes which are all equipped with the following hardware:

CPU: Intel Xeon E5-2697A v4 @ 2.60GHz
RAM: 4x 32GB DDR4-2666 MHz DIMMs
NIC: Mellanox ConnectX-5 100Gbps
Switch: Arista 7050QX

However, the network cards are only linked to the switch with 40G and not with 100G, as the switch provides a maximum of 40G. In this respect, the network cards should support TSO.

At the same time, however, I also have a second test setup consisting of 2 nodes, in which I have repeated the whole thing with Smart NICs, which have not implemented TSO at all, so that the segmentation for Homa and TCP has to be done in software and I also see the same effect here, that the bandwidth for the workloads W2 and W3 gets worse the higher I set the parameter.

John Ousterhout · Answer 14 · Sat Apr 13 2024 04:45:27 GMT+0800 (China Standard Time)

Thanks for the additional information. At this point I think we need to consider one or two very specific data points and drill down to figure out exactly what's happening. How about selecting a specific experiment that is not behaving the way you would expect (the simpler the better), and write back with details about exactly how the experiment works, what symptoms you are seeing, and how those differ from what you would expect or hope? It may take a few go arounds, but I suspect we should be able to gather enough information to figure out what is happening.

…

-John-

On Thu, Apr 11, 2024 at 9:44 AM Björn Petersen ***@***.***> wrote: Hello John, my test setup consists of 4 nodes which are all equipped with the following hardware: CPU: Intel Xeon E5-2697A v4 @ 2.60GHz RAM: 4x 32GB DDR4-2666 MHz DIMMs NIC: Mellanox ConnectX-5 100Gbps Switch: Arista 7050QX However, the network cards are only linked to the switch with 40G and not with 100G, as the switch provides a maximum of 40G. In this respect, the network cards should support TSO. At the same time, however, I also have a second test setup consisting of 2 nodes, in which I have repeated the whole thing with Smart NICs, which have not implemented TSO at all, so that the segmentation for Homa and TCP has to be done in software and I also see the same effect here, that the bandwidth for the workloads W2 and W3 gets worse the higher I set the parameter. — Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOOUCTYCO3HRRGL7OLAMI3Y424VRAVCNFSM6AAAAAA7VDCS42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGA4TSNBYGI> . You are receiving this because you commented.Message ID: ***@***.***>

Björn Petersen · Answer 15 · Mon Apr 15 2024 21:49:18 GMT+0800 (China Standard Time)

Hello John, I hope I have understood your suggestion correctly. I have now executed the script cp_vs_tcp only for the workload W2 and changed the parameter max_gso once. I have left all other settings identical. I tested with 4 nodes and the Mellanox ConnectX-5 network cards. I have attached the two cperf.log files for you. As I understand that the parameter should only affect large packets, I would not expect any difference between the two measurements. However, the fact is that the bandwidth drops by 1Gbps for the larger value.

Did you mean this measurement as an experiment or what did you have in mind?

cperf_max_gso_10000.log
cperf_max_gso_50000.log

John Ousterhout · Answer 16 · Tue Apr 16 2024 12:10:21 GMT+0800 (China Standard Time)

Thanks for the additional information; I think I'm starting to understand the experiment. I'm assuming that the metric of interest for you is "Overall for homa_w2 experiment" line in each file, showing about 2 Gbps with max_gso=50000 and about 3.1 Gbps with max_gso=10000? That is indeed curious; I will need to dive deeper to figure out what is going on.

Can you run the same experiment again, and while the experiment is running (about 15-20 seconds through the 30-second test for Homa), invoke the command sudo sysctl .net.homa.action=7 on one of the nodes in the experiment? This will capture detailed "timetraces" of absolutely everything happening on every node over period of a few tens of milliseconds. Then, once the experiment is finished, invoke the command ttprint.py > nodeN.tt (where N is the node number) on each of the nodes. Then send me all of the .tt files. In addition, can you send me all of the files in the log directory generated by the experiment? In particular, I'm interested in the .metrics files, but other files may also prove useful. With that information I should have a pretty good chance of figuring out what's going on.

By the way, you mentioned in your original message that your goal is to analyze the pacer. If so, W2 probably isn't a good workload to be using, because (a) almost all messages are small, so they don't need the pacer, and (b) this workload can't come close to saturating the network, so it's even less likely that the pacer will kick in.

Björn Petersen · Answer 17 · Fri Apr 19 2024 19:46:54 GMT+0800 (China Standard Time)

Hi John, that's right, I looked at the throughput for the "Overall for homa_w2 experiment" case for the experiment and was puzzled as to why it drops.

I ran your commands for both parameter settings, i.e. max_gso with 10000 and 50000. I packed the results into the attached tar file and all other files from the logs folder. Unfortunately I had to delete the rtt files because they are too big to upload here at Github. If you need them as well, I would have to send them to you by other means.

Thank you for your help and advice. Originally my goal was to implement the pacer in hardware, but in the first measurements in my test setup I realized that I could not see much influence from the pacer in contrast to your paper. This is probably due to the lower number of nodes and almost certainly due to the other parameters at the time, with which I only achieved a fraction of the throughput. In the first measurements, however, it became clear that my small number of nodes meant that receive side scaling was a problem, as only 1 RX queue is used with 2 nodes, so I am now working on several concepts for an RSS and for a Homa Segmentation Offload in my master's thesis. In this respect, my topic has changed.

homa-test_max_gso_workload2.tar.gz

John Ousterhout · Answer 18 · Sat Apr 20 2024 08:27:10 GMT+0800 (China Standard Time)

I have figured what is causing the performance difference. The settings for the benchmark specify 3 client threads to generate the request stream. This turns out to be just barely enough for this workload on your machines in the max_gso=10000 case. In the max_gso=50000 case it is taking Homa a lot longer to allocate packet buffers (I'm not totally sure why this is the case, but there are known issues with Homa's approach to bracket buffer allocation; this needs to be redone). As a result, the client threads can't issue requests at the desired rate. The solution is increase the --client-ports parameter from 3 to 5 or 6; with additional threads, you should get the same performance with max_gso=50000.

By the way, what kind of machines are you running on (type and clock rate)?

Björn Petersen · Answer 19 · Thu Apr 25 2024 01:05:52 GMT+0800 (China Standard Time)

Hi John, I have repeated the tests and increased the throughput, but not to the previous value. I have again compressed the log folders and attached them. For a higher value like 6 the throughput dropped again.

I would be interested to know how you know which parameter you need to change?

My test setup consists of the following hardware:
CPU: Intel Xeon E5-2697A v4 @ 2.60GHz
RAM: 4x 32GB DDR4-2666 MHz DIMMs
NIC: Mellanox ConnectX-5 100Gbps
Switch: Arista 7050QX (max 40Gbps)

4nodes_client_ports5_copy.tar.gz
4nodes_client_ports6_copy.tar.gz

John Ousterhout · Answer 20 · Fri May 03 2024 11:29:04 GMT+0800 (China Standard Time)

Sorry for my slow response. I finally got some time to look into this, but I don't see any timetrace files in the information you sent. Can you make another run and collect the .tt files as described in my earlier comment? With those, hopefully I'll be able to figure out what is going on. The performance improvement from increasing --client-ports to 6 was not as much as I would have hoped.

In response to your question about how I know which parameter to change: this is done on a case-by-case basis by looking at the available data. Unfortunately I can't provide a simple recipe for tracking down performance problems; so far every situation seems to be a bit different. In my previous analysis, I noticed lines like "Lag due to overload: 41.7%" in the nodeN.log files for the max_gso 50000 case. This line means that the clients were not able to actually generate the requested workload (they lagged about 42% behind the rate required to generate 3.2 Gbps of traffic). This meant either (a) requests were taking to long to complete (there is a limit of how many outstanding RPCs the experiment will generate at once) or (b) the sending clients couldn't generate requests fast enough. I then saw lines like "Outstanding client RPCs: 11" in the same files. I know that the limit is 200 from the --client-max argument, so the benchmark wasn't hitting the limit and the problem couldn't be case (a). Thus it had to be case (b). I was also able to verify from the timetrace files that the clients were generating requests back-to-back as fast as they could, which suggests case (b).

With the new run I'm seeing "Lag due to overload" of 15-25%, which is better, but there shouldn't be any lag at all.

John Ousterhout · Answer 21 · Fri May 03 2024 13:06:58 GMT+0800 (China Standard Time)

After sending my previous message, I noticed that the number of "Outstanding client RPCs" jumped up to about 180-190 (close to the limit of 200) with 6 client ports, which suggests that maybe the bottleneck has shifted from the clients to the servers. So, you might also try increasing --server-ports from 3 to 4 and see if that improves throughput.

Björn Petersen · Answer 22 · Wed May 08 2024 01:07:46 GMT+0800 (China Standard Time)

Hi John, I had actually forgotten the timetrace files in the previous test. I have now increased the number of threads for the server and the throughput has increased slightly, but not to the target bandwidth. I have reattached the logs and also added the timetrace files.

Thank you for your analysis and your explanation of how to recognize which parameter you need to change.

4nodes_client_port_6_server_port_4.tar.gz

John Ousterhout · Answer 23 · Fri May 10 2024 02:47:35 GMT+0800 (China Standard Time)

With the timetrace files you provided I was able to track down what's going on (but it wasn't easy). The server is still the bottleneck. This surprised me, given the number of ports and threads on the servers, but with the timetraces I was able to see that the application-level threads are being interfered with fairly significantly by the kernel transport and device driver code; the kernel-level code is stealing about half of the available cycles from the application code. How about adjusting the parameters as follows:

--client-ports 6 --port-receivers 2 --server-ports 4 --port-threads 5

In addition to increasing the number of threads per server port, this reduces the number of receiver threads per client port (I don't think 3 threads per port are needed, and reducing this frees up cores for server threads). If this still doesn't work, you might try reducing --port-receivers to 1 and increasing --port-threads to 6 (it's possible this will make things worse because of the low value of --port-receivers).

This benchmark may be approaching the limit of what can be handled with a 32-core machine; it's possible that there just aren't enough cores to make the benchmark work at this rate. It looks to me like TCP is struggling to get 3.2 Gbps as well.

John Ousterhout · Answer 24 · Fri May 10 2024 03:03:40 GMT+0800 (China Standard Time)

Also, can you upgrade to the head before re-running? Your current version of Homa is missing a few recent improvements in instrumentation.

Björn Petersen · Answer 25 · Wed May 15 2024 16:05:11 GMT+0800 (China Standard Time)

Hi John, thank you for your new analysis unfortunately my test setup is completely blocked last week and this week, so I won't be able to run the tests until next week. I'll get back to you next week with the new test results.

John Ousterhout · Answer 26 · Thu May 16 2024 00:08:44 GMT+0800 (China Standard Time)

No rush...