Remove multipath-tools package since it causes scheduling latency due as its RTPRIO is 99
shuhaowu opened this issue · comments
TL;DR: Ubuntu by default has a real-time service with a rtpriority of 99 (maximum possible) that can cause scheduling latency of around ~150us. We should remove the multipath-tools
package to remove that service, as it seems not necessary.
cc: @carlossvg @LanderU
The Problem
As a part of my benchmarking and testing for some blog posts I'm writing, I've noticed that the maximum latency of the Raspberry Pi 4 when running cyclictest while running stress-ng -c 4
is around 250us. This is actually quite high, as if you have a process that supposed to happen at 1kHz, you lost 25% of your compute time. I then figured I will trace the system via ftrace to see what's causing the problem:
# stress-ng -c 4 # (In a separate terminal)
# trace-cmd start -p wakeup_rt cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -D 60s
plugin 'wakeup_rt'
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 3.48 1.36 0.69 8/173 12907
T: 0 (12904) P:80 I:200 C: 149982 Min: 15 Act: 29 Avg: 24 Max: 313
T: 1 (12905) P:80 I:200 C: 149801 Min: 16 Act: 26 Avg: 54 Max: 434
T: 2 (12906) P:80 I:200 C: 149613 Min: 16 Act: 68 Avg: 37 Max: 280
T: 3 (12907) P:80 I:200 C: 149425 Min: 18 Act: 92 Avg: 68 Max: 266
# trace-cmd stop
# trace-cmd show
# tracer: wakeup_rt
#
# wakeup_rt latency trace v1.1.5 on 5.4.140-rt64
# --------------------------------------------------------------------
# latency: 400 us, #345/345, CPU#1 | (M:preempt_rt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: cyclictest-12905 (uid:0 nice:0 policy:1 rt_prio:80)
# -----------------
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# ||||| / _--=> preempt-lazy-depth
# |||||| / _-=> migrate-disable
# ||||||| / delay
# cmd pid |||||||| time | caller
# \ / |||||||| \ | /
stress-n-12898 1dN.h4.. 1us : 12898:120:R + [001] 12905: 19:R cyclictest
[omitted for brevity]
stress-n-12898 1d...3.. 57us : cpu_have_feature <-__switch_to
multipat-1456 1d...3.. 58us : finish_task_switch <-__schedule
[omitted for brevity]
multipat-1456 1d...3.. 382us : update_curr_rt <-put_prev_task_rt
multipat-1456 1d...3.. 383us : update_rt_rq_load_avg <-put_prev_task_rt
multipat-1456 1d...3.. 384us : pick_next_task_stop <-__schedule
multipat-1456 1d...3.. 384us : pick_next_task_dl <-__schedule
multipat-1456 1d...3.. 385us : pick_next_task_rt <-__schedule
multipat-1456 1d...3.. 389us : __schedule <-schedule
multipat-1456 1d...3.. 389us : 1456: 0:S ==> [001] 12905: 19:R cyclictest
Note that with ftrace enabled, the entire system is slower, which explains why I'm seeing a latency of ~434us as measured by cyclictest
and 400us as measured by wakeup_rt
. However, the above log shows that we spent most of the time in multipat-1456
, which we can find out is:
$ ps -e -o pid,class,rtprio,comm | grep 1456
1456 RR 99 multipathd
This is a process scheduled with SCHED_RR
with a priority of 99. No wonder it can get ahead of cyclictest
when being scheduled, which I set to a priority of 80 (as this is common for RT applications). A look at the Ubuntu docs, it seems like multipathd
is something that is involved with multi-path storage. While I'm not familiar with what it is, I don't think it is relavent to the RT base image for robotics (or even other RT applications).
Bonus picture: here's some kernelshark visualization of multipathd
(blue) jumping in front of cyclictest (green). The two black lines before the blue bar (multipathd) are Marker A and Marker B respectively, which has a delta of 0.000197482s (197us). This is expected as cyclictest is constructed to wake up every 200us. This periodicity is kept very well (as indicted by the regular appearance of the green lines), until the appearance of multipathd
, which steals the CPU by 139us.
Testing the solution
So I disabled multipathd and verified that it is dead:
# systemctl mask multipathd
# systemctl stop multipathd
# ps aux | grep multipath
ubuntu 13061 0.0 0.0 7692 616 pts/2 S+ 17:10 0:00 grep --color=auto multipath
Then I started cyclictest with stress-ng again
# stress-ng -c 4 # (In a separate terminal)
# cyclictest --mlockall --smp --priority --interval=200 --distance=0 -D 15m
policy: fifo: loadavg: 6.79 6.17 4.06 5/166 13260
T: 0 (13103) P:80 I:200 C:4499997 Min: 13 Act: 18 Avg: 20 Max: 132
T: 1 (13104) P:80 I:200 C:4499803 Min: 13 Act: 16 Avg: 19 Max: 115
T: 2 (13105) P:80 I:200 C:4499606 Min: 13 Act: 21 Avg: 20 Max: 136
T: 3 (13106) P:80 I:200 C:4499435 Min: 13 Act: 16 Avg: 19 Max: 138
This is significantly better result than the 250us nominal that I saw with multipathd running.
After a bit of investigation, multipathd
is coming from the Ubuntu base image. I'm not sure why it is installed on the Raspberry Pi image, as it is definitely not on my desktop system. We can remove this package on image building to prevent this service from being installed.
root@rpi4-image-build:/# apt list --installed | grep multi
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
multipath-tools/now 0.8.3-1ubuntu2 arm64 [installed,local]
Sorry I haven't gotten around to the command line builder yet. I got a branch, but as you can see I keep getting distracted by other stuff.
I agree to remove multipath-tools. I saw it on htop and I was thinking of removing it too.
What I would do is check if there are other real-time processes in the system we could remove. For everything, we remove we should add a note in the instructions and the release notes so people are aware what are the differences with the Ubuntu base image. Ideally, we should be able to get as good results as with Raspbian (see https://www.osadl.org/Latency-plot-of-system-in-rack-4-slot.qa-latencyplot-r4s1.0.html?shadow=1).
Btw, the kernelshark picture is great, this is the kind of analysis I expected to do using ros2tracing and compass. Once your blogpost is ready please share it in the RTWG :)
@shuhaowu @carlossvg, yeah, we can sudo systemctl disable multipathd
in our build process.
I created the PR to remove it.
@carlossvg I looked at the results you linked and it is not obvious to me if they stressed the system at the same time as running cyclictest. My personal testing seem to replicate their latency plot when the system is idling (left plot):
Under stress, with a RT kernel, the situation is a bit worse. The above right plot was actually generated before I removed multipathd. With multipathd removed, the distribution has not been changed, although the maximum has been significantly reduced:
I have data with various kind of benchmarking here: https://github.com/shuhaowu/rt-demo/blob/master/data/cyclictest-rpi4/plot.ipynb. Maybe it's worthwhile documenting that stuff somewhere as well.
Oh! I forgot to mention in the last post, but there doesn't seem to be any additional processes running with the RT priority on the system that shouldn't be:
ubuntu@ubuntu:~$ ps -e -o pid,class,rtprio,comm | grep RR
ubuntu@ubuntu:~$ ps -e -o pid,class,rtprio,comm | grep FF
10 FF 1 rcu_preempt
11 FF 1 rcub/0
12 FF 1 rcuc/0
13 FF 99 posixcputmr/0
14 FF 99 migration/0
18 FF 99 migration/1
19 FF 99 posixcputmr/1
20 FF 1 rcuc/1
25 FF 99 migration/2
26 FF 99 posixcputmr/2
27 FF 1 rcuc/2
32 FF 99 migration/3
33 FF 99 posixcputmr/3
34 FF 1 rcuc/3
42 FF 50 irq/11-fe00b880
101 FF 99 watchdogd
109 FF 50 irq/43-PCIe PME
110 FF 50 irq/43-aerdrv
111 FF 49 irq/43-s-aerdrv
114 FF 50 irq/44-xhci_hcd
121 FF 50 irq/29-VCHIQ do
149 FF 50 irq/22-DMA IRQ
150 FF 50 irq/24-DMA IRQ
151 FF 50 irq/15-fe204000
153 FF 50 irq/17-fe804000
154 FF 50 irq/25-DMA IRQ
155 FF 50 irq/30-mmc1
157 FF 50 irq/30-mmc0
158 FF 49 irq/30-s-mmc0
218 FF 50 irq/18-fe980000
277 FF 50 irq/18-fe980000
278 FF 50 irq/18-dwc2_hso
1490 FF 50 irq/36-eth0
1491 FF 50 irq/37-eth0
1686 FF 50 irq/16-ttyS0
One thing we might want to do for an application is probably tune the priority of the IRQ handlers. 50 looks like a sane default, but if an application is running at priority 80, then there's a chance that the IRQ handler never runs. For example, an RT application that needs USB probably need to ensure the priority of IRQ handler for it is properly tuned. This is application dependent so it probably belongs more in an advice section as opposed to be done here by default.
@carlossvg I looked at the results you linked and it is not obvious to me if they stressed the system at the same time as running cyclictest. My personal testing seem to replicate their latency plot when the system is idling (left plot):
@shuhaowu This is a good question. According to this https://www.osadl.org/Latency-plots.latency-plots.0.html, the latency plot would correspond to the green bar between 8:00 and 12:00. If this is the case the experiment includes some simulated load.
I have data with various kind of benchmarking here: https://github.com/shuhaowu/rt-demo/blob/master/data/cyclictest-rpi4/plot.ipynb. Maybe it's worthwhile documenting that stuff somewhere as well.
I took a look at the demo you are working on and I'm really interested in two of the topics you cover:
- Stressing approaches comparison: I would like to select a default stressing approach we can reuse as a default one for the experiments. Stressing the CPUs is a good start but it's very limited. I like the hackbench and the OSADL simulated workload approaches. On the other hand, stress-ng is very complete and we should be able to simulate every scenario with it.
- Data passing demos: I'm interested in documenting all the different options and provide examples for ROS 2 applications
One thing we might want to do for an application is probably tune the priority of the IRQ handlers. 50 looks like a sane default, but if an application is running at priority 80, then there's a chance that the IRQ handler never runs. For example, an RT application that needs USB probably need to ensure the priority of IRQ handler for it is properly tuned. This is application dependent so it probably belongs more in an advice section as opposed to be done here by default.
Yep. In the case of ROS 2 applications using the network stack, this is something important to take into account. For example, in that case, I would keep the application priority above 50 and the network IRQs priorities above the application thread priorities. Here is an interesting talk related to that topic. https://www.youtube.com/watch?v=-pehAzaP1eg There is a discussion at the end related to running the system IRQ thread priorities. Something we could do is to add a script to tune the network (and other) IRQ thread priorities and affinities. Then we can refer to this script in the documentation.
I just finished writing my blog post (part 2/4) about the problem for this issue: https://shuhaowu.com/blog/2022/02-linux-rt-appdev-part2.html. I'll share it in the RTWG later as well. It should contain nothing this group doesn't already know about, tho. That said, it's designed for a separate audience (robotics people interested in RT, and perhaps other RT people like audio/low-latency game emulation). There's also a part 1(https://shuhaowu.com/blog/2022/01-linux-rt-appdev-part1.html), which is even more generic.
I took a look at the demo you are working on and I'm really interested in two of the topics you cover:
Stressing approaches comparison: I would like to select a default stressing approach we can reuse as a default one for the experiments. Stressing the CPUs is a good start but it's very limited. I like the hackbench and the OSADL simulated workload approaches. On the other hand, stress-ng is very complete and we should be able to simulate every scenario with it.
Maybe there's interest in defining a "standard" set of benchmarks to run by running cyclictest along with stress-ng (and maybe others). I'm not sure if this is possible, as every application is different. That said, providing a starting point could be nice, as I don't think anywhere else has a standard set defined. The tests I did in that repo is based on various talks and articles i've seen.
Data passing demos: I'm interested in documenting all the different options and provide examples for ROS 2 applications
This is an area that I feel is under-explored, especially in robotics-related RT: how to pass data. I've seen a few conference talks about this (see this one as a starting point), but they're mostly focused on audio programming. I'm currently playing with their techniques in my rt-demo
repo, and I'll likely finish it over the next few weeks. I'm hoping I'll summarize all of my knowledge in part 4 of the blog series above.
Something we could do is to add a script to tune the network (and other) IRQ thread priorities and affinities.
Definitely something we can do. Another thing I would love to do, is be able to figure out exactly which irq handlers you need to tune given a particular application. I'm not an expert in tracing this stuff yet, so perhaps once I learn it I'll write i as a part 5 of my series 😄 .
This was done.