ros-realtime / ros-realtime-rpi4-image

An image for the Raspberry Pi 4 with ROS 2 and Linux RT preinstalled

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Remove multipath-tools package since it causes scheduling latency due as its RTPRIO is 99

shuhaowu opened this issue · comments

TL;DR: Ubuntu by default has a real-time service with a rtpriority of 99 (maximum possible) that can cause scheduling latency of around ~150us. We should remove the multipath-tools package to remove that service, as it seems not necessary.

cc: @carlossvg @LanderU

The Problem

As a part of my benchmarking and testing for some blog posts I'm writing, I've noticed that the maximum latency of the Raspberry Pi 4 when running cyclictest while running stress-ng -c 4 is around 250us. This is actually quite high, as if you have a process that supposed to happen at 1kHz, you lost 25% of your compute time. I then figured I will trace the system via ftrace to see what's causing the problem:

# stress-ng -c 4 # (In a separate terminal)
# trace-cmd start -p wakeup_rt cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -D 60s
  plugin 'wakeup_rt'                                                                                               
# /dev/cpu_dma_latency set to 0us                                                                                  
policy: fifo: loadavg: 3.48 1.36 0.69 8/173 12907                                          
                                                                                                                   
T: 0 (12904) P:80 I:200 C: 149982 Min:     15 Act:   29 Avg:   24 Max:     313     
T: 1 (12905) P:80 I:200 C: 149801 Min:     16 Act:   26 Avg:   54 Max:     434                 
T: 2 (12906) P:80 I:200 C: 149613 Min:     16 Act:   68 Avg:   37 Max:     280           
T: 3 (12907) P:80 I:200 C: 149425 Min:     18 Act:   92 Avg:   68 Max:     266   
# trace-cmd stop
# trace-cmd show
# tracer: wakeup_rt                                                                                                
#                                                                                                                  
# wakeup_rt latency trace v1.1.5 on 5.4.140-rt64                                                                   
# --------------------------------------------------------------------                                             
# latency: 400 us, #345/345, CPU#1 | (M:preempt_rt VP:0, KP:0, SP:0 HP:0 #P:4)                                     
#    -----------------                                                                                             
#    | task: cyclictest-12905 (uid:0 nice:0 policy:1 rt_prio:80)                                                   
#    -----------------                                                                                             
#                                                                                                                  
#                    _------=> CPU#                                                                                
#                   / _-----=> irqs-off                                                                            
#                  | / _----=> need-resched                                                                        
#                  || / _---=> hardirq/softirq                                                                     
#                  ||| / _--=> preempt-depth                                                                       
#                  ||||| / _--=> preempt-lazy-depth                                                                
#                  |||||| / _-=> migrate-disable                                                                   
#                  ||||||| /     delay                                                                             
# cmd     pid      |||||||| time   |  caller                                                                       
#     \   /        ||||||||   \    |  /                                                                            
stress-n-12898     1dN.h4..    1us :    12898:120:R   + [001]   12905: 19:R cyclictest 
[omitted for brevity]
stress-n-12898     1d...3..   57us : cpu_have_feature <-__switch_to
multipat-1456      1d...3..   58us : finish_task_switch <-__schedule
[omitted for brevity]
multipat-1456      1d...3..  382us : update_curr_rt <-put_prev_task_rt
multipat-1456      1d...3..  383us : update_rt_rq_load_avg <-put_prev_task_rt
multipat-1456      1d...3..  384us : pick_next_task_stop <-__schedule
multipat-1456      1d...3..  384us : pick_next_task_dl <-__schedule
multipat-1456      1d...3..  385us : pick_next_task_rt <-__schedule
multipat-1456      1d...3..  389us : __schedule <-schedule
multipat-1456      1d...3..  389us :     1456:  0:S ==> [001]   12905: 19:R cyclictest

Note that with ftrace enabled, the entire system is slower, which explains why I'm seeing a latency of ~434us as measured by cyclictest and 400us as measured by wakeup_rt. However, the above log shows that we spent most of the time in multipat-1456, which we can find out is:

$ ps -e -o pid,class,rtprio,comm | grep 1456
   1456 RR      99 multipathd

This is a process scheduled with SCHED_RR with a priority of 99. No wonder it can get ahead of cyclictest when being scheduled, which I set to a priority of 80 (as this is common for RT applications). A look at the Ubuntu docs, it seems like multipathd is something that is involved with multi-path storage. While I'm not familiar with what it is, I don't think it is relavent to the RT base image for robotics (or even other RT applications).

Bonus picture: here's some kernelshark visualization of multipathd (blue) jumping in front of cyclictest (green). The two black lines before the blue bar (multipathd) are Marker A and Marker B respectively, which has a delta of 0.000197482s (197us). This is expected as cyclictest is constructed to wake up every 200us. This periodicity is kept very well (as indicted by the regular appearance of the green lines), until the appearance of multipathd, which steals the CPU by 139us.

image

Testing the solution

So I disabled multipathd and verified that it is dead:

# systemctl mask multipathd
# systemctl stop multipathd
# ps aux | grep multipath
ubuntu     13061  0.0  0.0   7692   616 pts/2    S+   17:10   0:00 grep --color=auto multipath

Then I started cyclictest with stress-ng again

# stress-ng -c 4 # (In a separate terminal)
# cyclictest --mlockall --smp --priority --interval=200 --distance=0 -D 15m

policy: fifo: loadavg: 6.79 6.17 4.06 5/166 13260           

T: 0 (13103) P:80 I:200 C:4499997 Min:     13 Act:   18 Avg:   20 Max:     132
T: 1 (13104) P:80 I:200 C:4499803 Min:     13 Act:   16 Avg:   19 Max:     115
T: 2 (13105) P:80 I:200 C:4499606 Min:     13 Act:   21 Avg:   20 Max:     136
T: 3 (13106) P:80 I:200 C:4499435 Min:     13 Act:   16 Avg:   19 Max:     138

This is significantly better result than the 250us nominal that I saw with multipathd running.

After a bit of investigation, multipathd is coming from the Ubuntu base image. I'm not sure why it is installed on the Raspberry Pi image, as it is definitely not on my desktop system. We can remove this package on image building to prevent this service from being installed.

root@rpi4-image-build:/# apt list --installed | grep multi

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.                                                                               

multipath-tools/now 0.8.3-1ubuntu2 arm64 [installed,local]

Sorry I haven't gotten around to the command line builder yet. I got a branch, but as you can see I keep getting distracted by other stuff.

I agree to remove multipath-tools. I saw it on htop and I was thinking of removing it too.

What I would do is check if there are other real-time processes in the system we could remove. For everything, we remove we should add a note in the instructions and the release notes so people are aware what are the differences with the Ubuntu base image. Ideally, we should be able to get as good results as with Raspbian (see https://www.osadl.org/Latency-plot-of-system-in-rack-4-slot.qa-latencyplot-r4s1.0.html?shadow=1).

Btw, the kernelshark picture is great, this is the kind of analysis I expected to do using ros2tracing and compass. Once your blogpost is ready please share it in the RTWG :)

@shuhaowu @carlossvg, yeah, we can sudo systemctl disable multipathd in our build process.

I created the PR to remove it.

@carlossvg I looked at the results you linked and it is not obvious to me if they stressed the system at the same time as running cyclictest. My personal testing seem to replicate their latency plot when the system is idling (left plot):

image

Under stress, with a RT kernel, the situation is a bit worse. The above right plot was actually generated before I removed multipathd. With multipathd removed, the distribution has not been changed, although the maximum has been significantly reduced:

image

I have data with various kind of benchmarking here: https://github.com/shuhaowu/rt-demo/blob/master/data/cyclictest-rpi4/plot.ipynb. Maybe it's worthwhile documenting that stuff somewhere as well.

Oh! I forgot to mention in the last post, but there doesn't seem to be any additional processes running with the RT priority on the system that shouldn't be:

ubuntu@ubuntu:~$ ps -e -o pid,class,rtprio,comm | grep RR
ubuntu@ubuntu:~$ ps -e -o pid,class,rtprio,comm | grep FF
     10 FF       1 rcu_preempt
     11 FF       1 rcub/0
     12 FF       1 rcuc/0
     13 FF      99 posixcputmr/0
     14 FF      99 migration/0
     18 FF      99 migration/1
     19 FF      99 posixcputmr/1
     20 FF       1 rcuc/1
     25 FF      99 migration/2
     26 FF      99 posixcputmr/2
     27 FF       1 rcuc/2
     32 FF      99 migration/3
     33 FF      99 posixcputmr/3
     34 FF       1 rcuc/3
     42 FF      50 irq/11-fe00b880
    101 FF      99 watchdogd
    109 FF      50 irq/43-PCIe PME
    110 FF      50 irq/43-aerdrv
    111 FF      49 irq/43-s-aerdrv
    114 FF      50 irq/44-xhci_hcd
    121 FF      50 irq/29-VCHIQ do
    149 FF      50 irq/22-DMA IRQ
    150 FF      50 irq/24-DMA IRQ
    151 FF      50 irq/15-fe204000
    153 FF      50 irq/17-fe804000
    154 FF      50 irq/25-DMA IRQ
    155 FF      50 irq/30-mmc1
    157 FF      50 irq/30-mmc0
    158 FF      49 irq/30-s-mmc0
    218 FF      50 irq/18-fe980000
    277 FF      50 irq/18-fe980000
    278 FF      50 irq/18-dwc2_hso
   1490 FF      50 irq/36-eth0
   1491 FF      50 irq/37-eth0
   1686 FF      50 irq/16-ttyS0

One thing we might want to do for an application is probably tune the priority of the IRQ handlers. 50 looks like a sane default, but if an application is running at priority 80, then there's a chance that the IRQ handler never runs. For example, an RT application that needs USB probably need to ensure the priority of IRQ handler for it is properly tuned. This is application dependent so it probably belongs more in an advice section as opposed to be done here by default.

@carlossvg I looked at the results you linked and it is not obvious to me if they stressed the system at the same time as running cyclictest. My personal testing seem to replicate their latency plot when the system is idling (left plot):

@shuhaowu This is a good question. According to this https://www.osadl.org/Latency-plots.latency-plots.0.html, the latency plot would correspond to the green bar between 8:00 and 12:00. If this is the case the experiment includes some simulated load.

I have data with various kind of benchmarking here: https://github.com/shuhaowu/rt-demo/blob/master/data/cyclictest-rpi4/plot.ipynb. Maybe it's worthwhile documenting that stuff somewhere as well.

I took a look at the demo you are working on and I'm really interested in two of the topics you cover:

  • Stressing approaches comparison: I would like to select a default stressing approach we can reuse as a default one for the experiments. Stressing the CPUs is a good start but it's very limited. I like the hackbench and the OSADL simulated workload approaches. On the other hand, stress-ng is very complete and we should be able to simulate every scenario with it.
  • Data passing demos: I'm interested in documenting all the different options and provide examples for ROS 2 applications

One thing we might want to do for an application is probably tune the priority of the IRQ handlers. 50 looks like a sane default, but if an application is running at priority 80, then there's a chance that the IRQ handler never runs. For example, an RT application that needs USB probably need to ensure the priority of IRQ handler for it is properly tuned. This is application dependent so it probably belongs more in an advice section as opposed to be done here by default.

Yep. In the case of ROS 2 applications using the network stack, this is something important to take into account. For example, in that case, I would keep the application priority above 50 and the network IRQs priorities above the application thread priorities. Here is an interesting talk related to that topic. https://www.youtube.com/watch?v=-pehAzaP1eg There is a discussion at the end related to running the system IRQ thread priorities. Something we could do is to add a script to tune the network (and other) IRQ thread priorities and affinities. Then we can refer to this script in the documentation.

I just finished writing my blog post (part 2/4) about the problem for this issue: https://shuhaowu.com/blog/2022/02-linux-rt-appdev-part2.html. I'll share it in the RTWG later as well. It should contain nothing this group doesn't already know about, tho. That said, it's designed for a separate audience (robotics people interested in RT, and perhaps other RT people like audio/low-latency game emulation). There's also a part 1(https://shuhaowu.com/blog/2022/01-linux-rt-appdev-part1.html), which is even more generic.

I took a look at the demo you are working on and I'm really interested in two of the topics you cover:

Stressing approaches comparison: I would like to select a default stressing approach we can reuse as a default one for the experiments. Stressing the CPUs is a good start but it's very limited. I like the hackbench and the OSADL simulated workload approaches. On the other hand, stress-ng is very complete and we should be able to simulate every scenario with it.

Maybe there's interest in defining a "standard" set of benchmarks to run by running cyclictest along with stress-ng (and maybe others). I'm not sure if this is possible, as every application is different. That said, providing a starting point could be nice, as I don't think anywhere else has a standard set defined. The tests I did in that repo is based on various talks and articles i've seen.

Data passing demos: I'm interested in documenting all the different options and provide examples for ROS 2 applications

This is an area that I feel is under-explored, especially in robotics-related RT: how to pass data. I've seen a few conference talks about this (see this one as a starting point), but they're mostly focused on audio programming. I'm currently playing with their techniques in my rt-demo repo, and I'll likely finish it over the next few weeks. I'm hoping I'll summarize all of my knowledge in part 4 of the blog series above.

Something we could do is to add a script to tune the network (and other) IRQ thread priorities and affinities.

Definitely something we can do. Another thing I would love to do, is be able to figure out exactly which irq handlers you need to tune given a particular application. I'm not an expert in tracing this stuff yet, so perhaps once I learn it I'll write i as a part 5 of my series 😄 .

This was done.