The reference system

With the distributed development of ROS across many different organizations it is sometimes hard to benchmark and concretely show how a certain change to a certain system improves or reduces the performance of that system. For example did a change from one executor to another actually reduce the CPU or was it something else entirely?

The reference_system package was developed to provide the fundamental building blocks to create complex systems that then can be used to evaluate features or performance in a standardized and repeatable way.

The first project to use this reference_system is the autoware_reference_system.

Future reference systems could be proposed that are more complex using the same basic node building blocks within the reference_system package.

Defining a reference system

A reference system is defined by:

A platform is defined by:
- Hardware (e.g. an off-the-shelf single-board computer, embedded ECU, etc.)
  - if there are multiple configurations available for such hardware, ensure it is specified
- Operating System (OS) like RT Linux, QNX, etc. along with any special configurations made
for simplicity and ease of benchmarking, all nodes must run on a single process
a fixed number of nodes
- each node with:
  - a fixed number of publishers and subscribers
  - a fixed processing time or a fixed publishing rate
a fixed message type of fixed size to be used for every node

With these defined attributes the reference system can be replicated across many different possible configurations to be used to benchmark each configuration against the other in a reliable and fair manner.

With this approach portable and repeatable tests can also be defined to reliably confirm if a given reference system meets the requirements.

Supported Platforms

To enable as many people as possible to replicate this reference system, the platform(s) were chosen to be easily accessible (inexpensive, high volume), have lots of documentation, large community use and will be supported well into the future.

Platforms were not chosen for performance of the reference system - we know we could run “faster” with a more powerful CPU or GPU but then it would be harder for others to validate findings and test their own configurations. Accessibility is the key here and will be considered if more platforms want to be added to this benchmark list.

Platforms:

Raspberry Pi 4B:
- 4 GB RAM version is the assumed default
  - other versions could also be tested / added by the community
- real-time Linux kernel

Note: create an issue to add more platforms to the list, keeping in mind the above criteria

!!! warning Each reference system can be run on other targets as well however the results will change drastically depending on the specifications of the target hardware.

Base node types

Most real-world systems can be boiled down to only a handful of base node "types" that are then repeated to make the real-world system. This does not cover all possible node types, however it allows for numerous complicated systems to be developed using the same base building blocks.

Sensor Node
- input node to system
- one publisher, zero subscribers
- publishes message cyclically at some fixed frequency
Transform Node
- one subscriber, one publisher
- starts processing for N milliseconds after a message is received
- publishes message after processing is complete
Fusion Node
- 2 subscribers, one publisher
- starts processing for N milliseconds after a message is received from all subscriptions
- publishes message after processing is complete
Cyclic Node
- N subscribers, one publisher
- cyclically processes all received messages since the last cycle for N milliseconds
- publishes message after processing is complete
Command Node
- prints output stats everytime a message is received
Intersection Node
- behaves like N transform nodes
- N subscribers, N publisher bundled together in one-to-one connections
- starts processing on connection where sample was received
- publishes message after processing is complete

These basic building-block nodes can be mixed-and-matched to create quite complex systems that replicate real-world scenarios to benchmark different configurations against each other.

New base node types can be added if necessary.

Implemented reference systems

The first reference system benchmark proposed is based on the Autoware.Auto LiDAR data pipeline as stated above and shown in the node graph image above as well.

Autoware Reference System
- ROS2
  - Executors
    - Single Threaded
    - Static Single Threaded
    - Multithreaded
    - Callback Group
    - Prioritized

Results below show various characteristics of the same simulated system (Autoware.Auto).

Testing and Dependencies

Common benchmarking scripts are provided within the reference_system/reference_system_py directory which is a python module itself. The methods and tools provided there can assist with running standardized benchmarking tests and with generating reports as well. See the autoware_reference_system for an example

Unit and integration tests have also been written for the reference_system and can be found within the test directory. If a new system type is to be added, new unit and integration tests should also be added as well.

Setup Raspberry Pi 4 for the test

The goal is to provide a clean computation environment for the test avoiding an interference of other Ubuntu components.

Setup a constant CPU frequency

Frequency is setup to 1.50 GHz for all CPUs

# run it as root
sudo su

echo -n "setup constant CPU frequency to 1.50 GHz ... "
# disable ondemand governor
systemctl disable ondemand

# set performance governor for all cpus
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null

# set constant frequency
echo 1500000 | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq >/dev/null
echo 1500000 | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq >/dev/null

# reset frequency counters
echo 1 | tee /sys/devices/system/cpu/cpu*/cpufreq/stats/reset >/dev/null

echo done

sleep 1
# get freq info
echo `cpufreq-info | grep stats | cut -d ' ' -f 23-25`

Isolate CPUs

CPU 2,3 are isolated to run tests.

sudo apt install -y sysstat u-boot-tools

# modify kernel cmdline
cd ~
dd if=/boot/firmware/boot.scr of=boot.script bs=72 skip=1

# edit boot.script and modify bootargs to
ubuntu@ubuntu:~$ cat boot.script | grep "setenv bootargs" | head -1
setenv bootargs " ${bootargs} rcu_nocbs=2,3 nohz_full=2,3 isolcpus=2,3 irqaffinity=0,1 audit=0 watchdog=0 skew_tick=1 quiet splash"

# generate boot.scr
mkimage -A arm64 -O linux -T script -C none -d boot.script boot.scr

# replace boot.scr
sudo cp boot.scr /boot/firmware/boot.scr

sudo reboot

# check cmdline
ubuntu@ubuntu:~$ cat /proc/cmdline
 coherent_pool=1M 8250.nr_uarts=1 snd_bcm2835.enable_compat_alsa=0 snd_bcm2835.enable_hdmi=1 bcm2708_fb.fbwidth=0 bcm2708_fb.fbheight=0 bcm2708_fb.fbswap=1 smsc95xx.macaddr=DC:A6:32:2E:5
4:97 vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000  net.ifnames=0 dwc_otg.lpm_enable=0 console=ttyS0,115200 console=tty1 root=LABEL=writable rootfstype=ext4 elevator=deadline roo
twait fixrtc rcu_nocbs=2,3 nohz_full=2,3 isolcpus=2,3 irqaffinity=0,1 audit=0 watchdog=0 skew_tick=1 quiet splash

# check interrupts
# Only the number of interrupts handled by CPU 0,1 increases.
watch -n1 cat /proc/interrupts

# check soft interrupts
watch -n1 cat /proc/softirqs

# check isolated CPUs
cat /sys/devices/system/cpu/isolated
2-3
cat /sys/devices/system/cpu/present
0-3

# run reference system on CPU2
taskset -c 2 install/autoware_reference_system/lib/autoware_reference_system/autoware_default_singlethreaded > /dev/null

# get pid
RF_PID=`pidof autoware_default_singlethreaded` && cat /proc/$RF_PID/status | grep ^Cpu

# check how many threads are running
ps -aL | grep $RF_PID
   3835    3835 ttyS0    00:03:46 autoware_defaul
   3835    3836 ttyS0    00:00:00 autoware_defaul
   3835    3837 ttyS0    00:00:00 autoware_defaul
   3835    3838 ttyS0    00:00:00 autoware_defaul
   3835    3839 ttyS0    00:00:00 gc
   3835    3840 ttyS0    00:00:00 dq.builtins
   3835    3841 ttyS0    00:00:00 dq.user
   3835    3842 ttyS0    00:00:00 tev
   3835    3843 ttyS0    00:00:00 recv
   3835    3844 ttyS0    00:00:00 recvMC
   3835    3845 ttyS0    00:00:00 recvUC
   3835    3846 ttyS0    00:00:00 autoware_defaul

Hints

If you run colcon build on a Raspberry Pi 4 with little memory, use export MAKEFLAGS="-j 1" to inhibit parallelism. Otherwise, the system could hang due to memory swapping.

carlossvg / reference-system