unstable: high cpu usage by /sbin/urngd

Question

unstable: high cpu usage by /sbin/urngd

lePereT opened this issue 4 years ago · comments

Hi all, getting a lot of instability. On MacOS Mojave, running Docker version 19.03.8, and docker-machine version 0.16.2

If I just use the Readme command:

docker run --rm -it openwrtorg/rootfs

I get a number of error messages during launch:

rich$ docker run --rm -it openwrtorg/rootfs
Failed to resize receive buffer: Operation not permitted
ip: RTNETLINK answers: Operation not permitted
Press the [f] key and hit [enter] to enter failsafe mode
Press the [1], [2], [3] or [4] key and hit [enter] to select the debug level
ip: can't send flush request: Operation not permitted
ip: SIOCSIFFLAGS: Operation not permitted
Please press Enter to activate this console.

When in the shell, it's sluggish, and I notice that one core of my CPU is being used at 100%. A top inside the container reveals the following:

Mem: 433964K used, 579256K free, 290552K shrd, 9536K buff, 323160K cached
CPU:  99% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq
Load average: 0.99 0.58 0.24 2/163 817
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
   92     1 root     R      780   0% 100% /sbin/urngd
  279     1 root     S     1300   0%   0% /sbin/rpcd -s /var/run/ubus.sock -t 30
  434     1 root     S     1196   0%   0% /sbin/netifd
    1     0 root     S     1116   0%   0% /sbin/procd
   76     1 root     S     1084   0%   0% /bin/ash --login

Am I doing something wrong?

Paul Spooren · Answer 1 · Mon Apr 13 2020 16:53:32 GMT+0800 (China Standard Time)

Thanks for the report, I've never touched urngd but maybe @ynezz has a clue...

Richard Thanki · Answer 2 · Mon Apr 13 2020 19:00:06 GMT+0800 (China Standard Time)

So, quickly typing a killall /sbin/urngd after terminal access is gained appears to make urngd behave. Not ideal. Also what are the following error messages all about:

Failed to resize receive buffer: Operation not permitted
ip: RTNETLINK answers: Operation not permitted
...
ip: can't send flush request: Operation not permitted
ip: SIOCSIFFLAGS: Operation not permitted

Richard Thanki · Answer 3 · Mon Apr 13 2020 20:43:13 GMT+0800 (China Standard Time)

Just to confirm that the problem persists with an Ubuntu 18.04 VM as host

Mem: 865520K used, 143284K free, 984K shrd, 34440K buff, 579772K cached
CPU:  99% usr   0% sys   0% nic   0% idle   0% io   0% irq   0% sirq
Load average: 0.39 0.11 0.04 4/154 711
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
   91     1 root     R      776   0%  99% /sbin/urngd
  444     1 root     S     1208   0%   0% /sbin/netifd
    1     0 root     S     1176   0%   0% /sbin/procd

Paul Spooren · Answer 4 · Wed Apr 29 2020 06:15:03 GMT+0800 (China Standard Time)

I can't reproduce the error, did you tried to reproduce it on other machines?

Petr Štetiar · Answer 5 · Wed May 06 2020 13:24:04 GMT+0800 (China Standard Time)

I can't reproduce that error even on Ubuntu 18.04 (but with 5.6.7 kernel). It would help to get strace output from urngd if its in this state, should be as easy as running opkg update; opkg install strace; strace --no-abbrev --attach $(pidof urngd) inside container spawn with docker run --cap-add SYS_PTRACE --rm -it openwrtorg/rootfs

Richard Thanki · Answer 6 · Thu May 07 2020 02:59:42 GMT+0800 (China Standard Time)

i'll attempt to do this in the next week or so. i'll close the issue for now to prevent noise :) thanks for both your responses

Giovanni Giacobbi · Answer 7 · Sat Jan 30 2021 19:40:21 GMT+0800 (China Standard Time)

I would like to reopen this issue.

I am running in the same bug when OpenWRT is running in a docker that does not allow ioctl RNDADDENTROPY on /dev/random.

This causes an infinite loop consuming high cpu because the WRITE poll event keeps triggering and is never satisfied (because it cannot), thus causing the infinite busy loop.

Should I provide a possible fix? I would simply stop the polling for a certain amount of time in case RNDADDENTROPY fails.

Julien · Answer 8 · Sun Feb 07 2021 16:19:39 GMT+0800 (China Standard Time)

I have the same issue in Ubuntu18.04 VM, and OpenWRT(19.07.02) in the docker container.

Paul Spooren · Answer 9 · Sun Feb 07 2021 16:23:11 GMT+0800 (China Standard Time)

@thg2k please provide a fix

Giovanni Giacobbi · Answer 10 · Sun Feb 07 2021 17:33:33 GMT+0800 (China Standard Time)

@aparcar I did, but it was refused by the maintainer.

http://lists.openwrt.org/pipermail/openwrt-devel/2021-January/033587.html

It is indeed a very bad workaround but it solves the problem without causing any regression damage and it's easy to audit. A better fix would be to use uloop timers and improve logging but I have no interest in spending more time on this. It is still a fix and I recommend merging it.

Chen Yijun · Answer 11 · Sun Jun 20 2021 12:34:21 GMT+0800 (China Standard Time)

I got this problem on my MT7621 router too, maybe there is something wrong with the source code.

Haizs Chen · Answer 12 · Sat Feb 26 2022 16:59:59 GMT+0800 (China Standard Time)

I ran into this same problem when using PVE to run OpenWrt in Linux Container, according to random(4) - Linux manual page, The CAP_SYS_ADMIN capability is required for almost all related ioctl requests.

I had included the default OpenWrt config file (same as this lxc-template) which contains lxc.cap.drop = sys_admin, I removed this line and the /sbin/urngd not stuck my CPU anymore.

I think there is also a way to grant the SYS_ADMIN capability to a Docker container, but it is overloaded so the decision is yours.

Moreover, it seems just uninstall the urngd package could also solve this problem but I'm not sure the side effect.

pmelange · Answer 13 · Mon Jun 27 2022 17:51:02 GMT+0800 (China Standard Time)

I ran into this problem today on a Linksys WRT1900ACS which has an uptime of 248 days running

~# cat /etc/openwrt_release 
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='21.02.0'
DISTRIB_REVISION='r16279-5cc0535800'
DISTRIB_TARGET='mvebu/cortexa9'
DISTRIB_ARCH='arm_cortex-a9_vfpv3-d16'
DISTRIB_DESCRIPTION='OpenWrt 21.02.0 r16279-5cc0535800'
DISTRIB_TAINTS=''

Suddenly at around 1am my load jumped.

Killing urngd helped. But restarting it brought the load back up again. So, now I've killed urngd without restarting it. I will keep the system up to see if there are any impacts of having urngd stopped.

What, by the way, could be using urngd? Maybe those processes just need a restart. Perhaps dnsmasq? Anything else? Does OLSRd or babeld use urngd?

Andreas Fischer · Answer 14 · Fri Sep 16 2022 05:14:54 GMT+0800 (China Standard Time)

It looks like I am also seeing this on a TP-Link Archer C7 v2.

root@foobar:~# cat /etc/openwrt_release
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='19.07.2'
DISTRIB_REVISION='r10947-65030d81f3'
DISTRIB_TARGET='ar71xx/generic'
DISTRIB_ARCH='mips_24kc'
DISTRIB_DESCRIPTION='OpenWrt 19.07.2 r10947-65030d81f3'
DISTRIB_TAINTS=''