Runaway LWPs/threads on recording daemon
abalashov opened this issue · comments
I am running RTPEngine mr6.5.4.2 built from source on EL7, plus recording-daemon from the same suite. libav*
dependencies come from the nux-dextop
repo. RTPEngine is writing frames into the /proc
sink (--recording-method=proc
) and the recording daemon is writing out mixed mono WAVs, with file
-only metadata, no DB, and all in all the following invocation options:
/usr/local/sbin/rtpengine-recording \
--spool-dir=/recordings \
--output-storage=file \
--output-dir=/recordings \
--output-format=wav \
--output-mixed \
--pidfile=/var/run/rtpengine-recording.pid
What I am seeing is runaway growth in the number of worker threads spawned by the recording daemon, wildly disproportionate to the number of RTPEngine targets:
# cat /proc/rtpengine/0/status
Refcount: 1
Control PID: 3131
Targets: 72
# ps aux | grep -i rtpengine-rec
root 8635 19.4 5.3 4172872 416356 ? Sl 18:48 18:03 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
root 25573 0.0 0.0 112712 996 pts/0 S+ 20:21 0:00 grep --color=auto -i rtpengine-rec
# ps -p 8635 -lfT | wc -l
418
Almost all of them appear to be in a futex
state, so I assume some sort of deadlock, e.g.
1 S root 8635 25622 1 0 80 0 - 1047316 futex_ 20:22 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
1 S root 8635 25623 1 0 80 0 - 1047316 futex_ 20:22 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
1 S root 8635 25625 1 0 80 0 - 1047316 futex_ 20:22 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
The way this issue was detected is that the recording daemon started complaining about running into file descriptor limits ("Too many open files" error), which struck me as curious given the relatively small number of concurrent streams recorded and the fact that the recording daemon is running as EUID/EGID root
.
However, what I have found is that every one of those LWPs has several hundred open descriptors. For instance, PID 8635 above:
# cd /proc/8635/fd
# ls -w 5 | wc -l
291
This seems to be the story with all the LWPs:
# ps -p 8635 -fT | awk '{print $3}' | while read THIS_PID; do echo -n "$THIS_PID: "; find "/proc/$THIS_PID/fd" | wc -l; done
SPID: find: ‘/proc/SPID/fd’: No such file or directory
0
8635: 284
8636: 284
8637: 284
8638: 284
8639: 284
8640: 284
[... same all the way down the line ...]
Since the descriptor count is exactly the same across all the LWPs, I assume this is because they are cloned into every LWP. But regardless, it contributes to a rather large cumulative descriptor count across all the LWPs for that process:
# ps -p 8635 -fT | awk '{print $3}' | while read THIS_PID; do echo -n "$THIS_PID: "; find "/proc/$THIS_PID/fd" | wc -l; done | awk '{print $2}' | awk 'BEGIN { sum = 0 } { sum += $1 } END { print sum }'
find: ‘/proc/SPID/fd’: No such file or directory
110826
The number of LWPs steadily increases. We found it at a peak of 1200 before restarting the recording daemon. At that point, we seem to have bumped into the system-wide FD limit:
# cat /proc/sys/fs/file-max
763006
This situation appears to play out regardless of whether the recording daemon is invoked with a certain number of --num-threads=...
explicitly, or left at the defaults (as now).
There is nothing interesting in the logs (until the "Too many open files" messages start). Just fairly routine things like:
INFO: [C 2fcb0ec6-e8ef-4e84-8fda-163e9ac7626d-94e8e401f7b2a8ed.meta] [S tag-1-media-1-component-2-RTCP-id-1] EOF on stream tag-1-media-1-component-2-RTCP-id-1
And:
WARNING: [C 63b7c294-a546-4623-a034-6d2b26f54cc3-63ab5712488d039b.meta] [S tag-0-media-1-component-1-RTP-id-2] [0x554f12d] Cannot decode RTP payload type 101 (telephone-event/8000)
Since only a static number of poller threads is created in main()
:
for (int i = 0; i < num_threads; i++)
start_poller_thread();
a reasonable supposition for the runaway thread growth is that some or all of the libraries used by the recording daemon do their work in threads of their own, and that these threads are not exiting properly:
linux-vdso.so.1 => (0x00007fffc19ee000)
libm.so.6 => /lib64/libm.so.6 (0x00007f9afe96f000)
libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x00007f9afe659000)
libgthread-2.0.so.0 => /lib64/libgthread-2.0.so.0 (0x00007f9afe457000)
libavcodec.so.56 => /lib64/libavcodec.so.56 (0x00007f9afd1fd000)
libavformat.so.56 => /lib64/libavformat.so.56 (0x00007f9afce2c000)
libavutil.so.54 => /lib64/libavutil.so.54 (0x00007f9afcbca000)
libswresample.so.1 => /lib64/libswresample.so.1 (0x00007f9afc9af000)
libavfilter.so.5 => /lib64/libavfilter.so.5 (0x00007f9afc63a000)
libmysqlclient.so.18 => /usr/lib64/mysql/libmysqlclient.so.18 (0x00007f9afc13a000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9afbf1e000)
libz.so.1 => /lib64/libz.so.1 (0x00007f9afbd08000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f9afbb04000)
libssl.so.10 => /lib64/libssl.so.10 (0x00007f9afb892000)
libcrypto.so.10 => /lib64/libcrypto.so.10 (0x00007f9afb430000)
libc.so.6 => /lib64/libc.so.6 (0x00007f9afb063000)
/lib64/ld-linux-x86-64.so.2 (0x00007f9afec71000)
libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f9afae01000)
libva.so.1 => /lib64/libva.so.1 (0x00007f9afabe1000)
libxvidcore.so.4 => /lib64/libxvidcore.so.4 (0x00007f9afa8ce000)
libx265.so.79 => /lib64/libx265.so.79 (0x00007f9afa37d000)
libx264.so.142 => /lib64/libx264.so.142 (0x00007f9afa009000)
libvorbisenc.so.2 => /lib64/libvorbisenc.so.2 (0x00007f9af9b3a000)
libvorbis.so.0 => /lib64/libvorbis.so.0 (0x00007f9af990d000)
libvo-amrwbenc.so.0 => /lib64/libvo-amrwbenc.so.0 (0x00007f9af96f3000)
libtheoraenc.so.1 => /lib64/libtheoraenc.so.1 (0x00007f9af94c6000)
libtheoradec.so.1 => /lib64/libtheoradec.so.1 (0x00007f9af92b6000)
libspeex.so.1 => /lib64/libspeex.so.1 (0x00007f9af909d000)
libschroedinger-1.0.so.0 => /lib64/libschroedinger-1.0.so.0 (0x00007f9af8dd1000)
libopus.so.0 => /lib64/libopus.so.0 (0x00007f9af8b8f000)
libopenjpeg.so.1 => /lib64/libopenjpeg.so.1 (0x00007f9af896b000)
libopencore-amrwb.so.0 => /lib64/libopencore-amrwb.so.0 (0x00007f9af8757000)
libopencore-amrnb.so.0 => /lib64/libopencore-amrnb.so.0 (0x00007f9af852d000)
libmp3lame.so.0 => /lib64/libmp3lame.so.0 (0x00007f9af82b4000)
libgsm.so.1 => /lib64/libgsm.so.1 (0x00007f9af80a8000)
libfdk-aac.so.1 => /lib64/libfdk-aac.so.1 (0x00007f9af7df4000)
libgmp.so.10 => /lib64/libgmp.so.10 (0x00007f9af7b7c000)
libgnutls.so.28 => /lib64/libgnutls.so.28 (0x00007f9af7842000)
libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f9af7632000)
libsoxr.so.0 => /lib64/libsoxr.so.0 (0x00007f9af73cf000)
libswscale.so.3 => /lib64/libswscale.so.3 (0x00007f9af7148000)
libpostproc.so.53 => /lib64/libpostproc.so.53 (0x00007f9af6f2a000)
libavresample.so.2 => /lib64/libavresample.so.2 (0x00007f9af6d0b000)
libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007f9af6a4c000)
libass.so.5 => /lib64/libass.so.5 (0x00007f9af681c000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f9af6515000)
libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007f9af62c8000)
libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007f9af5fdf000)
libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007f9af5ddb000)
libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007f9af5ba8000)
librt.so.1 => /lib64/librt.so.1 (0x00007f9af59a0000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f9af5794000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f9af557e000)
libogg.so.0 => /usr/lib64/libogg.so.0 (0x00007f9af5377000)
liborc-0.4.so.0 => /lib64/liborc-0.4.so.0 (0x00007f9af50f3000)
libp11-kit.so.0 => /lib64/libp11-kit.so.0 (0x00007f9af4dc4000)
libtasn1.so.6 => /lib64/libtasn1.so.6 (0x00007f9af4bb1000)
libnettle.so.4 => /lib64/libnettle.so.4 (0x00007f9af4980000)
libhogweed.so.2 => /lib64/libhogweed.so.2 (0x00007f9af4759000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f9af4533000)
libpng15.so.15 => /lib64/libpng15.so.15 (0x00007f9af4308000)
libfribidi.so.0 => /lib64/libfribidi.so.0 (0x00007f9af40ec000)
libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007f9af3eaa000)
libharfbuzz.so.0 => /lib64/libharfbuzz.so.0 (0x00007f9af3c0d000)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007f9af39fd000)
libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007f9af37f9000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f9af35e0000)
libffi.so.6 => /lib64/libffi.so.6 (0x00007f9af33d8000)
libexpat.so.1 => /lib64/libexpat.so.1 (0x00007f9af31ae000)
libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f9af2fa9000)
libgraphite2.so.3 => /lib64/libgraphite2.so.3 (0x00007f9af2d7b000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f9af2b54000)
But, I have no way of identifying which library might be the problem.
One thing I did try was to remove mysql_thread_init()
from the poller_thread()
invocation in epoll.c
(and the corresponding mysql_thread_end()
), since we are not using MySQL at all for storage. However, this did not seem to have any effect.
One thing I did learn from attaching gdb
to a random selection of these LWPs is that they're all threads spawned by libavfilter
:
# gdb attach 13775
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 13775
warning: process 13775 is a cloned process
Reading symbols from /usr/local/sbin/rtpengine-recording...done.
[spam elided]
(gdb) where
#0 0x00007f98dd438965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f98ddbc33ab in worker () from /lib64/libavfilter.so.5
#2 0x00007f98dd434dd5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f98dc66fead in clone () from /lib64/libc.so.6
Here are the ffmpeg library versions in use:
# rpm -qa | grep -i ffmpeg
ffmpeg-2.8.15-2.el7.nux.x86_64
ffmpeg-devel-2.8.15-2.el7.nux.x86_64
ffmpeg-libs-2.8.15-2.el7.nux.x86_64
I suppose one thing I haven't tried is sourcing them from a place other than nux-dextop
. But some quick Googling suggested that this is the 'canonical' way to install them on EL7/CentOS 7, e.g.
https://linuxize.com/post/how-to-install-ffmpeg-on-centos-7/
One thing I am trying now is a much newer version of the ffmpeg packages from something called awel-media-release
:
http://awel.domblogger.net/7/media/x86_64/repoview/awel-media-release.html
Much newer versions:
# rpm -qa | grep -i ffmpeg
ffmpeg-devel-3.4.2-1.el7_5.awel.0.x86_64
ffmpeg-3.4.2-1.el7_5.awel.0.x86_64
ffmpeg-libs-3.4.2-1.el7_5.awel.0.x86_64
So far it is looking promising, but as it is after business hours call volumes have collapsed, so I can't really get truly meaningful feedback until tomorrow possibly.
So far, I've got this situation:
Refcount: 1
Control PID: 3131
Targets: 9
And about 30 LWPs. That number drops to 28 or 26 from time to time, spikes to 32 or so. Doesn't seem to be moving much beyond this level, but neither do the call volumes.
Is there any insight on the relationship between the threads spawned by the recording daemon and the call volumes? It's very difficult to tell if the upgrade of the ffmpeg
libs fixed the problem or if the low call volumes after hours are merely masking the same problem. About the only thing that's different is that there isn't the same all-but-monotonic upward increase as before...
I wasn't aware that libavfilter (or ffmpeg libs in general) would spawn any threads. There's certainly nothing in the code that would instruct it to do that. Gonna have to look into what it's doing there.
Just as a data point following the ffmpeg libs update 👍
It is now after 21:00 here, well outside of business hours, and there are no calls --
# cat /proc/rtpengine/0/status
Refcount: 1
Control PID: 3131
Targets: 0
There have not been any for quite some time. Yet, there are 38 LWPs spawned off of rtpengine-recording
:
# ps -p `pidof rtpengine-recording` -lfT | wc -l
38
The recording daemon was invoked without --num-threads
value, so it started with a default of 10. Since the last time the recording daemon was restarted, there has been a maximum of about 16 or 18 RTPEngine targets, and the the LWP count has crept up from 10 to about 32, then back down to 28, then back up to 30, and generally hovering somewhere in this area.
Here is the state of the 38 processes:
# ps -p `pidof rtpengine-recording` -lfT | awk '{print $1" "$12" "$16}'
F WCHAN CMD
1 do_sig /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
Attaching to a process at random:
(gdb) where
#0 0x00007f9453d15965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f9454affcbe in thread_worker () from /lib64/libavutil.so.55
#2 0x00007f9453d11dd5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f9452f4d02d in clone () from /lib64/libc.so.6
The same seems to be true of the others which are in the futex_
state.
I guess a key question is: given the zero call load, and no calls in the sink...
# pwd
/proc/rtpengine/0/calls
# ls
#
Overall, the state is:
# strace -fp 28115
strace: Process 28115 attached with 37 threads
[pid 26802] futex(0x7f942c01f844, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 26801] futex(0x7f942c01f7d4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 20644] futex(0x7f94240ff1a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 20643] futex(0x7f94240ff134, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 15791] futex(0x7f94440f63e4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 15790] futex(0x7f94440f6374, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 15202] futex(0x7f943c09c464, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 15201] futex(0x7f943c09c3f4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 13898] futex(0x7f942c0031a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 13897] futex(0x7f942c003134, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 9260] futex(0x7f94380242a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 9259] futex(0x7f9438024234, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 32658] futex(0x7f943c0c9384, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 32657] futex(0x7f943c0c9314, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31874] futex(0x7f9424006204, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31873] futex(0x7f9424006194, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31847] futex(0x7f9420032a44, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31846] futex(0x7f94200329d4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31196] futex(0x7f94301a3664, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31195] futex(0x7f94301a35f4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 30782] futex(0x7f942c01d0a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 30781] futex(0x7f942c01d034, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28218] futex(0x7f943c00c684, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28217] futex(0x7f943c00c614, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28122] epoll_wait(5, <unfinished ...>
[pid 28121] epoll_wait(5, <unfinished ...>
[pid 28120] epoll_wait(5, <unfinished ...>
[pid 28119] epoll_wait(5, <unfinished ...>
[pid 28118] epoll_wait(5, <unfinished ...>
[pid 28116] epoll_wait(5, <unfinished ...>
[pid 28115] rt_sigtimedwait([INT TERM], NULL, NULL, 8 <unfinished ...>
[pid 29690] futex(0x7f94301a1564, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 26629] futex(0x7f943401e734, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28117] epoll_wait(5, <unfinished ...>
[pid 26630] futex(0x7f943401e7a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 29689] futex(0x7f94301a14f4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28123] epoll_wait(5, ^Cstrace: Process 28115 detached
... why are there 38 LWPs? Why aren't these processes being wound back down?
Another interesting wrinkle -- it looks like the core rtpengine-recording
process is holding a number of file handles open for calls which are long over:
# lsof -p 28115
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
rtpengine 28115 root cwd DIR 259,1 242 64 /
rtpengine 28115 root rtd DIR 259,1 242 64 /
rtpengine 28115 root txt REG 259,1 635224 23382 /usr/local/sbin/rtpengine-recording
rtpengine 28115 root mem REG 259,1 61624 4195748 /usr/lib64/libnss_files-2.17.so
rtpengine 28115 root mem REG 259,1 155784 4305473 /usr/lib64/libselinux.so.1
rtpengine 28115 root mem REG 259,1 192728 4270323 /usr/lib64/libgraphite2.so.3.0.1
rtpengine 28115 root mem REG 259,1 20112 4195793 /usr/lib64/libuuid.so.1.3.0
rtpengine 28115 root mem REG 259,1 173320 4308009 /usr/lib64/libexpat.so.1.6.0
rtpengine 28115 root mem REG 259,1 15512 4270327 /usr/lib64/libXau.so.6.0.0
rtpengine 28115 root mem REG 259,1 32304 4307332 /usr/lib64/libffi.so.6.0.1
rtpengine 28115 root mem REG 259,1 19384 4307944 /usr/lib64/libgpg-error.so.0.10.0
rtpengine 28115 root mem REG 259,1 105824 4195760 /usr/lib64/libresolv-2.17.so
rtpengine 28115 root mem REG 259,1 15688 4307979 /usr/lib64/libkeyutils.so.1.5
rtpengine 28115 root mem REG 259,1 67104 4195787 /usr/lib64/libkrb5support.so.0.1
rtpengine 28115 root mem REG 259,1 179296 4517318 /usr/lib64/libpng15.so.15.13.0
rtpengine 28115 root mem REG 259,1 652984 4270325 /usr/lib64/libharfbuzz.so.0.10705.0
rtpengine 28115 root mem REG 259,1 276968 4258680 /usr/lib64/libfontconfig.so.1.11.1
rtpengine 28115 root mem REG 259,1 114176 4292710 /usr/lib64/libfribidi.so.0.4.0
rtpengine 28115 root mem REG 259,1 75848 4292657 /usr/lib64/libXext.so.6.4.0
rtpengine 28115 root mem REG 259,1 165976 4292651 /usr/lib64/libxcb.so.1.1.0
rtpengine 28115 root mem REG 259,1 160776 4308726 /usr/lib64/libhogweed.so.2.5
rtpengine 28115 root mem REG 259,1 201296 4308728 /usr/lib64/libnettle.so.4.7
rtpengine 28115 root mem REG 259,1 78056 4194398 /usr/lib64/libtasn1.so.6.5.3
rtpengine 28115 root mem REG 259,1 1261848 4307342 /usr/lib64/libp11-kit.so.0.3.0
rtpengine 28115 root mem REG 259,1 535064 4307338 /usr/lib64/libgcrypt.so.11.8.2
rtpengine 28115 root mem REG 259,1 28360 4258655 /usr/lib64/libogg.so.0.8.0
rtpengine 28115 root mem REG 259,1 88776 4241095 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
rtpengine 28115 root mem REG 259,1 210824 4195779 /usr/lib64/libk5crypto.so.3.1
rtpengine 28115 root mem REG 259,1 15920 4252611 /usr/lib64/libcom_err.so.2.1
rtpengine 28115 root mem REG 259,1 967848 4195785 /usr/lib64/libkrb5.so.3.3
rtpengine 28115 root mem REG 259,1 320400 4271970 /usr/lib64/libgssapi_krb5.so.2.2
rtpengine 28115 root mem REG 259,1 991616 4600929 /usr/lib64/libstdc++.so.6.0.19
rtpengine 28115 root mem REG 259,1 795608 4517319 /usr/lib64/libfreetype.so.6.14.0
rtpengine 28115 root mem REG 259,1 200744 4292712 /usr/lib64/libass.so.5.3.2
rtpengine 28115 root mem REG 259,1 130688 4292839 /usr/lib64/libpostproc.so.54.7.100
rtpengine 28115 root mem REG 259,1 530552 4292843 /usr/lib64/libswscale.so.4.8.100
rtpengine 28115 root mem REG 259,1 15720 4292671 /usr/lib64/libvdpau.so.1.0.0
rtpengine 28115 root mem REG 259,1 1318800 4292655 /usr/lib64/libX11.so.6.3.0
rtpengine 28115 root mem REG 259,1 68192 4305577 /usr/lib64/libbz2.so.1.0.6
rtpengine 28115 root mem REG 259,1 1300504 4517312 /usr/lib64/libgnutls.so.28.43.3
rtpengine 28115 root mem REG 259,1 495712 4308038 /usr/lib64/libgmp.so.10.2.0
rtpengine 28115 root mem REG 259,1 120760 4292827 /usr/lib64/librtmp.so.1
rtpengine 28115 root mem REG 259,1 48848 4258670 /usr/lib64/libgsm.so.1.0.12
rtpengine 28115 root mem REG 259,1 314840 4292855 /usr/lib64/libmp3lame.so.0.0.0
rtpengine 28115 root mem REG 259,1 152104 4292714 /usr/lib64/libopenjpeg.so.1.5.1
rtpengine 28115 root mem REG 259,1 351424 4292723 /usr/lib64/libopus.so.0.6.1
rtpengine 28115 root mem REG 259,1 102832 4292728 /usr/lib64/libspeex.so.1.5.0
rtpengine 28115 root mem REG 259,1 65936 4270307 /usr/lib64/libtheoradec.so.1.1.4
rtpengine 28115 root mem REG 259,1 185720 4270309 /usr/lib64/libtheoraenc.so.1.1.2
rtpengine 28115 root mem REG 259,1 185280 4258660 /usr/lib64/libvorbis.so.0.4.6
rtpengine 28115 root mem REG 259,1 2944200 4258662 /usr/lib64/libvorbisenc.so.2.0.9
rtpengine 28115 root mem REG 259,1 1334664 4292809 /usr/lib64/libvpx.so.1.3.0
rtpengine 28115 root mem REG 259,1 1056944 4292826 /usr/lib64/libx264.so.152
rtpengine 28115 root mem REG 259,1 700656 4292675 /usr/lib64/libxvidcore.so.4.3
rtpengine 28115 root mem REG 259,1 113232 4292804 /usr/lib64/libcrystalhd.so.3.6
rtpengine 28115 root mem REG 259,1 402384 4304403 /usr/lib64/libpcre.so.1.2.0
rtpengine 28115 root mem REG 259,1 2151672 4271975 /usr/lib64/libc-2.17.so
rtpengine 28115 root mem REG 259,1 2516624 4305516 /usr/lib64/libcrypto.so.1.0.2k
rtpengine 28115 root mem REG 259,1 470360 4305519 /usr/lib64/libssl.so.1.0.2k
rtpengine 28115 root mem REG 259,1 19288 4549116 /usr/lib64/libdl-2.17.so
rtpengine 28115 root mem REG 259,1 90248 4305546 /usr/lib64/libz.so.1.2.7
rtpengine 28115 root mem REG 259,1 141968 4195758 /usr/lib64/libpthread-2.17.so
rtpengine 28115 root mem REG 259,1 3135712 4463089 /usr/lib64/mysql/libmysqlclient.so.18.0.0
rtpengine 28115 root mem REG 259,1 2606848 4292833 /usr/lib64/libavfilter.so.6.107.100
rtpengine 28115 root mem REG 259,1 120408 4292841 /usr/lib64/libswresample.so.2.9.100
rtpengine 28115 root mem REG 259,1 451152 4292837 /usr/lib64/libavutil.so.55.78.100
rtpengine 28115 root mem REG 259,1 2299880 4292835 /usr/lib64/libavformat.so.57.83.100
rtpengine 28115 root mem REG 259,1 12793112 4292829 /usr/lib64/libavcodec.so.57.107.100
rtpengine 28115 root mem REG 259,1 7016 4517251 /usr/lib64/libgthread-2.0.so.0.5600.1
rtpengine 28115 root mem REG 259,1 1156600 4327750 /usr/lib64/libglib-2.0.so.0.5600.1
rtpengine 28115 root mem REG 259,1 1137024 4549118 /usr/lib64/libm-2.17.so
rtpengine 28115 root mem REG 259,1 163400 4271992 /usr/lib64/ld-2.17.so
rtpengine 28115 root mem REG 259,1 26254 23351 /usr/lib64/gconv/gconv-modules.cache
rtpengine 28115 root 0r CHR 1,3 0t0 1031 /dev/null
rtpengine 28115 root 1w CHR 1,3 0t0 1031 /dev/null
rtpengine 28115 root 2w CHR 1,3 0t0 1031 /dev/null
rtpengine 28115 root 3r CHR 1,9 0t0 1036 /dev/urandom
rtpengine 28115 root 4u unix 0xffff889cd1eb6400 0t0 1718397 socket
rtpengine 28115 root 5u a_inode 0,10 0 6358 [eventpoll]
rtpengine 28115 root 6r a_inode 0,10 0 6358 inotify
rtpengine 28115 root 11w REG 259,1 10223694 50290819 /recordings/2019_07_15_21/182e3fda-dba8-4bf6-b9bf-280dc866d999-b5cb69e0bc8d64b3-mix.wav
rtpengine 28115 root 12w REG 259,1 18088014 50317350 /recordings/2019_07_15_22/c9741d47-3b5f-4152-a272-f01c1f7e9391-e2efe72a0067b4fc-mix.wav
rtpengine 28115 root 13w REG 259,1 9437262 50317360 /recordings/2019_07_15_22/b7b92598-d3b4-441c-91db-d85bc99c9fba-2910f679df3904b8-mix.wav
rtpengine 28115 root 16w REG 259,1 4718670 29361752 /recordings/2019_07_15_21/294da9bf-0e81-4740-9b73-5ab73f0f92ae-d18988d0ef76ce8b-mix.wav
rtpengine 28115 root 17w REG 259,1 11010126 50290837 /recordings/2019_07_15_21/78588313-360c-47c0-bfc2-fd44595404ba-dd8552bc1038e41f-mix.wav
rtpengine 28115 root 22w REG 259,1 9699406 50290836 /recordings/2019_07_15_21/70c9a7bb-52e5-43aa-b1de-a5d524e67b33-d2bc03a62f3def3f-mix.wav
rtpengine 28115 root 25w REG 259,1 8912974 50317402 /recordings/2019_07_15_23/04334bdf-ef1e-4e77-86fa-74c307b97284-8ef5a2dbcd345055-mix.wav
rtpengine 28115 root 26w REG 259,1 4718670 50317383 /recordings/2019_07_15_22/6a438470-ade9-4d1d-a019-4bf3a15941f4-1ae275927be992dd-mix.wav
rtpengine 28115 root 27w REG 259,1 4718670 50290834 /recordings/2019_07_15_21/1f6c46c5-3cde-4ae6-9c3e-4d6d01b080ce-a6458de9d1c4a447-mix.wav
rtpengine 28115 root 29w REG 259,1 17563726 50317391 /recordings/2019_07_15_23/b1075f2d-4fd0-43e9-97b4-8a5ea5eb590d-b009e6186bfc7bf7-mix.wav
rtpengine 28115 root 32w REG 259,1 6815822 50290828 /recordings/2019_07_15_21/3f5ec2bf-0f43-4358-a95e-27eeec32cabf-5d6fde1bee7cd506-mix.wav
rtpengine 28115 root 33w REG 259,1 6029390 50317396 /recordings/2019_07_15_23/f082060b-2d55-454e-b5ec-60e95a924d01-6000e64b2e3d7bf4-mix.wav
rtpengine 28115 root 38w REG 259,1 5505102 50317400 /recordings/2019_07_15_23/09fd0e88-4dc8-4787-8feb-bd8a0ee434f1-ed563f22b8db66d6-mix.wav
rtpengine 28115 root 53w REG 259,1 6029390 50317376 /recordings/2019_07_15_22/7dfdc865-405d-449f-9b36-ab773f82cbc6-f9e6aa5012ec760d-mix.wav
Looking at all these calls, they seem to have one thing in common, e.g. f082060b-2d55-454e-b5ec-60e95a924d01
:
Jul 15 23:11:12 iad-prd-p-c7-rtppxy-097.p28.cloud rtpengine[3147]: INFO: [f082060b-2d55-454e-b5ec-60e95a924d01]: Closing call due to timeout
Not sure if that bears somehow upon the issue.
Lastly, the handles still open by the core process number 14:
# lsof -p 28115 | grep wav | wc -l
14
And these are the precise descriptors held open by each LWP/subprocess:
# cd /proc/26801/fd
# ls -la
total 0
dr-x------. 2 root root 0 Jul 16 01:23 .
dr-xr-xr-x. 9 root root 0 Jul 15 23:04 ..
lr-x------. 1 root root 64 Jul 16 01:23 0 -> /dev/null
l-wx------. 1 root root 64 Jul 16 01:23 1 -> /dev/null
l-wx------. 1 root root 64 Jul 16 01:23 11 -> /recordings/2019_07_15_21/182e3fda-dba8-4bf6-b9bf-280dc866d999-b5cb69e0bc8d64b3-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 12 -> /recordings/2019_07_15_22/c9741d47-3b5f-4152-a272-f01c1f7e9391-e2efe72a0067b4fc-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 13 -> /recordings/2019_07_15_22/b7b92598-d3b4-441c-91db-d85bc99c9fba-2910f679df3904b8-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 16 -> /recordings/2019_07_15_21/294da9bf-0e81-4740-9b73-5ab73f0f92ae-d18988d0ef76ce8b-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 17 -> /recordings/2019_07_15_21/78588313-360c-47c0-bfc2-fd44595404ba-dd8552bc1038e41f-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 2 -> /dev/null
l-wx------. 1 root root 64 Jul 16 01:23 22 -> /recordings/2019_07_15_21/70c9a7bb-52e5-43aa-b1de-a5d524e67b33-d2bc03a62f3def3f-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 25 -> /recordings/2019_07_15_23/04334bdf-ef1e-4e77-86fa-74c307b97284-8ef5a2dbcd345055-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 26 -> /recordings/2019_07_15_22/6a438470-ade9-4d1d-a019-4bf3a15941f4-1ae275927be992dd-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 27 -> /recordings/2019_07_15_21/1f6c46c5-3cde-4ae6-9c3e-4d6d01b080ce-a6458de9d1c4a447-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 29 -> /recordings/2019_07_15_23/b1075f2d-4fd0-43e9-97b4-8a5ea5eb590d-b009e6186bfc7bf7-mix.wav
lr-x------. 1 root root 64 Jul 16 01:23 3 -> /dev/urandom
l-wx------. 1 root root 64 Jul 16 01:23 32 -> /recordings/2019_07_15_21/3f5ec2bf-0f43-4358-a95e-27eeec32cabf-5d6fde1bee7cd506-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 33 -> /recordings/2019_07_15_23/f082060b-2d55-454e-b5ec-60e95a924d01-6000e64b2e3d7bf4-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 38 -> /recordings/2019_07_15_23/09fd0e88-4dc8-4787-8feb-bd8a0ee434f1-ed563f22b8db66d6-mix.wav
lrwx------. 1 root root 64 Jul 16 01:23 4 -> socket:[1718397]
lrwx------. 1 root root 64 Jul 16 01:23 5 -> anon_inode:[eventpoll]
l-wx------. 1 root root 64 Jul 16 01:23 53 -> /recordings/2019_07_15_22/7dfdc865-405d-449f-9b36-ab773f82cbc6-f9e6aa5012ec760d-mix.wav
lr-x------. 1 root root 64 Jul 16 01:23 6 -> anon_inode:inotify
As another data point from this morning ("Serious Call Volumes" have not started yet):
- RTPEngine 1: Targets: 14 / Processes: 70
- RTPEngine 2: Targets: 14 / Processes: 72
- RTPEngine 3: Targets: 16 / Processes: 46
Needless to say, it's a bit hard to make sense of this, though it does seem to be an improvement from the runaway increase of before. But after 9 AM calls will spike into much higher territory and then we can say more.
All the "superfluous" processes beyond the initial workers spawned are from `libavutil, as before:
(gdb) where
#0 0x00007f9453d15965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f9454affcbe in thread_worker () from /lib64/libavutil.so.55
#2 0x00007f9453d11dd5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f9452f4d02d in clone () from /lib64/libc.so.6
And, the number of WAV file handles held by the recording daemon as a whole has increased -- to 34 on this particular host. As before, a salient characteristic of the Call-IDs of all the calls whose handles are being held open is that they seemed to have been timed-out streams:
Jul 16 00:21:21 host INFO: [0d4803a8-9d0b-45f4-bed3-c79b9d3a6336]: Closing call due to timeout
...
Jul 16 00:21:21 host rtpengine-recording[24311]: INFO: [C 0d4803a8-9d0b-45f4-bed3-c79b9d3a6336-61b473e479218803.meta] [S tag-0-media-1-component-2-RTCP-id-3] EOF on stream tag-0-media-1-component-2-RTCP-id-3
Jul 16 00:21:21 host rtpengine-recording[24311]: INFO: [C 0d4803a8-9d0b-45f4-bed3-c79b9d3a6336-61b473e479218803.meta] [S tag-0-media-1-component-1-RTP-id-2] EOF on stream tag-0-media-1-component-1-RTP-id-2
Jul 16 00:21:21 host rtpengine-recording[24311]: INFO: [C 0d4803a8-9d0b-45f4-bed3-c79b9d3a6336-61b473e479218803.meta] [S tag-1-media-1-component-2-RTCP-id-1] EOF on stream tag-1-media-1-component-2-RTCP-id-1
Jul 16 00:21:21 host rtpengine-recording[24311]: INFO: [C 0d4803a8-9d0b-45f4-bed3-c79b9d3a6336-61b473e479218803.meta] [S tag-1-media-1-component-1-RTP-id-0] EOF on stream tag-1-media-1-component-1-RTP-id-0
I cannot help but think that there is some clearer relationship between the number of "stale" file handles opened from "timed out" calls and the number of deadlocked processes, though I cannot find it. There is certainly a correlation; overall, the more such handles, the more processes. But exactly how much more I am unable to establish; it seems to vary, and the process count isn't accounted for by the number of stale handles per se.
Now that we have had production loads all day, I think the verdict is in: the ffmpeg
library update didn't really do anything. Here are the LWP counts and the target counts on the three respective RTPEngine instances:
# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status
1020
Refcount: 1
Control PID: 28441
Targets: 62
# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status
1094
Refcount: 1
Control PID: 19633
Targets: 66
# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status
1074
Refcount: 1
Control PID: 3131
Targets: 70
Moreover, the stale WAV file handles have grown commensurately:
# lsof -p `pidof rtpengine-recording` | grep '.wav' | wc -l
538
# lsof -p `pidof rtpengine-recording` | grep '.wav' | wc -l
532
lsof -p `pidof rtpengine-recording` | grep '.wav' | wc -l
538
Can you tell if it's leaking memory also?
That's hard to say. But with 1000+ LWPs, we can be sure it is using a prodigious amount of memory. :-)
Well yes, I suppose the threads themselves are using up memory too...
My guess is that there's some kind of close/destroy/free/cleanup invocation missing somewhere. Are you able to run this under valgrind? Not recommended for production as performance is horrible, but in a test/lab environment?
I don't think I can do that, no.
What do you make of the fact that the stale WAV handles seem to be tied to streams which disappeared from a timeout?
Can you confirm that for sure? Because the recording daemon doesn't really care about how a call was closed, timeout or otherwise. Once the metadata spool file gets deleted, the call is closed. Assuming the metadata spool files actually do get deleted?
I can confirm that all the file handles that remain held open, as rendered by lsof -p <PID of master process>
, correspond to Call-IDs which reflect a timeout in the rtpengine
log, yes.
And I can say that there are only 20 entries at the moment in /proc/rtpengine/0/calls
, but 1258 LWPs.
I can also say that the chronology of the stuck LWPs and the timestamps of timed out calls line up oddly well:
1 S root 28115 1856 1 0 80 0 - 3105633 futex_ 20:50 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 1857 1 0 80 0 - 3105633 futex_ 20:50 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2025 1 0 80 0 - 3105633 futex_ 20:52 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2026 1 0 80 0 - 3105633 futex_ 20:52 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2037 1 0 80 0 - 3105633 futex_ 20:52 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2038 1 0 80 0 - 3105633 futex_ 20:52 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2044 1 0 80 0 - 3105633 futex_ 20:52 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2045 1 0 80 0 - 3105633 futex_ 20:52 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2094 1 0 80 0 - 3105633 futex_ 20:53 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2095 1 0 80 0 - 3105633 futex_ 20:53 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2186 1 0 80 0 - 3105633 futex_ 20:54 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2187 1 0 80 0 - 3105633 futex_ 20:54 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2553 1 0 80 0 - 3105633 futex_ 20:55 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2554 1 0 80 0 - 3105633 futex_ 20:55 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2877 1 0 80 0 - 3105633 futex_ 20:59 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 2878 1 0 80 0 - 3105633 futex_ 20:59 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 4012 1 0 80 0 - 3105633 futex_ 21:05 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 4013 1 0 80 0 - 3105633 futex_ 21:05 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 4044 1 0 80 0 - 3105633 futex_ 21:05 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root 28115 4045 1 0 80 0 - 3105633 futex_ 21:05 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
And:
Jul 16 21:00:31 host rtpengine[3147]: INFO: [b1f98ee9-a82d-49dd-8295-e3e48da392c9]: Closing call due to timeout
Jul 16 21:00:42 host rtpengine[3147]: INFO: [8f11933f-58e5-4fb6-856f-b3dc68c5da52]: Closing call due to timeout
Jul 16 21:00:56 host rtpengine[3147]: INFO: [ca8c4084-5de7-4bfa-8ca6-71645e3eae33]: Closing call due to timeout
Jul 16 21:01:10 host rtpengine[3147]: INFO: [ff1ccee6-645d-4d61-80ec-58caf7015427]: Closing call due to timeout
Jul 16 21:01:27 host rtpengine[3147]: INFO: [edacdc53-d45d-496a-9b7a-eee12d9904d9]: Closing call due to timeout
Jul 16 21:01:45 host rtpengine[3147]: INFO: [d9c4df76-c18e-43c9-9172-7a71137439e2]: Closing call due to timeout
Jul 16 21:01:57 host rtpengine[3147]: INFO: [03b01d3b-29ad-454e-85be-2b8704e54f16]: Closing call due to timeout
Jul 16 21:01:58 host rtpengine[3147]: INFO: [d53e45a4-af63-4ec6-b105-5e918243c195]: Closing call due to timeout
Jul 16 21:02:05 host rtpengine[3147]: INFO: [1358e926-26c9-4480-94d7-af9195bcdf27]: Closing call due to timeout
Jul 16 21:02:14 host rtpengine[3147]: INFO: [62b8faea-de64-4566-bf99-e7f1d41b71ab]: Closing call due to timeout
Jul 16 21:02:18 host rtpengine[3147]: INFO: [8fd02e81-e45d-4a14-a6f0-5df883e42aaf]: Closing call due to timeout
Jul 16 21:02:19 host rtpengine[3147]: INFO: [b6270741-c1aa-47fd-a304-97af6f115238]: Closing call due to timeout
Jul 16 21:02:34 host rtpengine[3147]: INFO: [e067d252-60b9-4a28-9bd7-684ed9e626ff]: Closing call due to timeout
Jul 16 21:02:34 host rtpengine[3147]: INFO: [a118f87c-48cf-4a62-b012-d23442bf2748]: Closing call due to timeout
Jul 16 21:02:39 host rtpengine[3147]: INFO: [db7fa038-d777-4520-ba64-d0dd07fceb8c]: Closing call due to timeout
Jul 16 21:02:45 host rtpengine[3147]: INFO: [57d4d9db-5509-47ce-a4f7-458f9a04eb73]: Closing call due to timeout
Jul 16 21:03:59 host rtpengine[3147]: INFO: [5d4388e7-2b50-439a-a1fd-6bdd1d483c40]: Closing call due to timeout
Jul 16 21:05:02 host rtpengine[3147]: INFO: [fb8c4c9b-5c9d-4b0c-9e7f-c963da28f19d]: Closing call due to timeout
Jul 16 21:05:12 host rtpengine[3147]: INFO: [52332284-7974-4474-aaa5-2a9f51700a5d]: Closing call due to timeout
Jul 16 21:05:52 host rtpengine[3147]: INFO: [25ac8b14-3aec-4637-943d-fae8b16fb66a]: Closing call due to timeout
What about the reverse though? Did any calls that were not closed from a timeout also result in a stale file/LWP?
I can confirm that every single one of the file handles held open by the recording daemon corresponds to a timed out call:
# lsof -p `pidof rtpengine-recording` | grep -i '.wav' | awk '{print $9}' | awk -F '/' '{print $3"/"$4}' | perl -ne 'chomp; if(/^2019/) { print "$_\n"; }' | awk -F '/' '{print $2}' | perl -ne 'chomp; if(/^(\S{36})/) { print "$1\n"; }' | while read i; do fgrep $i /var/log/messages | fgrep timeout; done | wc -l
The line count there is precisely identical to the one returned by lsof -p
pidof rtpengine-recording | grep '.wav' | wc -l
.
But the metadata spool files have been deleted regardless?
Correct -- none of the Call-IDs found in /proc/rtpengine/0/calls
are among the Call-IDs in the stale file handles.
Oh, no those are not the metadata spool files. Check the directory you have configured as spool-dir
(default /var/spool/rtpengine
)
Oh, I see. I put the .meta
files in the same directory as the recordings themselves (/recordings
) (which get swept into a timestamped directory after 5 mins by a cron job). Anyway, there are only 8 meta files in there at present, and they are all active calls. So yes, they are getting reliably deleted.
Ah. Ok. Don't do that. Use a separate spool directory. Try that for starters.
Okay. But can I ask why? :) This was non-obvious.
Because the recording daemon watches the spool directory for changes using inotify, reads each file upon changes, and if it writes the recordings to the same directory, it gets confused. I'm not sure if that fixes what you're seeing, but it's a first step.
I have changed the spool directory to /recordings-spool
and restarted. It's after 17:00 now so calls will be dying down, but there are still some. Let's see what happens. Thank you for the suggestion.
Well, that's odd. Now there are no metadata files being written to the spool directory, even though new calls are coming in, e.g.
Jul 16 21:29:16 host rtpengine[3147]: NOTICE: [e322a179-0197-487a-8038-17617b44147f]: Creating new call
Jul 16 21:29:16 host rtpengine[3147]: NOTICE: [e322a179-0197-487a-8038-17617b44147f]: Turning on call recording.
Did you change the spool directory on rtpengine's side too?
No, I didn't. I just realised that. I can't do that without dropping production calls, which would be a problem. Let me see what I can do to handle that 'gracefully'.
I assume that --recording-dir
is the option I should change on the rtpengine side?
--recording-dir=FILE Directory for storing pcap and metadata files
Yes.
You can also leave the spool dir unchanged and just change the output dir for the recordings.
I just thought of that too. :-) Stand by...
Okay, I have moved /recordings
to /recordings-out
and created a new /recordings
, and got rtpengine to recognise it after a SIGHUP without restarting, and started the recording daemon with the new spool and output directories. Let's see what happens...
Well, call volumes have collapsed since it's nearly 18:00. I only have 8 targets up right now. So, the pickings are slim for large-scale troubleshooting.
But I have a sense this did not fix the issue -- there are a few new LWPs stuck in the futex_
state on top of the pollers. We started with 10 threads.
# ps -p `pidof rtpengine-recording` -lfT
F S UID PID SPID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
1 S root 12094 12094 1 0 80 0 - 209082 do_sig 21:38 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 12095 1 0 80 0 - 209082 ep_pol 21:38 ? 00:00:02 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 12096 1 0 80 0 - 209082 ep_pol 21:38 ? 00:00:01 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 12097 1 0 80 0 - 209082 ep_pol 21:38 ? 00:00:01 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 12098 1 0 80 0 - 209082 ep_pol 21:38 ? 00:00:02 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 12099 1 0 80 0 - 209082 ep_pol 21:38 ? 00:00:02 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 12100 1 0 80 0 - 209082 ep_pol 21:38 ? 00:00:01 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 12101 1 0 80 0 - 209082 ep_pol 21:38 ? 00:00:01 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 12102 1 0 80 0 - 209082 ep_pol 21:38 ? 00:00:02 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 14316 1 0 80 0 - 209082 futex_ 21:50 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 14317 1 0 80 0 - 209082 futex_ 21:50 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 15285 1 0 80 0 - 209082 futex_ 21:57 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root 12094 15286 1 0 80 0 - 209082 futex_ 21:57 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
However, all the file handles held open right now are for live calls, so I'm going to have to see if those handles disappear afterward.
Well, one promising sign ... there was a call for which a file handle was held open before:
rtpengine 12094 root 16w REG 259,1 786510 52219679 /recordings-out/305120aa-7fae-490d-8add-fefd13d9c95b-daf1e79cc7568895-mix.wav
... which has since closed due to a timeout:
Jul 16 21:59:42 host rtpengine[3147]: INFO: [305120aa-7fae-490d-8add-fefd13d9c95b]: Closing call due to timeout
... and the file handle has disappeared:
# lsof -p `pidof rtpengine-recording` | grep -i wav
rtpengine 12094 root 11w REG 259,1 10485838 52219676 /recordings-out/d3fc186f-cc8e-41db-9b66-0fc361e4dec5-1188a59a85ff9ddb-mix.wav
rtpengine 12094 root 16w REG 259,1 786510 52172452 /recordings-out/530cef3a-b717-410e-a2dc-b0d38754a2f5-23cb1755bd932643-mix.wav
rtpengine 12094 root 21w REG 259,1 2621518 52164511 /recordings-out/4982ca2d-44f2-4103-9259-a8fe2c8357b4-de9d69453e84cb9c-mix.wav
rtpengine 12094 root 26w REG 259,1 262222 52172465 /recordings-out/99a3aad9-66c4-4c48-820c-5bebe5f8bb8f-c1c6e7205d70c1a2-mix.wav
On the other hand, the LWP count still seems wildly at odds with the amount of streams total on the system:
# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status
20
Refcount: 1
Control PID: 3131
Targets: 4
It'll occasionally decrease by 2 or so, but overall the trend is to increase and increase. That makes me pessimistic that the directory change fixed the issue.
We're just going to have to wait until tomorrow to get any real results.
Well, some cause for optimism, though I don't want to call it prematurely until we see tomorrow's production call loads.
Nevertheless, since I made the suggested change, we have dropped to zero call load on that RTPEngine and got back down to the default ten threads (absent a --num-threads
value it defaults to 10):
# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status
10
Refcount: 1
Control PID: 3131
Targets: 0
This is not a result I had seen before.
Hi @rfuchs, your suggestion to separate the spool and recording directories appears to have solved the problem. Thank you very much!
If you don't mind, I'm going to submit a PR with amendments to the README to caution against this for other users. Putting the .meta data in the directory with the actual recordings file was probably not a behaviour you anticipated, but adverse consequences of doing so were neither documented nor obvious to those who don't know how the daemon works. :-)
Sounds good, thanks. I'm thinking of even having a check to refuse startup when something like this is configured.
#810 has been submitted.