sipwise / rtpengine

The Sipwise media proxy for Kamailio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Runaway LWPs/threads on recording daemon

abalashov opened this issue · comments

I am running RTPEngine mr6.5.4.2 built from source on EL7, plus recording-daemon from the same suite. libav* dependencies come from the nux-dextop repo. RTPEngine is writing frames into the /proc sink (--recording-method=proc) and the recording daemon is writing out mixed mono WAVs, with file-only metadata, no DB, and all in all the following invocation options:

/usr/local/sbin/rtpengine-recording \
   --spool-dir=/recordings \
   --output-storage=file \
   --output-dir=/recordings \
   --output-format=wav \
   --output-mixed \
   --pidfile=/var/run/rtpengine-recording.pid

What I am seeing is runaway growth in the number of worker threads spawned by the recording daemon, wildly disproportionate to the number of RTPEngine targets:

# cat /proc/rtpengine/0/status 
Refcount:    1
Control PID: 3131
Targets:     72
# ps aux | grep -i rtpengine-rec
root      8635 19.4  5.3 4172872 416356 ?      Sl   18:48  18:03 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
root     25573  0.0  0.0 112712   996 pts/0    S+   20:21   0:00 grep --color=auto -i rtpengine-rec
# ps -p 8635 -lfT | wc -l
418

Almost all of them appear to be in a futex state, so I assume some sort of deadlock, e.g.

1 S root      8635 25622     1  0  80   0 - 1047316 futex_ 20:22 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
1 S root      8635 25623     1  0  80   0 - 1047316 futex_ 20:22 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
1 S root      8635 25625     1  0  80   0 - 1047316 futex_ 20:22 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec

The way this issue was detected is that the recording daemon started complaining about running into file descriptor limits ("Too many open files" error), which struck me as curious given the relatively small number of concurrent streams recorded and the fact that the recording daemon is running as EUID/EGID root.

However, what I have found is that every one of those LWPs has several hundred open descriptors. For instance, PID 8635 above:

# cd /proc/8635/fd
# ls -w 5 | wc -l
291

This seems to be the story with all the LWPs:

# ps -p 8635 -fT | awk '{print $3}' | while read THIS_PID; do echo -n "$THIS_PID: "; find "/proc/$THIS_PID/fd" | wc -l; done 
SPID: find: ‘/proc/SPID/fd’: No such file or directory
0
8635: 284
8636: 284
8637: 284
8638: 284
8639: 284
8640: 284
[... same all the way down the line ...]

Since the descriptor count is exactly the same across all the LWPs, I assume this is because they are cloned into every LWP. But regardless, it contributes to a rather large cumulative descriptor count across all the LWPs for that process:

# ps -p 8635 -fT | awk '{print $3}' | while read THIS_PID; do echo -n "$THIS_PID: "; find "/proc/$THIS_PID/fd" | wc -l; done | awk '{print $2}' | awk 'BEGIN { sum = 0 } { sum += $1 } END { print sum }'
find: ‘/proc/SPID/fd’: No such file or directory
110826

The number of LWPs steadily increases. We found it at a peak of 1200 before restarting the recording daemon. At that point, we seem to have bumped into the system-wide FD limit:

# cat /proc/sys/fs/file-max
763006

This situation appears to play out regardless of whether the recording daemon is invoked with a certain number of --num-threads=... explicitly, or left at the defaults (as now).

There is nothing interesting in the logs (until the "Too many open files" messages start). Just fairly routine things like:

INFO: [C 2fcb0ec6-e8ef-4e84-8fda-163e9ac7626d-94e8e401f7b2a8ed.meta] [S tag-1-media-1-component-2-RTCP-id-1] EOF on stream tag-1-media-1-component-2-RTCP-id-1

And:

WARNING: [C 63b7c294-a546-4623-a034-6d2b26f54cc3-63ab5712488d039b.meta] [S tag-0-media-1-component-1-RTP-id-2] [0x554f12d] Cannot decode RTP payload type 101 (telephone-event/8000)

Since only a static number of poller threads is created in main():

	for (int i = 0; i < num_threads; i++)
		start_poller_thread();

a reasonable supposition for the runaway thread growth is that some or all of the libraries used by the recording daemon do their work in threads of their own, and that these threads are not exiting properly:

	linux-vdso.so.1 =>  (0x00007fffc19ee000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f9afe96f000)
	libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x00007f9afe659000)
	libgthread-2.0.so.0 => /lib64/libgthread-2.0.so.0 (0x00007f9afe457000)
	libavcodec.so.56 => /lib64/libavcodec.so.56 (0x00007f9afd1fd000)
	libavformat.so.56 => /lib64/libavformat.so.56 (0x00007f9afce2c000)
	libavutil.so.54 => /lib64/libavutil.so.54 (0x00007f9afcbca000)
	libswresample.so.1 => /lib64/libswresample.so.1 (0x00007f9afc9af000)
	libavfilter.so.5 => /lib64/libavfilter.so.5 (0x00007f9afc63a000)
	libmysqlclient.so.18 => /usr/lib64/mysql/libmysqlclient.so.18 (0x00007f9afc13a000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9afbf1e000)
	libz.so.1 => /lib64/libz.so.1 (0x00007f9afbd08000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f9afbb04000)
	libssl.so.10 => /lib64/libssl.so.10 (0x00007f9afb892000)
	libcrypto.so.10 => /lib64/libcrypto.so.10 (0x00007f9afb430000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f9afb063000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f9afec71000)
	libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f9afae01000)
	libva.so.1 => /lib64/libva.so.1 (0x00007f9afabe1000)
	libxvidcore.so.4 => /lib64/libxvidcore.so.4 (0x00007f9afa8ce000)
	libx265.so.79 => /lib64/libx265.so.79 (0x00007f9afa37d000)
	libx264.so.142 => /lib64/libx264.so.142 (0x00007f9afa009000)
	libvorbisenc.so.2 => /lib64/libvorbisenc.so.2 (0x00007f9af9b3a000)
	libvorbis.so.0 => /lib64/libvorbis.so.0 (0x00007f9af990d000)
	libvo-amrwbenc.so.0 => /lib64/libvo-amrwbenc.so.0 (0x00007f9af96f3000)
	libtheoraenc.so.1 => /lib64/libtheoraenc.so.1 (0x00007f9af94c6000)
	libtheoradec.so.1 => /lib64/libtheoradec.so.1 (0x00007f9af92b6000)
	libspeex.so.1 => /lib64/libspeex.so.1 (0x00007f9af909d000)
	libschroedinger-1.0.so.0 => /lib64/libschroedinger-1.0.so.0 (0x00007f9af8dd1000)
	libopus.so.0 => /lib64/libopus.so.0 (0x00007f9af8b8f000)
	libopenjpeg.so.1 => /lib64/libopenjpeg.so.1 (0x00007f9af896b000)
	libopencore-amrwb.so.0 => /lib64/libopencore-amrwb.so.0 (0x00007f9af8757000)
	libopencore-amrnb.so.0 => /lib64/libopencore-amrnb.so.0 (0x00007f9af852d000)
	libmp3lame.so.0 => /lib64/libmp3lame.so.0 (0x00007f9af82b4000)
	libgsm.so.1 => /lib64/libgsm.so.1 (0x00007f9af80a8000)
	libfdk-aac.so.1 => /lib64/libfdk-aac.so.1 (0x00007f9af7df4000)
	libgmp.so.10 => /lib64/libgmp.so.10 (0x00007f9af7b7c000)
	libgnutls.so.28 => /lib64/libgnutls.so.28 (0x00007f9af7842000)
	libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f9af7632000)
	libsoxr.so.0 => /lib64/libsoxr.so.0 (0x00007f9af73cf000)
	libswscale.so.3 => /lib64/libswscale.so.3 (0x00007f9af7148000)
	libpostproc.so.53 => /lib64/libpostproc.so.53 (0x00007f9af6f2a000)
	libavresample.so.2 => /lib64/libavresample.so.2 (0x00007f9af6d0b000)
	libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007f9af6a4c000)
	libass.so.5 => /lib64/libass.so.5 (0x00007f9af681c000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f9af6515000)
	libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007f9af62c8000)
	libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007f9af5fdf000)
	libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007f9af5ddb000)
	libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007f9af5ba8000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f9af59a0000)
	libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f9af5794000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f9af557e000)
	libogg.so.0 => /usr/lib64/libogg.so.0 (0x00007f9af5377000)
	liborc-0.4.so.0 => /lib64/liborc-0.4.so.0 (0x00007f9af50f3000)
	libp11-kit.so.0 => /lib64/libp11-kit.so.0 (0x00007f9af4dc4000)
	libtasn1.so.6 => /lib64/libtasn1.so.6 (0x00007f9af4bb1000)
	libnettle.so.4 => /lib64/libnettle.so.4 (0x00007f9af4980000)
	libhogweed.so.2 => /lib64/libhogweed.so.2 (0x00007f9af4759000)
	libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f9af4533000)
	libpng15.so.15 => /lib64/libpng15.so.15 (0x00007f9af4308000)
	libfribidi.so.0 => /lib64/libfribidi.so.0 (0x00007f9af40ec000)
	libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007f9af3eaa000)
	libharfbuzz.so.0 => /lib64/libharfbuzz.so.0 (0x00007f9af3c0d000)
	libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007f9af39fd000)
	libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007f9af37f9000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f9af35e0000)
	libffi.so.6 => /lib64/libffi.so.6 (0x00007f9af33d8000)
	libexpat.so.1 => /lib64/libexpat.so.1 (0x00007f9af31ae000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f9af2fa9000)
	libgraphite2.so.3 => /lib64/libgraphite2.so.3 (0x00007f9af2d7b000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f9af2b54000)

But, I have no way of identifying which library might be the problem.

One thing I did try was to remove mysql_thread_init() from the poller_thread() invocation in epoll.c (and the corresponding mysql_thread_end()), since we are not using MySQL at all for storage. However, this did not seem to have any effect.

One thing I did learn from attaching gdb to a random selection of these LWPs is that they're all threads spawned by libavfilter:

# gdb attach 13775
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 13775

warning: process 13775 is a cloned process
Reading symbols from /usr/local/sbin/rtpengine-recording...done.
[spam elided]
(gdb) where
#0  0x00007f98dd438965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f98ddbc33ab in worker () from /lib64/libavfilter.so.5
#2  0x00007f98dd434dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f98dc66fead in clone () from /lib64/libc.so.6

Here are the ffmpeg library versions in use:

# rpm -qa | grep -i ffmpeg
ffmpeg-2.8.15-2.el7.nux.x86_64
ffmpeg-devel-2.8.15-2.el7.nux.x86_64
ffmpeg-libs-2.8.15-2.el7.nux.x86_64

I suppose one thing I haven't tried is sourcing them from a place other than nux-dextop. But some quick Googling suggested that this is the 'canonical' way to install them on EL7/CentOS 7, e.g.

https://linuxize.com/post/how-to-install-ffmpeg-on-centos-7/

One thing I am trying now is a much newer version of the ffmpeg packages from something called awel-media-release:

http://awel.domblogger.net/7/media/x86_64/repoview/awel-media-release.html

Much newer versions:

# rpm -qa | grep -i ffmpeg
ffmpeg-devel-3.4.2-1.el7_5.awel.0.x86_64
ffmpeg-3.4.2-1.el7_5.awel.0.x86_64
ffmpeg-libs-3.4.2-1.el7_5.awel.0.x86_64

So far it is looking promising, but as it is after business hours call volumes have collapsed, so I can't really get truly meaningful feedback until tomorrow possibly.

So far, I've got this situation:

Refcount:    1
Control PID: 3131
Targets:     9

And about 30 LWPs. That number drops to 28 or 26 from time to time, spikes to 32 or so. Doesn't seem to be moving much beyond this level, but neither do the call volumes.

Is there any insight on the relationship between the threads spawned by the recording daemon and the call volumes? It's very difficult to tell if the upgrade of the ffmpeg libs fixed the problem or if the low call volumes after hours are merely masking the same problem. About the only thing that's different is that there isn't the same all-but-monotonic upward increase as before...

I wasn't aware that libavfilter (or ffmpeg libs in general) would spawn any threads. There's certainly nothing in the code that would instruct it to do that. Gonna have to look into what it's doing there.

Just as a data point following the ffmpeg libs update 👍

It is now after 21:00 here, well outside of business hours, and there are no calls --

# cat /proc/rtpengine/0/status 
Refcount:    1
Control PID: 3131
Targets:     0

There have not been any for quite some time. Yet, there are 38 LWPs spawned off of rtpengine-recording:

# ps -p `pidof rtpengine-recording` -lfT | wc -l 
38

The recording daemon was invoked without --num-threads value, so it started with a default of 10. Since the last time the recording daemon was restarted, there has been a maximum of about 16 or 18 RTPEngine targets, and the the LWP count has crept up from 10 to about 32, then back down to 28, then back up to 30, and generally hovering somewhere in this area.

Here is the state of the 38 processes:

# ps -p `pidof rtpengine-recording` -lfT  | awk '{print $1" "$12" "$16}'
F WCHAN CMD
1 do_sig /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 ep_pol /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording
1 futex_ /usr/local/sbin/rtpengine-recording

Attaching to a process at random:

(gdb) where
#0  0x00007f9453d15965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9454affcbe in thread_worker () from /lib64/libavutil.so.55
#2  0x00007f9453d11dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f9452f4d02d in clone () from /lib64/libc.so.6

The same seems to be true of the others which are in the futex_ state.

I guess a key question is: given the zero call load, and no calls in the sink...

# pwd
/proc/rtpengine/0/calls
# ls
# 

Overall, the state is:

# strace -fp 28115
strace: Process 28115 attached with 37 threads
[pid 26802] futex(0x7f942c01f844, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 26801] futex(0x7f942c01f7d4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 20644] futex(0x7f94240ff1a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 20643] futex(0x7f94240ff134, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 15791] futex(0x7f94440f63e4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 15790] futex(0x7f94440f6374, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 15202] futex(0x7f943c09c464, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 15201] futex(0x7f943c09c3f4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 13898] futex(0x7f942c0031a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 13897] futex(0x7f942c003134, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid  9260] futex(0x7f94380242a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid  9259] futex(0x7f9438024234, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 32658] futex(0x7f943c0c9384, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 32657] futex(0x7f943c0c9314, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31874] futex(0x7f9424006204, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31873] futex(0x7f9424006194, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31847] futex(0x7f9420032a44, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31846] futex(0x7f94200329d4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31196] futex(0x7f94301a3664, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 31195] futex(0x7f94301a35f4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 30782] futex(0x7f942c01d0a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 30781] futex(0x7f942c01d034, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28218] futex(0x7f943c00c684, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28217] futex(0x7f943c00c614, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28122] epoll_wait(5,  <unfinished ...>
[pid 28121] epoll_wait(5,  <unfinished ...>
[pid 28120] epoll_wait(5,  <unfinished ...>
[pid 28119] epoll_wait(5,  <unfinished ...>
[pid 28118] epoll_wait(5,  <unfinished ...>
[pid 28116] epoll_wait(5,  <unfinished ...>
[pid 28115] rt_sigtimedwait([INT TERM], NULL, NULL, 8 <unfinished ...>
[pid 29690] futex(0x7f94301a1564, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 26629] futex(0x7f943401e734, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28117] epoll_wait(5,  <unfinished ...>
[pid 26630] futex(0x7f943401e7a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 29689] futex(0x7f94301a14f4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
[pid 28123] epoll_wait(5, ^Cstrace: Process 28115 detached

... why are there 38 LWPs? Why aren't these processes being wound back down?

Another interesting wrinkle -- it looks like the core rtpengine-recording process is holding a number of file handles open for calls which are long over:

# lsof -p 28115 
COMMAND     PID USER   FD      TYPE             DEVICE SIZE/OFF     NODE NAME
rtpengine 28115 root  cwd       DIR              259,1      242       64 /
rtpengine 28115 root  rtd       DIR              259,1      242       64 /
rtpengine 28115 root  txt       REG              259,1   635224    23382 /usr/local/sbin/rtpengine-recording
rtpengine 28115 root  mem       REG              259,1    61624  4195748 /usr/lib64/libnss_files-2.17.so
rtpengine 28115 root  mem       REG              259,1   155784  4305473 /usr/lib64/libselinux.so.1
rtpengine 28115 root  mem       REG              259,1   192728  4270323 /usr/lib64/libgraphite2.so.3.0.1
rtpengine 28115 root  mem       REG              259,1    20112  4195793 /usr/lib64/libuuid.so.1.3.0
rtpengine 28115 root  mem       REG              259,1   173320  4308009 /usr/lib64/libexpat.so.1.6.0
rtpengine 28115 root  mem       REG              259,1    15512  4270327 /usr/lib64/libXau.so.6.0.0
rtpengine 28115 root  mem       REG              259,1    32304  4307332 /usr/lib64/libffi.so.6.0.1
rtpengine 28115 root  mem       REG              259,1    19384  4307944 /usr/lib64/libgpg-error.so.0.10.0
rtpengine 28115 root  mem       REG              259,1   105824  4195760 /usr/lib64/libresolv-2.17.so
rtpengine 28115 root  mem       REG              259,1    15688  4307979 /usr/lib64/libkeyutils.so.1.5
rtpengine 28115 root  mem       REG              259,1    67104  4195787 /usr/lib64/libkrb5support.so.0.1
rtpengine 28115 root  mem       REG              259,1   179296  4517318 /usr/lib64/libpng15.so.15.13.0
rtpengine 28115 root  mem       REG              259,1   652984  4270325 /usr/lib64/libharfbuzz.so.0.10705.0
rtpengine 28115 root  mem       REG              259,1   276968  4258680 /usr/lib64/libfontconfig.so.1.11.1
rtpengine 28115 root  mem       REG              259,1   114176  4292710 /usr/lib64/libfribidi.so.0.4.0
rtpengine 28115 root  mem       REG              259,1    75848  4292657 /usr/lib64/libXext.so.6.4.0
rtpengine 28115 root  mem       REG              259,1   165976  4292651 /usr/lib64/libxcb.so.1.1.0
rtpengine 28115 root  mem       REG              259,1   160776  4308726 /usr/lib64/libhogweed.so.2.5
rtpengine 28115 root  mem       REG              259,1   201296  4308728 /usr/lib64/libnettle.so.4.7
rtpengine 28115 root  mem       REG              259,1    78056  4194398 /usr/lib64/libtasn1.so.6.5.3
rtpengine 28115 root  mem       REG              259,1  1261848  4307342 /usr/lib64/libp11-kit.so.0.3.0
rtpengine 28115 root  mem       REG              259,1   535064  4307338 /usr/lib64/libgcrypt.so.11.8.2
rtpengine 28115 root  mem       REG              259,1    28360  4258655 /usr/lib64/libogg.so.0.8.0
rtpengine 28115 root  mem       REG              259,1    88776  4241095 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
rtpengine 28115 root  mem       REG              259,1   210824  4195779 /usr/lib64/libk5crypto.so.3.1
rtpengine 28115 root  mem       REG              259,1    15920  4252611 /usr/lib64/libcom_err.so.2.1
rtpengine 28115 root  mem       REG              259,1   967848  4195785 /usr/lib64/libkrb5.so.3.3
rtpengine 28115 root  mem       REG              259,1   320400  4271970 /usr/lib64/libgssapi_krb5.so.2.2
rtpengine 28115 root  mem       REG              259,1   991616  4600929 /usr/lib64/libstdc++.so.6.0.19
rtpengine 28115 root  mem       REG              259,1   795608  4517319 /usr/lib64/libfreetype.so.6.14.0
rtpengine 28115 root  mem       REG              259,1   200744  4292712 /usr/lib64/libass.so.5.3.2
rtpengine 28115 root  mem       REG              259,1   130688  4292839 /usr/lib64/libpostproc.so.54.7.100
rtpengine 28115 root  mem       REG              259,1   530552  4292843 /usr/lib64/libswscale.so.4.8.100
rtpengine 28115 root  mem       REG              259,1    15720  4292671 /usr/lib64/libvdpau.so.1.0.0
rtpengine 28115 root  mem       REG              259,1  1318800  4292655 /usr/lib64/libX11.so.6.3.0
rtpengine 28115 root  mem       REG              259,1    68192  4305577 /usr/lib64/libbz2.so.1.0.6
rtpengine 28115 root  mem       REG              259,1  1300504  4517312 /usr/lib64/libgnutls.so.28.43.3
rtpengine 28115 root  mem       REG              259,1   495712  4308038 /usr/lib64/libgmp.so.10.2.0
rtpengine 28115 root  mem       REG              259,1   120760  4292827 /usr/lib64/librtmp.so.1
rtpengine 28115 root  mem       REG              259,1    48848  4258670 /usr/lib64/libgsm.so.1.0.12
rtpengine 28115 root  mem       REG              259,1   314840  4292855 /usr/lib64/libmp3lame.so.0.0.0
rtpengine 28115 root  mem       REG              259,1   152104  4292714 /usr/lib64/libopenjpeg.so.1.5.1
rtpengine 28115 root  mem       REG              259,1   351424  4292723 /usr/lib64/libopus.so.0.6.1
rtpengine 28115 root  mem       REG              259,1   102832  4292728 /usr/lib64/libspeex.so.1.5.0
rtpengine 28115 root  mem       REG              259,1    65936  4270307 /usr/lib64/libtheoradec.so.1.1.4
rtpengine 28115 root  mem       REG              259,1   185720  4270309 /usr/lib64/libtheoraenc.so.1.1.2
rtpengine 28115 root  mem       REG              259,1   185280  4258660 /usr/lib64/libvorbis.so.0.4.6
rtpengine 28115 root  mem       REG              259,1  2944200  4258662 /usr/lib64/libvorbisenc.so.2.0.9
rtpengine 28115 root  mem       REG              259,1  1334664  4292809 /usr/lib64/libvpx.so.1.3.0
rtpengine 28115 root  mem       REG              259,1  1056944  4292826 /usr/lib64/libx264.so.152
rtpengine 28115 root  mem       REG              259,1   700656  4292675 /usr/lib64/libxvidcore.so.4.3
rtpengine 28115 root  mem       REG              259,1   113232  4292804 /usr/lib64/libcrystalhd.so.3.6
rtpengine 28115 root  mem       REG              259,1   402384  4304403 /usr/lib64/libpcre.so.1.2.0
rtpengine 28115 root  mem       REG              259,1  2151672  4271975 /usr/lib64/libc-2.17.so
rtpengine 28115 root  mem       REG              259,1  2516624  4305516 /usr/lib64/libcrypto.so.1.0.2k
rtpengine 28115 root  mem       REG              259,1   470360  4305519 /usr/lib64/libssl.so.1.0.2k
rtpengine 28115 root  mem       REG              259,1    19288  4549116 /usr/lib64/libdl-2.17.so
rtpengine 28115 root  mem       REG              259,1    90248  4305546 /usr/lib64/libz.so.1.2.7
rtpengine 28115 root  mem       REG              259,1   141968  4195758 /usr/lib64/libpthread-2.17.so
rtpengine 28115 root  mem       REG              259,1  3135712  4463089 /usr/lib64/mysql/libmysqlclient.so.18.0.0
rtpengine 28115 root  mem       REG              259,1  2606848  4292833 /usr/lib64/libavfilter.so.6.107.100
rtpengine 28115 root  mem       REG              259,1   120408  4292841 /usr/lib64/libswresample.so.2.9.100
rtpengine 28115 root  mem       REG              259,1   451152  4292837 /usr/lib64/libavutil.so.55.78.100
rtpengine 28115 root  mem       REG              259,1  2299880  4292835 /usr/lib64/libavformat.so.57.83.100
rtpengine 28115 root  mem       REG              259,1 12793112  4292829 /usr/lib64/libavcodec.so.57.107.100
rtpengine 28115 root  mem       REG              259,1     7016  4517251 /usr/lib64/libgthread-2.0.so.0.5600.1
rtpengine 28115 root  mem       REG              259,1  1156600  4327750 /usr/lib64/libglib-2.0.so.0.5600.1
rtpengine 28115 root  mem       REG              259,1  1137024  4549118 /usr/lib64/libm-2.17.so
rtpengine 28115 root  mem       REG              259,1   163400  4271992 /usr/lib64/ld-2.17.so
rtpengine 28115 root  mem       REG              259,1    26254    23351 /usr/lib64/gconv/gconv-modules.cache
rtpengine 28115 root    0r      CHR                1,3      0t0     1031 /dev/null
rtpengine 28115 root    1w      CHR                1,3      0t0     1031 /dev/null
rtpengine 28115 root    2w      CHR                1,3      0t0     1031 /dev/null
rtpengine 28115 root    3r      CHR                1,9      0t0     1036 /dev/urandom
rtpengine 28115 root    4u     unix 0xffff889cd1eb6400      0t0  1718397 socket
rtpengine 28115 root    5u  a_inode               0,10        0     6358 [eventpoll]
rtpengine 28115 root    6r  a_inode               0,10        0     6358 inotify
rtpengine 28115 root   11w      REG              259,1 10223694 50290819 /recordings/2019_07_15_21/182e3fda-dba8-4bf6-b9bf-280dc866d999-b5cb69e0bc8d64b3-mix.wav
rtpengine 28115 root   12w      REG              259,1 18088014 50317350 /recordings/2019_07_15_22/c9741d47-3b5f-4152-a272-f01c1f7e9391-e2efe72a0067b4fc-mix.wav
rtpengine 28115 root   13w      REG              259,1  9437262 50317360 /recordings/2019_07_15_22/b7b92598-d3b4-441c-91db-d85bc99c9fba-2910f679df3904b8-mix.wav
rtpengine 28115 root   16w      REG              259,1  4718670 29361752 /recordings/2019_07_15_21/294da9bf-0e81-4740-9b73-5ab73f0f92ae-d18988d0ef76ce8b-mix.wav
rtpengine 28115 root   17w      REG              259,1 11010126 50290837 /recordings/2019_07_15_21/78588313-360c-47c0-bfc2-fd44595404ba-dd8552bc1038e41f-mix.wav
rtpengine 28115 root   22w      REG              259,1  9699406 50290836 /recordings/2019_07_15_21/70c9a7bb-52e5-43aa-b1de-a5d524e67b33-d2bc03a62f3def3f-mix.wav
rtpengine 28115 root   25w      REG              259,1  8912974 50317402 /recordings/2019_07_15_23/04334bdf-ef1e-4e77-86fa-74c307b97284-8ef5a2dbcd345055-mix.wav
rtpengine 28115 root   26w      REG              259,1  4718670 50317383 /recordings/2019_07_15_22/6a438470-ade9-4d1d-a019-4bf3a15941f4-1ae275927be992dd-mix.wav
rtpengine 28115 root   27w      REG              259,1  4718670 50290834 /recordings/2019_07_15_21/1f6c46c5-3cde-4ae6-9c3e-4d6d01b080ce-a6458de9d1c4a447-mix.wav
rtpengine 28115 root   29w      REG              259,1 17563726 50317391 /recordings/2019_07_15_23/b1075f2d-4fd0-43e9-97b4-8a5ea5eb590d-b009e6186bfc7bf7-mix.wav
rtpengine 28115 root   32w      REG              259,1  6815822 50290828 /recordings/2019_07_15_21/3f5ec2bf-0f43-4358-a95e-27eeec32cabf-5d6fde1bee7cd506-mix.wav
rtpengine 28115 root   33w      REG              259,1  6029390 50317396 /recordings/2019_07_15_23/f082060b-2d55-454e-b5ec-60e95a924d01-6000e64b2e3d7bf4-mix.wav
rtpengine 28115 root   38w      REG              259,1  5505102 50317400 /recordings/2019_07_15_23/09fd0e88-4dc8-4787-8feb-bd8a0ee434f1-ed563f22b8db66d6-mix.wav
rtpengine 28115 root   53w      REG              259,1  6029390 50317376 /recordings/2019_07_15_22/7dfdc865-405d-449f-9b36-ab773f82cbc6-f9e6aa5012ec760d-mix.wav

Looking at all these calls, they seem to have one thing in common, e.g. f082060b-2d55-454e-b5ec-60e95a924d01:

Jul 15 23:11:12 iad-prd-p-c7-rtppxy-097.p28.cloud rtpengine[3147]: INFO: [f082060b-2d55-454e-b5ec-60e95a924d01]: Closing call due to timeout

Not sure if that bears somehow upon the issue.

Lastly, the handles still open by the core process number 14:

# lsof -p 28115  | grep wav | wc -l
14

And these are the precise descriptors held open by each LWP/subprocess:

# cd /proc/26801/fd
# ls -la
total 0
dr-x------. 2 root root  0 Jul 16 01:23 .
dr-xr-xr-x. 9 root root  0 Jul 15 23:04 ..
lr-x------. 1 root root 64 Jul 16 01:23 0 -> /dev/null
l-wx------. 1 root root 64 Jul 16 01:23 1 -> /dev/null
l-wx------. 1 root root 64 Jul 16 01:23 11 -> /recordings/2019_07_15_21/182e3fda-dba8-4bf6-b9bf-280dc866d999-b5cb69e0bc8d64b3-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 12 -> /recordings/2019_07_15_22/c9741d47-3b5f-4152-a272-f01c1f7e9391-e2efe72a0067b4fc-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 13 -> /recordings/2019_07_15_22/b7b92598-d3b4-441c-91db-d85bc99c9fba-2910f679df3904b8-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 16 -> /recordings/2019_07_15_21/294da9bf-0e81-4740-9b73-5ab73f0f92ae-d18988d0ef76ce8b-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 17 -> /recordings/2019_07_15_21/78588313-360c-47c0-bfc2-fd44595404ba-dd8552bc1038e41f-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 2 -> /dev/null
l-wx------. 1 root root 64 Jul 16 01:23 22 -> /recordings/2019_07_15_21/70c9a7bb-52e5-43aa-b1de-a5d524e67b33-d2bc03a62f3def3f-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 25 -> /recordings/2019_07_15_23/04334bdf-ef1e-4e77-86fa-74c307b97284-8ef5a2dbcd345055-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 26 -> /recordings/2019_07_15_22/6a438470-ade9-4d1d-a019-4bf3a15941f4-1ae275927be992dd-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 27 -> /recordings/2019_07_15_21/1f6c46c5-3cde-4ae6-9c3e-4d6d01b080ce-a6458de9d1c4a447-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 29 -> /recordings/2019_07_15_23/b1075f2d-4fd0-43e9-97b4-8a5ea5eb590d-b009e6186bfc7bf7-mix.wav
lr-x------. 1 root root 64 Jul 16 01:23 3 -> /dev/urandom
l-wx------. 1 root root 64 Jul 16 01:23 32 -> /recordings/2019_07_15_21/3f5ec2bf-0f43-4358-a95e-27eeec32cabf-5d6fde1bee7cd506-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 33 -> /recordings/2019_07_15_23/f082060b-2d55-454e-b5ec-60e95a924d01-6000e64b2e3d7bf4-mix.wav
l-wx------. 1 root root 64 Jul 16 01:23 38 -> /recordings/2019_07_15_23/09fd0e88-4dc8-4787-8feb-bd8a0ee434f1-ed563f22b8db66d6-mix.wav
lrwx------. 1 root root 64 Jul 16 01:23 4 -> socket:[1718397]
lrwx------. 1 root root 64 Jul 16 01:23 5 -> anon_inode:[eventpoll]
l-wx------. 1 root root 64 Jul 16 01:23 53 -> /recordings/2019_07_15_22/7dfdc865-405d-449f-9b36-ab773f82cbc6-f9e6aa5012ec760d-mix.wav
lr-x------. 1 root root 64 Jul 16 01:23 6 -> anon_inode:inotify

As another data point from this morning ("Serious Call Volumes" have not started yet):

  • RTPEngine 1: Targets: 14 / Processes: 70
  • RTPEngine 2: Targets: 14 / Processes: 72
  • RTPEngine 3: Targets: 16 / Processes: 46

Needless to say, it's a bit hard to make sense of this, though it does seem to be an improvement from the runaway increase of before. But after 9 AM calls will spike into much higher territory and then we can say more.

All the "superfluous" processes beyond the initial workers spawned are from `libavutil, as before:

(gdb) where
#0  0x00007f9453d15965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9454affcbe in thread_worker () from /lib64/libavutil.so.55
#2  0x00007f9453d11dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f9452f4d02d in clone () from /lib64/libc.so.6

And, the number of WAV file handles held by the recording daemon as a whole has increased -- to 34 on this particular host. As before, a salient characteristic of the Call-IDs of all the calls whose handles are being held open is that they seemed to have been timed-out streams:

Jul 16 00:21:21 host INFO: [0d4803a8-9d0b-45f4-bed3-c79b9d3a6336]: Closing call due to timeout
...
Jul 16 00:21:21 host rtpengine-recording[24311]: INFO: [C 0d4803a8-9d0b-45f4-bed3-c79b9d3a6336-61b473e479218803.meta] [S tag-0-media-1-component-2-RTCP-id-3] EOF on stream tag-0-media-1-component-2-RTCP-id-3
Jul 16 00:21:21 host rtpengine-recording[24311]: INFO: [C 0d4803a8-9d0b-45f4-bed3-c79b9d3a6336-61b473e479218803.meta] [S tag-0-media-1-component-1-RTP-id-2] EOF on stream tag-0-media-1-component-1-RTP-id-2
Jul 16 00:21:21 host rtpengine-recording[24311]: INFO: [C 0d4803a8-9d0b-45f4-bed3-c79b9d3a6336-61b473e479218803.meta] [S tag-1-media-1-component-2-RTCP-id-1] EOF on stream tag-1-media-1-component-2-RTCP-id-1
Jul 16 00:21:21 host rtpengine-recording[24311]: INFO: [C 0d4803a8-9d0b-45f4-bed3-c79b9d3a6336-61b473e479218803.meta] [S tag-1-media-1-component-1-RTP-id-0] EOF on stream tag-1-media-1-component-1-RTP-id-0

I cannot help but think that there is some clearer relationship between the number of "stale" file handles opened from "timed out" calls and the number of deadlocked processes, though I cannot find it. There is certainly a correlation; overall, the more such handles, the more processes. But exactly how much more I am unable to establish; it seems to vary, and the process count isn't accounted for by the number of stale handles per se.

Now that we have had production loads all day, I think the verdict is in: the ffmpeg library update didn't really do anything. Here are the LWP counts and the target counts on the three respective RTPEngine instances:

# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status 
1020

Refcount:    1
Control PID: 28441
Targets:     62

# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status 
1094

Refcount:    1
Control PID: 19633
Targets:     66

# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status 
1074

Refcount:    1
Control PID: 3131
Targets:     70

Moreover, the stale WAV file handles have grown commensurately:

# lsof -p `pidof rtpengine-recording` | grep '.wav' | wc -l
538
# lsof -p `pidof rtpengine-recording` | grep '.wav' | wc -l
532
 lsof -p `pidof rtpengine-recording` | grep '.wav' | wc -l
538

Can you tell if it's leaking memory also?

That's hard to say. But with 1000+ LWPs, we can be sure it is using a prodigious amount of memory. :-)

Well yes, I suppose the threads themselves are using up memory too...

My guess is that there's some kind of close/destroy/free/cleanup invocation missing somewhere. Are you able to run this under valgrind? Not recommended for production as performance is horrible, but in a test/lab environment?

I don't think I can do that, no.

What do you make of the fact that the stale WAV handles seem to be tied to streams which disappeared from a timeout?

Can you confirm that for sure? Because the recording daemon doesn't really care about how a call was closed, timeout or otherwise. Once the metadata spool file gets deleted, the call is closed. Assuming the metadata spool files actually do get deleted?

I can confirm that all the file handles that remain held open, as rendered by lsof -p <PID of master process>, correspond to Call-IDs which reflect a timeout in the rtpengine log, yes.

And I can say that there are only 20 entries at the moment in /proc/rtpengine/0/calls, but 1258 LWPs.

I can also say that the chronology of the stuck LWPs and the timestamps of timed out calls line up oddly well:

1 S root     28115  1856     1  0  80   0 - 3105633 futex_ 20:50 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  1857     1  0  80   0 - 3105633 futex_ 20:50 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2025     1  0  80   0 - 3105633 futex_ 20:52 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2026     1  0  80   0 - 3105633 futex_ 20:52 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2037     1  0  80   0 - 3105633 futex_ 20:52 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2038     1  0  80   0 - 3105633 futex_ 20:52 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2044     1  0  80   0 - 3105633 futex_ 20:52 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2045     1  0  80   0 - 3105633 futex_ 20:52 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2094     1  0  80   0 - 3105633 futex_ 20:53 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2095     1  0  80   0 - 3105633 futex_ 20:53 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2186     1  0  80   0 - 3105633 futex_ 20:54 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2187     1  0  80   0 - 3105633 futex_ 20:54 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2553     1  0  80   0 - 3105633 futex_ 20:55 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2554     1  0  80   0 - 3105633 futex_ 20:55 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2877     1  0  80   0 - 3105633 futex_ 20:59 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  2878     1  0  80   0 - 3105633 futex_ 20:59 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  4012     1  0  80   0 - 3105633 futex_ 21:05 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  4013     1  0  80   0 - 3105633 futex_ 21:05 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  4044     1  0  80   0 - 3105633 futex_ 21:05 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
1 S root     28115  4045     1  0  80   0 - 3105633 futex_ 21:05 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid

And:

Jul 16 21:00:31 host rtpengine[3147]: INFO: [b1f98ee9-a82d-49dd-8295-e3e48da392c9]: Closing call due to timeout
Jul 16 21:00:42 host rtpengine[3147]: INFO: [8f11933f-58e5-4fb6-856f-b3dc68c5da52]: Closing call due to timeout
Jul 16 21:00:56 host rtpengine[3147]: INFO: [ca8c4084-5de7-4bfa-8ca6-71645e3eae33]: Closing call due to timeout
Jul 16 21:01:10 host rtpengine[3147]: INFO: [ff1ccee6-645d-4d61-80ec-58caf7015427]: Closing call due to timeout
Jul 16 21:01:27 host rtpengine[3147]: INFO: [edacdc53-d45d-496a-9b7a-eee12d9904d9]: Closing call due to timeout
Jul 16 21:01:45 host rtpengine[3147]: INFO: [d9c4df76-c18e-43c9-9172-7a71137439e2]: Closing call due to timeout
Jul 16 21:01:57 host rtpengine[3147]: INFO: [03b01d3b-29ad-454e-85be-2b8704e54f16]: Closing call due to timeout
Jul 16 21:01:58 host rtpengine[3147]: INFO: [d53e45a4-af63-4ec6-b105-5e918243c195]: Closing call due to timeout
Jul 16 21:02:05 host rtpengine[3147]: INFO: [1358e926-26c9-4480-94d7-af9195bcdf27]: Closing call due to timeout
Jul 16 21:02:14 host rtpengine[3147]: INFO: [62b8faea-de64-4566-bf99-e7f1d41b71ab]: Closing call due to timeout
Jul 16 21:02:18 host rtpengine[3147]: INFO: [8fd02e81-e45d-4a14-a6f0-5df883e42aaf]: Closing call due to timeout
Jul 16 21:02:19 host rtpengine[3147]: INFO: [b6270741-c1aa-47fd-a304-97af6f115238]: Closing call due to timeout
Jul 16 21:02:34 host rtpengine[3147]: INFO: [e067d252-60b9-4a28-9bd7-684ed9e626ff]: Closing call due to timeout
Jul 16 21:02:34 host rtpengine[3147]: INFO: [a118f87c-48cf-4a62-b012-d23442bf2748]: Closing call due to timeout
Jul 16 21:02:39 host rtpengine[3147]: INFO: [db7fa038-d777-4520-ba64-d0dd07fceb8c]: Closing call due to timeout
Jul 16 21:02:45 host rtpengine[3147]: INFO: [57d4d9db-5509-47ce-a4f7-458f9a04eb73]: Closing call due to timeout
Jul 16 21:03:59 host rtpengine[3147]: INFO: [5d4388e7-2b50-439a-a1fd-6bdd1d483c40]: Closing call due to timeout
Jul 16 21:05:02 host rtpengine[3147]: INFO: [fb8c4c9b-5c9d-4b0c-9e7f-c963da28f19d]: Closing call due to timeout
Jul 16 21:05:12 host rtpengine[3147]: INFO: [52332284-7974-4474-aaa5-2a9f51700a5d]: Closing call due to timeout
Jul 16 21:05:52 host rtpengine[3147]: INFO: [25ac8b14-3aec-4637-943d-fae8b16fb66a]: Closing call due to timeout

What about the reverse though? Did any calls that were not closed from a timeout also result in a stale file/LWP?

I can confirm that every single one of the file handles held open by the recording daemon corresponds to a timed out call:

# lsof -p `pidof rtpengine-recording` | grep -i '.wav' | awk '{print $9}' | awk -F '/' '{print $3"/"$4}' | perl -ne 'chomp; if(/^2019/) { print "$_\n"; }' | awk -F '/' '{print $2}' | perl -ne 'chomp; if(/^(\S{36})/) { print "$1\n"; }' | while read i; do fgrep $i /var/log/messages | fgrep timeout; done | wc -l

The line count there is precisely identical to the one returned by lsof -p pidof rtpengine-recording | grep '.wav' | wc -l.

But the metadata spool files have been deleted regardless?

Correct -- none of the Call-IDs found in /proc/rtpengine/0/calls are among the Call-IDs in the stale file handles.

Oh, no those are not the metadata spool files. Check the directory you have configured as spool-dir (default /var/spool/rtpengine)

Oh, I see. I put the .meta files in the same directory as the recordings themselves (/recordings) (which get swept into a timestamped directory after 5 mins by a cron job). Anyway, there are only 8 meta files in there at present, and they are all active calls. So yes, they are getting reliably deleted.

Ah. Ok. Don't do that. Use a separate spool directory. Try that for starters.

Okay. But can I ask why? :) This was non-obvious.

Because the recording daemon watches the spool directory for changes using inotify, reads each file upon changes, and if it writes the recordings to the same directory, it gets confused. I'm not sure if that fixes what you're seeing, but it's a first step.

I have changed the spool directory to /recordings-spool and restarted. It's after 17:00 now so calls will be dying down, but there are still some. Let's see what happens. Thank you for the suggestion.

Well, that's odd. Now there are no metadata files being written to the spool directory, even though new calls are coming in, e.g.

Jul 16 21:29:16 host rtpengine[3147]: NOTICE: [e322a179-0197-487a-8038-17617b44147f]: Creating new call
Jul 16 21:29:16 host rtpengine[3147]: NOTICE: [e322a179-0197-487a-8038-17617b44147f]: Turning on call recording.

Did you change the spool directory on rtpengine's side too?

No, I didn't. I just realised that. I can't do that without dropping production calls, which would be a problem. Let me see what I can do to handle that 'gracefully'.

I assume that --recording-dir is the option I should change on the rtpengine side?

  --recording-dir=FILE                                        Directory for storing pcap and metadata files

You can also leave the spool dir unchanged and just change the output dir for the recordings.

I just thought of that too. :-) Stand by...

Okay, I have moved /recordings to /recordings-out and created a new /recordings, and got rtpengine to recognise it after a SIGHUP without restarting, and started the recording daemon with the new spool and output directories. Let's see what happens...

Well, call volumes have collapsed since it's nearly 18:00. I only have 8 targets up right now. So, the pickings are slim for large-scale troubleshooting.

But I have a sense this did not fix the issue -- there are a few new LWPs stuck in the futex_ state on top of the pollers. We started with 10 threads.

# ps -p `pidof rtpengine-recording` -lfT 
F S UID        PID  SPID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
1 S root     12094 12094     1  0  80   0 - 209082 do_sig 21:38 ?       00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 12095     1  0  80   0 - 209082 ep_pol 21:38 ?       00:00:02 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 12096     1  0  80   0 - 209082 ep_pol 21:38 ?       00:00:01 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 12097     1  0  80   0 - 209082 ep_pol 21:38 ?       00:00:01 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 12098     1  0  80   0 - 209082 ep_pol 21:38 ?       00:00:02 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 12099     1  0  80   0 - 209082 ep_pol 21:38 ?       00:00:02 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 12100     1  0  80   0 - 209082 ep_pol 21:38 ?       00:00:01 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 12101     1  0  80   0 - 209082 ep_pol 21:38 ?       00:00:01 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 12102     1  0  80   0 - 209082 ep_pol 21:38 ?       00:00:02 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 14316     1  0  80   0 - 209082 futex_ 21:50 ?       00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 14317     1  0  80   0 - 209082 futex_ 21:50 ?       00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 15285     1  0  80   0 - 209082 futex_ 21:57 ?       00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --
1 S root     12094 15286     1  0  80   0 - 209082 futex_ 21:57 ?       00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings-out --output-format=wav --output-mixed --

However, all the file handles held open right now are for live calls, so I'm going to have to see if those handles disappear afterward.

Well, one promising sign ... there was a call for which a file handle was held open before:

rtpengine 12094 root   16w      REG              259,1   786510   52219679 /recordings-out/305120aa-7fae-490d-8add-fefd13d9c95b-daf1e79cc7568895-mix.wav

... which has since closed due to a timeout:

Jul 16 21:59:42 host rtpengine[3147]: INFO: [305120aa-7fae-490d-8add-fefd13d9c95b]: Closing call due to timeout

... and the file handle has disappeared:

# lsof -p `pidof rtpengine-recording` | grep -i wav
rtpengine 12094 root   11w      REG              259,1 10485838   52219676 /recordings-out/d3fc186f-cc8e-41db-9b66-0fc361e4dec5-1188a59a85ff9ddb-mix.wav
rtpengine 12094 root   16w      REG              259,1   786510   52172452 /recordings-out/530cef3a-b717-410e-a2dc-b0d38754a2f5-23cb1755bd932643-mix.wav
rtpengine 12094 root   21w      REG              259,1  2621518   52164511 /recordings-out/4982ca2d-44f2-4103-9259-a8fe2c8357b4-de9d69453e84cb9c-mix.wav
rtpengine 12094 root   26w      REG              259,1   262222   52172465 /recordings-out/99a3aad9-66c4-4c48-820c-5bebe5f8bb8f-c1c6e7205d70c1a2-mix.wav

On the other hand, the LWP count still seems wildly at odds with the amount of streams total on the system:

# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status 
20

Refcount:    1
Control PID: 3131
Targets:     4

It'll occasionally decrease by 2 or so, but overall the trend is to increase and increase. That makes me pessimistic that the directory change fixed the issue.

We're just going to have to wait until tomorrow to get any real results.

Well, some cause for optimism, though I don't want to call it prematurely until we see tomorrow's production call loads.

Nevertheless, since I made the suggested change, we have dropped to zero call load on that RTPEngine and got back down to the default ten threads (absent a --num-threads value it defaults to 10):

# ps -p `pidof rtpengine-recording` -lfT | wc -l && echo && cat /proc/rtpengine/0/status 
10

Refcount:    1
Control PID: 3131
Targets:     0

This is not a result I had seen before.

Hi @rfuchs, your suggestion to separate the spool and recording directories appears to have solved the problem. Thank you very much!

If you don't mind, I'm going to submit a PR with amendments to the README to caution against this for other users. Putting the .meta data in the directory with the actual recordings file was probably not a behaviour you anticipated, but adverse consequences of doing so were neither documented nor obvious to those who don't know how the daemon works. :-)

Sounds good, thanks. I'm thinking of even having a check to refuse startup when something like this is configured.

#810 has been submitted.