Watchdog failure stenotype abort after 2 minutes (Debian 10)
dspruell-s01 opened this issue · comments
- Debian GNU/Linux 10 (buster)
- Linux 4.19.0-10-amd64
- go version go1.11.6 linux/amd64
Using latest stenographer, can get through install and setup and systemd starts correctly, but seems to die (or be killed).
$ sudo systemctl status stenographer
stenographer.service - packet capture to disk
Loaded: loaded (/lib/systemd/system/stenographer.service; disabled; vendor preset: enabled)
Active: active (running) since Tue 2020-09-01 14:44:46 UTC; 2min 59s ago
Main PID: 10793 (stenographer)
Tasks: 13 (limit: 4915)
Memory: 21.1M
CGroup: /system.slice/stenographer.service
├─10793 /usr/bin/stenographer
└─10829 [stenotype]
Sep 01 14:44:46 astraea systemd[1]: Started packet capture to disk.
Sep 01 14:46:47 astraea stenographer[10793]: 2020-09-01T14:46:47.287721Z T:fffff7 [util.cc:117] WATCHDOG FAILURE: Thread 0 ABORTABORTABORT
Sep 01 14:46:47 astraea stenographer[10793]: /usr/bin/stenotype(+0xb0d6) [0x56553817f0d6]
/usr/bin/stenotype(+0x96e2) [0x56553817d6e2]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f8f8876bb2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f8f88a8efa3]
Sep 01 14:46:47 astraea stenographer[10793]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f8f8844b4cf]
Sep 01 14:46:47 astraea stenographer[10793]: 2020/09/01 14:46:47 Stenotype stopped after 2m1.002707097s: stenotype wait failed: signal: aborted
...
Sep 01 15:26:58 astraea stenographer[10793]: 2020-09-01T15:26:58.396742Z T:e0a8f7 [util.cc:117] WATCHDOG FAILURE: Thread 0
ABORTABORTABORT
Sep 01 15:26:58 astraea stenographer[10793]: /usr/bin/stenotype(+0xb0d6) [0x55c80320c0d6]
/usr/bin/stenotype(+0x96e2) [0x55c80320a6e2]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f7e626b1b2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f7e629d4fa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f7e623914cf]
Sep 01 15:26:58 astraea stenographer[10793]: 2020/09/01 15:26:58 Stenotype stopped after 2m0.559905173s: stenotype wait failed: signal: aborted
Sep 01 15:28:58 astraea stenographer[10793]: 2020-09-01T15:28:58.937431Z T:9bfff7 [util.cc:117] WATCHDOG FAILURE: Thread 0
ABORTABORTABORT
Sep 01 15:28:58 astraea stenographer[10793]: /usr/bin/stenotype(+0xb0d6) [0x55742f5110d6]
/usr/bin/stenotype(+0x96e2) [0x55742f50f6e2]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7fca2234fb2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7fca22672fa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fca2202f4cf]
Sep 01 15:28:59 astraea stenographer[10793]: 2020/09/01 15:28:59 Stenotype stopped after 2m0.551980822s: stenotype wait failed: signal: aborted
Unsure if related, but over course of execution the kernel log contains many of these:
[ 1864.826642] audit: type=1326 audit(1598971487.263:8): auid=4294967295 uid=106 gid=112 ses=4294967295 subj==unconfined pid=10800 comm="stenotype" exe="/usr/bin/stenotype" sig=31 arch=c000003e syscall=257 compat=0 ip=0x7f8f88a98d0e code=0x0
...
[ 4276.756724] audit: type=1326 audit(1598973899.366:49): auid=4294967295 uid=106 gid=112 ses=4294967295 subj==unconfined pid=11116 comm="stenotype" exe="/usr/bin/stenotype" sig=31 arch=c000003e syscall=231 compat=0 ip=0x7f7e6235e9d6 code=0x0
Did you install from the Debian buster package (if so, what version?) or did you build stenographer/stenotype yourself?
Did you install from the Debian buster package (if so, what version?) or did you build stenographer/stenotype yourself?
Have tried both now. (Didn't realize there was a package in repo, thanks. The top of INSTALL.md leads one to think this is still a future thing). The original post was building myself. I've since retried the setup from the buster package and the result appears the same. Package versions:
ii stenographer 0.0~git20180422.0.73ce5dd-1+b2 amd64 full-packet-capture utility -- server
ii stenographer-client 0.0~git20180422.0.73ce5dd-1 all full-packet-capture utility -- clients
Logging:
Sep 02 19:48:04 astraea systemd[1]: Started packet capture to disk.
Sep 02 19:50:05 astraea stenographer[1786]: 2020-09-02T19:50:05.237657Z T:b55787 [util.cc:117] WATCHDOG FAILURE: Thread 0
ABORTABORTABORT
Sep 02 19:50:05 astraea stenographer[1786]: /usr/sbin/stenotype(+0xb136) [0x558aa0cb3136]
/usr/sbin/stenotype(+0x9756) [0x558aa0cb1756]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f533719ab2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f53374bdfa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f5336e7a4cf]
Sep 02 19:50:05 astraea stenographer[1786]: 2020/09/02 19:50:05 Stenotype stopped after 2m0.525674512s: stenotype wait failed: signal: aborted
Sep 02 19:52:05 astraea stenographer[1786]: 2020-09-02T19:52:05.772114Z T:f7fff7 [util.cc:117] WATCHDOG FAILURE: Thread 0
ABORTABORTABORT
Sep 02 19:52:05 astraea stenographer[1786]: /usr/sbin/stenotype(+0xb136) [0x55e69b758136]
/usr/sbin/stenotype(+0x9756) [0x55e69b756756]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7faf7e839b2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7faf7eb5cfa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7faf7e5194cf]
Sep 02 19:52:05 astraea stenographer[1786]: 2020/09/02 19:52:05 Stenotype stopped after 2m0.539895563s: stenotype wait failed: signal: aborted
Sep 02 19:54:06 astraea stenographer[1786]: 2020-09-02T19:54:06.293815Z T:3e9067 [util.cc:117] WATCHDOG FAILURE: Thread 0
ABORTABORTABORT
Sep 02 19:54:06 astraea stenographer[1786]: /usr/sbin/stenotype(+0xb136) [0x564d672f8136]
/usr/sbin/stenotype(+0x9756) [0x564d672f6756]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f62c0528b2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f62c084bfa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f62c02084cf]
Sep 02 19:54:06 astraea stenographer[1786]: 2020/09/02 19:54:06 Stenotype stopped after 2m0.511912014s: stenotype wait failed: signal: aborted
System is currently test environment under VMware ESXi. 8x Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz, 64 GB mem.
# /etc/stenographer/config
{
"Threads": [
{ "PacketsDirectory": "/var/lib/stenographer/thread0/packets"
, "IndexDirectory": "/var/lib/stenographer/thread0/index"
, "MaxDirectoryFiles": 30000
, "DiskFreePercentage": 10
}
]
, "StenotypePath": "/usr/sbin/stenotype"
, "Interface": "ens192"
, "Port": 1234
, "Host": "127.0.0.1"
, "Flags": []
, "CertPath": "/etc/stenographer/certs"
}
Capture interface:
ens192: <BROADCAST,MULTICAST,NOARP,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:6f:f9:b5 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
19288537 39509 0 0 0 103
TX: bytes packets errors dropped carrier collsns
1786 23 0 0 0 0
driver: vmxnet3
version: 1.4.16.0-k-NAPI
firmware-version:
expansion-rom-version:
bus-info: 0000:0b:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no
Have you tried adding "--seccomp=none"
to the "Flags"
section in the configuration file?
Also, does the interface receive traffic? Just asking because of #94
Have you tried adding
"--seccomp=none"
to the"Flags"
section in the configuration file?
I had not tried that, and doing so seemingly resolves this. After restarting with the Flag set ("Flags": ["--seccomp=none"]
), I see the process remain running and data files stored.
$ sudo ls -lRA /var/lib/stenographer/
/var/lib/stenographer/:
total 4
drwx------ 4 stenographer stenographer 4096 Sep 2 19:39 thread0
/var/lib/stenographer/thread0:
total 8
drwx------ 2 stenographer stenographer 4096 Sep 8 18:45 index
drwx------ 2 stenographer stenographer 4096 Sep 8 18:45 packets
/var/lib/stenographer/thread0/index:
total 788
-rw------- 1 stenographer stenographer 53558 Sep 8 18:32 1599589913628559
-rw------- 1 stenographer stenographer 52220 Sep 8 18:33 1599589974776521
...
/var/lib/stenographer/thread0/packets:
total 84992
-rw------- 1 stenographer stenographer 5242880 Sep 8 18:32 1599589913628559
-rw------- 1 stenographer stenographer 6291456 Sep 8 18:33 1599589974776521
...
Also, does the interface receive traffic? Just asking because of #94
Probably moot now, but yes - a small trickle from a lab in this case.
Would it be possible to document more information/guidance about the seccomp flag and its impact in Stenographer relating to cases like this?
If the switch to --seccomp=none
resolved this issue, that almost definitely means that a non-whitelisted Linux syscall is being called as part of the startup of Stenotype. If you'd be interested in finding a complete fix for this, we'd need to do so via an strace
of the stenotype binary, to determine exactly which new syscall is being called.
Short description of what's going on:
Seccomp allows a process to say "I'm only going to use the following N syscalls", to define the complete set of behaviors which the Linux kernel should expect and allow from the userspace process. Stenographer (specifically Stenotype) uses this to define boundaries it must abide by in case of compromise due to a maliciously crafted packet. However, unfortunately over time this set of syscalls tends to drift, via changes to glibc, the linux kernel, etc. The list at https://github.com/google/stenographer/blob/master/stenotype/stenotype.cc lines 294-399 are the currently blessed set.
If we can, running stenotype
with --seccomp=trace
should result in an ENOSYS error which should be findable in strace
output. A simple approach to this is to:
- run as you did when things broke, and when
stenotype
stalls, runps auxww | grep stenotype
to get the command line arguments it's running with - add
--seccomp=trace
to that, and run with:sudo strace -f -o /tmp/strace_out stenotype <arguments>
- attach the resulting
/tmp/strace_out
file to this bug
Or, you can look for the ENOSYS error in /tmp/strace_out and try to trace it back to the specific syscall being made.
@gconnell Thanks for the input! I will capture the trace data.
Is that by chance the same thing that the kernel is picking up in the audit logs here?
Sep 2 22:12:46 astraea kernel: [22616.709325] audit: type=1326 audit(1599084766.102:153): auid=4294967295 uid=106 gid=112 ses=4294967295 subj==unconfined pid=2410 comm="stenotype" exe="/usr/sbin/stenotype" sig=31 arch=c000003e syscall=231 compat=0 ip=0x7f88753bc9d6 code=0x0
It repeatedly logged these syscall numbers:
syscall=231
syscall=257
This seems to have been the result:
$ sudo strace -f -o /tmp/strace_out /usr/sbin/stenotype --threads=1 --iface=ens192 --dir=/tmp/stenographer447647775 --seccomp=trace
2020-09-08T21:10:35.663205Z T:2245d7 [stenotype.cc:462] CHECK(SUCCEEDED(__check_success_error__)) output.Rotate(file_dirname, micros, flag_preallocate_file_mb << 20): Function not implemented <- open
ABORTABORTABORT
/usr/sbin/stenotype(+0xb136) [0x562b15d20136]
/usr/sbin/stenotype(+0x265af) [0x562b15d3b5af]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f) [0x7f7ea30d9b2f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f7ea33fcfa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f7ea2db94cf]
Segmentation fault
$ echo $?
139
Seems to be these, but attaching full output.
1245 openat(AT_FDCWD, "/tmp/stenographer447647775/PKT0/.1599599435662987", O_WRONLY|O_CREAT|O_DSYNC|O_DIRECT, 0600) = -1 ENOSYS (Function not implemented)
1245 openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOSYS (Function not implemented)
Ah, perfect. Given that, I expect this is actually the case of Debian being behind head. IIUC, 9c5b80d#diff-0ec2ed7e09c9fb335892ac49999a491e should be the fix necessary to support this.
I don't personally handle rebuilding on Debian, and I'm actually not sure who/what does. But with the idea that maybe tagging head might cause a release to go out, I'm going to throw a new version tag up and see what happens.
I'm the maintainer of stenographer in Debian, so thanks for the hint! Debian unstable (and also buster-backports) has a stenographer version from May 29, so an update will be in order. Will get it into unstable ASAP, and it will be uploading a backport for buster soon after it hits testing. Unfortunately there can be no updates to buster stable.
v1.0.1 should be a good tag to build against. Thanks for Debian-ing!
Likewise, thanks guys! And I got to learn a few things as well.
Oh, I didn't realize there were tagged releases now! Even better, will switch to tracking those -- that would also give us notifications about new upstream releases.
I am the worst release tagger known to man, which pretty much means that I tag a release whenever someone opens an Issue to ask me to do so. That said, I'll try to make this a thing that happens ;)
Good to know! :)
In unstable now, will migrate to testing in ~2 days if all goes well: https://tracker.debian.org/news/1174872/accepted-stenographer-101-1-source-into-unstable/
1.0.1 is now in testing and buster-backports.