syscall tests fail on newer Linux kernels
m-warmer opened this issue · comments
Description
The socket_netlink_route tests seem to be failing on my debian bullseye system.
Another example is //test/syscalls:socket_filesystem_test_native
which also returns errors on my machine.
Note: Google Test filter = RenameTest.SysfsFileToSelf
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from RenameTest
[ RUN ] RenameTest.SysfsFileToSelf
test/syscalls/linux/rename.cc:433: Failure
Value of: rename(path.c_str(), path.c_str())
Expected: not -1 (success)
Actual: -1 (of type int), with errno PosixError(errno=30 Read-only file system)
[ FAILED ] RenameTest.SysfsFileToSelf (0 ms)
[----------] 1 test from RenameTest (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] RenameTest.SysfsFileToSelf
Steps to reproduce
Run
bazel test //test/syscalls:socket_netlink_route_test_native
The test fails with a timeout and the log fail contains the following errors
Note: Google Test filter = NetlinkRouteTest.LookupAll
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN ] NetlinkRouteTest.LookupAll
test/syscalls/linux/socket_netlink_route.cc:617: Failure
Expected: (count) > (0), actual: 0 vs 0
[ FAILED ] NetlinkRouteTest.LookupAll (0 ms)
[----------] 1 test from NetlinkRouteTest (1 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] NetlinkRouteTest.LookupAll
1 FAILED TEST
Failed to match any benchmarks against regex: .
--- FAIL: NetlinkRouteTest_LookupAll (0.01s)
main.go:141: test "NetlinkRouteTest.LookupAll" exited with status 1, want 0
Note: Google Test filter = NetlinkRouteTest.AddAndRemoveAddr
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN ] NetlinkRouteTest.AddAndRemoveAddr
[ OK ] NetlinkRouteTest.AddAndRemoveAddr (0 ms)
[----------] 1 test from NetlinkRouteTest (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[ PASSED ] 1 test.
Failed to match any benchmarks against regex: .
Note: Google Test filter = NetlinkRouteTest.GetRouteDump
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN ] NetlinkRouteTest.GetRouteDump
test/syscalls/linux/socket_netlink_route.cc:729: Failure
Value of: routeFound
Actual: false
Expected: true
[ FAILED ] NetlinkRouteTest.GetRouteDump (1 ms)
[----------] 1 test from NetlinkRouteTest (1 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] NetlinkRouteTest.GetRouteDump
1 FAILED TEST
Failed to match any benchmarks against regex: .
--- FAIL: NetlinkRouteTest_GetRouteDump (0.01s)
main.go:141: test "NetlinkRouteTest.GetRouteDump" exited with status 1, want 0
Note: Google Test filter = NetlinkRouteTest.GetRouteRequest
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN ] NetlinkRouteTest.GetRouteRequest
test/syscalls/linux/socket_netlink_route.cc:774: Failure
Value of: hdr->nlmsg_type
Expected: is equal to 24
Actual: 2 (of type unsigned short)
Found route table=36, protocol=0, scope=0, type=0test/syscalls/linux/socket_netlink_route.cc:792: Failure
Expected equality of these values:
msg->rtm_family
Which is: '\x9B' (155)
2
test/syscalls/linux/socket_netlink_route.cc:793: Failure
Expected equality of these values:
msg->rtm_dst_len
Which is: '\xFF' (255)
32
test/syscalls/linux/socket_netlink_route.cc:794: Failure
Value of: (msg->rtm_flags & RTM_F_CLONED) == RTM_F_CLONED
Actual: false
Expected: true
1001a
, len=28
test/syscalls/linux/socket_netlink_route.cc:815: Failure
Value of: rtDstFound
Actual: false
Expected: true
[ FAILED ] NetlinkRouteTest.GetRouteRequest (0 ms)
[----------] 1 test from NetlinkRouteTest (1 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] NetlinkRouteTest.GetRouteRequest
1 FAILED TEST
Failed to match any benchmarks against regex: .
--- FAIL: NetlinkRouteTest_GetRouteRequest (0.01s)
main.go:141: test "NetlinkRouteTest.GetRouteRequest" exited with status 1, want 0
Note: Google Test filter = NetlinkRouteTest.RecvmsgTrunc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN ] NetlinkRouteTest.RecvmsgTrunc
test/syscalls/linux/socket_netlink_route.cc:871: Failure
Expected: (trunclen) >= (sizeof(struct nlmsghdr) + sizeof(struct ifaddrmsg)), actual: 20 vs 24
-- Test timed out at 2021-10-29 20:00:36 UTC --
runsc version
Build from HEAD at 1953d2ad28d405a3ab028feba7b6fca18339e9be with bazel 4.2.1
docker version (if using docker)
Client:
Version: 20.10.5+dfsg1
API version: 1.41
Go version: go1.15.9
Git commit: 55c4c88
Built: Wed Aug 4 19:55:57 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.5+dfsg1
API version: 1.41 (minimum version 1.12)
Go version: go1.15.9
Git commit: 363e9a8
Built: Wed Aug 4 19:55:57 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.5~ds1
GitCommit: 1.4.5~ds1-2
runc:
Version: 1.0.0~rc93+ds1
GitCommit: 1.0.0~rc93+ds1-5+b2
docker-init:
Version: 0.19.0
GitCommit:
uname
Linux debian 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
release-20211019.0-52-g1953d2ad2
runsc debug logs (if available)
No response
Running the entire syscall native test suite make syscall-native-tests
gives me lots of errors.
Executed 216 out of 216 tests: 157 tests pass and 59 fail locally.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.
With errors like
Note: Google Test filter = AllInetTests/RawPacketTest.Receive/0
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AllInetTests/RawPacketTest
[ RUN ] AllInetTests/RawPacketTest.Receive/0
test/syscalls/linux/packet_socket_raw.cc:75: Failure
Value of: sendto(sock, kMessage, sizeof(kMessage), 0, reinterpret_cast<struct sockaddr*>(&dest), sizeof(dest))
Expected: is equal to 20
Actual: -1 (of type long), with errno PosixError(errno=101 Network is unreachable)
test/syscalls/linux/packet_socket_raw.cc:174: Failure
Value of: RetryEINTR(poll)(&pfd, 1, 2000)
Expected: is equal to 1
Actual: 0 (of type int)
-- Test timed out at 2021-10-29 21:04:29 UTC --
I tried it again on a faster setup (8 core native instead of a 4 core VM) and most tests pass now. However I still get failures
//test/syscalls:tcp_socket_test_native FAILED in 2 out of 4 in 13.8s
Stats over 4 runs: max = 13.8s, min = 12.9s, avg = 13.3s, dev = 0.4s
//test/syscalls:socket_stress_test_native FAILED in 1 out of 8 in 21.2s
Stats over 8 runs: max = 21.2s, min = 5.3s, avg = 11.7s, dev = 5.8s
With a log file containing
Note: Google Test filter = AllConnectedSockets/PersistentListenerConnectStressTest.ShutdownCloseFirst/5
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AllConnectedSockets/PersistentListenerConnectStressTest
[ RUN ] AllConnectedSockets/PersistentListenerConnectStressTest.ShutdownCloseFirst/5
Testing with setsockopt(1, 2, 1) connected dual stack TCP socket
test/syscalls/linux/socket_generic_stress.cc:132: Failure
Value of: _expr_result
Expected: is OK
Actual: PosixError(errno=99 Cannot assign requested address) (connect_result = RetryEINTR(connect)(connected, AsSockAddr(&bind_addr), sizeof(bind_addr))) == -1 && errno == EINPROGRESS ? 0 : connect_result
[ FAILED ] AllConnectedSockets/PersistentListenerConnectStressTest.ShutdownCloseFirst/5, where GetParam() = 80-byte object <A0-7C 2F-93 04-56 00-00 33-00 00-00 00-00 00-00 33-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 0A-00 00-00 01-00 00-00 06-00 00-00 00-00 00-00 E0-7C 2F-93 04-56 00-00 00-00 00-00 01-7F 00-00 D9-29 78-91 04-56 00-00 A1-29 78-91 04-56 00-00> (2911 ms)
[----------] 1 test from AllConnectedSockets/PersistentListenerConnectStressTest (2911 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2912 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] AllConnectedSockets/PersistentListenerConnectStressTest.ShutdownCloseFirst/5, where GetParam() = 80-byte object <A0-7C 2F-93 04-56 00-00 33-00 00-00 00-00 00-00 33-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 0A-00 00-00 01-00 00-00 06-00 00-00 00-00 00-00 E0-7C 2F-93 04-56 00-00 00-00 00-00 01-7F 00-00 D9-29 78-91 04-56 00-00 A1-29 78-91 04-56 00-00>
1 FAILED TEST
S o my real question is what are the requirements for running the tests. It doesn't seem to be documented anywhere.
Hi @m-warmer. Usually these test failures on native are due to minor changes in the kernel that change assumptions of the tests themselves. We run on several different kernel versions, so when these issues come up, we usually question the test itself and if what is being tested is valid. These rarely raise comparability issues, which is what we're testing for, but we like to keep the syscall tests valid on as many kernel versions as possible (hence a preference on not checking "IsRunningOnGvisor" or "IsVersionAtLeast").
Are you on GCP? If so, which debian version? Otherwise, I'll fire up a debian VM or use my workstation, which is past that 5.10 version, and see if I can repro when I have time.
I forgot to add that the second set of tests were run on a native ubuntu 20.04 install instead of a local debian VM. In both cases they were run on my own computers and not on GCP.
It seems a bit much to ask you to install debian 11/bullseye just to see if any tests fail. My guess would be that the 4 core VM with debian had timeouts given the warning at the end of the run. The tests on a native ubuntu install seemed to fare much better, but it still surprised me that two tests failed.
I'm new to gvisor and was setting up a development environment to see if I could add a feature and make a pull request. When setting up my machine I noticed these errors and was wondering what the expected development environment is. As I'd rather focus on what I want to change instead of fighting the test suite.
@m-warmer @crappycrypto (same person?) Just to clarify: *_runsc(_ptrace | _kvm) tests run on gVisor. *_native tests are run in native containers, which call into your linux host. If _native tests are failing, then there's a change in the host kernel syscall implementation we haven't considered.
Fixing native tests is always welcome, but I wouldn't worry about this unless syscall tests on gVisor start happening. And you'll know that on presubmit, as the code will run on our infra in GCP.
I'll keep this open so I can fix it eventually.
Yep, same person different device. Thanks for the clarification. I'll ignore any failures in the native tests for now, unless I know I touched that area of gvisor.
For me it's okay to close this issue. Figuring out all the subtle differences between kernel versions and configs seem like a huge effort while most software should not depend on such behaviour.
A friendly reminder that this issue had no activity for 120 days.
This issue has been closed due to lack of activity.