google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

syscall tests fail on newer Linux kernels

m-warmer opened this issue · comments

Description

The socket_netlink_route tests seem to be failing on my debian bullseye system.

Another example is //test/syscalls:socket_filesystem_test_native which also returns errors on my machine.

Note: Google Test filter = RenameTest.SysfsFileToSelf
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from RenameTest
[ RUN      ] RenameTest.SysfsFileToSelf
test/syscalls/linux/rename.cc:433: Failure
Value of: rename(path.c_str(), path.c_str())
Expected: not -1 (success)
  Actual: -1 (of type int), with errno PosixError(errno=30 Read-only file system)
[  FAILED  ] RenameTest.SysfsFileToSelf (0 ms)
[----------] 1 test from RenameTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] RenameTest.SysfsFileToSelf

Steps to reproduce

Run

bazel test //test/syscalls:socket_netlink_route_test_native

The test fails with a timeout and the log fail contains the following errors

Note: Google Test filter = NetlinkRouteTest.LookupAll
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN      ] NetlinkRouteTest.LookupAll
test/syscalls/linux/socket_netlink_route.cc:617: Failure
Expected: (count) > (0), actual: 0 vs 0
[  FAILED  ] NetlinkRouteTest.LookupAll (0 ms)
[----------] 1 test from NetlinkRouteTest (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NetlinkRouteTest.LookupAll

 1 FAILED TEST
Failed to match any benchmarks against regex: .
--- FAIL: NetlinkRouteTest_LookupAll (0.01s)
    main.go:141: test "NetlinkRouteTest.LookupAll" exited with status 1, want 0
Note: Google Test filter = NetlinkRouteTest.AddAndRemoveAddr
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN      ] NetlinkRouteTest.AddAndRemoveAddr
[       OK ] NetlinkRouteTest.AddAndRemoveAddr (0 ms)
[----------] 1 test from NetlinkRouteTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 1 test.
Failed to match any benchmarks against regex: .
Note: Google Test filter = NetlinkRouteTest.GetRouteDump
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN      ] NetlinkRouteTest.GetRouteDump
test/syscalls/linux/socket_netlink_route.cc:729: Failure
Value of: routeFound
  Actual: false
Expected: true
[  FAILED  ] NetlinkRouteTest.GetRouteDump (1 ms)
[----------] 1 test from NetlinkRouteTest (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NetlinkRouteTest.GetRouteDump

 1 FAILED TEST
Failed to match any benchmarks against regex: .
--- FAIL: NetlinkRouteTest_GetRouteDump (0.01s)
    main.go:141: test "NetlinkRouteTest.GetRouteDump" exited with status 1, want 0
Note: Google Test filter = NetlinkRouteTest.GetRouteRequest
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN      ] NetlinkRouteTest.GetRouteRequest
test/syscalls/linux/socket_netlink_route.cc:774: Failure
Value of: hdr->nlmsg_type
Expected: is equal to 24
  Actual: 2 (of type unsigned short)
Found route table=36, protocol=0, scope=0, type=0test/syscalls/linux/socket_netlink_route.cc:792: Failure
Expected equality of these values:
  msg->rtm_family
    Which is: '\x9B' (155)
  2
test/syscalls/linux/socket_netlink_route.cc:793: Failure
Expected equality of these values:
  msg->rtm_dst_len
    Which is: '\xFF' (255)
  32
test/syscalls/linux/socket_netlink_route.cc:794: Failure
Value of: (msg->rtm_flags & RTM_F_CLONED) == RTM_F_CLONED
  Actual: false
Expected: true
1001a
, len=28
test/syscalls/linux/socket_netlink_route.cc:815: Failure
Value of: rtDstFound
  Actual: false
Expected: true
[  FAILED  ] NetlinkRouteTest.GetRouteRequest (0 ms)
[----------] 1 test from NetlinkRouteTest (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NetlinkRouteTest.GetRouteRequest

 1 FAILED TEST
Failed to match any benchmarks against regex: .
--- FAIL: NetlinkRouteTest_GetRouteRequest (0.01s)
    main.go:141: test "NetlinkRouteTest.GetRouteRequest" exited with status 1, want 0
Note: Google Test filter = NetlinkRouteTest.RecvmsgTrunc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NetlinkRouteTest
[ RUN      ] NetlinkRouteTest.RecvmsgTrunc
test/syscalls/linux/socket_netlink_route.cc:871: Failure
Expected: (trunclen) >= (sizeof(struct nlmsghdr) + sizeof(struct ifaddrmsg)), actual: 20 vs 24
-- Test timed out at 2021-10-29 20:00:36 UTC --

runsc version

Build from HEAD at 1953d2ad28d405a3ab028feba7b6fca18339e9be with bazel 4.2.1

docker version (if using docker)

Client:
 Version:           20.10.5+dfsg1
 API version:       1.41
 Go version:        go1.15.9
 Git commit:        55c4c88
 Built:             Wed Aug  4 19:55:57 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.5+dfsg1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.15.9
  Git commit:       363e9a8
  Built:            Wed Aug  4 19:55:57 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.5~ds1
  GitCommit:        1.4.5~ds1-2
 runc:
  Version:          1.0.0~rc93+ds1
  GitCommit:        1.0.0~rc93+ds1-5+b2
 docker-init:
  Version:          0.19.0
  GitCommit:

uname

Linux debian 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

release-20211019.0-52-g1953d2ad2

runsc debug logs (if available)

No response

Running the entire syscall native test suite make syscall-native-tests gives me lots of errors.

Executed 216 out of 216 tests: 157 tests pass and 59 fail locally.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.

With errors like

Note: Google Test filter = AllInetTests/RawPacketTest.Receive/0
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AllInetTests/RawPacketTest
[ RUN      ] AllInetTests/RawPacketTest.Receive/0
test/syscalls/linux/packet_socket_raw.cc:75: Failure
Value of: sendto(sock, kMessage, sizeof(kMessage), 0, reinterpret_cast<struct sockaddr*>(&dest), sizeof(dest))
Expected: is equal to 20
  Actual: -1 (of type long), with errno PosixError(errno=101 Network is unreachable)
test/syscalls/linux/packet_socket_raw.cc:174: Failure
Value of: RetryEINTR(poll)(&pfd, 1, 2000)
Expected: is equal to 1
  Actual: 0 (of type int)
-- Test timed out at 2021-10-29 21:04:29 UTC --

I tried it again on a faster setup (8 core native instead of a 4 core VM) and most tests pass now. However I still get failures

//test/syscalls:tcp_socket_test_native                                   FAILED in 2 out of 4 in 13.8s
  Stats over 4 runs: max = 13.8s, min = 12.9s, avg = 13.3s, dev = 0.4s
//test/syscalls:socket_stress_test_native                                FAILED in 1 out of 8 in 21.2s
  Stats over 8 runs: max = 21.2s, min = 5.3s, avg = 11.7s, dev = 5.8s

With a log file containing

Note: Google Test filter = AllConnectedSockets/PersistentListenerConnectStressTest.ShutdownCloseFirst/5
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AllConnectedSockets/PersistentListenerConnectStressTest
[ RUN      ] AllConnectedSockets/PersistentListenerConnectStressTest.ShutdownCloseFirst/5
Testing with setsockopt(1, 2, 1) connected dual stack TCP socket
test/syscalls/linux/socket_generic_stress.cc:132: Failure
Value of: _expr_result
Expected: is OK
  Actual: PosixError(errno=99 Cannot assign requested address) (connect_result = RetryEINTR(connect)(connected, AsSockAddr(&bind_addr), sizeof(bind_addr))) == -1 && errno == EINPROGRESS ? 0 : connect_result
[  FAILED  ] AllConnectedSockets/PersistentListenerConnectStressTest.ShutdownCloseFirst/5, where GetParam() = 80-byte object <A0-7C 2F-93 04-56 00-00 33-00 00-00 00-00 00-00 33-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 0A-00 00-00 01-00 00-00 06-00 00-00 00-00 00-00 E0-7C 2F-93 04-56 00-00 00-00 00-00 01-7F 00-00 D9-29 78-91 04-56 00-00 A1-29 78-91 04-56 00-00> (2911 ms)
[----------] 1 test from AllConnectedSockets/PersistentListenerConnectStressTest (2911 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2912 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] AllConnectedSockets/PersistentListenerConnectStressTest.ShutdownCloseFirst/5, where GetParam() = 80-byte object <A0-7C 2F-93 04-56 00-00 33-00 00-00 00-00 00-00 33-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 0A-00 00-00 01-00 00-00 06-00 00-00 00-00 00-00 E0-7C 2F-93 04-56 00-00 00-00 00-00 01-7F 00-00 D9-29 78-91 04-56 00-00 A1-29 78-91 04-56 00-00>

 1 FAILED TEST

S o my real question is what are the requirements for running the tests. It doesn't seem to be documented anywhere.

Hi @m-warmer. Usually these test failures on native are due to minor changes in the kernel that change assumptions of the tests themselves. We run on several different kernel versions, so when these issues come up, we usually question the test itself and if what is being tested is valid. These rarely raise comparability issues, which is what we're testing for, but we like to keep the syscall tests valid on as many kernel versions as possible (hence a preference on not checking "IsRunningOnGvisor" or "IsVersionAtLeast").

Are you on GCP? If so, which debian version? Otherwise, I'll fire up a debian VM or use my workstation, which is past that 5.10 version, and see if I can repro when I have time.

I forgot to add that the second set of tests were run on a native ubuntu 20.04 install instead of a local debian VM. In both cases they were run on my own computers and not on GCP.

It seems a bit much to ask you to install debian 11/bullseye just to see if any tests fail. My guess would be that the 4 core VM with debian had timeouts given the warning at the end of the run. The tests on a native ubuntu install seemed to fare much better, but it still surprised me that two tests failed.

I'm new to gvisor and was setting up a development environment to see if I could add a feature and make a pull request. When setting up my machine I noticed these errors and was wondering what the expected development environment is. As I'd rather focus on what I want to change instead of fighting the test suite.

@m-warmer @crappycrypto (same person?) Just to clarify: *_runsc(_ptrace | _kvm) tests run on gVisor. *_native tests are run in native containers, which call into your linux host. If _native tests are failing, then there's a change in the host kernel syscall implementation we haven't considered.

Fixing native tests is always welcome, but I wouldn't worry about this unless syscall tests on gVisor start happening. And you'll know that on presubmit, as the code will run on our infra in GCP.

I'll keep this open so I can fix it eventually.

Yep, same person different device. Thanks for the clarification. I'll ignore any failures in the native tests for now, unless I know I touched that area of gvisor.

For me it's okay to close this issue. Figuring out all the subtle differences between kernel versions and configs seem like a huge effort while most software should not depend on such behaviour.

A friendly reminder that this issue had no activity for 120 days.

This issue has been closed due to lack of activity.