shadow / shadow

Shadow is a discrete-event network simulator that directly executes real application code, enabling you to simulate distributed systems with thousands of network-connected processes in realistic and scalable private network experiments using your laptop, desktop, or server running Linux.

Home Page:https://shadow.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Partial read triggers an event in Shadow, but not Linux

ppopth opened this issue · comments

Describe the issue
When using epoll with EPOLLET flag set and the file under epoll is partially read (some is left in the buffer), the result of epoll_wait in the real Linux is different from the one in Shadow. This issue is a different one from #2673

To Reproduce
Add the following test to src/test/epoll/test_epoll.rs and run the test.

diff --git a/src/test/epoll/test_epoll.rs b/src/test/epoll/test_epoll.rs
index df0e67db2..f6b9849ce 100644
--- a/src/test/epoll/test_epoll.rs
+++ b/src/test/epoll/test_epoll.rs
@@ -4,6 +4,7 @@ use nix::errno::Errno;
 use nix::sys::epoll::{self, EpollFlags};
 use nix::unistd;
 
+use test_utils::socket_utils::{socket_init_helper, SocketInitMethod};
 use test_utils::{ensure_ord, set, ShadowTest, TestEnvironment};
 
 #[derive(Debug)]
@@ -329,6 +330,59 @@ fn test_ctl_invalid_op() -> anyhow::Result<()> {
     })
 }
 
+fn test_write_then_read() -> anyhow::Result<()> {
+    let (fd_client, fd_server) = socket_init_helper(
+        SocketInitMethod::Inet,
+        libc::SOCK_STREAM,
+        libc::SOCK_NONBLOCK,
+        /* bind_client = */ false,
+    );
+    let epollfd = epoll::epoll_create()?;
+
+    test_utils::run_and_close_fds(&[epollfd, fd_client, fd_server], || {
+        let mut event = epoll::EpollEvent::new(EpollFlags::EPOLLET | EpollFlags::EPOLLIN, 0);
+        epoll::epoll_ctl(
+            epollfd,
+            epoll::EpollOp::EpollCtlAdd,
+            fd_server,
+            Some(&mut event),
+        )?;
+
+        let timeout = Duration::from_millis(100);
+
+        let thread = std::thread::spawn(move || {
+            vec![
+                do_epoll_wait(epollfd, timeout, /* do_read= */ false),
+                // The second one is supposed to timeout.
+                do_epoll_wait(epollfd, timeout, /* do_read= */ false),
+            ]
+        });
+
+        // Wait for readers to block.
+        std::thread::sleep(timeout / 3);
+
+        // Make the read-end readable.
+        unistd::write(fd_client, &[0, 0])?;
+
+        // Wait and read some, but not all, from the buffer.
+        std::thread::sleep(timeout / 3);
+        unistd::read(fd_server, &mut [0])?;
+
+        let results = thread.join().unwrap();
+
+        // The first wait should have received the event
+        ensure_ord!(results[0].epoll_res, ==, Ok(1));
+        ensure_ord!(results[0].duration, <, timeout);
+        ensure_ord!(results[0].events[0], ==, epoll::EpollEvent::new(EpollFlags::EPOLLIN, 0));
+
+        // The second wait should have timed out with no events received.
+        ensure_ord!(results[1].epoll_res, ==, Ok(0));
+        ensure_ord!(results[1].duration, >=, timeout);
+
+        Ok(())
+    })
+}
+
 fn main() -> anyhow::Result<()> {
     // should we restrict the tests we run?
     let filter_shadow_passing = std::env::args().any(|x| x == "--shadow-passing");
@@ -340,6 +394,7 @@ fn main() -> anyhow::Result<()> {
     let mut tests: Vec<test_utils::ShadowTest<(), anyhow::Error>> = vec![
         ShadowTest::new("threads-edge", test_threads_edge, all_envs.clone()),
         ShadowTest::new("threads-level", test_threads_level, all_envs.clone()),
+        ShadowTest::new("write-then-read", test_write_then_read, all_envs.clone()),
         // in Linux these two tests have a race condition and don't always pass
         ShadowTest::new(
             "threads-level-with-late-read",

In Shadow, the second epoll_wait receives an event, but, in Linux, it doesn't.

Operating System (please complete the following information):

  • OS and version: Ubuntu 22.04.3 LTS
  • Kernel version: Linux thinkpad-t14 6.2.0-39-generic # 40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Shadow (please complete the following information):

  • Version: commit c133925
  • Which processes you are trying to run inside the Shadow simulation: the tests

Additional context

(Writing this out so that I can understand it better.) If I'm understanding correctly, the server socket should initially be "not readable". Then the second thread blocks waiting for an edge-triggered "not readable -> readable" state change. The main thread sends two bytes to the server, which should cause the server to transition from "not readable" to "readable", and the epoll wait in the second thread should unblock, and immediately block again waiting on an edge-triggered "readable -> not readable" "not readable -> readable" state change. The main thread reads one byte from the server, which should not cause any state transitions since there were two bytes in the receive buffer. So the second epoll wait should time out.

The problem appears to be that for C TCP sockets, when the application calls read() but doesn't read the socket's entire buffer, the socket transitions from READABLE -> NOT_READABLE -> READABLE, which epoll sees as an "edge condition".

[mod.rs:147] [shadow_rs::host::syscall::handler] SYSCALL_HANDLER_PRE: read (0) — (testnode.test_epoll.1000, tid=1000)
[descriptor.c:204] [_legacyfile_handleStatusChange] Status changed on desc 0x7f275800e300, from ACTIVE|READABLE|WRITEABLE to ACTIVE|WRITEABLE
[descriptor.c:204] [_legacyfile_handleStatusChange] Status changed on desc 0x7f275800e300, from ACTIVE|WRITEABLE to ACTIVE|READABLE|WRITEABLE
[tcp.c:2662] [tcp_receiveUserData] 127.0.0.1:21576 (descriptor 0x7f275800e300) <-> 127.0.0.1:24536: receiving 1 user bytes
[entry.rs:68] [shadow_rs::host::descriptor::epoll::entry] Notify old state FileState(ACTIVE | READABLE | WRITABLE), new state FileState(ACTIVE | WRITABLE), changed FileState(READABLE)
[entry.rs:68] [shadow_rs::host::descriptor::epoll::entry] Notify old state FileState(ACTIVE | WRITABLE), new state FileState(ACTIVE | READABLE | WRITABLE), changed FileState(READABLE)
[mod.rs:209] [shadow_rs::host::syscall::handler] SYSCALL_HANDLER_POST: read (0) result 1 — (testnode.test_epoll.1000, tid=1000)

This is caused by the way the C TCP code handles partial packet data. When data is read from a C TCP socket, legacysocket_removeFromInputBuffer is called which updates the STATUS_FILE_READABLE flag, and then tcp_receiveUserData also updates the STATUS_FILE_READABLE flag.

gssize tcp_receiveUserData(TCP* tcp, const Host* host, UntypedForeignPtr buffer, gsize, ...) {
    ...
    while(remaining > 0) {
        ...
        // this can remove `STATUS_FILE_READABLE`
        Packet* packet = legacysocket_removeFromInputBuffer((LegacySocket*)tcp, host);
        ...
    }
    if ((legacysocket_getInputBufferLength(&(tcp->super)) > 0) || (tcp->partialUserDataPacket != NULL)) {
        // this can add `STATUS_FILE_READABLE` back again
        legacyfile_adjustStatus(&(tcp->super.super), STATUS_FILE_READABLE, TRUE, 0);
    } else {
        ...
    }
    ...
}