shadow / shadow

Shadow is a discrete-event network simulator that directly executes real application code, enabling you to simulate distributed systems with thousands of network-connected processes in realistic and scalable private network experiments using your laptop, desktop, or server running Linux.

Home Page:https://shadow.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement the splice syscall

cohosh opened this issue · comments

Describe the issue
splice is a system call that allows for the efficient movement of data between file descriptors if one of the descriptors is a pipe.

The Go networking library switched to using splice for the read/write methods of net.TCPConn to improve performance. This means that read/write calls between net.TCPConn's in Shadow fail with the following Go logs:

2000/01/01 00:00:10 error copying to ORPort writeto tcp 0.0.0.0:8081->11.0.0.1:29902: readfrom tcp 127.0.0.1:20027->127.0.0.1:8080: splice: function not implemented

To Reproduce
I've created a minimal example of a simple TCP proxy in Go: https://github.com/cohosh/go-tcp-shadow-minimal

Operating System (please complete the following information):

  • OS and version: Debian GNU/Linux trixie/sid
  • Kernel version: Linux 6.7.9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.7.9-2 (2024-03-13) x86_64 GNU/Linux

Shadow (please complete the following information):

  • Version and build information: post the output of shadow --show-build-info
Shadow 3.1.0 — v3.1.0-205-g1ca8075dd 2024-04-01--16:54:23
GLib 2.78.4
Built on 2024-04-02--15:00:39
Built from git branch main
Shadow was built with PROFILE=release, OPT_LEVEL=3, DEBUG=false, RUSTFLAGS="-C force-frame-pointers=y", CFLAGS="-std=gnu11 -O3 -ggdb -fno-omit-frame-pointer -Wreturn-type -Wswitch -DNDEBUG"
For more information, visit https://shadow.github.io or https://github.com/shadow
  • Which processes you are trying to run inside the Shadow simulation:
    minimal tcp proxy, tgen

A couple of notes about implementing this while I'm thinking about it:

  1. splice has a few similar syscalls like tee, sendfile, and copy_file_range. When designing the interface for splice, it might be worth considering these other syscalls to see if it's possible to support more than just splice with the same interface.
  2. Currently to get bytes in or out of a file, you would use the file's readv and writev methods. But these are meant for copying bytes to/from the plugin's memory so they take an iovec of plugin pointers, which isn't useful for moving bytes between two files within shadow. It's probably best to add new methods to the file (maybe named something like splice_in/splice_out), and implement these for each file type (RegularFile, TcpSocket, EventFd, etc). The tricky part is that you need to move data from one file to another without dropping any bytes (the move must be infallible). If you remove bytes from one file you must add it to the other file, which means the receiving file must have enough space for it. Once you remove data from one file, you can't re-add it back to the original file if the receiving file doesn't have enough space. This could be especially tricky when copying bytes from a pipe to a regular file since we don't know if the Linux kernel "write-to-file" will succeed or not. It might be useful to have some sort of pattern where a file gives you a handle to some bytes which allows you to copy them, and then only once the copy is successful, you "commit" the changes on the handle causing the original file to remove those bytes.

A workaround in the short term is to avoid calling io.Copy from go code. Instead, you could call the following function, which mirrors go's internal implementation but does not make calls to WriteTo/ReadFrom (thus avoiding splice syscalls).

////
// Adapted to skip WriteTo/ReadFrom, which use splice internally.
// See https://cs.opensource.google/go/go/+/master:src/io/io.go;l=407
////
// copyBuffer is the actual implementation of Copy and CopyBuffer.
// if buf is nil, one is allocated.
func myCopyBuffer(dst io.Writer, src io.Reader, buf []byte) (written int64, err error) {
	// // If the reader has a WriteTo method, use it to do the copy.
	// // Avoids an allocation and a copy.
	// if wt, ok := src.(WriterTo); ok {
	// 	return wt.WriteTo(dst)
	// }
	// // Similarly, if the writer has a ReadFrom method, use it to do the copy.
	// if rf, ok := dst.(ReaderFrom); ok {
	// 	return rf.ReadFrom(src)
	// }
	if buf == nil {
		size := 32 * 1024
		if l, ok := src.(*io.LimitedReader); ok && int64(size) > l.N {
			if l.N < 1 {
				size = 1
			} else {
				size = int(l.N)
			}
		}
		buf = make([]byte, size)
	}
	for {
		nr, er := src.Read(buf)
		if nr > 0 {
			nw, ew := dst.Write(buf[0:nr])
			if nw < 0 || nr < nw {
				nw = 0
				if ew == nil {
					ew = errors.New("invalid write result")
				}
			}
			written += int64(nw)
			if ew != nil {
				err = ew
				break
			}
			if nr != nw {
				err = io.ErrShortWrite
				break
			}
		}
		if er != nil {
			if er != io.EOF {
				err = er
			}
			break
		}
	}
	return written, err
}

A full patch for the dummy examples included in the goptlib examples directory is attached as a working example. You can apply the patch like:

git clone https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/goptlib.git
cd goptlib
git checkout tags/v1.5.0 -b dummy_no_splice
git apply copy_no_splice.patch
cd examples/dummy-client
CGO_ENABLED=1 go build
cd ../dummy-server
CGO_ENABLED=1 go build

Of course the better solution is to support splice in Shadow, but maybe this workaround can help in the meantime.

copy_no_splice.patch

@stevenengler I wonder if, instead of the try/commit scheme you outlined, we instead support a new interface called something like splice_from or read_from that we would call on the splice destination fd, and to that function we would pass in the source fd. Then, the destination fd could figure out internally how much space it has available and then read from the source fd at most min(space, requested_len), potentially blocking or returning EWOULDBLOCK as appropriate.

I haven't checked tee, sendfile, and copy_file_range, so I'm not yet sure if the above would be general enough to cover all cases though.

@stevenengler I wonder if, instead of the try/commit scheme you outlined, we instead support a new interface called something like splice_from or read_from that we would call on the splice destination fd, and to that function we would pass in the source fd. Then, the destination fd could figure out internally how much space it has available and then read from the source fd at most min(space, requested_len), potentially blocking or returning EWOULDBLOCK as appropriate.

Yeah I think that would work. For RegularFile, I think we could just assume the underlying Linux file can accept infinite bytes, and then just loop calling write(osfile.fd, ...) until all bytes (min(available, requested_len)) are written. Then we don't need to worry about the problem where the destination doesn't have enough space.

Since we'd need to add an additional method for reading/taking bytes from a file, I'm not sure if the method should have a Read-like api where you pass in a mutable buffer and the file writes to the buffer, or if the method should return a Bytes object instead. Since one of the two files needs to be a pipe, I think it might be better to deal with Bytes objects to avoid memory copies (pipes use Bytes internally). But I think this can be decided by whatever is easiest to implement.

Agreed, using Bytes objects to avoid copies when we can seems like a good idea to me :)