gVisor writes ethernet frames to tun devices
benbuzbee opened this issue · comments
Description
If we create a tun device inside the netns that gvisor will use, it will write ethernet frames to it when it should write IP packets
https://www.kernel.org/doc/Documentation/networking/tuntap.txt
Depending on the type of device chosen the userspace program has to read/write
IP packets (with tun) or ethernet frames (with tap)
tun devices are more common, for example openvpn --mktun
or wireguard-go
. openvpn can support a tap device but it is not the default. Wireguard-go on the other hand cannot.
In our particular use case, we would like to create a tun device inside the sandbox that will ship frames via wireguard and an ethernet device that does not live in the network namespace, for added egress security.
In a related problem, gVisor ignores the pointtopoint and noarp flags on the interface and requires a gateway & ARP. Ideally gVisor notes the pointtopoint device and doesn't try to ARP, and also allows us to specify a default route without a gateway in this case.
Steps to reproduce
These steps are bit annoying because you need something that reads the tun data so I took https://web.ecs.syr.edu/~wedu/seed/Labs/VPN/files/simpletun.c and modified its server mode to just drop the packets for demonstration. Modified code attached and compiled via gcc simpletun.txt -o simpletun
simpletun.txt
Create rootfs with network tools
ID=$(docker run -d --rm --entrypoint /usr/bin/tail praqma/network-multitool -f /dev/null)
docker export $ID> root.tar
docker stop $ID
mkdir root
tar -C root -xf root.tar
Create test container and configure netns with tun device
Simple config.json.txt
You will need to compile the attached code for simpletun to read from the tun device and drop packets so the requests don't hang
sudo runsc create test
sudo rm -f /var/run/netns/test && sudo ln -s /proc/$(sudo runsc state test | jq -r '.pid')/ns/net /var/run/netns/test
sudo ip netns exec test ./simpletun -i tun0 -s -d
sudo ip netns exec test ip link set lo up
sudo ip netns exec test ip link set tun0 up
sudo ip netns exec test ip addr add 10.0.2.2/24 dev tun0
sudo ip netns exec test ip route add default via 10.0.2.1 dev tun0
Start it and check settings
$ sudo ip netns exec test ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tun0: <NO-CARRIER,POINTOPOINT,MULTICAST,NOARP,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 100
link/none
inet 10.0.2.2/24 scope global tun0
valid_lft forever preferred_lft forever
$ sudo runsc start test
$ sudo runsc exec test /sbin/ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope global dynamic
inet6 ::1/128 scope global dynamic
2: tun0: <UP,LOWER_UP> mtu 1500
link/ether 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 10.0.2.2/24 scope global dynamic
inet6 fe80::98e5:3224:586a:4f9d/64 scope global dynamic
Ok so now we have gvisor running with the tun device. note it has dropped the pointtopoint and multicast flags and given it mac address 00:00:00:00:00:00
Verify gVisor is writing ethernet frames
In one terminal:
$ sudo ip netns exec test tcpdump -vv -l -x -i tun0
In another:
$ sudo runsc exec test /usr/bin/curl -vv 10.0.2.10
Note:
tcpdump: listening on tun0, link-type RAW (Raw IP), capture size 262144 bytes
19:27:47.814582 unknown ip 15
0x0000: ffff ffff ffff 0000 0000 0000 0806 0001
0x0010: 0800 0604 0001 0000 0000 0000 0a00 0202
0x0020: 0000 0000 0000 0a00 0264
Ethernet ARP. tcpdump is also confused because it knows the device is a tun and doens't expet this framing.
You can run simpletun with -a
If you run simpletun with the -a flag it will create a tap device and you can see tcpdump recognizes the ARP
tcpdump: listening on tun0, link-type EN10MB (Ethernet), capture size 262144 bytes
19:29:37.704942 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.2.10 tell 10.0.2.2, length 28
0x0000: 0001 0800 0604 0001 86d9 dba2 f42b 0a00
0x0010: 0202 0000 0000 0000 0a00 020a
runsc version
runsc version release-20210315.0
spec: 1.0.2
### docker version (if using docker)
_No response_
### uname
Linux browser-devbox 5.8.0-23-generic #24~20.04.1-Ubuntu SMP Sat Oct 10 04:57:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
### kubectl (if using Kubernetes)
_No response_
### repo state (if built from source)
_No response_
### runsc debug logs (if available)
_No response_
Thanks for the report. I believe I understand where its going wrong. runsc today treats all underlying FD's as ethernet devices.
Here's where we scrape the network interfaces in the namespace
gvisor/runsc/sandbox/network.go
Line 198 in 6eb8596
Line 212 in 6eb8596
As you can see its currently hardcoded to treat all underlying FD's as ethernet. We could trivially change that to say
EthernetHeader: mac != "",
That should I believe fix most of the issues.
Could you try that and if it works then I can roll up a PR to fix this.
@benbuzbee ping
Seems on a good track. I will have to integrate more stuff to really verify its working correctly e2e but I can see it now sends a TCP SYN
18:18:08.888624 IP (tos 0x0, ttl 64, id 21798, offset 0, flags [none], proto TCP (6), length 60)
10.0.2.2.46653 > 10.0.2.10.http: Flags [S], cksum 0xfd48 (correct), seq 1809025410, win 29184, options [mss 1460,sackOK,TS val 2298317770 ecr 0,nop,wscale 7], length 0
I will try to spend more time and verify e2e