ICMPv6 neighbor discovery between network namespaces does not work in sandbox networking mode

Question

ICMPv6 neighbor discovery between network namespaces does not work in sandbox networking mode

th0m opened this issue 2 years ago · comments

Description

My setup is the following:

two runsc containers
each container is in a different network namespace
each network namespace has a macvlan interface in bridge mode with the same host parent interface
each macvlan interface has a link local address set up in the fe80::/64 range

From each container, I'd expect ICMPv6 neighbor discovery to work and be able to ping the link local ip of the other container.
ICMPv6 neighbor discovery works in host networking mode but does not work in the default sandbox/netstack networking mode.

Please see minimal reproduction steps below.
Let me know if there is anything else I can provide to help diagnose the issue.

Steps to reproduce

Here are minimal reproduction steps to reproduce the issue:

Bundle setup

# create bundle
mkdir -p bundle/rootfs
cd bundle
docker export $(docker create debian) | tar -xf - -C rootfs
runsc spec -netns /run/netns/ctr -- sleep 10000

Container in sandbox networking mode, ICMPv6 neighbor discovery fails

# clean up
sudo runsc kill ip6nd
sudo runsc delete ip6nd
sudo ip netns delete ctr
sudo ip netns delete testing

# create macvlan iface in ctr netns used by the runsc container
sudo ip link add link wlp0s20f3 dev ctr0 type macvlan mode bridge
sudo ip netns add ctr
sudo ip link set ctr0 netns ctr
sudo ip netns exec ctr ip a a fe80::f00/64 dev ctr0
sudo ip netns exec ctr ip l set up dev ctr0 up

# start the container in sandbox networking mode and wait 10s for it to start
sudo runsc run -detach -bundle . ip6nd
sleep 10

# create macvlan iface in testing netns
sudo ip link add link wlp0s20f3 dev tst0 type macvlan mode bridge
sudo ip netns add testing
sudo ip link set tst0 netns testing
sudo ip netns exec testing ip l set up dev tst0 up

# ping from testing netns to ctr netns fails
sudo ip netns exec testing ping fe80::f00
PING fe80::f00(fe80::f00) 56 data bytes
^C
--- fe80::f00 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2037ms

Container in host networking mode, ICMPv6 neighbor discovery succeeds

# clean up
sudo runsc kill ip6nd
sudo runsc delete ip6nd
sudo ip netns delete ctr
sudo ip netns delete testing

# create macvlan iface in ctr netns used by the runsc container
sudo ip link add link wlp0s20f3 dev ctr0 type macvlan mode bridge
sudo ip netns add ctr
sudo ip link set ctr0 netns ctr
sudo ip netns exec ctr ip a a fe80::f00/64 dev ctr0
sudo ip netns exec ctr ip l set up dev ctr0 up

# start the container in host networking mode and wait 10s for it to start
sudo runsc --network=host run -detach -bundle . ip6nd
sleep 10

# create macvlan iface in testing netns
sudo ip link add link wlp0s20f3 dev tst0 type macvlan mode bridge
sudo ip netns add testing
sudo ip link set tst0 netns testing
sudo ip netns exec testing ip l set up dev tst0 up

# ping from testing netns to ctr netns succeeds
sudo ip netns exec testing ping fe80::f00
PING fe80::f00(fe80::f00) 56 data bytes
64 bytes from fe80::f00%tst0: icmp_seq=1 ttl=64 time=0.106 ms
64 bytes from fe80::f00%tst0: icmp_seq=2 ttl=64 time=0.057 ms
64 bytes from fe80::f00%tst0: icmp_seq=3 ttl=64 time=0.073 ms
^C
--- fe80::f00 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2049ms
rtt min/avg/max/mdev = 0.057/0.078/0.106/0.020 ms

runsc version

$ runsc --version
runsc version release-20220222.0
spec: 1.0.2-dev

docker version (if using docker)

$ docker version
Client:
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.8
 Git commit:        20.10.7-0ubuntu5.1
 Built:             Mon Nov  1 00:33:40 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       20.10.7-0ubuntu5.1
  Built:            Thu Oct 21 23:58:58 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.5-0ubuntu3
  GitCommit:        
 runc:
  Version:          1.0.1-0ubuntu2
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:

uname

Linux tlefebvre-Latitude-7420 5.13.0-30-generic #33-Ubuntu SMP Fri Feb 4 17:03:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

Thomas Lefebvre · Answer 1 · Wed Mar 09 2022 03:31:38 GMT+0800 (China Standard Time)

Any thoughts on this @hbhasker? Thank you!

Bhasker Hariharan · Answer 2 · Wed Mar 09 2022 05:18:21 GMT+0800 (China Standard Time)

Sorry I haven't had time to look at this report.

@nybidari Could you take a look?

Thomas Lefebvre · Answer 3 · Wed Mar 16 2022 01:29:27 GMT+0800 (China Standard Time)

No worries, thanks for assigning this issue.
@nybidari feel free to let me know if you need any additional information.

Bhasker Hariharan · Answer 4 · Tue May 03 2022 04:54:19 GMT+0800 (China Standard Time)

I will take a look at this one.

Kevin Krakauer · Answer 5 · Tue Aug 29 2023 04:19:18 GMT+0800 (China Standard Time)

Caveat: this is my first time hearing about MACVLAN interfaces. I'm mostly going with the description here.

runsc was written to operate in a Docker-esque environment. For networking, this means a netns containing one end of a veth pair. In sandbox mode, runsc scrapes the addresses and routes from its veth device, removes them from the veth (the host no longer needs them and shouldn't respond to pings and whatnot -- that's runsc's job), and sets up its own network stack with those addresses and routes.

My guess is that, because MACVLAN appears not to be a "two ended pipe" like veth, deleting addresses and routes from the device makes it basically unreachable. Does other network functionality work, or is it just neighbor discovery? Can you ping another host locally or on the internet?

The reason this works in host networking mode is that we don't do any of the above: a connection made from within runsc results in a connection made on the host. We don't modify the interface in the namespace.

Kevin Krakauer · Answer 6 · Tue Aug 29 2023 04:37:49 GMT+0800 (China Standard Time)

What's the motivation for using MACVLAN? I'm thinking about how to deal with this, but the simplest solution is to use the traditional veth+bridge solution.

Maybe we could re-add the routes after deleting the address when using MACVLAN. That might get the kernel routing packets to the right place. Hard to know before testing -- it really depends how MACVLAN is implemented by Linux.

Ignat Korchagin · Answer 7 · Tue Aug 29 2023 16:28:11 GMT+0800 (China Standard Time)

@th0m can give context of this specific use case (because I forgot), but in general think of macvlan as a more lightweight solution to container networking. In the default bridge/veth mode:

the host OS needs to create the bridge
there is an overhead of setting up and maintaining NAT between the host network and the bridge (and all the IPv6 problems with it, because IPv6 does not like NAT + resources for connection tracking in the Linux kernel + addressing, routes etc)

In some cases we just want to reuse the network "infra" already available for containers. This means just plugging them directly into the network. While it is technically possible to do so via bridge/veth - macvlan is just more convenient and less overhead. One example - we want to run services accessible via L2 over IPv6 link-local addresses. These do not require any extra setup from networking perspective and macvlan is super convenient for this.

Does other network functionality work, or is it just neighbor discovery?

For IPv6 if neighbour discovery is broken, everything else is broken as well (because it is like ARP for IPv4)

Kevin Krakauer · Answer 8 · Wed Aug 30 2023 05:03:53 GMT+0800 (China Standard Time)

I ran the repro and confirmed that the issue exists, albeit with some weirdness:

runsc scrapes 2 IPs from the host namespace: the fe80::f00 and an auto-generated (EUI?) address.
The auto-generated address is immediately ping-able.
fe80::f00 becomes ping-able eventually, although I'm unsure whether this happens as a result of the successful EUI ping or just "over time".

I'm not sure when I'll have time to look more deeply into this, but support for macvlan would be a useful contribution. It seems like there's something missing in the initial setup with multiple IPv6 addresses, although the missing logic could be on the host or in gVisor.