worstcase / blockade

Docker-based utility for testing network failures and partitions in distributed applications

Home Page:http://blockade.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What would it take to add support for Docker Swarm please?

julianharty opened this issue · comments

As https://github.com/worstcase/blockade/blob/master/docs/install.rst states, Docker Swarm isn't currently supported. I think I'd need Docker Swarm support in order to test machine clusters e.g. a Kafka cluster that's running on Virtual Machines e.g. from VMWare or a cluster of physical machines e.g. Raspberry Pi's :)

If you could provide some pointers on what'd it'd take to add support for Docker Swarm that'd be a great help. Of course, if there are other ways to solve my testing challenge (of meddling with nodes running Kafka, Zookeeper, etc. and their connections) that'd be appreciated too.

PS: thanks for an excellent project and capability, I came here by following a well-trodden path of Jepsen, then jepsen-python which uses blockade 👍

Thanks for your message. I think Swarm support is possible and would definitely be nice to have. There are two main obstacles I am aware of:

Blockade uses a HostExec mechanism to run commands in a host container alongside your application containers. These commands manipulate the network stack of the host itself, and necessarily need to run on the host that also runs the target application container. This is easy now because there is only one host that runs all the containers, so we can simply run the commands there. But with Swarm, the application is potentially spread across multiple hosts. And to further complicate things, for performance, the host container is actually kept running for the duration of your blockade and is fed commands as needed via docker exec.

To support Swarm, this mechanism would need to be extended to run and track host containers across every Swarm host used by your application. I believe this could be accomplished using Docker placement constraints. HostExec.run currently accepts only a command argument, but it could perhaps be extended to also accept a container_id. Then it could figure out the host the container runs on, and use placement constraints to coschedule a host container to that host as well (if one isn't already running).

The other obstacle requires some thought and experimentation. Blockade uses iptables to implement network partitions, and uses the above HostExec mechanism. We create a chain for each partition, direct all traffic from hosts in that partition to the chain, then add drop rules for any container NOT in the partition. Here is the implementation. I am not sure how well this will extend to Swarm. The same mechanism might work if we simply run the same iptables commands on all relevant Swarm hosts, using the improved HostExec mechanism. But I am not sure about this.

These are the issues I know about. There could certainly be more complexities that arise during implementation. Because Blockade manipulates the underlying network stack, I've come across a lot of surprises during implementation.