Isolator recovery is problematic with legacy mounts.

Question

Isolator recovery is problematic with legacy mounts.

jieyu opened this issue 8 years ago · comments

The isolator determines the 'legacy mounts' (no active container is using that) during slave recovery by just looking at the checkpointed state and will umount those legacy mounts immediately. However, for known orphans, they might not be killed yet (known orphans are those containers that are known by the launcher). The Mesos agent will do an async cleanup on those known orphan containers.

Umounting the volumes while those orphan containers are still using them might be problematic. The correct way is to wait for 'cleanup' to be called for those known orphan containers (i.e., still create Info struct for those orphan containers, and do proper cleanup in 'cleanup' function).

Steve Wong · Answer 1 · Fri Apr 15 2016 02:59:01 GMT+0800 (China Standard Time)

when this was developed, I did a test:

start a task with an external mount
Kill the Mesos slave process to simulate a crash..
The task terminates
Mesos slave service is restarted

I did not observe a cleanup call related to the task -resulting in a mount leak

What are the guarantees in both a slave service crash and a full cluster node restart?
Should cleanup always be called for every container that was running before the crash or restart?

The current implementation is based on the assumption that if the container is still running, it will be in the orphans list on the recover call. If it is on the orphans list, unmount is NOT called. Are you saying that there is a scenario where a container can remain running and not be in the orphans list?

Jie Yu · Answer 2 · Fri Apr 15 2016 04:16:08 GMT+0800 (China Standard Time)

A container is called orphan if it's not in 'ContainerState' but the launcher detects it (e.g., through cgroups). This is a common when the operator wants to do an upgrade that requires killing all existing containers. THe operator usually does by stop the agent, wipe the metadata directory, and restart the slave. In that case, 'inUseMounts' does not contain the mounts for that orphan container. Since you checkpointed it, it'll be considered as a 'legacyMounts', and umount will be called on it even though the launcher hasn't cleaned it up yet.

Steve Wong · Answer 3 · Fri Apr 15 2016 04:53:45 GMT+0800 (China Standard Time)

OK, so I understand you are saying the mount will be present, and ContainerState will NOT contain a record for the container.

Current recover implementation is:
mount will be in legacyMounts
mount will not be in inUseMounts

Unmount is called for any mount that is in legacyMounts but not in inUseMounts.
There is no other Unmount processing.

So the change you propose would result in no unmount calls in recover() under any circumstances.

This might be OK for the scenario you outlined of stopping the agent, wiping the metadata directory and restarting, assuming cleanup will always be called for every container not running at restart.

But what about other scenarios:

Kill -9 (simulate crash) the agent and restart. No metadata directory wipe.
1a. container still running when agent restarts - should be OK, I expect container will be on the ContainerState list and no unmount will be called, which is what you want.
1b. container not still running when agent restarts - seems like you need an unmount, unless the the agent will do a cleanup() call for the container. Will this happen?

Jie Yu · Answer 4 · Fri Apr 15 2016 10:26:26 GMT+0800 (China Standard Time)

For 1b case, the container will still be in ContainerState list because the agent believes that it's still running. The containerizer will reap on the checkpointed pid and find out that the container is terminated. It'll subsequently call 'destroy', resulting in 'cleanup' being called.
https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L573