containers / youki

A container runtime written in Rust

Home Page:https://containers.github.io/youki/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

On some systems, mounting cgroup v1 by `setup_emulated_subsystem` may missing some subsystems

JCKeep opened this issue · comments

I'm writing an oci runtime using some youki's source code by learning, and find some bugs.

On my system, /proc/self/cgroup file be like:

# cat /proc/self/cgroup
11:perf_event:/
10:freezer:/
9:memory:/
8:pids:/
7:devices:/
6:cpuacct,cpu:/
5:blkio:/
4:cpuset:/
3:hugetlb:/
2:net_prio,net_cls:/
1:name=systemd:/user.slice/user-0.slice/session-264.scope

the cpuacct,cpu and net_prio,net_cls was unsorted, so in function mount_cgroup_v1:

        let process_cgroups: HashMap<String, String> = Process::myself()?
            .cgroups()?
            .into_iter()
            .map(|c| {
                let hierarchy = c.hierarchy;
                // When youki itself is running inside a container, the cgroup path
                // will include the path of pid-1, which needs to be stripped before
                // mounting.
                let root_pathname = root_cgroups
                    .iter()
                    .find(|c| c.hierarchy == hierarchy)
                    .map(|c| c.pathname.as_ref())
                    .unwrap_or("");
                let path = c
                    .pathname
                    .strip_prefix(root_pathname)
                    .unwrap_or(&c.pathname);
                (c.controllers.join(","), path.to_owned())
            })
            .collect();

process_cgroups will record cpuacct,cpu and net_prio,net_cls, but system comount in name cpu,cpuacct and net_cls,net_prio, so when using setup_emulated_subsystem it find no subsystem to bind.

// in function setup_emulated_subsystem, it will search `cpu,cpuacct` and `net_cls,net_prio`,
// but process_cgroups store `cpuacct,cpu` and `net_prio,net_cls`
if let Some(proc_path) = process_cgroups.get(named_hierarchy.as_ref()) {

So when I test, it will not mount some cgroupv1 subsystems and not generate error.

I fixed it on my system like this, mybe this will work for other system?

        let process_cgroups: HashMap<String, String> = Process::myself()?
            .cgroups()?
            .into_iter()
            .map(|c| {
                let hierarchy = c.hierarchy;
                // When youki itself is running inside a container, the cgroup path
                // will include the path of pid-1, which needs to be stripped before
                // mounting.
                let root_pathname = root_cgroups
                    .iter()
                    .find(|c| c.hierarchy == hierarchy)
                    .map(|c| c.pathname.as_ref())
                    .unwrap_or("");
                let path = c
                    .pathname
                    .strip_prefix(root_pathname)
                    .unwrap_or(&c.pathname);
                
                // some system the controllers may unsorted
                (c.controllers.iter().sorted().join(","), path.to_owned())
            })
            .collect();

Thanks for your report, @JCKeep

the cpuacct,cpu and net_prio,net_cls was unsorted,

What does this mean? May I ask you to give me a more specific example or something?

@utam0k youki scan /proc/self/cgroup to collect the cgroups, when call setup_emulated_subsystem to bind mount cgroup subsystems, it will result in errors but not generate error.

for example, it may not mount net_cls,net_prio subsystem, because my /proc/self/cgroup is

[root@iZbp16aq5lgolei74n44yqZ ~]# cat /proc/self/cgroup
11:perf_event:/
10:freezer:/
9:memory:/
8:pids:/
7:devices:/
6:cpuacct,cpu:/
5:blkio:/
4:cpuset:/
3:hugetlb:/
2:net_prio,net_cls:/
1:name=systemd:/user.slice/user-0.slice/session-558.scope

in setup_emulated_subsystem, it will search process_cgroups by name "net_cls,net_prio", but it store "net_prio,net_cls"

#[cfg(feature = "v1")]
    fn setup_emulated_subsystem(
        &self,
        cgroup_mount: &SpecMount,
        options: &MountOptions,
        subsystem_name: &str,
        named: bool,
        host_mount: &Path,
        process_cgroups: &HashMap<String, String>,
    ) -> Result<()> {
        tracing::debug!("Mounting (emulated) {:?} cgroup subsystem", subsystem_name);
        let named_hierarchy: Cow<str> = if named {
            format!("name={subsystem_name}").into()
        } else {
            subsystem_name.into()
        };

        // we search "net_cls,net_prio", but it store "net_prio,net_cls"
        if let Some(proc_path) = process_cgroups.get(named_hierarchy.as_ref()) {
        ....
        } else {
            tracing::warn!("Could not mount {:?} cgroup subsystem", subsystem_name);
        }

图片

I understood this problem. I have a question. How does runc handle this case? Do you know?

@utam0k runc works
图片

@JCKeep We should fix it, but I wonder how runc treats it.

@utam0k runc use the mount options of /proc/self/mountinfo to get the cgroup subsystem mounts.
On my system, /proc/self/mountinfo like:
图片
the subsystem listed in /proc/self/cgroup will match cgroup mount options.

such as:
图片

@JCKeep Which code?

@utam0k you may have a look at function mountCgroupV1 and getCgroupMountsHelper

Why don't you align to runc's implementation?