brettwooldridge / NuProcess

Low-overhead, non-blocking I/O, external Process implementation for Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unshare(CLONE_FS) failed, return code: -1, last error: 1

MasseGuillaume opened this issue · comments

Exception in thread "NuProcessLinuxCwdChangeable-1" java.lang.RuntimeException: unshare(CLONE_FS) failed, return code: -1, last error: 1
        at com.zaxxer.nuprocess.internal.BasePosixProcess.checkReturnCode(BasePosixProcess.java:793)
        at com.zaxxer.nuprocess.internal.BasePosixProcess$LinuxCwdThreadFactory$1.run(BasePosixProcess.java:97)
        at java.lang.Thread.run(Thread.java:748)

We had seen this issue previously when travis-ci changed their default permissions. see: #80 (comment)

This is the commit that resolved the same issue in our own travis build environment: 72c9d8e

You might follow the links from there to help find the root cause. Something about CAP_SYS_ADMIN permission in docker containers.

@lfbayer Thanks for the link. It looks like we upgraded our Linux kernel and hit the issue linked. I will take a look if it's possible to loosen the security.

commented

@brettwooldridge @lfbayer I'm running into this issue. If I understand it correctly, this effectively means that nuprocess cannot be run within docker because of the lack of CLONE_FS permissions in non-privileged agents. Those using Travis can work around it by using sudo: required (which disables docker), and those forced to use docker can run privileged agents with userns_mode=host (which is dangerous).

However, there are some legit cases when enabling privileged agents is not possible. Especially when using nuprocess to power a build tool/compile test server (as we're doing in Bloop), that trivially any CI will run within docker to compile and test the code. We love nuprocess, but we cannot ask them to enable privileged agents in order to use our tool.

Would it be possible to avoid the use of unshare(CLONE_FS) in the linux base posix process implementation and use something else instead? I could make a PR if I have the right pointers. Thank you in advance.

We use docker all the time without using privileged mode. So it's something different than not being able to use docker. I would need to investigate more to understand specifically what the root cause was.

commented

Interesting. Do you think it has to do with the storage driver? I'm using overlay2.

@jvican @lfbayer I think it comes down to how the container (docker) is configured. By default Docker, on CentOS, runs in "unconfined" mode, which is permissive AFAIK. "Desktop" versions, such as that for Mac, are even more permissive.

Unshare is needed to support changing the current working directory (without the subprocess affecting the Java process or other subprocess). If you do not, we can put in a change (likely a system property) to avoid unshare.

There is another hack that I've been looking at ... using JNA to invoke the native spawn process code in jvm.so ... but I haven't had time to spend on it.

commented

Thank you for the pointers. I was assuming that privileged would allow me to run in unconfined mode, but that wasn't the case. I fixed this issue by allowing CAP_SYS_ADMIN in my seccomp configuration. That usually translates to adding cap_add: SYS_ADMIN to a docker-compose file, or passing in the corresponding --cap-add=SYS_ADMIN flag to docker. NOTE: Remember that you need to rebuild the images with --build.

There is another hack that I've been looking at ... using JNA to invoke the native spawn process code in jvm.so ... but I haven't had time to spend on it.

I agree that long-term this is the best solution.

UPDATE: Oops, looks like that didn't work either.

commented

After a lot of work in our Docker-based CI infrastructure, I haven't been able to make this exception go away, not even by enabling unconfined and privileged together for all our infrastructure. I don't know what's up.

There is another hack that I've been looking at ... using JNA to invoke the native spawn process code in jvm.so ... but I haven't had time to spend on it.

Is there anything I can do to help move this forward?

I’ve been working on it, you can follow some of the discussion here:

https://groups.google.com/forum/?nomobile=true#!topic/jna-users/VvSct3YstXA

Response from earlier today gives me some ideas to pursue. Despite the issues I’ve run into, it looks imminently doable.

I don’t have a lot of time to devote to it, but I’m chipping away, 30 minutes at a time.

commented

Happy to see it's doable. Let me know when you get it working (to help you test it) or if you need help. I'm looking forward to this change.

@jvican Good news. Using a small test harness, I was able to successfully spawn a process using the internal JVM native API. I'll be working over the next few days (mainly this weekend) to attempt integration into the NuProcess source.

Once complete, barring any unsolvable issues, the following issues will be fixed: #13, #73, #85.

EDIT: And, of course, this one.

@jvican I have committed changes that use the native JVM process spawning code. All test are passing on OS X and Linux on JDK 7 and JDK 8. Do you think you can clone and build the repo and let me know if it addresses your issue?

commented

Fabulous! I just tried it and the error is gone. Thank you for the hard work, the new 1.2.0 release is 💯. FYI, it also passed my tests in Windows.

Fixed in v1.2.0