Cannot stop container; status gets out of sync with LXC
stefan-pdx opened this issue · comments
I came across an interesting behavior where a Docker container's status got out of sync with an lxc container.
docker ps
showed my container (ae23c705afdb
) with a status of running. However, whenever I try runningdocker kill
ordocker stop
, the command hangs indefinitely.docker rm
says that it cannot remove a container that is running.docker inspect ae23c705afdb
confirmed the running state and showed a PID of 11780.ps aux | grep 11780
showed a process that was running (lxc-start -n ae23c705afdbbcfcd723c7bb17fbdbc7c8632da41e5e8c38bbf714a701b5b536 -f /var/lib/docker/containers/...
), but it's state is shown asD
, or "uninterruptible sleep". This process thus does not respond to any interrupts.lxc-list
does not show the corresponding lxc container running, solxc-kill xxx
does not work.
I ended up just having to do a reboot. Any thoughts on this?
I think it would be useful to attach to lcx-start
process and take a look where it blocks. gdb -p <PID>
and then issue bt
command to show stacktrace. Output of strace -fp <PID
can be helpful too.
BTW, google took me to launchpad bug which looks similar.
Sounds like it might be related to #1300?
@nekto0n, @pwaller I faced the same problem. However, I am seeing it with a simple sleep command.
- lxc-ls and lxc-kill dont work on the docker thingy even when it is running.
- lxc-info shows that the container is STOPPED whereas docker ps shows it as UP.
For eg: I run
$ docker run -i -t ubuntu sleep 600
On another terminal
$ docker ps
ID IMAGE COMMAND CREATED STATUS PORTS
1c49f1d5ccd4 ubuntu:12.04 sleep 600 33 seconds ago Up 32 seconds
$
$ ps -eaf | grep lxc-start
root 6352 1698 0 16:23 pts/16 00:00:00 lxc-start -n 1c49f1d5ccd41a7436596f6fdbc53158986f51522caefc166bc88cb248997e30 -f /var/lib/docker/containers/1c49f1d5ccd41a7436596f6fdbc53158986f51522caefc166bc88cb248997e30/config.lxc -- /.dockerinit -g 172.17.42.1 -e TERM=xterm -e HOME=/ -e PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin -e container=lxc -e HOSTNAME=1c49f1d5ccd4 -- sleep 600
$
$ sudo lxc-info -n 1c49f1d5ccd41a7436596f6fdbc53158986f51522caefc166bc88cb248997e30
state: STOPPED
mahendra@kautilya:~/affirm/salt/srv/salt$ sudo docker ps
ID IMAGE COMMAND CREATED STATUS PORTS
1c49f1d5ccd4 ubuntu:12.04 sleep 600 About a minute ago Up About a minute
mahendra@kautilya:~/affirm/salt/srv/salt$
$
$ sudo lxc-ls
$ sudo lxc-ls
$ sudo lxc-kill -n 1c49f1d5ccd41a7436596f6fdbc53158986f51522caefc166bc88cb248997e30 15
lxc-kill: failed to get the init pid
$ sudo docker ps
ID IMAGE COMMAND CREATED STATUS PORTS
1c49f1d5ccd4 ubuntu:12.04 sleep 600 2 minutes ago Up 2 minutes
My system information
$ docker version
Client version: 0.6.4
Go version (client): go1.1.2
Git commit (client): 2f74b1c
Server version: 0.6.4
Git commit (server): 2f74b1c
Go version (server): go1.1.2
Last stable version: 0.6.4
$ uname -a
Linux kautilya 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$
I just got a similar one, docker stop would not stop it, it simply hung.
when I did inspect it had a pid loggged, when I looked up the pid it was not there.
eventually I restarted the docker service and stop started working, then I was able to rm the image but got:
Unable to remove filesystem for ac772babe9ba8ed8dc1369fb59ea07ac0e82c48002c3feb31635aaff4a414679: remove /var/lib/docker/containers/ac772babe9ba8ed8dc1369fb59ea07ac0e82c48002c3feb31635aaff4a414679/rw: device or resource busy
Tentatively scheduling for 0.8.
I think this will be affected by the execution drivers work, so @crosbymichael and @creack I'll assign it to one of you.
We had similar problems with non stoppable containers/hanging processes and subsequently locked files on the following configuration (RedHat EL 6.4) and a Docker container with a CMD issuing the Tomcat start command catalina.sh run
:
Linux solv213 3.8.13-13.el6uek.x86_64 #1 SMP Wed Aug 21 14:28:36 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux
Docker version:
Client version: 0.8.0
Go version (client): go1.2
Git commit (client): cc3a8c8/0.8.0
Server version: 0.8.0
Git commit (server): cc3a8c8/0.8.0
Go version (server): go1.2
Changing the CMD to /bin/bash -c "startup.sh; while [ true ]; do sleep 1; done;"
made the stopping/redeployments work.
Is there any advice on what the CMD instructions should start?
Similar issue on a ubntu 13.04 box, with we have few container running
Linux box 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Which left about 300 MB free memory. The docker is latest one, v0.8.1
Client version: 0.8.1
Go version (client): go1.2
Git commit (client): a1598d1
Server version: 0.8.1
Git commit (server): a1598d1
Go version (server): go1.2
Last stable version: 0.8.1
Some problems described at http://phusion.github.io/baseimage-docker/ might be relevant. Whether the proposed solution of their baseimage is a good one should everyone decide for themselves :)
In order to minimize suprises or to avoid too many issues I would propose to add some details to the Docker documentation.
Is this bug still present now that Docker uses straight libcontainer by default?
Does anyone have a good way to reproduce this?
@crosbymichael pretty sure bootstrapping and starting Discourse with device mapper is broken https://github.com/discourse/discourse_docker , follow the guide with DM picked (edit out the line in ./launcher that does the pre-req)
Any news on this issue? We are experiencing the same on
Linux sv-arg-bld-d1 2.6.32-431.23.3.el6.x86_64 #1 SMP Wed Jul 16 06:12:23 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
Docker version 1.0.0, build 63fe64c/1.0.0
At some point docker ps is not up to date, stopping/starting containers doesn't work anymore. Restarting the docker daemon at that point gives a lot of broken containers due to the volume mounting issues (umount'ing doesn't work)
@mrdfuse That kernel is outdated and it needs to be updated.
Please keep your systems up to date using system provided packages.
@ashahab-altiscale Can you look into this, please?
@unclejack Looking at this.
I checked with the company that does our infrastructure, there are 2 support programs from RedHat Enterprise: either you install the dvd version and get support on that, or you choose the daily version and get only support when you update all your packages constantly.
That company manages 100's of servers for us, which need to remain as stable as possible (financial environment). In such an environment it is simply not done to constantly change packages/kernels. They are now upgrading to RedHat 6.6, the migration traject will take a couple of months.
I'm not saying I expect the docker devs to keep supporting olders kernels/packages, I'm only trying to explain you can't expect from everyone to always use the latest/greatest. Docker is fairly new and as I understand depends upon kernel features and bugfixes in later packages. As such I think Docker is not (yet) fit for us. Again, not blaming anyone, I understand you choose to only support later kernels/packages.
@mrdfuse
I cannot reproduce this:
12:32:39-ashahab~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4a7c657c9586 ubuntu:latest "sleep 600" 12 seconds ago Up 10 seconds grave_goldstine
12:32:41-ashahab~$ ps -eaf | grep lxc-start
root 11766 1522 1 12:32 pts/3 00:00:00 lxc-start -n 4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d -f /var/lib/docker/containers/4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d/config.lxc -- /.dockerinit -g 172.17.42.1 -i 172.17.1.151/16 -mtu 1500 -- sleep 600
ashahab 12179 11915 0 12:32 pts/5 00:00:00 grep lxc-start
12:32:59-ashahab~$ sudo lxc-info -n 4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d
Name: 4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d
State: RUNNING
PID: 11777
IP: 172.17.1.151
CPU use: 0.77 seconds
BlkIO use: 1.79 MiB
Memory use: 1.90 MiB
KMem use: 0 bytes
12:33:23-ashahab~$ docker stop 4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d
4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d
12:33:23-ashahab~$ docker version
Client version: 1.3.2-dev
Client API version: 1.16
Go version (client): go1.3.3
Git commit (client): 320706f
OS/Arch (client): linux/amd64
Server version: 1.3.2-dev
Server API version: 1.16
Go version (server): go1.3.3
Git commit (server): 320706f
I have tried this on 3.15 kernel.
@mrdfuse I remember the problems you're facing in your environment. However, RHEL6 should be kept up to date. The 2.6.32 kernel is actually receiving fixes and backports from newer kernels. Kernel 2.6.32 with the features it was released when it was put on kernel.org isn't supported by Docker in any way. That kernel is maintained by Red Hat to ensure that it's also OK for containers and that's why I was recommending an update in this case. Red Hat is actually maintaining that kernel to provide their customers with a stable system to be used for containers and many other things.
Kernels such as 3.10, 3.12, 3.13 (on Ubuntu 14.04) and 3.14 might be better, but updates are always recommended. Installing updates is more important for kernel 2.6.32 because it didn't have some of the features and fixes when it was released.
Since you're already paying that company for support, please tell them about issues like this one and the other one you've reported in that environment. They should test and upgrade to new kernels when you run into such bugs, just like they do when they need to update for security fixes. You're going to miss out on the newest fixes otherwise.
I've seen bugs go away after installing the system updates on Ubuntu and CentOS. From what I recall, it was always kernel related and that's to be expected - the kernel is being worked on all the time and fixes are pulled in all the time.
There's also no way to fix certain kernel bugs through Docker or work around them. I actually know some problems related to devicemapper were fixed through kernel updates on RHEL6 (some affected all systems).
If you have an easy way to reproduce this on your systems, please provide the exact steps and the output so we can reproduce and investigate.
In the few months I have been running Docker I only ran into this issue twice, so I highly doubt I can simply reproduce this :(
About the kernel, I thought I read here that 2.6.32-431 is the minimum version? We'll be updating to RHEL6.6 anyways in the near future, so it doesn't matter that much anymore.
can you try with the latest version of docker and lxc 1.0.7
closing as stale please ping me with details to reproduce on latest and I will reopen
Seems I can reproduce it consistently with
MacBook-Pro:mesos-logstash vik$ docker --version
Docker version 1.9.1, build a34a1d5
MacBook-Pro:mesos-logstash vik$ docker-machine --version
docker-machine version 0.5.1 (7e8e38e)
Container, which image is based on ubuntu:14.04
, starts with Java process as an entry point. Based on logs from the process, it completes. However the container remains Up
. Attempts to execute any commands in the running container by docker exec
return without doing anything. Attempts to kill the container hang.
From SSH of docker-machine VM
docker@minimesos:~$ docker version
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.3
Git commit: a34a1d5
Built: Fri Nov 20 17:56:04 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.3
Git commit: a34a1d5
Built: Fri Nov 20 17:56:04 UTC 2015
OS/Arch: linux/amd64
docker@minimesos:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
43a263b7dc87 containersol/minimesos:latest "java -Dminimesos.hos" 16 minutes ago Up 16 minutes backstabbing_brattain
docker@minimesos:~$ ps -eaf | grep lxc
docker 12769 12421 0 11:25 pts/0 00:00:00 grep lxc
What should I do to get some useful debug information?
@sadovnikov are you using the LXC driver, or the native driver?
I'm not very familiar with these technologies yet. The command I use to create docker-machine VM is docker-machine create -d virtualbox --virtualbox-memory 2048 --virtualbox-cpu-count 1 minimesos
. It creates
Boot2Docker version 1.9.1, build master : cef800b - Fri Nov 20 19:33:59 UTC 2015
Docker version 1.9.1, build a34a1d5
How do I know LXC driver?
@sadovnikov in that case, you're using the default (native) driver, so your issue is probably unrelated to the issue discussed here