moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems

Home Page:https://mobyproject.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot stop container; status gets out of sync with LXC

stefan-pdx opened this issue · comments

I came across an interesting behavior where a Docker container's status got out of sync with an lxc container.

  1. docker ps showed my container (ae23c705afdb) with a status of running. However, whenever I try running docker kill or docker stop, the command hangs indefinitely. docker rm says that it cannot remove a container that is running.
  2. docker inspect ae23c705afdb confirmed the running state and showed a PID of 11780.
  3. ps aux | grep 11780 showed a process that was running (lxc-start -n ae23c705afdbbcfcd723c7bb17fbdbc7c8632da41e5e8c38bbf714a701b5b536 -f /var/lib/docker/containers/...), but it's state is shown as D, or "uninterruptible sleep". This process thus does not respond to any interrupts.
  4. lxc-list does not show the corresponding lxc container running, so lxc-kill xxx does not work.

I ended up just having to do a reboot. Any thoughts on this?

I think it would be useful to attach to lcx-start process and take a look where it blocks. gdb -p <PID> and then issue bt command to show stacktrace. Output of strace -fp <PID can be helpful too.
BTW, google took me to launchpad bug which looks similar.

Sounds like it might be related to #1300?

@nekto0n, @pwaller I faced the same problem. However, I am seeing it with a simple sleep command.

  • lxc-ls and lxc-kill dont work on the docker thingy even when it is running.
  • lxc-info shows that the container is STOPPED whereas docker ps shows it as UP.

For eg: I run

$ docker run -i -t ubuntu sleep 600

On another terminal

$ docker ps
ID                  IMAGE               COMMAND             CREATED             STATUS              PORTS
1c49f1d5ccd4        ubuntu:12.04        sleep 600           33 seconds ago      Up 32 seconds    
$
$ ps -eaf | grep lxc-start
root      6352  1698  0 16:23 pts/16   00:00:00 lxc-start -n 1c49f1d5ccd41a7436596f6fdbc53158986f51522caefc166bc88cb248997e30 -f /var/lib/docker/containers/1c49f1d5ccd41a7436596f6fdbc53158986f51522caefc166bc88cb248997e30/config.lxc -- /.dockerinit -g 172.17.42.1 -e TERM=xterm -e HOME=/ -e PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin -e container=lxc -e HOSTNAME=1c49f1d5ccd4 -- sleep 600
$
$ sudo lxc-info -n 1c49f1d5ccd41a7436596f6fdbc53158986f51522caefc166bc88cb248997e30
state:   STOPPED
mahendra@kautilya:~/affirm/salt/srv/salt$ sudo docker ps
ID                  IMAGE               COMMAND             CREATED              STATUS              PORTS
1c49f1d5ccd4        ubuntu:12.04        sleep 600           About a minute ago   Up About a minute                       
mahendra@kautilya:~/affirm/salt/srv/salt$       
$
$ sudo lxc-ls
$ sudo lxc-ls
$ sudo lxc-kill -n 1c49f1d5ccd41a7436596f6fdbc53158986f51522caefc166bc88cb248997e30 15
lxc-kill: failed to get the init pid
$ sudo docker ps
ID                  IMAGE               COMMAND             CREATED             STATUS              PORTS
1c49f1d5ccd4        ubuntu:12.04        sleep 600           2 minutes ago       Up 2 minutes                            

My system information

$ docker version
Client version: 0.6.4
Go version (client): go1.1.2
Git commit (client): 2f74b1c
Server version: 0.6.4
Git commit (server): 2f74b1c
Go version (server): go1.1.2
Last stable version: 0.6.4
$ uname -a
Linux kautilya 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ 
commented

I just got a similar one, docker stop would not stop it, it simply hung.

when I did inspect it had a pid loggged, when I looked up the pid it was not there.

eventually I restarted the docker service and stop started working, then I was able to rm the image but got:

Unable to remove filesystem for ac772babe9ba8ed8dc1369fb59ea07ac0e82c48002c3feb31635aaff4a414679: remove /var/lib/docker/containers/ac772babe9ba8ed8dc1369fb59ea07ac0e82c48002c3feb31635aaff4a414679/rw: device or resource busy

Tentatively scheduling for 0.8.

I think this will be affected by the execution drivers work, so @crosbymichael and @creack I'll assign it to one of you.

We had similar problems with non stoppable containers/hanging processes and subsequently locked files on the following configuration (RedHat EL 6.4) and a Docker container with a CMD issuing the Tomcat start command catalina.sh run:

Linux solv213 3.8.13-13.el6uek.x86_64 #1 SMP Wed Aug 21 14:28:36 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux

Docker version:

Client version: 0.8.0
Go version (client): go1.2
Git commit (client): cc3a8c8/0.8.0
Server version: 0.8.0
Git commit (server): cc3a8c8/0.8.0
Go version (server): go1.2

Changing the CMD to /bin/bash -c "startup.sh; while [ true ]; do sleep 1; done;" made the stopping/redeployments work.

Is there any advice on what the CMD instructions should start?

commented

Similar issue on a ubntu 13.04 box, with we have few container running

Linux box 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Which left about 300 MB free memory. The docker is latest one, v0.8.1

Client version: 0.8.1
Go version (client): go1.2
Git commit (client): a1598d1
Server version: 0.8.1
Git commit (server): a1598d1
Go version (server): go1.2
Last stable version: 0.8.1

Some problems described at http://phusion.github.io/baseimage-docker/ might be relevant. Whether the proposed solution of their baseimage is a good one should everyone decide for themselves :)

In order to minimize suprises or to avoid too many issues I would propose to add some details to the Docker documentation.

Is this bug still present now that Docker uses straight libcontainer by default?

Does anyone have a good way to reproduce this?

commented

@crosbymichael pretty sure bootstrapping and starting Discourse with device mapper is broken https://github.com/discourse/discourse_docker , follow the guide with DM picked (edit out the line in ./launcher that does the pre-req)

Any news on this issue? We are experiencing the same on
Linux sv-arg-bld-d1 2.6.32-431.23.3.el6.x86_64 #1 SMP Wed Jul 16 06:12:23 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
Docker version 1.0.0, build 63fe64c/1.0.0

At some point docker ps is not up to date, stopping/starting containers doesn't work anymore. Restarting the docker daemon at that point gives a lot of broken containers due to the volume mounting issues (umount'ing doesn't work)

@mrdfuse That kernel is outdated and it needs to be updated.

Please keep your systems up to date using system provided packages.

@ashahab-altiscale Can you look into this, please?

@unclejack Looking at this.

I checked with the company that does our infrastructure, there are 2 support programs from RedHat Enterprise: either you install the dvd version and get support on that, or you choose the daily version and get only support when you update all your packages constantly.
That company manages 100's of servers for us, which need to remain as stable as possible (financial environment). In such an environment it is simply not done to constantly change packages/kernels. They are now upgrading to RedHat 6.6, the migration traject will take a couple of months.

I'm not saying I expect the docker devs to keep supporting olders kernels/packages, I'm only trying to explain you can't expect from everyone to always use the latest/greatest. Docker is fairly new and as I understand depends upon kernel features and bugfixes in later packages. As such I think Docker is not (yet) fit for us. Again, not blaming anyone, I understand you choose to only support later kernels/packages.

@mrdfuse
I cannot reproduce this:

12:32:39-ashahab~$ docker ps
CONTAINER ID        IMAGE               COMMAND                CREATED             STATUS              PORTS               NAMES
4a7c657c9586        ubuntu:latest       "sleep 600"            12 seconds ago      Up 10 seconds                           grave_goldstine        
12:32:41-ashahab~$ ps -eaf | grep lxc-start
root     11766  1522  1 12:32 pts/3    00:00:00 lxc-start -n 4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d -f /var/lib/docker/containers/4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d/config.lxc -- /.dockerinit -g 172.17.42.1 -i 172.17.1.151/16 -mtu 1500 -- sleep 600
ashahab  12179 11915  0 12:32 pts/5    00:00:00 grep lxc-start

12:32:59-ashahab~$ sudo lxc-info -n 4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d
Name:           4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d
State:          RUNNING
PID:            11777
IP:             172.17.1.151
CPU use:        0.77 seconds
BlkIO use:      1.79 MiB
Memory use:     1.90 MiB
KMem use:       0 bytes
12:33:23-ashahab~$ docker stop 4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d
4a7c657c9586d5ec8d83abd55ee8a39ace855888a4053f5f099b5d2e79ccd06d

12:33:23-ashahab~$ docker version
Client version: 1.3.2-dev
Client API version: 1.16
Go version (client): go1.3.3
Git commit (client): 320706f
OS/Arch (client): linux/amd64
Server version: 1.3.2-dev
Server API version: 1.16
Go version (server): go1.3.3
Git commit (server): 320706f

I have tried this on 3.15 kernel.

@mrdfuse I remember the problems you're facing in your environment. However, RHEL6 should be kept up to date. The 2.6.32 kernel is actually receiving fixes and backports from newer kernels. Kernel 2.6.32 with the features it was released when it was put on kernel.org isn't supported by Docker in any way. That kernel is maintained by Red Hat to ensure that it's also OK for containers and that's why I was recommending an update in this case. Red Hat is actually maintaining that kernel to provide their customers with a stable system to be used for containers and many other things.

Kernels such as 3.10, 3.12, 3.13 (on Ubuntu 14.04) and 3.14 might be better, but updates are always recommended. Installing updates is more important for kernel 2.6.32 because it didn't have some of the features and fixes when it was released.

Since you're already paying that company for support, please tell them about issues like this one and the other one you've reported in that environment. They should test and upgrade to new kernels when you run into such bugs, just like they do when they need to update for security fixes. You're going to miss out on the newest fixes otherwise.

I've seen bugs go away after installing the system updates on Ubuntu and CentOS. From what I recall, it was always kernel related and that's to be expected - the kernel is being worked on all the time and fixes are pulled in all the time.

There's also no way to fix certain kernel bugs through Docker or work around them. I actually know some problems related to devicemapper were fixed through kernel updates on RHEL6 (some affected all systems).

If you have an easy way to reproduce this on your systems, please provide the exact steps and the output so we can reproduce and investigate.

In the few months I have been running Docker I only ran into this issue twice, so I highly doubt I can simply reproduce this :(

About the kernel, I thought I read here that 2.6.32-431 is the minimum version? We'll be updating to RHEL6.6 anyways in the near future, so it doesn't matter that much anymore.

can you try with the latest version of docker and lxc 1.0.7

closing as stale please ping me with details to reproduce on latest and I will reopen

Seems I can reproduce it consistently with

MacBook-Pro:mesos-logstash vik$ docker --version
Docker version 1.9.1, build a34a1d5
MacBook-Pro:mesos-logstash vik$ docker-machine --version
docker-machine version 0.5.1 (7e8e38e)

Container, which image is based on ubuntu:14.04, starts with Java process as an entry point. Based on logs from the process, it completes. However the container remains Up. Attempts to execute any commands in the running container by docker exec return without doing anything. Attempts to kill the container hang.

From SSH of docker-machine VM

docker@minimesos:~$ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64

docker@minimesos:~$ docker ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS              PORTS               NAMES
43a263b7dc87        containersol/minimesos:latest   "java -Dminimesos.hos"   16 minutes ago      Up 16 minutes                           backstabbing_brattain
docker@minimesos:~$ ps -eaf | grep lxc
docker   12769 12421  0 11:25 pts/0    00:00:00 grep lxc

What should I do to get some useful debug information?

@sadovnikov are you using the LXC driver, or the native driver?

I'm not very familiar with these technologies yet. The command I use to create docker-machine VM is docker-machine create -d virtualbox --virtualbox-memory 2048 --virtualbox-cpu-count 1 minimesos. It creates

Boot2Docker version 1.9.1, build master : cef800b - Fri Nov 20 19:33:59 UTC 2015
Docker version 1.9.1, build a34a1d5

How do I know LXC driver?

@sadovnikov in that case, you're using the default (native) driver, so your issue is probably unrelated to the issue discussed here