krallin / tini

A tiny but valid `init` for containers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

not reaping zombie or defunct child processes

bingerambo opened this issue · comments

In container, I exec a python process as a child process of tini. as following steps:

  1. run the container from dockerfile

os info:

[root@node3 ~]# uname -a
Linux node3 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@node3 ~]#
[root@node3 ~]#
[root@node3 ~]#
[root@node3 ~]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)

docker version:

[root@node3 ~]# docker version
Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77156
 Built:             Sat May  4 02:34:58 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:19:08 2018
  OS/Arch:          linux/amd64
  Experimental:     false

docker file

ADD tini /tini
RUN chmod +x /tini
ENV PYTHONPATH=/root/tf/models
WORKDIR /examples

ENTRYPOINT ["/tini", "-g", "-w", "-vvv", "--", "bash"]
CMD ["-c","/usr/bin/python ~/tf/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py  --batch_size=64 --model=official_resnet18 --optimizer=momentum  --num_gpus=2 --num_epochs=1 --weight_decay=1e-4 --data_dir=/tmp/imagenet"]

process status:

tini: pid 1
python: pid 8

root@deee30ff32b4:/examples# ps axjf
  PPID    PID   PGID    SID TTY       TPGID STAT   UID   TIME COMMAND
     0    397    397    397 pts/1       410 Ss       0   0:00 bash
   397    410    410    397 pts/1       410 R+       0   0:00  \_ ps axjf
     0      1      1      1 pts/0         8 Ss       0   0:00 /tini -g -w -vvv -- bash -c /usr/bin/python ~/tf/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py  --batch_size=64 --model=official_resnet
     1      8      8      1 pts/0         8 Sl+      0   2:31 /usr/bin/python /root/tf/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --batch_size=64 --model=official_resnet18 --optimizer=momentum -
  1. kill the python process

kill python process (the child of tini, and its pid is 8 )

root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples# kill -9 8
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples#
root@deee30ff32b4:/examples# ps axjf
  PPID    PID   PGID    SID TTY       TPGID STAT   UID   TIME COMMAND
     0    397    397    397 pts/1       411 Ss       0   0:00 bash
   397    411    411    397 pts/1       411 R+       0   0:00  \_ ps axjf
     0      1      1      1 pts/0         8 Ss       0   0:00 /tini -g -w -vvv -- bash -c /usr/bin/python ~/tf/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py  --batch_size=64 --model=official_resnet
     1      8      8      1 pts/0         8 Zl+      0   4:30 [tf_cnn_benchmar] <defunct>
  1. zombie process exists: the python process is defunct, but not reap by parent process (tini), it become zombie.

the tini and python processes print info:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0810 01:44:12.306231 140481052276544 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
[TRACE tini (1)] No child to reap
Initializing graph
WARNING:tensorflow:From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0810 01:44:13.657472 140481052276544 deprecation.py:323] From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2020-08-10 01:44:13.911021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2020-08-10 01:44:13.911240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-10 01:44:13.911262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1
2020-08-10 01:44:13.911272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y
2020-08-10 01:44:13.911358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N
2020-08-10 01:44:13.911696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0)
2020-08-10 01:44:13.912073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:88:00.0, compute capability: 7.0)
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
INFO:tensorflow:Running local_init_op.
I0810 01:44:17.589881 140481052276544 session_manager.py:491] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0810 01:44:17.782803 140481052276544 session_manager.py:493] Done running local_init_op.
Running warm up
[TRACE tini (1)] No child to reap
2020-08-10 01:44:18.530569: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap

now I can not remove the zombie process. only reboot the machine.

when the python process suspended, the tini will not reap the child process, and the defunct python process will become zombie .

I'm not sure about either of the following reasons:

  1. the tini did not receive the SIGCHLD signal from the defunct child process. Is the tini process missed to deal with the SIGCHLD ?
  2. as a child process of tini. did the python process not send the SIGCHLD to tini? maybe it is suspended for hardware or program problem

Having to use kill -9 is a red flag to me. See also https://serverfault.com/a/76296/58240

Do you mind sharing all the lines of output from Tini (from the point at which you start your process until the point at which you're stuck) — you can exclude the other output, but please include all the tini ones. There should at least be some additional input starting with an info line reporting what Tini spawned.

Note that the point at which you're stuck is basically a look where Tini asks the Kernel "do I have any children that have exited?", and the Kernel is answering "you do not".

What happens if you send SIGTERM to Tini? If nothing happens, what if you send it SIGKILL? Do the processes get torn down?

Do you mind sharing all the lines of output from Tini (from the point at which you start your process until the point at which you're stuck) — you can exclude the other output, but please include all the tini ones. There should at least be some additional input starting with an info line reporting what Tini spawned.

Note that the point at which you're stuck is basically a look where Tini asks the Kernel "do I have any children that have exited?", and the Kernel is answering "you do not".

What happens if you send SIGTERM to Tini? If nothing happens, what if you send it SIGKILL? Do the processes get torn down?

@krallin
I am sorry that I did not save complete log information. the python program runs TensorFlow framework with NVIDIA GPU cards, for training a deep learning job.
I saved the problem context log information and /proc/pid/taskid status , so I exepect them are maybe useful.

  1. python program started and blocked.
    python and tini print info:
.............................................

Initializing graph
WARNING:tensorflow:From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0810 01:44:13.657472 140481052276544 deprecation.py:323] From /root/tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2020-08-10 01:44:13.911021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2020-08-10 01:44:13.911240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-10 01:44:13.911262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1
2020-08-10 01:44:13.911272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y
2020-08-10 01:44:13.911358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N
2020-08-10 01:44:13.911696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0)
2020-08-10 01:44:13.912073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:88:00.0, compute capability: 7.0)
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
INFO:tensorflow:Running local_init_op.
I0810 01:44:17.589881 140481052276544 session_manager.py:491] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0810 01:44:17.782803 140481052276544 session_manager.py:493] Done running local_init_op.
Running warm up
[TRACE tini (1)] No child to reap
2020-08-10 01:44:18.530569: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
.............................................
  1. exec docker stop command, to stop python program.
[root@node3 ~]#
[root@node3 ~]# docker stop e99a157e24f8
  1. the tini print info: tini received the SIGTERM signal, and passed it to child. Then tini printed " No child to reap" at all the time.
.............................................

[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[DEBUG tini (1)] Passing signal: 'Terminated'
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap
[TRACE tini (1)] No child to reap

.............................................


  1. the python process info, /proc/pid/taskid details:
    the defunct python process (its name was tf_cnn_benchmar ) had 2 threads.
    thread taskid 175308 status running
    thread taskid 174982 status zombie
    why the defunct process still contained a running thread?
top - 09:51:14 up 1 day, 1:16, 1 user,  load average: 1.21, 1.24, 1.32
Threads:   2 total,   1 running,   0 sleeping,   0 stopped,   1 zombie
%Cpu(s):  0.1 us,  1.1 sy,  0.0 ni, 98.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :   79106937+total,   72575788+free,   17352112 used,   47959384 buff/cache
KiB Swap:          0 total,          0 free,          0 used.   76956729+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
175308 root      20   0       0      0      0 R 99.9  0.0 157:58.34 tf_cnn_benchmar
174982 root      20   0       0      0      0 Z 0.0   0.0   0:06.41 tf_cnn_benchmar

zombie thread /proc/pid/taskid/stack info:

[root@node3 174982]# cat stack
[<ffffffff8d297bcb>] do_exit+0x6bb/0xa40
[<ffffffff8d297fcf>] do_group_exit+0x3f/0xa0
[<ffffffff8d2a887e>] get_signal_to_deliver+0x1ce/0x5e0
[<ffffffff8d22a527>] do_signal+0x57/0x6e0
[<ffffffff8d22ac22>] do_notify_resume+0x72/0xc0
[<ffffffff8d91fb1d>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff

running thread /proc/pid/taskid/stack info:

[root@node3 175308]# cat stack
[<ffffffffc23182a2>] uvm_spin_loop+0xc2/0x100 [nvidia_uvm]
[<ffffffffc23519dd>] uvm_tracker_wait+0x8d/0x1a0 [nvidia_uvm]
[<ffffffffc234d74d>] uvm_page_tree_wait+0x1d/0x30 [nvidia_uvm]
[<ffffffffc234e398>] uvm_page_table_range_vec_init+0x158/0x1d0 [nvidia_uvm]
[<ffffffffc235e2d7>] uvm_va_range_map_rm_allocation+0x157/0x310 [nvidia_uvm]
[<ffffffffc235e772>] uvm_map_external_allocation_on_gpu+0x1b2/0x230 [nvidia_uvm]
[<ffffffffc235ea6b>] uvm_api_map_external_allocation+0x27b/0x4c0 [nvidia_uvm]
[<ffffffffc231a017>] uvm_unlocked_ioctl+0xd57/0xe70 [nvidia_uvm]
[<ffffffff8d42fb90>] do_vfs_ioctl+0x350/0x560
[<ffffffff8d42fe41>] SyS_ioctl+0xa1/0xc0
[<ffffffff8d91fa51>] tracesys+0x9d/0xc3
[<ffffffffffffffff>] 0xffffffffffffffff

When I send SIGKILL, the process did not be removed. Only reboot the machine to clear it.