lablup / backend.ai-jail

A programmable security sandbox for Backend.AI kernels

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Child count gets too much increased due to missing exit tracking

achimnol opened this issue · comments

Often TensorFlow codes spawn many threads, but the jail recognizes "too many" threads while the actual number of threads are within the configured limit.

Potential solutions:

  • Directly read "/proc/{pid}/status" to get the actual number of threads from the OS. May incur some overheads when spawning new processes/threads in the child.
  • Guard the childCount variable with explicit locks.

But still, TensorFlow seems to increase the number of threads when we repeat calling regressors.
We need to find some good solution on this.

NOTE:
Even the following code produces a large number of threads more than the number of CPU cores allocated to the container:

config = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1, \
                        allow_soft_placement=True, device_count = {'CPU': 1})
session = tf.Session(config=config)

Adding locks did not change anything, as expected because we increment/decrement childCount within a single goroutine which receives waitpid results via a channel.

After writing a function that reads procfs to get all children's number of threads recursively, I found that the original jail implementation is correct and numThreads value in "/proc/{pid}/status" contains only the direct children threads.

Then we need to find some way to further reduce the number of threads used by TensorFlow itself.