clean_stale_shared_memory duplicating the master process when called in a train.py script

Question

clean_stale_shared_memory duplicating the master process when called in a train.py script

antoinedandi opened this issue 3 months ago · comments

To reproduce

calling clean_stale_shared_memory() at the beginning of a train.py script itself launched with composer in a distributed setup.

Expected behavior

The memory is cleaned at the beginning of the training and then the training happens normally

What I get:

The process is duplicated on the GPU:0 and is never destroyed

Saaketh Narayan · Answer 1 · Thu May 09 2024 04:17:56 GMT+0800 (China Standard Time)

Hmm interesting...normally, you shouldn't need to call clean_stale_shared_memory() at the start of your training script. Is this causing issues during training for you?

Karan Jariwala · Answer 2 · Fri Jun 14 2024 09:16:49 GMT+0800 (China Standard Time)

@antoinedandi "clean_stale_shared_memory() removes stale open shared memory files, but if no stale files are found, it doesn't perform any action. I'm curious if the issue is truly originating from clean_stale_shared_memory(). Do you have a reproducible script we can test?"