clean_stale_shared_memory duplicating the master process when called in a train.py script
antoinedandi opened this issue · comments
To reproduce
calling clean_stale_shared_memory()
at the beginning of a train.py
script itself launched with composer in a distributed setup.
Expected behavior
The memory is cleaned at the beginning of the training and then the training happens normally
What I get:
The process is duplicated on the GPU:0 and is never destroyed
Hmm interesting...normally, you shouldn't need to call clean_stale_shared_memory()
at the start of your training script. Is this causing issues during training for you?
@antoinedandi "clean_stale_shared_memory()
removes stale open shared memory files, but if no stale files are found, it doesn't perform any action. I'm curious if the issue is truly originating from clean_stale_shared_memory()
. Do you have a reproducible script we can test?"