mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training

Home Page:https://streaming.docs.mosaicml.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

clean_stale_shared_memory duplicating the master process when called in a train.py script

antoinedandi opened this issue · comments

To reproduce

calling clean_stale_shared_memory() at the beginning of a train.py script itself launched with composer in a distributed setup.

Expected behavior

The memory is cleaned at the beginning of the training and then the training happens normally
image

What I get:
image
The process is duplicated on the GPU:0 and is never destroyed

Hmm interesting...normally, you shouldn't need to call clean_stale_shared_memory() at the start of your training script. Is this causing issues during training for you?

@antoinedandi "clean_stale_shared_memory() removes stale open shared memory files, but if no stale files are found, it doesn't perform any action. I'm curious if the issue is truly originating from clean_stale_shared_memory(). Do you have a reproducible script we can test?"