vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

Home Page:http://docs.cleanrl.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data corruption due to run naming convention when running on Slurm/GridEngine

Bam4d opened this issue · comments

Problem Description

In many CleanRL scripts, a timestamp is used as a differentiator in the naming of the jobs:

https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py#L134

In some very rare cases (when running on a slurm cluster or Sun Grid Engine for example) two scripts may be executed within a second of each other and end up with the same timestamp.

If a shared drive is used to store results (very common for these cluster setups) then the jobs can actually overwrite each other's data.
You will end up with a bunch of very wierdly similar looking runs in wandb,

Checklist

Current Behavior

Data overwritten due to shared drives and problematic naming convention which causes collisions.

wandb/tensorboard might throw an error but I've only ever seen this once.

Expected Behavior

Data should not be overwritten and runs should always have unique names.

Possible Solution

I actually have replaced time.time() with uuid.uuid4() which is extemely unlikely to cause collisions.

Steps to Reproduce

steps to reproduce are a bit pointless unless you have access to a fairly empty cluster, however I believe the bug is trivial enough to require no repro steps to understand.

Is the main purpose of it to run more random seeds? This should no longer be an issue with the new slurm integration in the benchmark utility https://docs.cleanrl.dev/get-started/benchmark-utility/#slurm-integration. It basically increments the seed per run :)

env_ids={{env_ids}}
seeds={{seeds}}
env_id=${env_ids[$SLURM_ARRAY_TASK_ID / {{len_seeds}}]}
seed=${seeds[$SLURM_ARRAY_TASK_ID % {{len_seeds}}]}

echo "Running task $SLURM_ARRAY_TASK_ID with env_id: $env_id and seed: $seed"

srun {{command}} --env-id $env_id --seed $seed #