ZiwenZhuang / parkour

[CoRL 2023] Robot Parkour Learning

Home Page:https://robot-parkour.github.io

Repository from Github https://github.comZiwenZhuang/parkourRepository from Github https://github.comZiwenZhuang/parkour

Questions About Using Multi-node GPUs to Collect Trajectories

JLCucumber opened this issue · comments

Hi all,

I'm currently running the go2 distillation task with three GPUs on different devices (multi-node). The role of each device is like this:

  • train.py: laptop 4080 (12 GB)
  • collect.py: 4090 (24 GB) & A4000 (16 GB)

Some key training configuration is like this:

  • num_envs: 128
  • collection terrain: num_rows=6, num_cols=30
  • data_dir = "/mnt/rpl_project/data"

I followed the instruction to start the training by: (1) multi_process_=True & launch train.py; (2) launch collect.py on GPU 4090; (3) launch collect.py on GPU A4000.

However, I noticed that on each collection device, a seperate folder is created to store data. In my case, they are:

  • Folder A: Jul04_20-03-46_(....)_Jul04_20-59-01
  • Folder B: Jul06_15-00-21_(....)_Jul04_20-59-01

This caused a problem: my training device can only extract one of the folders with a logging info: "multiple metadata files found, using the first one". However, I intuitively thought both trajectory folders should be used for the training, otherwise no acceleration can be make even in a multi-GPU mode. I wonder if there is something I forgot to add.

Appreciate for any help or suggestions.

Cheers

Hi,

Don't worry, it is only using the first metadata, since each collect.py process puts their own metadata file in the data folder.

Everything will be fine as long as you run the exact same collect.py on multi-GPUs in your current experiment, you can ignore this terminal output.

Hi,

Thanks for the explanation! Just want to double check:

So it doesn't matter how many metadata files it identified, as long as they are all refering to the same training process. And the collection devices are all actually contributing to speeding up the training process. Is that right?

Thanks again for the help!
Cheers