qjchen1972 / testML

Quickly testing deep learning environments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

testML

Test code built based on Python>=2.0

This training script can be run both on a single gpu in debug mode, and also in a larger training run with distributed data parallel (ddp).

To run on a single GPU, example:

$ python play.py

To run with DDP on 4 gpus on 1 node, example:

$ torchrun --standalone --nproc_per_node=4 play.py

To run with DDP on 4 gpus across 2 nodes, example:

  • Run on the first (master) node with example IP 123.456.123.456: $ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 play.py
  • Run on the worker node: $ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 play.py (If your cluster does not have Infiniband interconnect prepend NCCL_IB_DISABLE=1)

About

Quickly testing deep learning environments


Languages

Language:Python 100.0%