torch.distributed.launch on eight 40G A100, CUDA out of memory.

Question

torch.distributed.launch on eight 40G A100, CUDA out of memory.

zhengbiqing opened this issue a year ago · comments

I run:
export CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7'
task=gene
datadir=data/$task
outdir=runs/$task/GPT2
name=gene0913
checkpoint=/root/siton-glusterfs-eaxtsxdfs/xts/data/BioMedLM
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --use_env run_seqcls_gpt.py
--tokenizer_name $checkpoint --model_name_or_path $checkpoint --train_file
$datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train
--do_eval --do_predict --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1
--learning_rate 2e-6 --warmup_ratio 0.5 --num_train_epochs 5 --max_seq_length
32 --logging_steps 1 --save_strategy no --evaluation_strategy no --output_dir
$outdir --overwrite_output_dir --bf16 --seed 1000 --run_name %name

but still get CUDA out of memory.
Anyone know to finetune seqcls how many GPUs must be need?