Confusion about the code of train

Question

Confusion about the code of train

jaywongs opened this issue 6 months ago · comments

First of all, thank you for your amazing work!
I'm attempting to replicate the training process, and I have a question regarding the train.py file. In your paper, you mentioned using two A100-80G GPUs, but I couldn't find any mention of multiprocessing or distributed training in your code. I'm curious if you used deepspeed for training? If not, could you provide guidance on modifying the code to make it compatible with a multi-GPU setup?
Thanks once again!

Yuxiang Wei · Answer 1 · Thu Dec 21 2023 12:20:34 GMT+0800 (China Standard Time)

Hi, such options are given in the shell command, which we have not documented yet. Roughly here is how the training is invoked:

accelerate launch -m src/magicoder/train.py \
	--model_key $MODEL_KEY \
	--model_name_or_path $MODEL_KEY \
	--use_flash_attention True \
	--datafile_paths $DATASET_PATH \
	--output_dir $OUTPUT_DIR \
	--bf16 True \
	--num_train_epochs 2 \
	--per_device_train_batch_size 2 \
	--gradient_accumulation_steps 128 \
	--group_by_length False \
	--ddp_find_unused_parameters False \
	--optim adafactor \
	--max_grad_norm -1 \
	--warmup_steps $WARMUP_STEP \
	--learning_rate 5e-5 \
	--lr_scheduler_type linear

We will give a more clear documentation later.

jaywongs · Answer 2 · Thu Dec 21 2023 13:54:06 GMT+0800 (China Standard Time)

Thank you for your reply, it worked! Looking forward to the clear documentation~

Chris / Hao Chen · Answer 3 · Mon Dec 25 2023 20:22:53 GMT+0800 (China Standard Time)

Hey thx for the answer, looking forward to the whole scripts

Wenkai.Zhang · Answer 4 · Tue Jan 23 2024 17:33:46 GMT+0800 (China Standard Time)

Hi, such options are given in the shell command, which we have not documented yet. Roughly here is how the training is invoked:

accelerate launch -m src/magicoder/train.py \
	--model_key $MODEL_KEY \
	--model_name_or_path $MODEL_KEY \
	--use_flash_attention True \
	--datafile_paths $DATASET_PATH \
	--output_dir $OUTPUT_DIR \
	--bf16 True \
	--num_train_epochs 2 \
	--per_device_train_batch_size 2 \
	--gradient_accumulation_steps 128 \
	--group_by_length False \
	--ddp_find_unused_parameters False \
	--optim adafactor \
	--max_grad_norm -1 \
	--warmup_steps $WARMUP_STEP \
	--learning_rate 5e-5 \
	--lr_scheduler_type linear

We will give a more clear documentation later.

Hey thx for the answer, Is clear documentation done?