Training Time Issue

Question

Training Time Issue

imethanlee opened this issue 2 years ago · comments

Hi,

What is the expected time to train PET model on yelp_full dataset (with default arguments)? I started the training the day before yesterday with a RTX 3090 GPU and it is still running.

Thanks.

Timo Schick · Answer 1 · Thu Mar 31 2022 03:47:37 GMT+0800 (China Standard Time)

I don't know how efficient RTX 3090's are, but with a single Nvidia Geforce 1080Ti, training PET (not iPET) with the default parameters is a matter of a few hours. In case you haven't fixed the issue yourself yet, could you provide me with the exact command that you've used to train the model? Also, did you check (e.g., with nvidia-smi) whether the GPU is actually used?

Jeremiah · Answer 2 · Wed Apr 20 2022 06:02:49 GMT+0800 (China Standard Time)

Hi @timoschick,

I am having the same issue here. I started the training on a RTX 3090 yesterday and it is still running. The command I am using is as follows:

python pet/cli.py \
    --method pet \
    --pattern_ids 0 3 5 \
    --data_dir ${DATA_DIR} \
    --model_type albert \
    --model_name_or_path albert-xxlarge-v2 \
    --task_name boolq \
    --output_dir ${OUTPUT_DIR} \
    --do_train \
    --do_eval \
    --pet_per_gpu_eval_batch_size 8 \
    --pet_per_gpu_train_batch_size 2 \
    --pet_gradient_accumulation_steps 8 \
    --pet_max_steps 250 \
    --pet_max_seq_length 256 \
    --pet_repetitions 3 \
    --sc_per_gpu_train_batch_size 2 \
    --sc_per_gpu_unlabeled_batch_size 2 \
    --sc_gradient_accumulation_steps 8 \
    --sc_max_steps 5000 \
    --sc_max_seq_length 256 \
    --sc_repetitions 1

Jeremiah · Answer 3 · Wed Apr 20 2022 09:19:34 GMT+0800 (China Standard Time)

Just a heads up -- I bumped up the version of PyTorch to 1.8.0 and CUDA to 11.3 and that solved the performance issues. I am now able to run through the first 126 epochs in about 12 minutes compared to 1.5 hours. I am still waiting to see if this affects the results, but the performance is much better.

Jackson · Answer 4 · Thu Feb 16 2023 20:29:50 GMT+0800 (China Standard Time)

@jmcrey So, the result is ok ?

I'm now use 1080 Ti and trained with CUDA 11.5 and having TensorRT with 3 epoch.
My pre-trained model is Roberta-large and the dataset is AG News, other's arguments set to default.
It's looks like the training time needs to take half a day.