About the training time.

Question

About the training time.

yuanxuanS opened this issue 7 months ago · comments

Could you please tell me the time of training of every experiment? I saw you trained them on 4 GPUs simutanously. When I transfer it to another problem, I need train for more than 3 days on one Tesla p100.

Federico Berto · Answer 1 · Sat Dec 23 2023 02:03:13 GMT+0800 (China Standard Time)

Hi @yuanxuanS !
Actually, we did not train on 4 GPUs simultaneously. Only for the larger AM-XL version we trained on 2x3090 at the time (the initial version of RL4CO, which had some inefficiencies that have now been solved). For TSP/CVRP, the training time is less than 7 hours for 50 nodes, 1,280,000 samples/epoch (train_data_size), batch size 512, and 100 epochs on a single 3090 (note we are using mixed-precision from Lightning + FlashAttention), which is faster than the original Kool's implementation on the same machine.

Could you tell us more about your setting? Which problem and hyperparameters did you use?

Federico Berto · Answer 2 · Sat Mar 02 2024 17:49:51 GMT+0800 (China Standard Time)

Closing as stale. Feel free to re-open @yuanxuanS if you find any issue with training time!