Do you have a PyTorch v2 training script with multi-GPU support?

Question

Do you have a PyTorch v2 training script with multi-GPU support?

linhduongtuan opened this issue a year ago · comments

Hi Phillip,

Do you have a PyTorch v2 training script with multi-GPU support? If so, would you be able to share it with me?

As far as I know, your Arxiv paper states that the total compute runtime was around 170 days and 800 runs (without linear probing). I am wondering why you used only one GPU to train the huge model.

Have a nice weekend.
Linh

phseidl · Answer 1 · Tue Sep 12 2023 16:41:52 GMT+0800 (China Standard Time)

Hi Linh,
not as of now, but feel free to add a PR for the support. This shouldn't take too long.
There was no need to add multi-GPU support for my setup. Training pubchem23 runs in under 2 days on a sing A100.

The models aren't too large, since only an "adapter" is trained; the large models are used as frozen encoders only once to embed them.

Best, Philipp

phseidl · Answer 2 · Tue Sep 12 2023 16:46:47 GMT+0800 (China Standard Time)

Just checked; torch '2.0.1+cu117' works fine for me for training

Linh T. Duong · Answer 3 · Tue Sep 12 2023 21:21:42 GMT+0800 (China Standard Time)

Hi @phseidl Philipp,
Thank you for answering my question.
It's good to know you used an adapter for downstream finetuning on the PubChem23 dataset.

Since I am struggling to preprocess the PubChem23 dataset, would you be able to share your preprocessed PubChem23 data with me?

I have a follow-up question - do you plan to make a PR to the Huggingface Hub? It would be great if you could push your results and source code there. Sometimes I encounter errors with the mlflow package. In my opinion, using wandb to monitor everything would be fine.

Best,
Linh

phseidl · Answer 4 · Wed Oct 25 2023 16:45:46 GMT+0800 (China Standard Time)

Hi @linhduongtuan ,
sorry for the late response; I have added a reproducible way to download the pubchem23 dataset.

wget -N -r https://cloud.ml.jku.at/s/fi83oGMN2KTbsNQ/download -O pubchem23.zip
unzip pubchem23.zip
rm pubchem23.zip

(added it to ./data/pubchem.md)

Completely agree. Working on a new project so far, so it's not a priority, but would like to add the models to the HF-Hub.

Best, Philipp