Question about the training for stage 3
QizhiPei opened this issue · comments
Thanks for the interesting work that bridges 3D molecule and text. I'm a little confused about the training for stage 3.
- Does the model is jointly trained on PubChem and PubChemQC datasets at the same time?
bash ./scripts/stage3_train.sh
seems only train the model on one of them. May I need to manually merge these two datasets provided in huggingface? bash ./scripts/stage3_train.sh
seems use thepretrain
subset of PubChem and PubChemQC datasets. So I'm a little confused about when thetrain
subset of PubChem and PubChemQC is used? I know that thepretrain
subset of PubChem is used for stage 1 and starge 2 pre-training, and thetrain
subset of PubChem is used for stage 1 and stage 2 fine-tuning. But for stage 3 I'm confused.
Any help you might provide is appreciated and thanks for your time and attention.
Thanks for your interest in our work.
(1) Yes, it is jointly trained on both PubChem and PubChemQC. Please check the data provider fucntions for detailes (instruct_dataset.py & balance_dataset.py in data_provider folder).
(2) Sorry for the confusion in our codes. Actually, in stage 3, only train subsets of PubChem and PubChemQC are used. We have revised the mode name (pretrain -> train) in stage 3 to avoid this confusion.
Thanks for your response. Does it means that the pretrain
subset of PubChemQC is not used for 3D-MoLM training?
Yes
Thanks for your quick reply~