lsh0520 / 3D-MoLM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about the training for stage 3

QizhiPei opened this issue · comments

Thanks for the interesting work that bridges 3D molecule and text. I'm a little confused about the training for stage 3.

  1. Does the model is jointly trained on PubChem and PubChemQC datasets at the same time? bash ./scripts/stage3_train.sh seems only train the model on one of them. May I need to manually merge these two datasets provided in huggingface?
  2. bash ./scripts/stage3_train.sh seems use the pretrain subset of PubChem and PubChemQC datasets. So I'm a little confused about when the train subset of PubChem and PubChemQC is used? I know that the pretrain subset of PubChem is used for stage 1 and starge 2 pre-training, and the train subset of PubChem is used for stage 1 and stage 2 fine-tuning. But for stage 3 I'm confused.

Any help you might provide is appreciated and thanks for your time and attention.

Thanks for your interest in our work.

(1) Yes, it is jointly trained on both PubChem and PubChemQC. Please check the data provider fucntions for detailes (instruct_dataset.py & balance_dataset.py in data_provider folder).

(2) Sorry for the confusion in our codes. Actually, in stage 3, only train subsets of PubChem and PubChemQC are used. We have revised the mode name (pretrain -> train) in stage 3 to avoid this confusion.

Thanks for your response. Does it means that the pretrain subset of PubChemQC is not used for 3D-MoLM training?

Yes

Thanks for your quick reply~

@QizhiPei @lsh0520 Hi, I'm also a little confused about stage 2 and stage 3. What is the difference between stafe2-ft and stage3-train?
Are they corresponding to Specialist and Generalist, respectively?