failed to reproduce the condenser pretraining results on V100

Question

failed to reproduce the condenser pretraining results on V100

1024er opened this issue 3 years ago · comments

I am trying to reproduce the codenser pretraining results. I evaluate the checkpoint on the sts-b task with sentence-transformer, but the results are different.
（1）bert-base-uncased
2022-01-03 17:07:01 - Load pretrained SentenceTransformer: output/training_stsbenchmark_bert-base-uncased-2022-01-03_17-04-06
2022-01-03 17:07:02 - Use pytorch device: cuda
2022-01-03 17:07:02 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 17:07:05 - Cosine-Similarity : Pearson: 0.8484 Spearman: 0.8419
2022-01-03 17:07:05 - Manhattan-Distance: Pearson: 0.8345 Spearman: 0.8322
2022-01-03 17:07:05 - Euclidean-Distance: Pearson: 0.8349 Spearman: 0.8328
2022-01-03 17:07:05 - Dot-Product-Similarity: Pearson: 0.7521 Spearman: 0.7421

（2） Luyu/condenser
2022-01-03 17:12:46 - Load pretrained SentenceTransformer: output/training_stsbenchmark_Luyu-condenser-2022-01-03_17-09-51
2022-01-03 17:12:48 - Use pytorch device: cuda
2022-01-03 17:12:48 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 17:12:50 - Cosine-Similarity : Pearson: 0.8528 Spearman: 0.8504
2022-01-03 17:12:50 - Manhattan-Distance: Pearson: 0.8394 Spearman: 0.8380
2022-01-03 17:12:50 - Euclidean-Distance: Pearson: 0.8396 Spearman: 0.8378
2022-01-03 17:12:50 - Dot-Product-Similarity: Pearson: 0.7942 Spearman: 0.7819

（3）self-trained checkpoints
2022-01-03 17:34:30 - Load pretrained SentenceTransformer: output/training_stsbenchmark_output--2022-01-03_17-31-48
2022-01-03 17:34:32 - Use pytorch device: cuda
2022-01-03 17:34:32 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 17:34:34 - Cosine-Similarity : Pearson: 0.8498 Spearman: 0.8469
2022-01-03 17:34:34 - Manhattan-Distance: Pearson: 0.8415 Spearman: 0.8396
2022-01-03 17:34:34 - Euclidean-Distance: Pearson: 0.8423 Spearman: 0.8402
2022-01-03 17:34:34 - Dot-Product-Similarity: Pearson: 0.7959 Spearman: 0.7826

I run the pretraining on 8x32G V100 with the following settings:

python -m torch.distributed.launch --nproc_per_node 8 run_pre_training.py
--output_dir output
--model_name_or_path bert-base-uncased
--do_train
--save_steps 20000
--per_device_train_batch_size 128
--gradient_accumulation_steps 1
--fp16
--warmup_ratio 0.1
--learning_rate 1e-4
--num_train_epochs 8
--overwrite_output_dir
--dataloader_num_workers 16
--n_head_layers 2
--skip_from 6
--max_seq_length 128
--train_dir data
--weight_decay 0.01
--late_mlm

I use per_device_train_batch_size =128 and the global_batch_size = 128 x 8 = 1024.
The pre-training data is bookcorpus + wikipedia , created with released code by nvidia

raw dara:
5.0G bookscorpus_one_book_per_line.txt
13G wikicorpus_en_one_article_per_line.txt

after being preprocessed:
24G book_wiki.json
containing 41420334 lines with maxlen=128

I used the data to train bert-large and was able to reach F1=90% on squad task, so I think the corpus should be fine..

Will you please provide me some suggestions ? thank you

XingWu_UCAS · Answer 1 · Mon Jan 03 2022 21:03:07 GMT+0800 (China Standard Time)

I also find it that the variance of Spearman correlation on Test Set is quite large, is the result in the paper the average of the results of multiple experiments?

I ran the default settings 4 times :
python training_stsbenchmark.py Luyu/condenser

2022-01-03 20:58:39 - Load SentenceTransformer from folder: output/training_stsbenchmark_Luyu-condenser-2022-01-03_20-55-43
2022-01-03 20:58:41 - Use pytorch device: cuda
2022-01-03 20:58:41 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 20:58:44 - Cosine-Similarity : Pearson: 0.8549 Spearman: 0.8509
2022-01-03 20:58:44 - Manhattan-Distance: Pearson: 0.8470 Spearman: 0.8447
2022-01-03 20:58:44 - Euclidean-Distance: Pearson: 0.8473 Spearman: 0.8450
2022-01-03 20:58:44 - Dot-Product-Similarity: Pearson: 0.8059 Spearman: 0.7951

2022-01-03 20:59:11 - Load SentenceTransformer from folder: output/training_stsbenchmark_Luyu-condenser-2022-01-03_20-56-02
2022-01-03 20:59:14 - Use pytorch device: cuda
2022-01-03 20:59:14 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 20:59:16 - Cosine-Similarity : Pearson: 0.8570 Spearman: 0.8541
2022-01-03 20:59:16 - Manhattan-Distance: Pearson: 0.8504 Spearman: 0.8497
2022-01-03 20:59:16 - Euclidean-Distance: Pearson: 0.8508 Spearman: 0.8496
2022-01-03 20:59:16 - Dot-Product-Similarity: Pearson: 0.8144 Spearman: 0.8035

2022-01-03 19:27:40 - Load pretrained SentenceTransformer: output/training_stsbenchmark_Luyu-condenser-2022-01-03_19-24-38
2022-01-03 19:27:41 - Use pytorch device: cuda
2022-01-03 19:27:41 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 19:27:43 - Cosine-Similarity : Pearson: 0.8538 Spearman: 0.8493
2022-01-03 19:27:43 - Manhattan-Distance: Pearson: 0.8457 Spearman: 0.8433
2022-01-03 19:27:43 - Euclidean-Distance: Pearson: 0.8462 Spearman: 0.8434
2022-01-03 19:27:43 - Dot-Product-Similarity: Pearson: 0.8088 Spearman: 0.7982**

2022-01-03 21:00:00 - Load SentenceTransformer from folder: output/training_stsbenchmark_Luyu-condenser-2022-01-03_20-57-04
2022-01-03 21:00:03 - Use pytorch device: cuda
2022-01-03 21:00:03 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-01-03 21:00:05 - Cosine-Similarity : Pearson: 0.8505 Spearman: 0.8468
2022-01-03 21:00:05 - Manhattan-Distance: Pearson: 0.8432 Spearman: 0.8414
2022-01-03 21:00:05 - Euclidean-Distance: Pearson: 0.8440 Spearman: 0.8424
2022-01-03 21:00:05 - Dot-Product-Similarity: Pearson: 0.8108 Spearman: 0.8005

XingWu_UCAS · Answer 2 · Mon Jan 03 2022 23:35:31 GMT+0800 (China Standard Time)

sentence-transformer == 1.2.1
transformer == 4.2.0

Luyu Gao · Answer 3 · Tue Jan 04 2022 14:29:19 GMT+0800 (China Standard Time)

I'd need more information on your pre-training/fine-tuning setup/scripts to understand the situation and provide suggestions.

One caveat with the sentence-transformer package is that some example scripts use mean pooling by default while Condenser is designed for CLS pooling; you may need to make some slight code adjustments.

XingWu_UCAS · Answer 4 · Thu Jan 06 2022 12:20:48 GMT+0800 (China Standard Time)

I'd need more information on your pre-training/fine-tuning setup/scripts to understand the situation and provide suggestions.

One caveat with the sentence-transformer package is that some example scripts use mean pooling by default while Condenser is designed for CLS pooling; you may need to make some slight code adjustments.

Thank you, you are right. I modify the pooling type to [CLS] and able to achieve almost comparable results:

Thank you for your code and guidance.

Hannibal046 · Answer 5 · Fri Mar 11 2022 15:44:27 GMT+0800 (China Standard Time)

Hi, could you please tell where to get the training corpus?

5.0G bookscorpus_one_book_per_line.txt

13G wikicorpus_en_one_article_per_line.txt

Luyu Gao · Answer 6 · Sat Mar 12 2022 10:15:59 GMT+0800 (China Standard Time)

@Hannibal046 I'd recommend either the nvidia megatron repo as mentioned above or wikipedia and bookcorpusopen from huggingface dataset hub.

Hannibal046 · Answer 7 · Sat Mar 12 2022 22:52:08 GMT+0800 (China Standard Time)

@luyug Hi, Thanks for your response. But I find the Nvidia Repo download_wikipedia can not work for me. How can I post-process my data after downloading wiki and book_corpus from HuggingFace ? I am recently trying to Pre-train a Bert model from scratch. Any help? Thanks so much !!