princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problems pretraining simcse on custom dataset

mphomokoatle opened this issue · comments

Hi,

I am having some issues trying to pretrain the unsupervised SimCSE (princeton-nlp/unsup-simcse-bert-base-uncased) on a custom dataset. I got the following error:

"File "train.py", line 51
model_name_or_path: Optional[str] = field(
^
SyntaxError: invalid syntax"

I am using Python version 3.9.7, CentOS Linux version 7.

Thanks.

Hi,

It doesn't seem to be a prevalent issue with other people's environments. Maybe the file somehow got corrupted. Can you try re-cloning the repo?

Stale issue message

@gaotianyu1350 Thank you for your response. I re-cloned the repo and it worked.
Quick question, after training the unsupervised simcse on a custom dataset, will I also get my own custom vocab.txt and tokenizer or it will produce the one from the original bert checkpoint (i.e my-unsup-simcse-bert-base-uncased )? I am asking cos it seems like I inherited the vocab and tokenizer from the original model (i.e wikipedia data) and not my own data

Output of training unsupervised simcse on custom data:

[INFO|modeling_utils.py:1151] 2023-10-20 06:41:25,541 >> All the weights of BertForCL were initialized from the model checkpoint at /home/mmokoatle/Training_SIMCSE/bert.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForCL for predictions without further training.
Map: 100%|██████████| 256/256 [00:00<00:00, 303.19 examples/s]
[INFO|trainer.py:441] 2023-10-20 06:41:26,423 >> The following columns in the training set don't have a corresponding argument in BertForCL.forward and have been ignored: .
10/20/2023 06:41:26 - INFO - simcse.trainers - ***** Running training *****
10/20/2023 06:41:26 - INFO - simcse.trainers - Num examples = 256
10/20/2023 06:41:26 - INFO - simcse.trainers - Num Epochs = 3
10/20/2023 06:41:26 - INFO - simcse.trainers - Instantaneous batch size per device = 16
10/20/2023 06:41:26 - INFO - simcse.trainers - Total train batch size (w. parallel, distributed & accumulation) = 16
10/20/2023 06:41:26 - INFO - simcse.trainers - Gradient Accumulation steps = 1
10/20/2023 06:41:26 - INFO - simcse.trainers - Total optimization steps = 48
10/20/2023 06:41:26 - INFO - simcse.trainers - Continuing training from checkpoint, will skip to saved global_step
10/20/2023 06:41:26 - INFO - simcse.trainers - Continuing training from epoch 1
10/20/2023 06:41:26 - INFO - simcse.trainers - Continuing training from global step 16
10/20/2023 06:41:26 - INFO - simcse.trainers - Will skip the first 1 epochs then the first 0 batches in the first epoch.
100%|██████████| 48/48 [06:47<00:00, 12.29s/it]10/20/2023 06:48:13 - INFO - simcse.trainers -

Training completed. Do not forget to share your model on huggingface.co/models =)

100%|██████████| 48/48 [06:47<00:00, 8.49s/it]
[INFO|trainer.py:1344] 2023-10-20 06:48:13,808 >> Saving model checkpoint to /home/mmokoatle/Training_SIMCSE/result/my-unsup-simcse-bert-base-uncased
[INFO|configuration_utils.py:300] 2023-10-20 06:48:13,810 >> Configuration saved in /home/mmokoatle/Training_SIMCSE/result/my-unsup-simcse-bert-base-uncased/config.json
[INFO|modeling_utils.py:817] 2023-10-20 06:48:14,779 >> Model weights saved in /home/mmokoatle/Training_SIMCSE/result/my-unsup-simcse-bert-base-uncased/pytorch_model.bin
10/20/2023 06:48:14 - INFO - main - ***** Train results *****
10/20/2023 06:48:14 - INFO - main - epoch = 3.0
10/20/2023 06:48:14 - INFO - main - train_runtime = 407.3796
10/20/2023 06:48:14 - INFO - main - train_samples_per_second = 0.118
/var/spool/PBS/mom_priv/jobs/5219044.sched01.SC: line 13: 195572 Segmentation fault python /home/mmokoatle/Training_SIMCSE/train.py --model_name_or_path /home/mmokoatle/Training_SIMCSE/bert --train_file /home/mmokoatle/SimCSE/hg19_4000_seq_tokens_k9.txt --output_dir /home/mmokoatle/Training_SIMCSE/result/my-unsup-simcse-bert-base-uncased --num_train_epochs 3 --per_device_train_batch_size 16 --learning_rate 3e-5 --max_seq_length 512 --pooler_type cls --mlp_only_train --overwrite_output_dir --temp 0.05 --do_train

Hi, you should still get the same vocab file and tokenizer, as the training does not change the tokenization part.

@gaotianyu1350 Thank you! 🙂

Stale issue message