princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question About bad results from trained model

Alison-starbeat opened this issue · comments

Sorry to bother you! I'm new to NLP, and I tried to use unsupervised simCSE on my own data, and the goal is to achieve best recall scores and precise scores (there is a test dataset) on my own data. I tried to use 10,000 - 90,000 data,1-2 epoches,learning_rate of 1e-5,with batch_size of 64 for training, and use a base model(roformer-sim Chinese version). But I found that the results from the trained model was worse than the base model.

I guess that questions happens in my datasets, maybe my dataset concludes lots of similiar sentence pairs naturally, which may causes bad influence to the contrastive learning step. Could this be true? What could I do to improve the results?

Thank you for your patience and hope for your reply!

Hi, can you elaborate more on the issue? For example, what this dataset is about, what the baseline model is, etc.

Stale issue message