princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Validating on other language

Lukecn1 opened this issue · comments

Hi there, thank you for the excellent paper and repo!

If I want to train a supervised simcse model using a plm based on another language than english, how can I validate the model performance during training, given that the repo defaults to using STS and other English language tasks?

regards Lukas

Hi,

If there is a similar STS dataset for the language, you can use the corresponding dataset. Otherwise you can use the hyper parameter for English for a rough estimation.

Fair game. As I dont have sts dataset for the language needed, ill try and implement a validation functionality that doesnt rely on sts data.

Ill make a PR once i have implemented and tested it :)

There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.

There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.

Thanks for sharing :)

I found a way around it, and have written a routine that allows for validating on custom data during training.
This enables validating on other than sentence pair data as well.