Validating on other language
Lukecn1 opened this issue · comments
Hi there, thank you for the excellent paper and repo!
If I want to train a supervised simcse model using a plm based on another language than english, how can I validate the model performance during training, given that the repo defaults to using STS and other English language tasks?
regards Lukas
Hi,
If there is a similar STS dataset for the language, you can use the corresponding dataset. Otherwise you can use the hyper parameter for English for a rough estimation.
Fair game. As I dont have sts dataset for the language needed, ill try and implement a validation functionality that doesnt rely on sts data.
Ill make a PR once i have implemented and tested it :)
There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.
There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.
Thanks for sharing :)
I found a way around it, and have written a routine that allows for validating on custom data during training.
This enables validating on other than sentence pair data as well.