Validating on other language

Question

Validating on other language

Lukecn1 opened this issue 2 years ago · comments

Hi there, thank you for the excellent paper and repo!

If I want to train a supervised simcse model using a plm based on another language than english, how can I validate the model performance during training, given that the repo defaults to using STS and other English language tasks?

regards Lukas

Tianyu Gao · Answer 1 · Mon Jun 27 2022 03:05:25 GMT+0800 (China Standard Time)

Hi,

If there is a similar STS dataset for the language, you can use the corresponding dataset. Otherwise you can use the hyper parameter for English for a rough estimation.

Lukecn1 · Answer 2 · Mon Jun 27 2022 17:00:51 GMT+0800 (China Standard Time)

Fair game. As I dont have sts dataset for the language needed, ill try and implement a validation functionality that doesnt rely on sts data.

Ill make a PR once i have implemented and tested it :)

Yiren Jian · Answer 3 · Thu Aug 11 2022 18:30:02 GMT+0800 (China Standard Time)

There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.

Lukecn1 · Answer 4 · Fri Aug 12 2022 17:02:08 GMT+0800 (China Standard Time)

There are translated STS-Benchmark in other languages. https://huggingface.co/datasets/stsb_multi_mt, https://github.com/PhilipMay/stsb-multi-mt/tree/main/data and https://github.com/PhilipMay/stsb-multi-mt. PS: You need to preprocess these csv files as '\t' separated and replace them in STS-Benchmark.

Thanks for sharing :)

I found a way around it, and have written a routine that allows for validating on custom data during training.
This enables validating on other than sentence pair data as well.