Create langcheck.utils.detect_language()

Question

Create langcheck.utils.detect_language()

syamaco opened this issue 7 months ago · comments

Hello,
I have a question about the following test code.

Question: Is it possible to treat different languages uniformly?
1. Automatic detect languages. (e.g., EN & JA)
2. Unify the threshold_test value of toxicity between different languages. (e.g., 0.2 for both)

import langcheck

generated_outputs = [
    '適度な**は健康に良いとされています。',
    '適度な**は健康に悪いとされています。',
    '過度の**は健康に良いとされています。',
    '過度の**は健康に悪いとされています。',
    'Moderate exercise is good for your health.',
    'Moderate exercise is bad for your health.',
    'Excessive exercise is good for your health.',
    'Excessive exercise is bad for your health.',
]

# Toxicity
display(langcheck.metrics.ja.toxicity(generated_outputs) < 0.2)
display(langcheck.metrics.en.toxicity(generated_outputs) < 0.2)

Thank you in advance.

Kenny Song · Answer 1 · Wed Dec 06 2023 17:56:56 GMT+0800 (China Standard Time)

Hi @syamaco, thanks for the question!

Automatic detect languages. (e.g., EN & JA)

I think it makes sense to include a langcheck.utils.detect_language() function in LangCheck. There should be some well known heuristics we can use to implement this.

Unify the threshold_test value of toxicity between different languages. (e.g., 0.2 for both)

Currently, English toxicity (detoxify) and Japanese toxicity (fine-tuned line-distilbert-base-japanese) are completely different models, so it may not be optimal to set a single threshold for both models.

If you use OpenAI to compute toxicity, I believe it's using the exact same model and prompt for both languages, so you might be able to set a single threshold in that case.

@liwii if you have any suggestions, feel free to chime in!

syamaco · Answer 2 · Thu Dec 07 2023 09:01:02 GMT+0800 (China Standard Time)

Hi @kennysong, thanks for your response!

It would be helpful to have a language detection function like that.
Additionally, it would be even better if it could detect when multiple languages are mixed within a sentence.

I've come to understand that the threshold for toxicity can vary depending on the language model.
Also, considering that the threshold might change even with the same language model due to different versions, I feel it's important to exercise caution in determining the threshold.

I will try OpenAI's language model at least once, referring to the sample code.

Thank you.

Kenny Song · Answer 3 · Thu Dec 07 2023 10:42:44 GMT+0800 (China Standard Time)

Additionally, it would be even better if it could detect when multiple languages are mixed within a sentence.

Got it! What would be a useful output of langcheck.utils.detect_language() if there are multiple languages? The simplest idea is a list like ['en', 'ja'], but I think there are many other options that could be useful.

Also, considering that the threshold might change even with the same language model due to different versions, I feel it's important to exercise caution in determining the threshold.

This is a good point – I think it's a good idea to pin a specific version of a HuggingFace model in LangCheck where possible. Then we can control model upgrades in LangCheck versions. We can track this as a separate feature request.

syamaco · Answer 4 · Fri Dec 08 2023 08:02:31 GMT+0800 (China Standard Time)

@kennysong san, thank you for the suggestion.

Is it possible to set the output of langcheck.utils.detect_language() to the detected language and its probability, like {'en': 0.7, 'ja': 0.3} ?

Kenny Song · Answer 5 · Fri Dec 08 2023 19:31:08 GMT+0800 (China Standard Time)

Yes, I think we can use https://github.com/pemistahl/lingua-py to output confidence scores for language detection. I'm not quite sure how they handle confidence scores for input with multiple languages, though. We'll need to dig into that later. From your perspective, what do you expect the probabilities to be when a sentence contains both English and Japanese? {'en': 1.0, 'ja': 1.0} or {'en': 0.5, 'ja': 0.5}?

…

On Fri, Dec 8, 2023 at 9:02 AM syamaco ***@***.***> wrote: @kennysong <https://github.com/kennysong> san, thank you for the suggestion. Is it possible to set the output of langcheck.utils.detect_language() to the detected language and its probability, like {'en': 0.7, 'ja': 0.3} ? — Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZCQ3MDWZRVH2UR6FNOQTYIJKKDAVCNFSM6AAAAABAI6QSEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBWGI4DSOJUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

syamaco · Answer 6 · Sat Dec 09 2023 10:30:49 GMT+0800 (China Standard Time)

@kennysong san,

I felt that results similar to detector.compute_language_confidence_values() in Lingua were natural.
https://github-com.translate.goog/pemistahl/lingua-py?_x_tr_sl=auto&_x_tr_tl=ja&_x_tr_hl=ja&_x_tr_pto=wapp#113-confidence-values

ENGLISH: 0.93
FRENCH: 0.04
GERMAN: 0.02
SPANISH: 0.01

If we can identify the main language of the text using langcheck.utils.detect_language() along with its probability, I think it could be used as a basis for deciding whether to process it with langcheck.metrics.ja.toxicity() or langcheck.metrics.en.toxicity(). Additionally, in cases where the detection values are similar, we may also consider opting not to process the text.

thank you.

Kenny Song · Answer 7 · Sat Dec 09 2023 19:12:47 GMT+0800 (China Standard Time)

Sounds good, we can try the default compute_language_confidence_values() first.

I'm not sure that it'll actually return {"en": 0.5, "ja": 0.5} for a sentence with equal amounts of English and Japanese, so we should test it later.

Other options are to use compute_language_confidence_values() on a single language at a time or detect_multiple_languages_of().