citadel-ai / langcheck

Simple, Pythonic building blocks to evaluate LLM applications.

Home Page:https://langcheck.readthedocs.io/en/latest/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create langcheck.utils.detect_language()

syamaco opened this issue · comments

Hello,
I have a question about the following test code.

  • Question: Is it possible to treat different languages uniformly?
    1. Automatic detect languages. (e.g., EN & JA)
    2. Unify the threshold_test value of toxicity between different languages. (e.g., 0.2 for both)
import langcheck

generated_outputs = [
    '適度な**は健康に良いとされています。',
    '適度な**は健康に悪いとされています。',
    '過度の**は健康に良いとされています。',
    '過度の**は健康に悪いとされています。',
    'Moderate exercise is good for your health.',
    'Moderate exercise is bad for your health.',
    'Excessive exercise is good for your health.',
    'Excessive exercise is bad for your health.',
]

# Toxicity
display(langcheck.metrics.ja.toxicity(generated_outputs) < 0.2)
display(langcheck.metrics.en.toxicity(generated_outputs) < 0.2)

Thank you in advance.

LangCheck_Toxicity

Hi @syamaco, thanks for the question!

Automatic detect languages. (e.g., EN & JA)

I think it makes sense to include a langcheck.utils.detect_language() function in LangCheck. There should be some well known heuristics we can use to implement this.

Unify the threshold_test value of toxicity between different languages. (e.g., 0.2 for both)

Currently, English toxicity (detoxify) and Japanese toxicity (fine-tuned line-distilbert-base-japanese) are completely different models, so it may not be optimal to set a single threshold for both models.

If you use OpenAI to compute toxicity, I believe it's using the exact same model and prompt for both languages, so you might be able to set a single threshold in that case.

@liwii if you have any suggestions, feel free to chime in!

Hi @kennysong, thanks for your response!

It would be helpful to have a language detection function like that.
Additionally, it would be even better if it could detect when multiple languages are mixed within a sentence.

I've come to understand that the threshold for toxicity can vary depending on the language model.
Also, considering that the threshold might change even with the same language model due to different versions, I feel it's important to exercise caution in determining the threshold.

I will try OpenAI's language model at least once, referring to the sample code.

Thank you.

Additionally, it would be even better if it could detect when multiple languages are mixed within a sentence.

Got it! What would be a useful output of langcheck.utils.detect_language() if there are multiple languages? The simplest idea is a list like ['en', 'ja'], but I think there are many other options that could be useful.

Also, considering that the threshold might change even with the same language model due to different versions, I feel it's important to exercise caution in determining the threshold.

This is a good point – I think it's a good idea to pin a specific version of a HuggingFace model in LangCheck where possible. Then we can control model upgrades in LangCheck versions. We can track this as a separate feature request.

@kennysong san, thank you for the suggestion.

Is it possible to set the output of langcheck.utils.detect_language() to the detected language and its probability, like {'en': 0.7, 'ja': 0.3} ?

@kennysong san,

I felt that results similar to detector.compute_language_confidence_values() in Lingua were natural.
https://github-com.translate.goog/pemistahl/lingua-py?_x_tr_sl=auto&_x_tr_tl=ja&_x_tr_hl=ja&_x_tr_pto=wapp#113-confidence-values

ENGLISH: 0.93
FRENCH: 0.04
GERMAN: 0.02
SPANISH: 0.01

If we can identify the main language of the text using langcheck.utils.detect_language() along with its probability, I think it could be used as a basis for deciding whether to process it with langcheck.metrics.ja.toxicity() or langcheck.metrics.en.toxicity(). Additionally, in cases where the detection values are similar, we may also consider opting not to process the text.

thank you.

Sounds good, we can try the default compute_language_confidence_values() first.

I'm not sure that it'll actually return {"en": 0.5, "ja": 0.5} for a sentence with equal amounts of English and Japanese, so we should test it later.

Other options are to use compute_language_confidence_values() on a single language at a time or detect_multiple_languages_of().