score_br_model

Goal

This project is a quick and dirty tool to evaluate some large language models (LLMs) in their ability to carry out tasks via interaction in Breton language.
So far, only 2 tasks are implemented:
- br2fr (Breton to French translation)
- fr2br (French to Breton translation)
The evaluation produces a proximity score comparing the semantic distance of a text produced by an LLM with an expected text pre-written by a human evaluator.
The semantic distance is based on the proximity of OpenAI embeddings.
Currently, the models from the following providers can be tested:
- OpenAI: e.g. gpt-3.5-turbo, gpt-4-turbo, gpt-4o, gpt-4o-mini-2024-07-18
- Google e.g. gemini-1.0-pro, gemini-1.5-flash, gemini-1.5-pro, palm-2-chat-bison-32k
- Anthropic: e.g. claude-3-5-sonnet-20240620, claude-3-haiku-20240307, claude-3-sonnet-20240229, claude-3-opus-20240229
- Meta: e.g. llama-3.1-70b-versatile, llama-3.1-8b-instant, llama3-8b-8192, llama3-70b-8192
- Mistral open-mistral-7b, mistral-large-latest
- Cohere: e.g. command-r-plus

Ubuntu OS
An OPENAI_API_KEY (cf. https://platform.openai.com/api-keys)
A GOOGLE_API_KEY (cf. (https://ai.google.dev/gemini-api/docs/api-key)
An ANTHROPIC_API_KEY (cf. https://console.anthropic.com/settings/keys)
A GROQ_API_KEY (cf. https://console.groq.com/keys) for Llama models
A MISTRAL_API_KEY (cf. https://console.mistral.ai/api-keys/)
A COHERE_API_KEY (cf. https://dashboard.cohere.com/api-keys)
A OPENROUTER_API_KEY (cf. https://openrouter.ai/keys) for Google Palm models
Only the OPENAI_API_KEY is mandatory given it is also needed for calculating the evaluation scores.
A mandatory source file of your choice (e.g. samples_br.txt)
An optional target file of your choice (e.g. samples_fr.txt). If not provided, evaluation will not be performed.
a dedicated configuration file (e.g samples_br.yaml)

git clone https://github.com/marxav/score_br_model.git
cd score_br_model
python3 -m venv env
source env/bin/activate
pip install openai pandas ipykernel tabulate llmlite google-generativeai anthropic groq mistralai cohere
echo OPENAI_API_KEY=your-secret-key-1 >> .env
echo GOOGLE_API_KEY=your-secret-key-2 >> .env
echo ANTHROPIC_API_KEY=your-secret-key-3 >> .env
echo GROQ_API_KEY=your-secret-key-4 >> .env
echo MISTRAL_API_KEY=your-secret-key-5 >> .env
echo COHERE_API_KEY=your-secret-key-6 >> .env
echo OPENROUTER_API_KEY=your-secret-key-7 >> .env

The source text to be translated must be in a *.txt file (e.g. samples_br.txt).
In order to evaluate the translation, another file must contain the target translation (e.g. samples_fr.txt), to which the translation will be compared to carry out the evaluation.
Running the translate_and_eval.py creates 2 files
- A log file containing all translations and scores;
  - For example: samples_br_logs.tsv
- A result file containing the summary of scores.
  - For example: samples_br_res.tsv

Enhance the scoring metric(s)
Add more samples in samples.tsv
Add a leaderboard of the tested LLMs and theirs scores at different tasks
- Either like an LMSYS leaderboard
- Or with via a product like https://scale.com/leaderboard

Some model can refuse to translate some sentences that they consider as :
- HARM CATEGORY_SEXUALLY_EXPLICIT,
- HARM_CATEGORY_HATE_SPEECH,
- HARM_CATEGORY_HARASSMENT,
- HARM_CATEGORY_DANGEROUS_CONTENT.
Currently, we use a file with lines like "br:port nawak" and "fr:trop chouette"
- For the br2fr task "br:port nawak" is sent to the model; the answer is then compared with "fr:trop chouette", which is used as true value.
- For the fr2br task "fr:trop chouette" is sent to the model; the answer is then compared with "br:port nawak", which is used as true value.
- A model saving history could then learn that "br:port nawak" has to be translated by fr:trop chouette", which would bias the evaluation.
- TODO: try to check if this happens... or avoid using the same input-file for the two different tasks.

tregor_2110_br.txt is a sample of a text written by Gireg Konan in Le Tregor newspaper, n°2110, June 6th 2024.