This repository contains the scripts used to generate and evaluate the datasets from the Eval-UA-tion 1.0 benchmark for evaluating LLMs in Ukrainian.
It was initially my Master's Thesis (see /other/MA) and was accepted to UNLP 2024 | The Third Ukrainian Natural Language Processing Workshop (preprint here: Eval-UA-tion 1.0: Benchmark for Evaluating Ukrainian (Large) Language Models - Archive ouverte HAL).
- Benchmark for evaluating LLMs in the Ukrainian language
- 3 novel tasks (9 datasets in total).
- All with human baselines, most contamination-safe (for now...)
- See presentation (https://serhii.net/F/MA/presentation/#/3/1) for more.
- TODO UNLP video and presentation will be added here when completed.
- /code contains the (messy) code used to generate and evaluate most of the tasks
- /other/MA contains my Master's Thesis and defense presentation (that have more details than the paper)
- CC BY 4.0 for the code, CC BY-NC 4.0 for the Thesis and presentation unless stated otherwise.
- The presentation uses the MIT-licensed animation from the README of the excellent anilev6/HumanResponseBot: a special research project.
These awesome people helped proofread the stories, annotate the datasets and establish human baselines (in alphabetical order):
- Oleksii K.
- Viacheslav Kravchenko
- Daria Kravets
- Anna-Izabella Levbarg
- Lina Mykhailenko
- Mariia Tkachenko
- @arturius453
Anna-Izabella Levbarg wrote the anilev6/HumanResponseBot Telegram bot used for all human baselines.