Eval-UA-tion 1.0

Intro

This repository contains the scripts used to generate and evaluate the datasets from the Eval-UA-tion 1.0 benchmark for evaluating LLMs in Ukrainian.

It was initially my Master's Thesis (see /other/MA) and was accepted to UNLP 2024 | The Third Ukrainian Natural Language Processing Workshop (preprint here: Eval-UA-tion 1.0: Benchmark for Evaluating Ukrainian (Large) Language Models - Archive ouverte HAL).

TL; DR

Benchmark for evaluating LLMs in the Ukrainian language
3 novel tasks (9 datasets in total).
All with human baselines, most contamination-safe (for now...)
See presentation (https://serhii.net/F/MA/presentation/#/3/1) for more.
TODO UNLP video and presentation will be added here when completed.

This repository

/code contains the (messy) code used to generate and evaluate most of the tasks
/other/MA contains my Master's Thesis and defense presentation (that have more details than the paper)

License

CC BY 4.0 for the code, CC BY-NC 4.0 for the Thesis and presentation unless stated otherwise.
The presentation uses the MIT-licensed animation from the README of the excellent anilev6/HumanResponseBot: a special research project.

Thanks and acknowledgements

These awesome people helped proofread the stories, annotate the datasets and establish human baselines (in alphabetical order):

Oleksii K.
Viacheslav Kravchenko
Daria Kravets
Anna-Izabella Levbarg
Lina Mykhailenko
Mariia Tkachenko
@arturius453

Anna-Izabella Levbarg wrote the anilev6/HumanResponseBot Telegram bot used for all human baselines.

About

Code and sources for the eval-UA-tion benchmark and paper

Languages

Language:TeX 38.9%Language:HTML 25.9%Language:JavaScript 13.0%Language:Jupyter Notebook 12.3%Language:Python 8.1%Language:CSS 1.4%Language:SCSS 0.4%Language:Lua 0.0%Language:Shell 0.0%