EvaLatin 2024

Question

EvaLatin 2024

gcelano opened this issue 7 months ago · comments

Giuseppe G. A. Celano commented 7 months ago

Given that one of the challenges of the Shared Task 2024 is "to understand which treebank (or combination of treebanks) is the most suitable to deal with new test data", what decision criteria are expected to guide participants in developing a model, if no training data is provided and it has only been disclosed that test data will contain "prose and poetic texts from different time periods"? This information is too generic to guide informed choices, considering that the UD Latin treebanks are quite unbalanced for genre and period and also have annotation scheme differences (e.g., "iobj" is mentioned in the Shared Task guidelines, but it only appears in the LLCT treebank). In the Shared Task guidelines, then, an example of the test data is given (from Caesar, De Bello Gallico, 4.1), but that same sentence is also available as training data in the PROIEL treebank (moreover, this sentence shows another issue left unspecified, i.e. tokenization, as in ne que vs neque). Without a better definition, the outcome of the Shared Task is going to be largely random.

RacheleSprugnoli · Answer 1 · Thu Jan 18 2024 01:12:12 GMT+0800 (China Standard Time)

Dear Giuseppe,
thanks for the comment.
The purpose of the task is precisely to understand the level of compatibility of the current treebanks (see this paper: https://aclanthology.org/2023.udw-1.2.pdf). Theoretically, saying that the data is based on the UD guidelines for Latin should be sufficient: the fact that the treebanks are annotated by making specific choices proves that the task is very necessary. Furthermore, dealing with unbalanced data or annotation problems is absolutely normal in the real world: a solid system should be able to handle these cases too. Also note that participants can submit multiple runs to try out various configurations. Finally, I would like to stress that evaluation campaigns should not be considered competitions to be won but forums for discussion, a discussion which, thanks to the shared tasks and new data specifically prepared with the precious work of the annotators, can be based on empirical evidence rather than just on theoretical assumptions.
Ciao,
Rachele

Giuseppe G. A. Celano · Answer 2 · Thu Jan 18 2024 10:03:28 GMT+0800 (China Standard Time)

Hi @RacheleSprugnoli. A condition for a model to perform well is that test data resemble training data. If no training data is given, relevant information on test data is what allows participants to build a model in a principled way (i.e., not randomly). I now realize from your answer that we are writing under a different assumption: you refer to the UD guidelines as if there were one annotation style for Latin: in reality, however, Latin UD treebanks have different annotation styles, the annotation differences being more than variations due to different annotators' hands (this is also evidenced by the article you mention, where, despite the harmonization effort, the parser results are still poor). Remarkable annotation differences are still contained in the current Latin UD treebanks, and they are therefore going to significantly affect any model: this explains why having relevant information on the composition and the annotation scheme choices adopted for the new test set is necessary to avoid that any model build for the Shared Task is unduly dependent on chance.

RacheleSprugnoli · Answer 3 · Fri Jan 19 2024 00:45:19 GMT+0800 (China Standard Time)

Hi again!
We agree that treebanks are annotated differently and this is not a good thing: if everyone annotates following specific principles, the possibility of making comparisons at a monolingual and cross-lingual level is lost (such comparisons are motivations behind the development of UD).
As already said in the previous answer, the objective is not so much to find a model that performs well but to highlight the problems of treebanks through collaboration with people interested, not in arriving first in the ranking, but in getting involved with their own models on new data.
The test data was annotated using the latest version of the UD guidelines, knowing that there are much-discussed syntactic relationships (like iobj) and excluding the normal discrepancies that manual annotation can have (no one is perfect!).
Best,
Rachele

RacheleSprugnoli · Answer 4 · Thu Feb 01 2024 21:49:01 GMT+0800 (China Standard Time)

Issue closed.