Adding dataset ARletta
lithlefranc opened this issue · comments
Hello !
Here is our dataset YAML file:
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: ARletta
url: zenodo.org/records/11191457
authors:
- name: Lith
surname: Lefranc
- name: Ilja
surname: Van Damme
- name: Thibault
surname: Clérice
- name: Mike
surname: Kestemont
institutions:
- name: University of Antwerp
- name: National Institute for Research in Digital Science and Technology, Paris
description: Open-source handwritten text recognition models for historic Dutch
project-name: Bias in History
project-website: https://www.bias-in-history.eu/
language:
- nld
- fra
production-software: eScriptorium + Kraken
automatically-aligned: false
script:
- iso: Latn
script-type: only-manuscript
time:
notBefore: '1600'
notAfter: '1940'
hands:
count: more-than-10
precision: estimated
license:
name: CC-BY-SA 4.0
url: https://creativecommons.org/licenses/by-sa/4.0/
format: Page-XML
volume:
- metric: lines
count: 431359
- metric: regions
count: 44536
- metric: pages
count: 10267
- metric: characters
count: 14253206
transcription-guidelines: diplomatic transcription: all of the text was transcribed verbatim, preserving all of its original features:
- orthography: preserve original spelling
- abbreviations: do not expand abbreviations
- capitalization: retain original use of uppercase and lowercase letters
- punctuation: transcribe punctuation marks exactly as they appear, even of they are unconventional by modern standards
- special characters: include any special characters or symbols as they appear
- formatting: maintain original formatting such as underlining or strikethrough
- errors and corrections: include all errors and corrections found in the text
- non-interpretative: avoid interpreting or modernizing the text
- use the '@' symbol for characters you can not read an tag them as 'unclear' on baseline level
- tag marginal text as 'marginalia' and main body text as 'paragraph' on region level
Hello Lith,
Thank you very much for this contribution and sorry for the late response! I have created a PR corresponding to the addition of the dataset card in the catalog.
Regarding the description of the transcription guidelines, I think the description could be improved. Could you provide more details or refer to a transcription rulebook published somewhere else?
I have a remark that is not linked to the addition to HTR-United: on the Zenodo repo, you mention twice "/datasets/antw-expert: the image files and preprocessed transcription files for the Antwerp data (annotated by the expert);". Is the repository missing something or is it a typo?
Hello Alix,
Thanks for your remark on the Zenodo repository. That is a typo indeed. I have corrected it.
Considering the transcription guidelines: do the adjustments suffice?
Thank you!
Best wishes,
Lith
Hello,
Perfect, I just updated the yml file and merged the entry to the catalog :)
Thank you again for your contribution!