HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents

Home Page:https://htr-united.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding dataset ARletta

lithlefranc opened this issue · comments

Hello !

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: ARletta
url: zenodo.org/records/11191457
authors:
 - name: Lith
   surname: Lefranc
 - name: Ilja
   surname: Van Damme
 - name: Thibault
   surname: Clérice
 - name: Mike
   surname: Kestemont
institutions:
 - name: University of Antwerp
 - name: National Institute for Research in Digital Science and Technology, Paris
description: Open-source handwritten text recognition models for historic Dutch
project-name: Bias in History
project-website: https://www.bias-in-history.eu/
language:
 - nld
 - fra
production-software: eScriptorium + Kraken
automatically-aligned: false
script:
 - iso: Latn
script-type: only-manuscript
time:
 notBefore: '1600'
 notAfter: '1940'
hands:
 count: more-than-10
 precision: estimated
license:
 name: CC-BY-SA 4.0
 url: https://creativecommons.org/licenses/by-sa/4.0/
format: Page-XML
volume:
 - metric: lines
   count: 431359
 - metric: regions
   count: 44536
 - metric: pages
   count: 10267
 - metric: characters
   count: 14253206
transcription-guidelines: diplomatic transcription: all of the text was transcribed verbatim, preserving all of its original features:
 - orthography: preserve original spelling
 - abbreviations: do not expand abbreviations
 - capitalization: retain original use of uppercase and lowercase letters
 - punctuation: transcribe punctuation marks exactly as they appear, even of they are unconventional by modern standards
 - special characters: include any special characters or symbols as they appear
 - formatting: maintain original formatting such as underlining or strikethrough
 - errors and corrections: include all errors and corrections found in the text
 - non-interpretative: avoid interpreting or modernizing the text
 - use the '@' symbol for characters you can not read an tag them as 'unclear' on baseline level
 - tag marginal text as 'marginalia' and main body text as 'paragraph' on region level

Hello Lith,

Thank you very much for this contribution and sorry for the late response! I have created a PR corresponding to the addition of the dataset card in the catalog.

Regarding the description of the transcription guidelines, I think the description could be improved. Could you provide more details or refer to a transcription rulebook published somewhere else?

I have a remark that is not linked to the addition to HTR-United: on the Zenodo repo, you mention twice "/datasets/antw-expert: the image files and preprocessed transcription files for the Antwerp data (annotated by the expert);". Is the repository missing something or is it a typo?

Hello Alix,
Thanks for your remark on the Zenodo repository. That is a typo indeed. I have corrected it.
Considering the transcription guidelines: do the adjustments suffice?
Thank you!
Best wishes,
Lith

Hello,

Perfect, I just updated the yml file and merged the entry to the catalog :)
Thank you again for your contribution!