HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents

Home Page:https://htr-united.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding "EHRI Dataset"

FloChiff opened this issue · comments

Hi !

Here is my dataset YAML file

schema: "https://htr-united.github.io/schema/2021-10-15/schema.json"
title: EHRI Dataset
url: 'https://github.com/FloChiff/ehri-dataset'
project-name: >
    European Holocaust Research Infrastructure
project-website: 'https://www.ehri-project.eu/'
authors:
    - name: 'Floriane'
      surname: 'Chiffoleau'
      roles:
      - 'transcriber'
    - name: 'Sarah'
      surname: 'Beniere'
      roles:
      - 'transcriber'
    - name: 'Michal'
      surname: 'Frankl'
      roles:
      - 'transcriber'
    - name: 'Wolfgang'
      surname: 'Schellenbacher'
      roles:
      - 'transcriber'
    - name: 'Zoltán'
      surname: 'Vági'
      roles:
      - 'transcriber'
    - name: 'Gábor'
      surname: 'Kádár'
      roles:
      - 'transcriber'
    - name: 'Magdalena'
      surname: 'Sedlická'
      roles:
      - 'transcriber'
    - name: 'Miriam'
      surname: 'Schulz'
      roles:
      - 'transcriber'
    - name: 'Christine'
      surname: 'Schmidt'
      roles:
      - 'transcriber'
    - name: 'Jessica'
      surname: 'Green'
      roles:
      - 'transcriber'
    - name: 'Martina'
      surname: 'Ravagnan'
      roles:
      - 'transcriber'
    - name: 'Daniela'
      surname: 'Bartáková'
      roles:
      - 'transcriber'
    - name: 'Judith'
      surname: 'Levin'
      roles:
      - 'transcriber'
    - name: 'Daphna'
      surname: 'Sehayek'
      roles:
      - 'transcriber'
    - name: 'Michał'
      surname: 'Czajka'
      roles:
      - 'transcriber'
    - name: 'Marta'
      surname: 'Wojas'
      roles:
      - 'transcriber'
    - name: 'Dagmara'
      surname: 'Chełstowska'
      roles:
      - 'transcriber'
    - name: 'Winfried'
      surname: 'Garscha'
      roles:
      - 'transcriber'
    - name: 'Claudia'
      surname: 'Kuretsidis-Haider'
      roles:
      - 'transcriber'
description: >
  Multilingual dataset from various corpus of the EHRI project 
language:
  - eng
  - ces
  - deu
  - slk
  - hun
  - pol
  - dan
script: 
  - Latn
script-type: 'only-typed'
time: 
  notBefore: "1936"
  notAfter: "1958"
hands: 
  count: 'unknown'
  precision: 'estimated'
license:
  - {name: 'CC-BY 4.0', url: 'https://creativecommons.org/licenses/by/4.0/'}
format: 'Alto-XML'
volume:
  - metric: files
    count: 252
  - metric: characters
    count: 540645
  - metric: lines
    count: 9203
production-software: eScriptorium + Kraken

Hello Floriane!
Thank you for this submission!

I recommend using the form on Github's website to make to use the latest schema for the description of dataset. I took the liberty of reformatting the information to the following content:

schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: EHRI Dataset
url: https://github.com/FloChiff/ehri-dataset
authors:
    - name: Floriane
      surname: Chiffoleau
      roles:
      - transcriber
    - name: Sarah
      surname: Beniere
      roles:
      - transcriber
    - name: Michal
      surname: Frankl
      roles:
      - transcriber
    - name: Wolfgang
      surname: Schellenbacher
      roles:
      - transcriber
    - name: Zoltán
      surname: Vági
      roles:
      - transcriber
    - name: Gábor
      surname: Kádár
      roles:
      - transcriber
    - name: Magdalena
      surname: Sedlická
      roles:
      - transcriber
    - name: Miriam
      surname: Schulz
      roles:
      - transcriber
    - name: Christine
      surname: Schmidt
      roles:
      - transcriber
    - name: Jessica
      surname: Green
      roles:
      - transcriber
    - name: Martina
      surname: Ravagnan
      roles:
      - transcriber
    - name: Daniela
      surname: Bartáková
      roles:
      - transcriber
    - name: Judith
      surname: Levin
      roles:
      - transcriber
    - name: Daphna
      surname: Sehayek
      roles:
      - transcriber
    - name: Michał
      surname: Czajka
      roles:
      - transcriber
    - name: Marta
      surname: Wojas
      roles:
      - transcriber
    - name: Dagmara
      surname: Chełstowska
      roles:
      - transcriber
    - name: Winfried
      surname: Garscha
      roles:
      - transcriber
    - name: Claudia
      surname: Kuretsidis-Haider
      roles:
      - transcriber
institutions: []
description: Multilingual dataset from various corpus of the EHRI project
project-name: European Holocaust Research Infrastructure
project-website: https://www.ehri-project.eu/
language:
  - eng
  - ces
  - deu
  - slk
  - hun
  - dan
  - pol
production-software: eScriptorium + Kraken
automatically-aligned: false
script:
  - iso: Latn
script-type: only-typed
time:
  notBefore: '1936'
  notAfter: '1958'
hands:
  count: unknown
  precision: estimated
license:
  name: CC-BY 4.0
  url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
  - metric: files
    count: 252
  - metric: characters
    count: 540645
  - metric: lines
    count: 9203
transcription-guidelines: provide information on the transcription guidelines

Can you:

  • make sure that the information was correctly copied?
  • provide content for the "transcription-guildeines" entry?
  • add more details about to the description of the dataset? For example, precising that the dataset is made of typewritten forms, etc.

Also, but this is just a suggestion, I think the dataset could be called "EHRI Multilingual Dataset", as this is a really important aspect of this dataset.

Hello Alix !!!
Thank you for your input.

Here is the content of the YAML with the additions that you asked for:

schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: EHRI Multilingual Dataset
url: https://github.com/FloChiff/ehri-dataset
authors:
    - name: Floriane
      surname: Chiffoleau
      roles:
      - transcriber
    - name: Sarah
      surname: Beniere
      roles:
      - transcriber
    - name: Michal
      surname: Frankl
      roles:
      - transcriber
    - name: Wolfgang
      surname: Schellenbacher
      roles:
      - transcriber
    - name: Zoltán
      surname: Vági
      roles:
      - transcriber
    - name: Gábor
      surname: Kádár
      roles:
      - transcriber
    - name: Magdalena
      surname: Sedlická
      roles:
      - transcriber
    - name: Miriam
      surname: Schulz
      roles:
      - transcriber
    - name: Christine
      surname: Schmidt
      roles:
      - transcriber
    - name: Jessica
      surname: Green
      roles:
      - transcriber
    - name: Martina
      surname: Ravagnan
      roles:
      - transcriber
    - name: Daniela
      surname: Bartáková
      roles:
      - transcriber
    - name: Judith
      surname: Levin
      roles:
      - transcriber
    - name: Daphna
      surname: Sehayek
      roles:
      - transcriber
    - name: Michał
      surname: Czajka
      roles:
      - transcriber
    - name: Marta
      surname: Wojas
      roles:
      - transcriber
    - name: Dagmara
      surname: Chełstowska
      roles:
      - transcriber
    - name: Winfried
      surname: Garscha
      roles:
      - transcriber
    - name: Claudia
      surname: Kuretsidis-Haider
      roles:
      - transcriber
institutions: []
description: This dataset has been created with files from various corpora made by the EHRI Project. As this project diffuse archives from World War II and the Holocaust, the dataset is constituted of documents of several languages (Czech, Danish, English, German, Hungarian, Polish, and Slovak) and of various types (reports, testimonies, letters, etc.). The common thread among all of those documents is that they have been typewritten.
project-name: European Holocaust Research Infrastructure
project-website: https://www.ehri-project.eu/
language:
  - eng
  - ces
  - deu
  - slk
  - hun
  - dan
  - pol
production-software: eScriptorium + Kraken
automatically-aligned: false
script:
  - iso: Latn
script-type: only-typed
time:
  notBefore: '1936'
  notAfter: '1958'
hands:
  count: unknown
  precision: estimated
license:
  name: CC-BY 4.0
  url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
  - metric: files
    count: 252
  - metric: characters
    count: 540645
  - metric: lines
    count: 9203
transcription-guidelines: The texts reproduce exactly what is on the images, except for two characters from the Slovak and Czech parts of the dataset. Those languages have caron on several of their alphabet characters. They were encoded as such, except when it was placed on a 'd' or a 't', as it was not possible to do it on eScriptorium. In that case, the character has been modified to have an apostrophe-like stroke next to it.

I hope everything is okay now.