HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents

Home Page:https://htr-united.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding dataset EpiSearch (Astori’s letters)

federico-boschetti opened this issue · comments

Hello ! [We are glad to send you the metadata related to the dataset described in https://doi.org/10.5281/zenodo.7719291]

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: EpiSearch HTR
url: https://github.com/vedph/episearch-htr
authors:
 - name: Lorenzo
   surname: Calvelli
   orcid: 0000-0002-0920-9156
   roles:
     - project-manager
 - name: Tatiana
   surname: Tommasi
   orcid: 0009-0000-2815-0113
   roles:
     - transcriber
 - name: Federico
   surname: Boschetti
   orcid: 0000-0002-7810-7735
   roles:
     - support
institutions: []
description: Ground Truth for Astori’s letters (see the README.md file for details)
project-name: EpiSearch
project-website: https://github.com/vedph/episearch-htr
language:
 - ita
production-software: eScriptorium + Kraken
script:
 - iso: Latn
script-type: only-manuscript
time:
 notBefore: '1705'
 notAfter: '1709'
hands:
 count: '1'
 precision: exact
license:
 - name: CC-BY-SA 4.0
   url: https://creativecommons.org/licenses/by-sa/4.0/
format: Alto-XML
volume:
 - metric: files
   count: 34

Hello @federico-boschetti!

Thank you for your contribution! I made #122 to add the dataset description to the catalog.

I have two questions regarding the dataset:

  1. I saw that some lines are not segmented or transcribed. It's not a problem, but I just wanted to make sure it is intentional.

  2. regarding the organization of the repository, I think it would be easier to users if you put all the JPEG and the XML files in a data/ folder, in stead of having them all at the root level. (like what we suggested in the template). Do you think you could do this ?

Otherwise, as far as the description is concerned, it's all good for merging

Hello @alix-tz !
Thank you for your feed-back.

  1. Omissions are intentional (introductory formulae and signatures were over-represented and lowered the performance of the training);
  2. I created the "data" directory and I filled it with images and XML files, as you suggested.

Awesome! I just confirmed the addition of the description of the dataset to the catalog.

Thank you again!