Adding dataset for Malayalam - (issue #104)

Question

Adding dataset for Malayalam - (issue #104)

nidame opened this issue a year ago · comments

Hello ! Here come the metadata for "Ground Truth for printed Malayalam". Hope the data is correct.
Belongs to issue #104

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Ground Truth data for printed Malayalam
url: https://doi.org/10.11588/data/L2KRZO
authors: []
institutions:
 - name: Tübingen University Library
   roles:
     - project-manager
description: >-
 Ground Truth (GT) data (JPG and ALTO XML files) which can be used to train OCR
 models that recognize printed text in Malayalam script. The training material
 is gathered from 19th and 20th centuries prints.


 The GT data was trained in Transkribus with the HTR+ and the PyLaia engine
 with a resulting CER of 2.29% on validation set with HTR+ and 3,20% with
 PyLaia. The training was performed on 43 pages with appr. 9,000 words. The
 validation set consisted of 5 pages (ca. 1,000 words).


 Transcription was performed by Tübingen University Library, the Ground Truth
 data was created by Elena Mucciarelli (University of Groningen) with support
 and model training by Dorothee Huff (Tübingen University Library).
 (2022-11-02)
project-name: DigitalSouthAsia
project-website: http://idb.ub.uni-tuebingen.de/digitue/southasia
language:
 - mal
production-software: Transkribus
script:
 - iso: Mlym
script-type: only-typed
time:
 notBefore: '1850'
 notAfter: '1996'
hands:
 count: unknown
 precision: exact
license:
 - name: CC-BY 4.0
   url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
 - metric: pages
   count: 43