calfa-co / Patrologia-Graeca

Main repository of the CGPG project for OCR and Text Analysis of the Patrologia Graeca

Repository from Github https://github.comcalfa-co/Patrologia-GraecaRepository from Github https://github.comcalfa-co/Patrologia-Graeca

Patrologia-Graeca

The CGPG project (Calfa GREgORI Patrologia Graeca), led by Jean-Marie Auwers (UCLouvain), aims to OCRize the remaining non-digital versions of the Patrologia Graeca volumes. The project relies on the expertise of GREgORI and Calfa.

The project is sponsored by the ASBL Byzantion, the Fondation Sedes Sapientiae, the Institut Religions, Spiritualités, Cultures, Sociétés (RSCS, UCLouvain) and the Centre d'études orientales (CIOL, UCLouvain) and by a generous donor who wishes to remain anonymous. Other sponsors have recently expressed their willingness to support the project.

Modus operandi

The project implements the creation of specialized OCR models for the automatic reading of heavily damaged Patrologia Graeca fonts and for the extraction of Greek content only. The texts produced are then tagged (lemmatization, POS, and morphology). This Github offers the raw data produced. A proofread version of each text will gradually be offered within the GREgORI interfaces.

Works and Authors Dataset

ID Edition File Edition URL Author Author URL Author Date Work Description Word Count Raw Text Markup TXT SkE Analysis
71 PG071_ed.pdf Link Cyril of Alexandria Wikipedia 4th-5th AD Commentarius in Oseam prophetam, in Joelem prophetam, In Amos prophetam, In Abdiam prophetam, In Jonam prophetam, In Michæam prophetam, In Nahum prophetam, In Habacuc prophetam, In Sophoniam prophetam, In Aggæum prophetam. 208423 available available available forthcoming
73 PG073_ed.pdf Link Cyril of Alexandria Wikipedia 4th-5th AD In Joannis Evangelium 230336 available available available forthcoming
087.1 PG087.1_ed.pdf Link Procopius of Gaza Wikipedia 5th-6th AD Commentarii in OT 211763 available available forthcoming forthcoming
101 PG101_ed.pdf Link Photios I of Constantinople Wikipedia 9th AD Amphilochiana, Commentarii in NT 229437 available available forthcoming forthcoming
109 PG109_ed.pdf Link Scriptores Post Theopanem N/A 211898 available available available forthcoming
112 PG112_ed.pdf Link Constantine Porphyrogenitus Wikipedia 10th AD De ceremoniis 153718 available available forthcoming forthcoming
123 PG123_ed.pdf Link Theophylact of Ohrid Wikipedia 11th-12th AD Commentarii in NT 247369 available available forthcoming forthcoming
124 PG124_ed.pdf Link Theophylact of Ohrid Wikipedia 11th-12th AD Commentarii in NT 263430 available available forthcoming forthcoming
125 PG125_ed.pdf Link Theophylact of Ohrid Wikipedia 11th-12th AD Commentarii in NT 249703 available available forthcoming forthcoming
126 PG126_ed.pdf Link Theophylact of Ohrid Wikipedia 11th-12th AD Commentarii in NT; et alia opera 229628 available available forthcoming forthcoming
134 PG134_ed.pdf Link Joannes Zonaras Wikipedia 11th-12th AD Annales 271191 available available available forthcoming
146 PG146_ed.pdf Link Nikephoros Kallistos Xanthopoulos Wikipedia 13th-14th AD Ecclesiastica Historia 242816 available available available forthcoming
155 PG155_ed.pdf Link Simeon of Thessalonica Wikipedia 14th-15th AD Dialogus in Christo (et alia opera) 204532 available available available forthcoming
158 PG158_ed.pdf Link Michael Glykas (et al.) Wikipedia 12th AD Annales (et alia) 195632 available available available forthcoming

File formats description

  • *_text_raw.txt: UTF-8 plain text, raw OCR result.
  • *_text_markup.txt: Inherited from *_text_raw.txt file, with text structure markups (volume number, page number of the source PDF file), no hyphenation, empty lines deletion.
  • *_text_markup_ske.vert: inherited from *_text_markup.txt file, usable on the Sketch Engine platform; upcoming versions will feature lexical analysis (lemmatization and POS tagging).

For optimal use in Sketch Engine, configure the corpus (Manage Corpus/Configure/Expert settings) by replacing

ATTRIBUTE "lc" {
    DYNAMIC ‘utf8lowercase
    DYNLIB ‘internal
    DYNTYPE ‘freq’
    FROMATTR ‘word’
    FUNTYPE ‘0’
    LABEL ‘word (lowercase)’
    TRANSQUERY ‘yes’
}

by

ATTRIBUTE "intuitive_word" {
}

Ground-truth

A first training dataset has been released on Zenodo in 2022 : https://zenodo.org/records/7296539.

@dataset{vidal_gorene_2022_7296539,
  author       = {Vidal-Gorène, Chahan and
                  Kindt, Bastien},
  title        = {Patrologia Graeca (OCR ground truth)},
  month        = nov,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.7296539},
  url          = {https://doi.org/10.5281/zenodo.7296539}
}

Bibliography

Within the scope of the CGPG project

About guidelines for transcription and layout analysis

@article{vidalgorene:hal-03982432,
  TITLE = {{La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées}},
  AUTHOR = {{Vidal-Gorène, Chahan}},
  URL = {https://enc.hal.science/hal-03982432},
  JOURNAL = {{The Programming Historian en français}},
  NUMBER = {5},
  YEAR = {2023},
  DOI = {10.46430/phfr0023},
}
@article{vidalgorene:hal-04565386,
  TITLE = {{Reconhecimento autom{\'a}tico de manuscritos para o teste de idiomas n{\~a}o latinos}},
  AUTHOR = {{Vidal-Gorène, Chahan and Paulino, Joana}},
  URL = {https://hal.science/hal-04565386},
  JOURNAL = {{Programming Historian em portugu{\^e}s}},
  NUMBER = {4},
  YEAR = {2024},
  DOI = {10.46430/phpt0046},
}

Related publications

@article{kindt2024fondation,
  author    = {Kindt, B. and Auwers, J.-M.},
  title     = {La Fondation Sedes Sapientiae soutient le projet de valorisation numérique de la Patrologie Grecque},
  journal   = {Bulletin de la Fondation Sedes Sapientiae},
  volume    = {45},
  month     = {janvier},
  year      = {2024},
  pages     = {19--21}
}
@article{kindt2022analyse,
  title={Analyse automatique du grec ancien par r{\'e}seau de neurones. {\'E}valuation sur le corpus De Thessalonica Capta},
  author={Kindt, Bastien and Vidal-Gor{\`e}ne, Chahan and Delle Donne, Saulo},
  journal={Bulletin de l’Acad{\'e}mie Belge pour l’{\'E}tude des Langues Anciennes et Orientales},
  pages={537--562},
  year={2022}
}
@article{kindt2022manuscript,
  title={From Manuscript to Tagged Corpora, An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East},
  author={Kindt, Bastien and Vidal-Gor{\`e}ne, Chahan},
  journal={Armeniaca-International Journal of Armenian Studies},
  volume={1},
  pages={73--96},
  year={2022}
}

About

Main repository of the CGPG project for OCR and Text Analysis of the Patrologia Graeca


Languages

Language:GLSL 100.0%