The CGPG project (Calfa GREgORI Patrologia Graeca), led by Jean-Marie Auwers (UCLouvain), aims to OCRize the remaining non-digital versions of the Patrologia Graeca volumes. The project relies on the expertise of GREgORI and Calfa.
The project is sponsored by the ASBL Byzantion, the Fondation Sedes Sapientiae, the Institut Religions, Spiritualités, Cultures, Sociétés (RSCS, UCLouvain) and the Centre d'études orientales (CIOL, UCLouvain) and by a generous donor who wishes to remain anonymous. Other sponsors have recently expressed their willingness to support the project.
The project implements the creation of specialized OCR models for the automatic reading of heavily damaged Patrologia Graeca fonts and for the extraction of Greek content only. The texts produced are then tagged (lemmatization, POS, and morphology). This Github offers the raw data produced. A proofread version of each text will gradually be offered within the GREgORI interfaces.
ID | Edition File | Edition URL | Author | Author URL | Author Date | Work Description | Word Count | Raw Text | Markup TXT | SkE | Analysis |
---|---|---|---|---|---|---|---|---|---|---|---|
71 | PG071_ed.pdf | Link | Cyril of Alexandria | Wikipedia | 4th-5th AD | Commentarius in Oseam prophetam, in Joelem prophetam, In Amos prophetam, In Abdiam prophetam, In Jonam prophetam, In Michæam prophetam, In Nahum prophetam, In Habacuc prophetam, In Sophoniam prophetam, In Aggæum prophetam. | 208423 | available | available | available | forthcoming |
73 | PG073_ed.pdf | Link | Cyril of Alexandria | Wikipedia | 4th-5th AD | In Joannis Evangelium | 230336 | available | available | available | forthcoming |
087.1 | PG087.1_ed.pdf | Link | Procopius of Gaza | Wikipedia | 5th-6th AD | Commentarii in OT | 211763 | available | available | forthcoming | forthcoming |
101 | PG101_ed.pdf | Link | Photios I of Constantinople | Wikipedia | 9th AD | Amphilochiana, Commentarii in NT | 229437 | available | available | forthcoming | forthcoming |
109 | PG109_ed.pdf | Link | Scriptores Post Theopanem | N/A | ∅ | ∅ | 211898 | available | available | available | forthcoming |
112 | PG112_ed.pdf | Link | Constantine Porphyrogenitus | Wikipedia | 10th AD | De ceremoniis | 153718 | available | available | forthcoming | forthcoming |
123 | PG123_ed.pdf | Link | Theophylact of Ohrid | Wikipedia | 11th-12th AD | Commentarii in NT | 247369 | available | available | forthcoming | forthcoming |
124 | PG124_ed.pdf | Link | Theophylact of Ohrid | Wikipedia | 11th-12th AD | Commentarii in NT | 263430 | available | available | forthcoming | forthcoming |
125 | PG125_ed.pdf | Link | Theophylact of Ohrid | Wikipedia | 11th-12th AD | Commentarii in NT | 249703 | available | available | forthcoming | forthcoming |
126 | PG126_ed.pdf | Link | Theophylact of Ohrid | Wikipedia | 11th-12th AD | Commentarii in NT; et alia opera | 229628 | available | available | forthcoming | forthcoming |
134 | PG134_ed.pdf | Link | Joannes Zonaras | Wikipedia | 11th-12th AD | Annales | 271191 | available | available | available | forthcoming |
146 | PG146_ed.pdf | Link | Nikephoros Kallistos Xanthopoulos | Wikipedia | 13th-14th AD | Ecclesiastica Historia | 242816 | available | available | available | forthcoming |
155 | PG155_ed.pdf | Link | Simeon of Thessalonica | Wikipedia | 14th-15th AD | Dialogus in Christo (et alia opera) | 204532 | available | available | available | forthcoming |
158 | PG158_ed.pdf | Link | Michael Glykas (et al.) | Wikipedia | 12th AD | Annales (et alia) | 195632 | available | available | available | forthcoming |
*_text_raw.txt
: UTF-8 plain text, raw OCR result.*_text_markup.txt
: Inherited from*_text_raw.txt
file, with text structure markups (volume number, page number of the source PDF file), no hyphenation, empty lines deletion.*_text_markup_ske.vert
: inherited from*_text_markup.txt
file, usable on the Sketch Engine platform; upcoming versions will feature lexical analysis (lemmatization and POS tagging).
For optimal use in Sketch Engine, configure the corpus (Manage Corpus/Configure/Expert settings) by replacing
ATTRIBUTE "lc" {
DYNAMIC ‘utf8lowercase
DYNLIB ‘internal
DYNTYPE ‘freq’
FROMATTR ‘word’
FUNTYPE ‘0’
LABEL ‘word (lowercase)’
TRANSQUERY ‘yes’
}
by
ATTRIBUTE "intuitive_word" {
}
A first training dataset has been released on Zenodo in 2022 : https://zenodo.org/records/7296539.
@dataset{vidal_gorene_2022_7296539,
author = {Vidal-Gorène, Chahan and
Kindt, Bastien},
title = {Patrologia Graeca (OCR ground truth)},
month = nov,
year = 2022,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.7296539},
url = {https://doi.org/10.5281/zenodo.7296539}
}
@article{vidalgorene:hal-03982432,
TITLE = {{La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées}},
AUTHOR = {{Vidal-Gorène, Chahan}},
URL = {https://enc.hal.science/hal-03982432},
JOURNAL = {{The Programming Historian en français}},
NUMBER = {5},
YEAR = {2023},
DOI = {10.46430/phfr0023},
}
@article{vidalgorene:hal-04565386,
TITLE = {{Reconhecimento autom{\'a}tico de manuscritos para o teste de idiomas n{\~a}o latinos}},
AUTHOR = {{Vidal-Gorène, Chahan and Paulino, Joana}},
URL = {https://hal.science/hal-04565386},
JOURNAL = {{Programming Historian em portugu{\^e}s}},
NUMBER = {4},
YEAR = {2024},
DOI = {10.46430/phpt0046},
}
@article{kindt2024fondation,
author = {Kindt, B. and Auwers, J.-M.},
title = {La Fondation Sedes Sapientiae soutient le projet de valorisation numérique de la Patrologie Grecque},
journal = {Bulletin de la Fondation Sedes Sapientiae},
volume = {45},
month = {janvier},
year = {2024},
pages = {19--21}
}
@article{kindt2022analyse,
title={Analyse automatique du grec ancien par r{\'e}seau de neurones. {\'E}valuation sur le corpus De Thessalonica Capta},
author={Kindt, Bastien and Vidal-Gor{\`e}ne, Chahan and Delle Donne, Saulo},
journal={Bulletin de l’Acad{\'e}mie Belge pour l’{\'E}tude des Langues Anciennes et Orientales},
pages={537--562},
year={2022}
}
@article{kindt2022manuscript,
title={From Manuscript to Tagged Corpora, An Automated Process for Ancient Armenian or Other Under-Resourced Languages of the Christian East},
author={Kindt, Bastien and Vidal-Gor{\`e}ne, Chahan},
journal={Armeniaca-International Journal of Armenian Studies},
volume={1},
pages={73--96},
year={2022}
}