OCR-D / gt_structure_text

The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include.

Home Page:https://OCR-D.github.io/gt_structure_text/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gt_structure_text

The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include. The data is based on transcription data stored in the German Text Archive (DTA) (https://www.deutschestextarchiv.de/).

Metadata

Language:
eng, fra, deu, heb, lat
Format:
Page-XML
Time:
1500-1900
GT Type:
data_structure_and_text
License:
CC-BY-SA-4.0
Transcription Guidelines:
OCR-D Ground Truth Guidelines https://ocr-d.de/en/gt-guidelines/trans/
Project:
OCR-D
Project-URL:
https://ocr-d.de/

Sources

The volume of transcriptions:

TextLine Page TxtRegion ImgRegion GraphRegion TabRegion SepRegion MathRegion MusicRegion NoiseRegion
6609 217 1648 1 74 3 141 1 4 17

List of transcriptions

document TxtRegion ImgRegion LineDrawRegion GraphRegion TabRegion ChartRegion SepRegion MathRegion ChemRegion MusicRegion AdRegion NoiseRegion UnknownRegion CustomRegion TextLine Page
nn_lied_1520 5 1 1 22 1
silesius_seelenlust01_1657 38 1 7 4 137 5
nn_mirabilia_1500 10 2 58 3
loeber_heuschrecken_1693 15 1 3 87 3
rollenhagen_reysen_1603 22 1 81 3
luther_babstum_1526 7 2 51 2
huebner_handbuch_1696 26 4 4 78 3
reinkingk_policey_1653_teil1 20 1 146 3
benner_herrnhuterey04_1748 37 6 144 4
reinkingk_policey_1653_teil2 21 1 108 2
vespucci_insule_1506 7 62 2
arnold_ketzerhistorie01_1699 43 6 378 4
luz_blitz_1784 17 1 4 110 4
basilius_legendi_1515 12 2 82 3
clauren_mimil_1815 44 1 206 9
pistoris_regiment_1506 12 90 3
nn_lied_1515 6 25 1
valentinus_occulta_1603 22 1 1 164 6
gerstner_mechaniktafeln01_1831 2 1 2 1
bohse_helicon_1696 35 3 2 121 5
pinder_epiphanie_1506 31 1 5 169 4
boeschenstain_gedicht_1520 9 1 45 1
alberti_pictura_1540 22 1 94 3
osiander_predigt_1553 7 57 2
herder_geschichte03_1787 5 3 14 1
heyden_paedono_1548 19 72 3
witzstat_buchszbaum_1540 13 47 2
oesterreicher_sachsen_1548 8 2 48 2
brenz_abentmal_1550 22 89 4
kistler_kraeuter_1500 14 58 2
kant_aufklaerung_1784 15 4 55 2
buerger_gedichte_1778 14 6 52 2
petrarca_psalmi_1506 13 2 64 3
blumenbach_anatomie_1805 20 84 3
praetorius_verrichtung_1668 38 2 197 5
ruempler_gartenbau_1882 105 2 3 9 1 6
calvi_beutelschneider01_1627 21 3 87 3
wecker_kochbuch_1598 35 156 4
dannhauer_catechismus10_1673 18 151 4
hilbert_zahlkoerper_1897 46 4 5
laube_europa0202_1837 15 2 7 43 5
nn_historia_1500 5 1 35 2
hohberg_georgica01_1682_teil2 27 159 2
aventinus_grammatica_1515 29 19 1 129 3
rhegius_artzney_1529 12 1 80 3
lohenstein_agrippina_1665 56 3 1 109 3
estor_rechtsgelehrsamkeit02_1758 44 1 3 153 4
nn_vertrag_1525 5 35 2
trota_mordtbrenner_1540 20 2 44 2
schiller_raeuber_1781 15 2 54 2
hohberg_georgica01_1682_teil1 14 3 66 2
euler_rechenkunst01_1738 94 8 31 234 6
arnimb_goethe03_1835 5 1 22 1
karlstadt_sermon_1523 5 1 1 65 2
bebel_frau_1879 20 3 164 4
ballenstedt_delatio_1777 26 3 98 3
lessing_menschengeschlecht_1780 8 1 15 1
nn_besuch_1780 5 3 1 76 4
aepinus_bekentnis_1548 20 3 101 4
weigel_gnothi02_1618 22 1 128 4
vischer_aesthetikregister_1858 1 1
glauber_opera01_1658 127 3 2 376 6
sachs_drey_1553 7 54 2
bernd_lebensbeschreibung_1738 15 4 1 71 3
justi_abhandlung01_1758 37 1 1 131 4
meyfart_rhetorica_1634 27 4 113 4
luther_auszlegunge_1520 10 59 2
praetorius_syntagma02_1619_teil1 72 1 4 168 4
praetorius_syntagma02_1619_teil2 30 1 5 136 4

Extent

In this section they can insert additional information, instructions or notes.

About

The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include.

https://OCR-D.github.io/gt_structure_text/

License:Creative Commons Attribution Share Alike 4.0 International