jtauber / lgpsi-processing

repo to set up processing scripts and data for LGPSI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LGPSI Processing

Repo for processing LGPSI. For official content, see seumasjeltzz/LinguaeGraecaePerSeIllustrata.

Contributors

  • Seumas Macdonald
  • James Tauber

Directory Layout

So far, the only data directories being manually modified are:

  • orig (the original files from Seumas)
  • manual-data (data needed for processing, e.g. lemma overrides from Seumas)

The main output directories are:

  • text (the processed text in GLTP format)
  • analysis (further analysis of the text in GLTP format)

Other directories are:

  • cache (for storing the Morpheus cache)
  • config (for storing configuration like for text-validation)
  • scripts (where all the code lives)

Scripts

The scripts are run in this order (after dependencies in the Pipfile are installed):

  • ./scripts/orig-to-para.py converts from orig files to para files under text
  • ./scripts/para-to-sent.py converts those para files to sentence-based sent files
  • ./scripts/add-norm.py produces the norm files in analysis from the sent files
  • ./scripts/lemmatise.py produces the lemma files in analysis from the norm files using Morpheus and manual-data/lemma_overrides.yaml

The folowing are modules not called from the command-line:

  • morpheus.py (Morphology API client)
  • utils.py (common functions shared between scripts)

Other scripts include:

  • sort-yaml.py <filename> sorts the given yaml file with top-level keys in alphabetical order

License

The content is CC-BY-SA and the code is MIT.

About

repo to set up processing scripts and data for LGPSI

License:Creative Commons Attribution Share Alike 4.0 International


Languages

Language:Python 100.0%