jfilter / german-lemmatizer-docker

✂️ Combining the power of several tools for lemmatization of German text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scissors

German Lemmatizer Docker Image

A Docker image to lemmatize German texts.

Built upon:

It works as follows. First spaCy tags the token with POS. Then German Lemmatizer looks up lemmas on IWNLP and GermanLemma. If they disagree, choose the one from IWNLP. If they agree or only one tool finds it, take it. Try to preserve the casing of the original token.

You may want to use the Python wrapper: German Lemmatizer

Installation

  1. Install Docker.

Usage

  1. Read and accept the license terms of the TIGER Corpus (free to use for non-commercial purposes).

  2. Start Docker.

  3. To execute, you have two options:

    1. To lemmatize a string from the termial, run:
    docker run -it filter/german-lemmatizer:0.5.0 "Was ist das für ein Leben?" [--remove_stop]
    1. To lemmatize a collection of text, add two local folders to the docker container (NB: you have to give absolute paths):
    docker run -it -v $(pwd)/some_input_folder:/input -v $(pwd)/some_output_folder:/output filter/german-lemmatizer:0.5.0 [--line] [--escape] [--remove_stop]

    With --line each line is treated as a single document instead of the whole file.

    With --escape The newlines are escaped ('\n' -> '\\n') for each document (per line), so the text in the input file has to be processed like this.

    --remove_stop removes stop words as defined by spaCy.

The Case for Reproduciblilty

Everything – all the code and all the data – is packaged in the Docker image. This means that every lemmatization is reproduceable. For the future, I may update the code and/or data but each images is tagged with a specific version.

Dev Remarks

  • Tried to base in on an Docker Apline Image but there were too many installation hassels.
  • Tried to parallelise with joblib but it created too much overhead
  • To build an image run docker build -t lemma . in this folder
  • For debugging purposes, you may want enter the container and override the entry point: docker run -it --entrypoint /bin/bash lemma
  • docker build -t filter/german-lemmatizer:0.5.0 . and docker push filter/german-lemmatizer:0.5.0

License

MIT.

Sponsoring

This work was created as part of a project that was funded by the German Federal Ministry of Education and Research.

About

✂️ Combining the power of several tools for lemmatization of German text

License:MIT License


Languages

Language:Python 80.1%Language:Dockerfile 19.9%