KorAP / KorAP-XML-TEI

Conversion of TEI P5 based formats to KorAP-XML

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NAME

tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML

SYNOPSIS

cat corpus.i5.xml | tei2korapxml - > corpus.korapxml.zip

DESCRIPTION

tei2korapxml is a script to convert TEI P5 and I5 based documents to the KorAP-XML format.

This program is usually called from inside another script.

FORMATS

Input restrictions

  • TEI P5 formatted input with certain restrictions:

    • mandatory: text-header with integrated textsigle (or convertable identifier), text-body

    • optional: corp-header with integrated corpsigle, doc-header with integrated docsigle

  • All tokens inside the primary text may not be newline seperated, because newlines are removed (see KorAP::XML::TEI::Data) and a conversion of newlines into blanks between 2 tokens could lead to additional blanks, where there should be none (e.g.: punctuation characters like , or . should not be seperated from their predecessor token). (see also code section ~ whitespace handling ~ in script/tei2korapxml).

  • Header types, like <idsHeader [...] type="document" [...] > need to be defined in the same line as the header tag.

Notes on the output

  • zip file output (default on stdout) with utf8 encoded entries (which together form the KorAP-XML format)

INSTALLATION

tei2korapxml requires libxml2-dev bindings and File::ShareDir::Install to be installed. When these requirements are met, the preferred way to install the script is to use cpanm.

$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git

In case everything went well, the tei2korapxml tool will be available on your command line immediately.

Minimum requirement for KorAP::XML::TEI is Perl 5.36.

OPTIONS

--input|-i

The input file to process. If no specific input is defined and a single dash - is passed as an argument, data is read from STDIN.

--root|-r

The root directory for output. Defaults to ..

--help|-h

Print help information.

--version|-v

Print version information.

--tokenizer-korap|-tk

Use the standard KorAP/DeReKo tokenizer.

--tokenizer-internal|-ti

Tokenize the data using two embedded tokenizers, that will take an aggressive and a conservative approach.

--tokenizer-call|-tc

Call an external tokenizer process, that will tokenize from STDIN and outputs the offsets of all tokens.

Texts are separated using \x04\n. The external process should add a new line per text.

If the "--use-tokenizer-sentence-splits" option is activated, sentences are marked by offset as well in new lines.

To use Datok including sentence splitting, call tei2korap as follows:

$ cat corpus.i5.xml | tei2korapxml -s \
$   -tc 'datok tokenize \
$        -t ./tokenizer.matok \
$        -p --newline-after-eot --no-sentences \
$        --no-tokens --sentence-positions -' - \
$        > corpus.korapxml.zip
--skip-inline-tokens

Boolean flag indicating that inline tokens should not be processed. Defaults to false (meaning inline tokens will be processed).

--skip-inline-token-annotations

Boolean flag indicating that inline token annotations should not be processed. Defaults to true (meaning inline token annotations won't be processed).

--skip-inline-tags <tags>

Expects a comma-separated list of tags to be ignored when the structure is parsed. Content of these tags however will be processed.

--xmlid-to-textsigle <from-regex>@<to-c/to-d/to-t>

Expects a regular replacement expression (separated by @ between the search and the replacement) to convert text id attributes to text sigles with three parts (separated by /).

Example:

tei2korapxml  \
  --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
  -tk - < t/data/icc_german_sample.p5.xml

Converts text id ICC.German.DeReKo.WPD17.G11.00238 to sigle ICCGER/DeReKo.WPD17/G11.00238.

--inline-tokens <foundry>#[<file>]

Define the foundry and file (without extension) to store inline token information in. Unless --skip-inline-token-annotations is set, this will contain annotations as well. Defaults to tokens and morpho.

The inline token data will also be stored in the inline structures file (see --inline-structures), unless the inline token foundry is prepended by an ! exclamation mark, indicating that inline tokens are stored exclusively in the inline tokens file.

Example:

tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip
--inline-structures <foundry>#[<file>]

Define the foundry and file (without extension) to store inline structure information in. Defaults to struct and structures.

--base-foundry <foundry>

Define the base foundry to store newly generated token information in. Defaults to base.

--data-file <file>

Define the file (without extension) to store primary data information in. Defaults to data.

--header-file <file>

Define the file name (without extension) to store header information on the corpus, document, and text level in. Defaults to header.

--use-tokenizer-sentence-splits|-s

Replace existing with, or add new, sentence boundary information provided by the tokenizer. Currently KorAP-tokenizer and certain external tokenizers support these boundaries.

--tokens-file <file>

Define the file (without extension) to store generated token information in (either from the KorAP tokenizer or an externally called tokenizer). Defaults to tokens.

--log|-l

Loglevel for Log::Any. Defaults to notice.

ENVIRONMENT VARIABLES

KORAPXMLTEI_DEBUG

Activate minimal debugging. Defaults to false.

COPYRIGHT AND LICENSE

Copyright (C) 2021-2024, IDS Mannheim

Author: Peter Harders

Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober

KorAP::XML::TEI is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for the German Language (IDS), member of the Leibniz-Gemeinschaft.

This program is free software published under the BSD-2 License.

About

Conversion of TEI P5 based formats to KorAP-XML

License:BSD 2-Clause "Simplified" License


Languages

Language:Perl 100.0%