NAME

tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML

SYNOPSIS

cat corpus.i5.xml | tei2korapxml - > corpus.korapxml.zip

DESCRIPTION

tei2korapxml is a script to convert TEI P5 and I5 based documents to the KorAP-XML format.

This program is usually called from inside another script.

FORMATS

Input restrictions

TEI P5 formatted input with certain restrictions:
- mandatory: text-header with integrated textsigle (or convertable identifier), text-body
- optional: corp-header with integrated corpsigle, doc-header with integrated docsigle
All tokens inside the primary text may not be newline seperated, because newlines are removed (see KorAP::XML::TEI::Data) and a conversion of newlines into blanks between 2 tokens could lead to additional blanks, where there should be none (e.g.: punctuation characters like , or . should not be seperated from their predecessor token). (see also code section ~ whitespace handling ~ in script/tei2korapxml).
Header types, like <idsHeader [...] type="document" [...] > need to be defined in the same line as the header tag.

Notes on the output

zip file output (default on stdout) with utf8 encoded entries (which together form the KorAP-XML format)

INSTALLATION

tei2korapxml requires libxml2-dev bindings and File::ShareDir::Install to be installed. When these requirements are met, the preferred way to install the script is to use cpanm.

$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git

In case everything went well, the tei2korapxml tool will be available on your command line immediately.

Minimum requirement for KorAP::XML::TEI is Perl 5.36.

OPTIONS

--input|-i

The input file to process. If no specific input is defined and a single dash - is passed as an argument, data is read from STDIN.

--root|-r

The root directory for output. Defaults to ..

--help|-h

Print help information.

--version|-v

Print version information.

--tokenizer-korap|-tk

Use the standard KorAP/DeReKo tokenizer.

--tokenizer-internal|-ti

Tokenize the data using two embedded tokenizers, that will take an aggressive and a conservative approach.

--tokenizer-call|-tc

Call an external tokenizer process, that will tokenize from STDIN and outputs the offsets of all tokens.

Texts are separated using \x04\n. The external process should add a new line per text.

If the "--use-tokenizer-sentence-splits" option is activated, sentences are marked by offset as well in new lines.

To use Datok including sentence splitting, call tei2korap as follows:

$ cat corpus.i5.xml | tei2korapxml -s \
$   -tc 'datok tokenize \
$        -t ./tokenizer.matok \
$        -p --newline-after-eot --no-sentences \
$        --no-tokens --sentence-positions -' - \
$        > corpus.korapxml.zip

--skip-inline-tokens

Boolean flag indicating that inline tokens should not be processed. Defaults to false (meaning inline tokens will be processed).

--skip-inline-token-annotations

Boolean flag indicating that inline token annotations should not be processed. Defaults to true (meaning inline token annotations won't be processed).

--skip-inline-tags <tags>

Expects a comma-separated list of tags to be ignored when the structure is parsed. Content of these tags however will be processed.

--xmlid-to-textsigle <from-regex>@<to-c/to-d/to-t>

Expects a regular replacement expression (separated by @ between the search and the replacement) to convert text id attributes to text sigles with three parts (separated by /).

Example:

tei2korapxml  \
  --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
  -tk - < t/data/icc_german_sample.p5.xml

Converts text id ICC.German.DeReKo.WPD17.G11.00238 to sigle ICCGER/DeReKo.WPD17/G11.00238.

--inline-tokens <foundry>#[<file>]

Define the foundry and file (without extension) to store inline token information in. Unless --skip-inline-token-annotations is set, this will contain annotations as well. Defaults to tokens and morpho.

The inline token data will also be stored in the inline structures file (see --inline-structures), unless the inline token foundry is prepended by an ! exclamation mark, indicating that inline tokens are stored exclusively in the inline tokens file.

Example:

tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip

--inline-structures <foundry>#[<file>]

Define the foundry and file (without extension) to store inline structure information in. Defaults to struct and structures.

--base-foundry <foundry>

Define the base foundry to store newly generated token information in. Defaults to base.

--data-file <file>

Define the file (without extension) to store primary data information in. Defaults to data.

--header-file <file>

Define the file name (without extension) to store header information on the corpus, document, and text level in. Defaults to header.

--use-tokenizer-sentence-splits|-s

Replace existing with, or add new, sentence boundary information provided by the tokenizer. Currently KorAP-tokenizer and certain external tokenizers support these boundaries.

--tokens-file <file>

Define the file (without extension) to store generated token information in (either from the KorAP tokenizer or an externally called tokenizer). Defaults to tokens.

--log|-l

Loglevel for Log::Any. Defaults to notice.

ENVIRONMENT VARIABLES

KORAPXMLTEI_DEBUG: Activate minimal debugging. Defaults to false.

COPYRIGHT AND LICENSE

Author: Peter Harders

Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober

KorAP::XML::TEI is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for the German Language (IDS), member of the Leibniz-Gemeinschaft.

This program is free software published under the BSD-2 License.

KorAP / KorAP-XML-TEI