- GIZA++
- Python 3.x+
- PyTorch 1.1+
- NumPy
- Scikit-Learn
- MorphAGram (Add a
MorphAGram
directory in the main directory of this repo.)
We use GIZA++ to train and produce word-level alignments between the target language and a source language for which POS annotations are available based on a parallel corpus that is white-space tokenized.
- Create a directory
alignments
that has theGIZA++
andmkcls
installation directories, in addition to therun_gizapp.sh
andgiza-convert.py
scripts and aworkspace
directory to store the inputs and outputs. - For the source language (ISO3 code), the target language (ISO3 code) and the dataset , produce the following files:
- the source-target GIZA++ input parallel file
<SL>-<TL>-<D>.parallel
(per line: <white_space_tokenized_source_sentence> ||| <white_space_tokenized_target_sentence>) - the GIZA++ input configuration file
<SL>-<TL>-<D>.giza.config
. Use the config filedata/gizapp-sample.config
, and replace 'ENG' by , 'AFR' by and 'bible' by . - a key file of sentence IDs
<SL>-<TL>-<D>.keys
, one ID per line. The order of the IDs should correspond to the order of the sentences in<SL>-<TL>-<D>.parallel
.
- the source-target GIZA++ input parallel file
- Run the
run_gizapp.sh
script to train and produce the alignments from the source to the target with the three parameters , and . This will create a new directoryworkspace/<SL>-<TL>-<D>-gfiles
with the necessary GIZA++ output files. - Run the
giza-convert.py
script to produce the forward alignments as follows:python giza-convert.py workspace/<SL>-<TL>-<D>-gfiles/<SL>-<TL>.alignments N > workspace/<SL>-<TL>-<D>-giza.forward
- Repeat the second and their steps while switching and in order to produce the backward alignments.
- Run the
giza-convert.py
script to produce the backward alignments as follows:python giza-convert.py workspace/<TL>-<SL>-<D>-gfiles/<TL>-<SL>.alignments Y > workspace/<SL>-<TL>-<D>-giza.backward
Use an off-the-shelf POS tagger to tag the source text. The output should be a tabular file that has one sentence per line, where the first tab has the sentence ID and the second tab has space-separated words and their POS tags as word_POS.
Example:
This_PRON is_AUX a_DET simple_ADJ example_NOUN ._PUNCT
The script projection_handler.py
is responsible for projecting the tags from the source language onto the target one and coupling the token and type constraints. The script relies on the GIZA++ output along with the annotated source text.
- key_path: alignments/workspace/--.keys
- forward_alignment_path: alignments/workspace/---giza.forward
- forward_source_vocabulary_path: alignments/workspace/---gfiles/-.vcb
- forward_target_vocabulary_path: alignments/workspace/---gfiles/-.vcb
- forward_alignment_probability_path: alignments/workspace/---gfiles/g.t3.final
- backward_alignment_path: alignments/workspace/---giza.backward
- backward_source_vocabulary_path: alignments/workspace/---gfiles/-.vcb
- backward_target_vocabulary_path: alignments/workspace/---gfiles/-.vcb
- backward_alignment_probability_path: alignments/workspace/---gfiles/g.t3.final
- tagged_source_path: the path of the tagged source text
- target_data_path: the path of the target text, a tabular file that has one sentence per line, where the first tab has the sentence ID and the second tab has the white-space tokenized sentence.
- pos_output_path: the path of the projected annotations
- pos_output_path: the path of the probabilities of the projected annotations
In order to apply stem-based alignment and projection, run both alignment and projection in the stem space, and then replace the tagged stems by their corresponding words using the map_word_stem.py
script.
- source_data_path: the path of the source text, a tabular file that has one sentence per line, where the first tab has the sentence ID and the second tab has the white-space tokenized sentence.
- tagged_stem_path: the path of the stem-based projected annotations (as produced by
projection_handler.py
) - pos_output_path: the path of the output annotations in which the stems are replaced by their corresponding words
In order to apply morpheme-based alignment and projection, run both alignment and projection in the morpheme space, and then replace the tagged morphemes by their corresponding words and their tags by their representative tags using the map_word_morpheme.py
script.
- source_data_path: the path of the source text
- tagged_morpheme_path: the path of the morpheme-based projected annotations (as produced by
projection_handler.py
) - prob_path: the path of the probabilities of the morpheme-based projected annotations (as produced by
projection_handler.py
). This is not needed in the stem-based approach as the stem-based probabilities are the same as the word-based probabilities. - stem_path: the path of the source text where the stems are marked by '+' e.g., 're+play+s' or '+make+'. This is needed in the STEM mechanism for the selection of the representative morpheme. If set to 'NA', the RANK mechanism is used instead.
- pos_output_path: the path of the output annotations in which the morphemes are replaced by their corresponding words
- prob_output_path: the path of the probabilities of the output annotations
The script multi_source_projection.py
is responsible for multi-source projection.
- target_language: the ISO3 code of the target language
- source_languages: the comma-separated ISO3 codes of the source languages
- use_bea:
T
= use Bayesian inference, weighted-maximum voting otherwise - use_weights_with_bea:
T
= use the weights in the initialization of Bayesian inference, unweighted otherwise - use_weights_with_argmax:
T
= use the weights in the argmax of Bayesian inference, unweighted otherwise - scale_weights
T
= scale the weights between 0 and 1, unscaled otherwise - pos_path: the path of the projected annotations (as produced by
projection_handler.py
). It is a wildcard path in which the source language is be replaced by '#SOURCE_LANG#'. - prob_path: the path of the probabilities of the projected annotations (as produced by
projection_handler.py
). It is a wildcard path in which the source language is be replaced by '#SOURCE_LANG#'. - pos_output_path: the path of the output multi-source projected annotations
- prob_output_path: the path of the output probabilities of the multi-source projected annotations
The script training_data_generator.py
is responsible for generating the training data given the projected annotations and their weights. It relies on the the outputs produced by either projection_handler.py
or multi_source_projection.py
. The produced POS training file has one sentence per line, where each word is represented as word_POS, and empty tags are marked as ***.
Example:
Deur_ADP saam_*** te_PART werk_VERB ,_PUNCT kan_*** ons_PRON meer_DET bereik_VERB ._PUNCT
- pos_path: the path of the projected annotations (as produced by
projection_handler.py
ormulti_source_projection.py
) - prob_path: the path of the probabilities of the projected annotations (as produced by
projection_handler.py
ormulti_source_projection.py
) - training_path: the path of the output training file
The script tagger.py
is responsible for training and testing the neural POS tagger in one fell swoop. However, it should be straightforward to split the training and testing phases, if needed.
- target_language: the ISO3 code of the target language
- source_language: the ISO3 code of the source language
- data_path: the path of the training and testing data (directory)
- output_path: the path of the final tagged output (directory)
- model_path: the path of the final model (directory)
- training_data_set: the name of the training dataset
- test_data_sets: the comma-separated name(s) of the test dataset(s) (or 'NA' for no testing)
- training_size: the number of words to train on, -1 = all
- max_sentence_length: the maximum sentence length (in words) to train on, -1 = all
- min_density: the percentage of partially tagged words to the number of words in a sentence to train on, -1 = all
- use_word_embeddings: whether to use randomly initialized word embeddings (recommended)
- use_length_affix_embeddings: whether to use randomly initialized prefix/suffix (of lengths 1, 2, 3 and 4) embeddings (recommended).
- use_segmentation_affix_embeddings: whether to use randomly initialized MorphAGram prefix/suffix embeddings (requires MorphAGram and a segmentation model)
- use_segmentation_complex_affix_embeddings: whether to use randomly initialized MorphAGram complex prefix/suffix embeddings (requires MorphAGram and a segmentation model)
- use_segmentation_stem_embeddings: whether to use randomly initialized MorphAGram stem embeddings (requires MorphAGram and a segmentation model)
- use_char_embeddings: whether to use character embeddings
- segmentation_grammar_output_path: the path of the segmentation output based on the segmentation grammar needed for running MorphAGram, 'NA' otherwise
- use_brown_clusters: whether to use Brown clusters (recommended)
- brown_cluster_path: the path of Brown clusters
- use_contextual_embeddings: whether to use contextual embeddings (e.g., BERT or XLM) (recommended)
- contextual_embeddings_dimensions: the vector size for contextual embeddings
- contextual_embedding_path: the path of the precomputed contextual embeddings (see the description under the Notes section)
- contextual_tokenization_path: the path of the tokenization file, a tabular file of two columns: words and space-separated tokens
- subword_combination_method: how to combine the embeddings of subwords; the values are: AVERAGE, FIRST, FIRST_LAST and LONGEST (recommended: FIRST_LAST)
- epochs: number of epochs (recommended: 12)
- learning_rate: learning rate (recommended: 0.0001)
- learning_decay_rate: learning decay rate (recommended: 0.1)
- dropout_rate: dropout rate (recommended: 0.7)
- fix_tags: whether to match the output tags with the UD annotation guidelines for the underlying language (e.g., converting PRT to ADV in TUR)
- run_postprocessing: whether to force rule-based tagging for punctuation marks, symbols and numbers in the output (recommended)
- overwrite_by_output: whether to use the tags in the test dataset(s) (e.g., when partially annotated) to overwrite the output of the system
- The system assumes all the contextual embeddings are precomputed. However, it is straightforward to change this into runtime computations, if needed.
- In the embeddings file, each sentence should occupy n+2 lines. The first line contains the white-space tokenized text; the second line contains a vector of subwords or subword IDs, while the (n+2)th line contains the comma-separated vector of the nth token.
Example:
Eta haur eçagut cedin Ioppe gucian eta sinhets ceçaten anhitzec Iauna baithan .
['[CLS]', 25623, 56155, 28, 2968, 13306, 405, 3035, 28136, 7340, 2497, 69438, 522, 3811, 7831, 405, 2968, 510, 3616, 24374, 3240, 12044, 1946, 63670, 1121, 6, 5, '[SEP]']
0.6276292204856873, -0.8384165167808533, 0.6102157235145569, -0.2547730505466461, -0.45138606429100037,.....
- The training fie should be named as (target_language)-(source_languagonee)-(training_data_set).txt, e.g., EUS-ENG-TRAIN.txt.
- The test file(s) hould be named as (target_language)-(source_language)-(test_data_set).txt, e.g., EUS-ENG-TEST.txt.
- The test files should has the same format as the training one.
- We use the output of the Brown-Clustering implementation here.
- The system supports the following set of languages for the postprocessing: AFR, AMH, BUL, EUS, FIN, HIN, IND, KAT, KAZ, JPN, LIT, PER, POR, TEL and TUR.
- The booleans are expressed as 'Yes', 'True', 'T', 'Y' or '1' for the TRUE value.
The script multi_source_decoding.py
is responsible for combining decoded outputs that correspond to single-source POS models.
- target_language: the ISO3 code of the target language
- source_languages: the comma-separated ISO3 codes of the source languages
- use_bea:
T
= use Bayesian inference, weighted-maximum voting otherwise - use_inference_weights:
T
= use inference weights, alignment probabilities otherwise - use_weights_with_bea:
T
= use the weights in the initialization of Bayesian inference, unweighted otherwise - use_weights_with_argmax:
T
= use the weights in the argmax of Bayesian inference, unweighted otherwise - scale_weights
T
= scale the weights between 0 and 1, unscaled otherwise - pos_path: the path of the decoded POS annotations (as produced by
tagger.py
- only used with inference-based weights). It is a wildcard path in which the source language is be replaced by '#SOURCE_LANG#'. - prob_path: the path of the probabilities of the decoded POS annotations (as produced by
tagger.py
- only used with inference-based weights). It is a wildcard path in which the source language is be replaced by '#SOURCE_LANG#'. - training_path: the path of the training files (as produced by
training_data_generator.py
- only used for statistics). It is a wildcard path in which the source language is be replaced by '#SOURCE_LANG#'. - forward_alignment_path: alignments/workspace/---giza.forward - only used with alignment-based weights. It is a wildcard path in which the source language is be replaced by '#SOURCE_LANG#'.
- backward_alignment_path: alignments/workspace/---giza.backward - only used with alignment-based weights. It is a wildcard path in which the source language is be replaced by '#SOURCE_LANG#'.
- gold_sets: the name of the gold/test dataset (for evaluation purpose)
- pos_output_file: the path of the gold file (for evaluation purpose)
- pos_output_file: the path of the output combined POS tags
- eval_output_file: the path of the output evaluation script (for evaluation purpose)