conditional-random-field named-entity-recognition text-preprocessing

preprocessing-crf-ner

Description

This work contributes to extensively assessing the impact of preprocessing tasks on the named entity recognition success in Indonesian text at various feature dimensions and possible interactions among these tasks.

Preprocessing Procedures

Contractions Expansion
Lowercase Conversion
Stemming
Number to Words Conversion
Hyphen and Comma Splitting

Feature Extraction

The word
The length of the word or number of characters
Prefixes and suffixes of the word of varying lengths
The word in lowercase
Stemmed version of the word, which deletes all vowels along with g, y, n from the end of the word, but leaves at least a 2 character long stem
If the word is a punctuation mark
If the word is a digit
Features mentioned above for the previous word, the following word, and the words two places before and after
Word POS tag
If the word is at the beginning of the sentence (BOS) or the end of the sentence (EOS) or neither

Requirements

Both Linux and Windows are supported. Linux is recommended for performance and compatibility reasons.
64-bit Python 3.7 installation.
I recommend sklearn-crfsuite 0.36, which I used for all experiments.
Download singgalang.tsv and store it in the data directory.
Download all_indo_man_tag_corpus_model.crf.tagger and store it in the pre-trained-model directory.

Usage

python main.py

About

Implementation of text preprocessing impact analysis on named entity recognition (NER) based on conditional random field (CRF) in Indonesian text.

conditional-random-field named-entity-recognition text-preprocessing

GNU General Public License v3.0

Languages

Language:Python 100.0%