Preprocess Library

Preprocess is a python package for preprocessing text using some NLP techniques:

normalization techniques (Eg. url recognition, etc)
punctuation patterns recognition
symbol filtering and substitution
shallow NLP techniques (Eg. Part of Speech Tagging)
deep NLP techniques (Eg. Name Entity Recognition, etc)

The python ecosystem for text preprocessing is large and difficult to configure and use. When I started to use preprocessing for some more complex NLP task, the process to configure and generate standalone apps with non heavy dependencies was impossible using nltk as a baseline. At the same time every normalization step taken from a different approach/library have a different input type and arguments. For that reason I decided to wrapped all those functions in a standard and unified library named preprocess, which works as follow:

from preprocess import func
func('text')
'text_result'

This package integrates some text normalization techniques from some python packages likes: nltk, normalizr. Also contains many ideas extracted from other normalization or text preprocessing packages.

Some regular expressions used on shallow parsing are based on observations made from frequent errors in txt obtained from pdf conversion.

Additionally some functions intend to keep the original length of the text after normalization. E.gs. 'state-of-the-art' by 'state_of_the_art'; 'doing... some' by 'doing some' (there are 4 whitespaces between doing and some). The objective was to wrangling the data (not munging it), not in all cases this objective was get it, some alignment examples can be read.

Requirements

Linux

java X is needed, but you need to check your downloaded Stanford model requirements to get the correct version.

(current version for this example are stanford-xxx-2015-04-20 models)

$ apt install openjdk-8-jre pandoc
$ pip install nltk nose numpy

Installation

$ pip install preprocess

Generating the doc

Read carefully the section #doc in requirements.txt This documentation render some notebooks to sphinx docs, so a not common set of python libraries is used, and pandoc package is needed at OS level.

apt install pandoc
pip3 install

Basic Usage

>>> import preprocess
>>> preprocess.lowercase('Stanford parser was created by Stanford University')
'stanford parser was created by stanford university'

Basic usage includes the following functions:

lowercase, replace_urls, abbreviations, expand_contractions, ngrams
replace_symbols: based on recognition of unicode & utf8 representations of Greek symbols, etc.
replace_punctuation: based on punctuation regular expressions
multipart_words: or hyphenated expressions.
some non-classical text manipulation operations such those made to easy parse the texts obtained from PDF:
- extraspace_for_endingpoints: add an extra whitespace before the ending point of a sentence.
- add_doc_ending_point: check if the last sentence of a doc ends with a dot, if not add it.
- del_tokens_len_one: a primariy way to make stopword removal

Advanced usage includes the following functions:

pos, ner, syntdp, sngrams, remove_stopword, contextual_ngrams, stopword_ngrams

Advanced Usage

Configuration

For a correct function of preprocess package, the supported NLP models must be configured correctly. Use the following steps to get it right:

get the stanford coreNLP software on Internet: Stanford CoreNLP software
1. pick the language of your preference in stanford models repository: Stanford CoreNLP models. (current tested version filename is stanford-parser-full-2015-04-20.zip)
get Stanford Name Entity parser
get Stanford POSTagger parser
unzip the all the parsers in your /path/to/stanford/jars/
extract Lexical models inside /path/to/stanford/jars/stanford-parser-full-2015-04-20/stanford-parser-3.5.2-models.jar

note: the Name Entity parser contains its models in classifier folder, & the POSTagger parser contains a set of models in the models folder)

$ cd /path/to/stanford/jars/
$ ls -l
stanford-ner-2015-04-20
stanford-parser-full-2015-04-20
stanford-postagger-2015-04-20

$ cd stanford-parser-full-2015-04-20
$ ls -l
bin
conf
data
models <- your extracted models
...

Usage

>>> import preprocess
>>> preprocess.pos('What is the airspeed of an unladen swallow ?')
[('What', 'WP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('airspeed', 'NN'),
 ('of', 'IN'),
 ('an', 'DT'),
 ('unladen', 'JJ'),
 ('swallow', 'VB'),
 ('?', '.')]

Future

The future changes of this library are based in the initial objectives:

A pure python library: at this version the Stanford models dependency, developed in java, made this milestone impossible.
- The future: replace with spacy, or pntl or any other self deep learning tagging implementation with a free and professional collection of texts.
Optimization to get better times in processing big collections of data: many functions are in pure python:
- The future: implement pure python funcs in cython or rust.
Standard Input: many libraries, ideas or codes reused in this library have different ways to get the inputs (numerical vectors, strings, set of words, etc), the objective is to pass a simple string or a well known ObjectType like TfIdfModel.
- The future: check if the input is a string distance and with a decorator change all kind of object type to string.
Low-weight: to have the less possible dependencies for academical or commercial use deployment, and the least possible complexity.
- The future: avoid the nltk dependency or any other, reusing the necessary code and fixing them to integrate them on preprocess architecture.
Integration: add every preprocessing technique mentioned in the papers of SEMEVAL or CLEF to solve Reused Text Detection or Semantic Text Similiarity, and other fundamental papers in this area.
- Citation: add a complete set of cites about all techniques, and link them with its correspondent function on the library.

sorice / preprocess