wink-nlp-utils

Easily tokenize, stem, phonetize, remove stop words, manage elisions, create ngrams, bag of words and more

Prepare raw text for Natural Language Processing (NLP) using wink-nlp-utils.It is a part of wink — a growing family of high quality packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS.

It offers a set of APIs to work on strings such as names, sentences, paragraphs and tokens represented as an array of strings/words. They perform the required pre-processing for many simple ML tasks such as semantic search, and classification.

Installation

Use npm to install:

npm install wink-nlp-utils --save

Usage

// Load wink-nlp-utils
var nlp = require( 'wink-nlp-utils' );

// Use a string Function - Input argument is a string
// Extract person's name from a string:
var name = nlp.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' );
// name -> 'Sarah Connor'

// Compose all possible sentences from a string:
var str = '[I] [am having|have] [a] [problem|question]';
console.log( prepare.string.composeCorpus( str ) );
// -> [ 'I am having a problem',
// ->   'I am having a question',
// ->   'I have a problem',
// ->   'I have a question' ]

// Use a tokens Function - Input argument is an array of tokens
// Remove stop words:
var t = nlp.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] );
// t -> [ 'mary', 'little', 'lamb' ]

APIs

string

lowerCase( s )

Converts the input string s to lower case.

upperCase( s )

Converts the input sting s to upper case.

trim( s )

Trims leading and trailing spaces from the input string s.

removeExtraSpaces( s )

Removes leading & trailing white spaces along with any extra spaces appearing in between from the input string s.

retainAlphaNums( s )

Retains only alpha-numerals and spaces and removes all other characters, including leading/trailing/extra spaces from the input string s.

extractPersonsName( s )

Attempts to extract person's name from input string s in formats like Dr. Eugine Cyron B. Tech., M. Tech., PhD. - Electrical by dropping the titles and degrees. It assumes the following name format:

[<salutations>] <name part in FN, MN, LN> [<degrees>]

extractRunOfCapitalWords( s )

Returns an array of words appearing as Title Case or in ALL CAPS in the input string s.

removePunctuations( s )

Removes each punctuation mark by a space. It looks for .,;!?:"!'... - () [] {} from the input string s and replaces it by a space. Use removeExtraSpaces( s ) in order to remove the spaces in the string.

removeSplChars( s )

Removes the special characters like ~@#%^*+= from the input string 's' and replaces it by a space. These can be removed using removeExtraSpaces( s ).

removeHTMLTags( s )

Removes HTML tags, escape sequences from the input string s and replaces it by space character. These can be removed using removeExtraSpaces( s ).

removeElisions( s )

Removes basic elisions found in the input string s. An I'll becomes I, Isn't becomes Is. An apostrophe found in the string s remains as is.

splitElisions( s )

Splits elisions from the input string s by inserting a space. Elisions like we're or I'm are split as we re or I m.

amplifyNotElision( s )

Amplifies the not elision by replacing it by the word not in the input string s; it must be used before calling the removeElisions(). Can't, Isn't, Haven't are amplified as Ca not, Is not, Have not.

marker( s )

Generates a marker for the input string s as 1-gram, sorted and joined back as a string again; useful input in determining a quick and aggressive way to detect similarity in short strings. Its aggression leads to more false positives such as Meter and Metre or no melon and no lemon.

soc( s, ifn, idx )

Creates a set of characters (soc) from the input string s. This is useful in even more aggressive string matching using Jaccard or Tversky Indexes as compared to marker().

ngram( s, size )

Generates the ngram of the size from the input string s. Default value of size is 2. The function returns an array of ngrams. In case, 0 is given as size parameter, ngrams of size 2 will be returned.

bong( s, size, ifn, idx )

Generates a bag of ngrams of the size from the input string s. Default value of size is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While ngram() preserves the sequence and has no frequency information of each ngram, bong() on the other hand captures the frequency of each ngram and has no sequence information. Input arguments ifn and idx are optional. For special cases, where index is required, please refer to the helper function index().

song( s, size, ifn, idx )

Generates a set of ngrams of the size from the input string s. Default value of size is 2. This function returns an object containing ngram (key) and their frequency (value) of occurrence. While ngram() preserves the sequence and has no frequency information of each ngram, song() on the other hand captures the frequency of each ngram and has no sequence information. Input arguments ifn and idx are optional. For special cases, where index is required, please refer to the helper function index().

stem( s )

The input string s is stemmed using the Porter2 English Stemming Algorithm

sentences( s, splChar )

Splits the text contained in the input string s into sentences returned in the form of an array. Punctuation marks found at the end of a sentence are retained. The function can handle sentences beginning with numbers as well, though it is not a good english practice. It uses ~ as the splChar for splitting and therefore it must not be present in the input string; else you may give another splChar as the second argument.

composeCorpus( s )

Generates all possible sentences from the input argument string — s. The string s must follow a special syntax as illustrated in the example below:

'[I] [am having|have] [a] [problem|question]'

Each phrase must be quoted between [] and each possible option of phrases (if any) must be separated by a | character. The corpus is composed by computing the cartesian product of all the phrases as highlighted in the usage section. It returns an array of sentences (i.e. strings).

tokenize0( s )

Tokenizes by splitting the input string s on non-words. This means tokens would consists of only alphas, numerals and underscores; all other characters will be stripped as they are treated as separators. However negations are retained and amplified but all other elisions are removed. Tokenize0 is useful when the text strings are clean and do not require pre-processing like removing punctuations,extra spaces, handling elisions etc.

tokenize( s )

The function follows set of rules given below to remove and preserve punctuation/special characters in the input string s. The Extra/leading/trailing spaces are removed and finally split on space to tokenize.

First, single quotes are processed as they may be a part of elisions in the string; and … are converted to ellipses.
Not elisions are amplified and then split on elisions. Thus words with elisions get tokenized
The word cannot is split in to can not.
. , - punctuations commonly embedded in numbers are left intact, All other punctuations are tokenized.
The currency symbols are padded by space i.e. become separate tokens.
Underscore (_) embedded in the word is preserved.
Special characters are preserved and may/may not become separate tokens.
Finally after removing extra/leading/trailing spaces, split on space to tokenize.

phonetize( s )

Phonetizes the input string s using an algorithmic adaptation of Metaphone.

tokens

Tokens are created by splitting a string into words, keywords, symbols. These tokens are used as an input to various activities during text analysis.

stem( t )

Each element of input array of tokens t is stemmed using Porter2 English Stemming Algorithm. Not to be confused with the stem() under string as it performs stemming on the input string s, whereas this function requires an token array t as an input.

bow( t, logcounts )

Creates Bag of Words from the input array of tokens t. Specifying the logCounts parameter flags the use of log2( word counts ) instead of counts directly. The idea behind using log2 is to ensure that a word’s importance does not increase linearly with its count. It is required as an input for computing similarity using bow.cosine().

sow( t, ifn, idx )

Creates a Set of tokens from the input array t. It is required as an input for computing similarity using Jaccard or Tversky Indexes. Input arguments ifn and idx are optional, please refer to the function index().

phonetize( t )

An array of tokens t are phonetized using an algorithmic adaptation of Metaphone. This is not to be confused with phonetize( s ) for string only phonetization.

set( t )

Creates a Set of tokens from the input array t. It is required as an input for computing similarity using Jaccard or Tversky Indexes. This is not to be confused with set( s ) of string sets for computing similarity.

removeWords( t, givenStopWords )

Removes the givenStopWords from the input array of tokens t. If the givenStopWords parameter is not specified then the default stop words are used. The list of default stop words are loaded from stop_words.json located under the lib/dictionaries/ directory.

The givenStopWords are constructed using prepare.words() as outlined below:

words( w, givenMappers )

Creates stop words for removeWords() from an array of words (i.e. strings) and serially applies all the mapper functions supplied in the optional givenMappers array. A mapper should take a string as an input and return a transformed string. Typical example of a mapper is prepare.string.phonetize().

propagateNegation( t, upto )

It looks for negation tokens in the input array of tokens t and propagates negation to subsequent upto tokens by prefixing them by a !: useful in handling text containing negations during similarity detection or classification.

bigrams( t )

Generates the bigram from the input tokens t.

helper

helper name space contains functions which returns function(s). They can be used for generating input arguments by the calling function.

words( w, givenMappers )

Returns an object contains the following functions (a) set() returns a set of words given in the input array w. (b) exclude() that is suitable for filtering operations.

If the second argument givenMappers is passed as an array of mapper functions then these are applied on the input array before converting into a set. Typical example of mapper functions are prepare.string.stem() and prepare.string.phonetize().

It has an alias returnWordsFilter().

index()

Builds an index and returns 2 functions as follows:

(a) build() is useful with bag & set creation functions where, by passing the build function, an index of each key/member can be built. (b) result() can be probed anytime to access the output of build().

Probing the result() returns ifn and idx values for the calling function as in n soc(), song(), bong(), bow(), and sow(). Note: usage of ifn are limited by the developer’s imagination!

It has an alias returnIndexer().

returnQuotedTextExtractor( leftQuote, rightQuote )

It returns a function that extracts all occurrences of every quoted text between the leftQuote and the rightQuote characters from its argument s - a string. Both the default quote characters are same — ". Usage is illustrated below:

var extractQuotedText = prepare.helper.returnQuotedTextExtractor();
console.log( extractQuotedText( 'Raise 2 issues - "fix a bug" & "run tests"' ) );
// -> [ 'fix a bug', 'run tests' ]

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

Copyright & License

It is licensed under the under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License.

wink-nlp-utils

Installation

Usage

APIs

string

lowerCase( s )

upperCase( s )

trim( s )

removeExtraSpaces( s )

retainAlphaNums( s )

extractPersonsName( s )

extractRunOfCapitalWords( s )

removePunctuations( s )

removeSplChars( s )

removeHTMLTags( s )

removeElisions( s )

splitElisions( s )

amplifyNotElision( s )

marker( s )

soc( s, ifn, idx )

ngram( s, size )

bong( s, size, ifn, idx )

song( s, size, ifn, idx )

stem( s )

sentences( s, splChar )

composeCorpus( s )

tokenize0( s )

tokenize( s )

phonetize( s )

tokens

stem( t )

bow( t, logcounts )

sow( t, ifn, idx )

phonetize( t )

set( t )

removeWords( t, givenStopWords )

words( w, givenMappers )

propagateNegation( t, upto )

bigrams( t )

helper

words( w, givenMappers )

index()

returnQuotedTextExtractor( leftQuote, rightQuote )

Need Help?

Copyright & License

About

Languages