KT12/tag-lemmatize

penn-treebank wordnet tagger wordnet-tags speech-tagger nltk python lemmatizer pos-tag nlp

Summary

tag-lemmatize is a small bolt-on utility function to be used in concert with the nltk package. The function accepts un-tokenized strings. The original intent was to write a small function which would ease the use of the VADER sentiment analysis tool.

The function uses nltk.tokenize.word_tokenizeto tokenize the string. It then tags parts-of-speech (POS) taking into account context using nltk.pos_tag, which assigns a Penn Treebank POS tag. The function then converts the Penn Treebank tag into the appropriate WordNet POS tag. Finally, it lemmatizes each word using nltk.stem.WordNetLemmatizer.

Installation

Clone and add to path.

import to the Python interpreter.

tag_and_lem is the primary function.

Motivation

The nltk pre-trained part-of-speech tagger uses Penn Treebank tags which must be converted to Wordnet tags in order to use nltk's lemmatizer. This small utility should make it easier to test of Natural Language Processing techniques without training a tagger which uses Wordnet tags.

Requirements

Python 2.6+ nltk

Contributors

@KT12

If this small function was useful, please star/follow me!

About

nltk utility which more accurately lemmatizes text using pre-trained part-of-speech tagger.

penn-treebank wordnet tagger wordnet-tags speech-tagger nltk python lemmatizer pos-tag nlp

MIT License

Languages

Language:Python 100.0%