Jiannan28 / stemtokstop

Stemmer + Tokenizer + Stopper, use NLTK to make life easier

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

README

stemtokstop = stemmer + tokenizer + remove stop words. It's nothing special, but a simple application of NLTK wrapped with Flask.

INSTALL

$ pip install nltk
$ pip install snowballstemmer  # 1.2.0 supports Turkish
$ python
>>> import nltk
>>> nltk.download() # And download all
>>> ^D
$ python stemtokstop.py

For Japanese stemmer, I chose Masato Hagiwara's [TinySegmenter] (https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py). It applies BSD License, so I keep a copy here.

If you like to have a more precise result in Japanese, install [MeCab] (http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html). Make it UTF-8 only. stemtokstop will use MeCab in place of TinySegmenter.

TEST

Run stemtokstop.py in one terminal, and run test.py in another. $ python test.py

You should see Sent: and Recv:. Use your linguistic knowledge to justify if the result is satisfactory.

WHAT IF I DON'T LIKE IT?

Open an issue, or better, submit a pull request.

I'm not satisfied with the current output, because I'd like them to be in Noun-stem and not the stems in a stemmer. For example, europe, not europ for Europe. It's possible to find another stemmer.

As for Japanese stop words, I use a heuristic method to enumerate them. Japanese, as well as Chinese, needs to be tagged to get high accuracy. Arbitrarily removing stop words (like the implementation here) results in losing meaningful words.

LICENSE

Apache License 2.0. Please refer to LICENSE.

About

Stemmer + Tokenizer + Stopper, use NLTK to make life easier

License:Apache License 2.0


Languages

Language:Python 100.0%