christophsk / fast-rake

A very efficient implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fast-rake

The fast-rake package is an optimized implementation of the RAKE algorithm for unsupervised keyword extraction. It is specifically built to efficiently process large collections of text in an uninterrupted fashion. The performance gains derive from using optimized regular expressions of stopword lists and a few Python-specific optimizations.

The Rapid Automatic Keyword Extraction (RAKE) algorithm is described in "Automatic Keyword Extraction from Individual Documents", Rose, S., et al., (2010)

Features

  • Use of optimized regular expressions for splitting sentences into candidate keywords. Included are optimized stopword lists from gensim, google, nltk, scikit-learn, and SMART.

  • Allows for custom stopword lists to augment the built-in stopword lists.

  • The RAKE implementation is easy to subclass. Two (sub)classes are available showing how difference sentence and word tokenizers can be incorporated:

    • RakePunkt uses the punkt sentence tokenizer from nltk for improved sentence splitting. All the required nltk_data are included and installed.
    • RakeNLTK subclasses RakePunkt and adds the word tokenizer TreebankWordTokenizer from nltk.
  • Python-specific optimizations to speed each step of the algorithm.

  • Safe for multiprocessing (see examples/bbc_mp.py).

Test & Install

To install as a module

pip install .

If pytest is installed, tests can be run via:

python -m pytest -v

Examples

The following example is from Rose, et al.:

Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types.

The implementation uses __call__:

>>> from fast_rake import Rake
>>>
>>> # default arguments are shown
>>> smart_rake = Rake(stopword_name="smart", custom_stopwords=None, max_kw=None, ngram_range=None, top_percent=1.0, kw_only=False)
>>>
>>> text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types." 
>>> kw = smart_rake(text)

The resulting list, kw:

[
        ("minimal generating sets", 8.666666666666666),
        ("linear Diophantine equations", 8.5),
        ("minimal supporting set", 7.666666666666666),
        ("minimal set", 4.666666666666666),
        ("linear constraints", 4.5),
        ("natural numbers", 4.0),
        ("strict inequations", 4.0),
        ("nonstrict inequations", 4.0),
        ("Upper bounds", 4.0),
        ("mixed types", 3.666666666666667),
        ("considered types", 3.166666666666667),
        ("set", 2.0),
        ("types", 1.6666666666666667),
        ("considered", 1.5),
        ("Compatibility", 1.0),
        ("systems", 1.0),
        ("Criteria", 1.0),
        ("compatibility", 1.0),
        ("system", 1.0),
        ("components", 1.0),
        ("solutions", 1.0),
        ("algorithms", 1.0),
        ("construction", 1.0),
        ("criteria", 1.0),
        ("constructing", 1.0),
        ("solving", 1.0),
    ]

Example Use Case

The data are 2,225 BBC News articles from BBC-Dataset-News-Classification (not included). examples/bbc_news.py presents a typical use case of finding keywords for each document in a corpus.

bbc_news.py --input-dir BBC-Dataset-News-Classification/dataset/data_files --algorithm rake-og --stopwords nltk 
Rake v1.2.0, stopwords: nltk
num docs: 2,225
time: 1.93149 secs
rate: 1151.96 docs/sec
UserWarnings: 0

fast-rake is safe for multiprocessing. The example bbc_mp.py uses joblib as the multiprocessing backend (you must install joblib to run this example).

bbc_mp.py --dataset bbc --top-dir BBC-Dataset-News-Classification/dataset/data_files --njobs -1 --algorithm rake-og --stopwords nltk 
running dataset: bbc
Rake v1.2.0, stopwords: nltk
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done  43 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 108 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 303 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 1024 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 2084 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 2160 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 2210 out of 2225 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done 2225 out of 2225 | elapsed:    1.0s finished
num docs: 2,225
time: 0.96945 secs
rate: 2295.11 docs/sec
UserWarnings: 0

License

Copyright © 2024, Lion Technologies, LLC.

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

A very efficient implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm

License:MIT License


Languages

Language:Python 99.7%Language:Makefile 0.3%