TextRank

Simple and clean Python implementation of TextRank based on (Mihalcea and Tarau, 2004). Implements both keyword extraction, as well as extractive summarization.

Prerequisites

Python 2.7 or Python 3.*
NumPy
NLTK

Once NLTK is installed, you need to download the necessary files used by stopwords, tokenizer and stemmer. To do so, enter Python shell and run the following:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Usage

Extract top 10 keywords from document.txt:

python textrank.py -p ./samples/00.txt -l 10

Summarize document.txt in 3 lines:

python textrank.py -p ./samples/00.txt -s -l 3

Implementation Details

TextRank is a graph-based ranking algorithm inspired by Google's PageRank (Brin and Page, 1998). It takes into account global information computed recursively from the entire graph. It is a fast, completely unsupervised method that requires no training. In this implementation we make the design choices that yield the best results as per (Mihalcea and Tarau, 2004).

The following provides a summary of the design choices:

Undirected graphs for keyword extraction
The graphs are created using a co-occurrence matrix with co-occurrence window N=2
Similarity matrix for sentence extraction/summarization
Similarity measure is based on normalized word overlap between adjacent sentences
Text is normalized to lower-case, and some non-interpretable unicode characters are replaced by their proper unicode counterparts
Tokenization is performed using NLTK's enhanced Treebank Word Tokenizer
NLTK's built-in English stopword list is used to remove "unimportant" words
Part-of-speech (POS) tagging is employed, and only nouns and adjectives selected
NLTK's Porter stemmer is used during sentence extraction/summarization but not during keyword extraction
For the ranking algorithm a damping factor of 0.85 is used, while the delta rank score of 0.0001 is employed

Note that precision, recall and f1-score will not be exactly the same as in the original paper. In the original paper no further details are given on the syntactic filter used, besides POS tags; also different POS taggers will give different results based on their implementation. Also it is unclear what kind of tokenization is employed in the original paper. It is also unclear what kind of stopwords are used. In this implementation we actually obtain better results than in the original paper (based on Hulth and DUC datasest).

Examples

Three-sentence summary of a random political article online (see full version)

The Liberal party has wheeled out its elder statesman, former prime minister John Howard, in a last-ditch attempt to convince Liberal voters in Wentworth not to punish the government with a protest vote on the weekend. “I don’t think those normal Liberal voters in Wentworth want a Labor government,” Howard said. During a street walk in Double Bay, Howard experienced first hand the sentiments in Wentworth, with one voter telling him candidly he was appalled by the treatment of Turnbull at the hands of his own party and would not be voting Liberal.

Top 10 keywords based on the same article:

people; campaign; vote; Wentworth; Turnbull; Phelps; party; Liberal; Howard; government

Three-sentence summary of a random tech article (see full version)

Telstra chief executive Andy Penn has chosen Swedish technology company Ericsson as a partner in its upcoming launch of Australia's ultra-fast mobile network 5G, two months after the government banned Chinese-equipment provider Huawei. As Telstra continues plans to turn the business around, Mr Penn is rallying for the NBN Co to cut the wholesale prices it charges providers as margins on retail providers have been squeezed leaving it less profitable to sell these services. Launching 5G, and staying ahead of competitors such as Optus and a combined TPG-Vodafone, is part of the Telstra boss' plan to turn around the telco, after a difficult couple of years for the share price amid the rollout of the National Broadband Network and intensifying mobile competition.

Top 10 keywords based on the same article:

sites; Mr; Penn; Australia; providers; technology; mobile; networks; Ericsson; Telstra

Future Improvements

For keyword extraction, join adjacent top-ranked keywords in the text, i.e. if we have former prime minister John Howard, and both prime and minister are top ranked keywords selected, we should join them into a single keyword prime minister
Use pre-trained GloVe embeddings and perform similarity and co-occurrence weightings using GloVe vectors
Sometimes the sentence ordering in the extracted summary is sub-optimal; improve this using either syntax rules, or an additional model that performs sentence ordering
Add a capability for fetching articles/text from websites, perform HTML parsing, and then run keyword/sentence extraction

References

R. Mihalcea and P. Tarau. 2004. TextRank: Bringing Order into Texts.

D. Greene and P. Cunningham. 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In Proc. 23rd International Conference on Machine learning (ICML'06). ACM Press.

S. Brin and L. Page. 1998. The anatomy of large-scale hyper-textual Web search engine. Computer Networks and ISDN Systems, 30(1-7)

acatovic / textrank