Extractive Text Summarizer

This repository contains a very simple imlementation of extractive text summarization. The implemented summarizer was partially implemented from this paper (without adding a boost factor) - https://pdfs.semanticscholar.org/2df1/595bcbee37de1147784585a097f3a2819fdf.pdf

The code for summarizer service can be found in service folder. After creating the service, this project was hosted as a flask API.

Steps

From the above mentioned paper

Read a text in and split it intoindividual tokens.
Remove the stop words to filter the text
Assign a weight value to each individual terms. The weight is calculated as:
```
 weight = (frequency of that term)/(total number of terms)
```
Add a boost factor to bold, italic or underlined text
Find the weight of each sentence (sum of individual weights)
Rank inidivivdual sentences according to weight
Extract n highest ranked sentences

Things implemented

Read the text
Pre-process the data
- Convert to lower case
- Remove special characters
- Remove digits
- Remove all the extra spaces with a single space
- return the clean text
Tokenize the data into sentences
Remove stop words
Create a word-count dictionary
Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
```
 weight = (frequency of that word)/(total number of terms)
```
Assign score to each sentence
Rank inidivivdual sentences according to weight and extract n highest ranked sentences

Steps involved illustrated -

Read the text

Read a text document, or ask input from user. Here we create a function which would take the input text and return the summarized text. The second argument to the function is the number of high scored senteces which you want to extract

def summarize_text (text, num_sent):
    ...
    ...
    return summary

Pre-process the data

Steps and code as shown below -

def preprocess (text):
    # Convert to lower case
    clean_text = text.lower()
    # Remove special characters
    clean_text = re.sub (r"\W", " ", clean_text)
    # Remove digits
    clean_text = re.sub (r"\d", " ", clean_text)
    # Remove all the extra spaces with a single space
    clean_text = re.sub (r"\s+", " ", clean_text)
    # return the clean text
    return clean_text

Tokenize the data into sentences

We use sent_tokenize() provided by nltk library

sentences = nltk.sent_tokenize (text)

Remove stop words

Again, we use nltk

stop_words = nltk.corpus.stopwords.words('english')

(contd. from 4) Remove stop words and create word count dictionary

word_count_dict = {}

for word in nltk.word_tokenize(clean_text):
    if word not in stop_words:
        if word not in word_count_dict.keys():
            word_count_dict[word] = 1
        else:
            word_count_dict[word] += 1

Normalize the word-frequency dictionary (weighted word count matrix/dictionary)

# Find the total number of terms (not necessarily unique) = sum of values in the word_count_dict
total_terms = sum(word_count_dict.values)

# Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
max_value = max(word_count_dict.values())
for key in word_count_dict.keys():
    word_count_dict[key] = word_count_dict[key]/total_terms

Assign scores to each sentence

sentence_score_dict = {}
for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word_count_dict.keys():
            if len(sentence.split(' ')) < 25: # 25 taken at random, to remove very long sentences
                if sentence not in sentence_score_dict.keys():
                    sentence_score_dict[sentence] = word_count_dict[word]
                else:
                    sentence_score_dict[sentence] += word_count_dict[word]

Rank inidivivdual sentences according to weight and extract n highest ranked sentences

best_sentences = heapq.nlargest(num_sent, sentence_score_dict, key=sentence_score_dict.get)

MadhavBahl / Extractive-Text-Summarizer

Extractive Text Summarizer

Steps

From the above mentioned paper

Things implemented

Steps involved illustrated -

About

Languages