This repository contains a very simple imlementation of extractive text summarization. The implemented summarizer was partially implemented from this paper (without adding a boost factor) - https://pdfs.semanticscholar.org/2df1/595bcbee37de1147784585a097f3a2819fdf.pdf
The code for summarizer service can be found in service
folder. After creating the service, this project was hosted as a flask API.
- Read a text in and split it intoindividual tokens.
- Remove the stop words to filter the text
- Assign a weight value to each individual terms. The weight is calculated as:
weight = (frequency of that term)/(total number of terms)
- Add a boost factor to bold, italic or underlined text
- Find the weight of each sentence (sum of individual weights)
- Rank inidivivdual sentences according to weight
- Extract
n
highest ranked sentences
- Read the text
- Pre-process the data
- Convert to lower case
- Remove special characters
- Remove digits
- Remove all the extra spaces with a single space
- return the clean text
- Tokenize the data into sentences
- Remove stop words
- Create a word-count dictionary
- Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
weight = (frequency of that word)/(total number of terms)
- Assign score to each sentence
- Rank inidivivdual sentences according to weight and extract
n
highest ranked sentences
- Read the text
Read a text document, or ask input from user. Here we create a function which would take the input text and return the summarized text. The second argument to the function is the number of high scored senteces which you want to extract
def summarize_text (text, num_sent):
...
...
return summary
- Pre-process the data
Steps and code as shown below -
def preprocess (text):
# Convert to lower case
clean_text = text.lower()
# Remove special characters
clean_text = re.sub (r"\W", " ", clean_text)
# Remove digits
clean_text = re.sub (r"\d", " ", clean_text)
# Remove all the extra spaces with a single space
clean_text = re.sub (r"\s+", " ", clean_text)
# return the clean text
return clean_text
- Tokenize the data into sentences
We use sent_tokenize()
provided by nltk
library
sentences = nltk.sent_tokenize (text)
- Remove stop words
Again, we use nltk
stop_words = nltk.corpus.stopwords.words('english')
- (contd. from 4) Remove stop words and create word count dictionary
word_count_dict = {}
for word in nltk.word_tokenize(clean_text):
if word not in stop_words:
if word not in word_count_dict.keys():
word_count_dict[word] = 1
else:
word_count_dict[word] += 1
- Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
# Find the total number of terms (not necessarily unique) = sum of values in the word_count_dict
total_terms = sum(word_count_dict.values)
# Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
max_value = max(word_count_dict.values())
for key in word_count_dict.keys():
word_count_dict[key] = word_count_dict[key]/total_terms
- Assign scores to each sentence
sentence_score_dict = {}
for sentence in sentences:
for word in nltk.word_tokenize(sentence.lower()):
if word in word_count_dict.keys():
if len(sentence.split(' ')) < 25: # 25 taken at random, to remove very long sentences
if sentence not in sentence_score_dict.keys():
sentence_score_dict[sentence] = word_count_dict[word]
else:
sentence_score_dict[sentence] += word_count_dict[word]
- Rank inidivivdual sentences according to weight and extract
n
highest ranked sentences
best_sentences = heapq.nlargest(num_sent, sentence_score_dict, key=sentence_score_dict.get)