MadhavBahl / Extractive-Text-Summarizer

Repository accompanying the blog

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extractive Text Summarizer

Short Notes

This repository contains a very simple imlementation of extractive text summarization. The implemented summarizer was partially implemented from this paper (without adding a boost factor) - https://pdfs.semanticscholar.org/2df1/595bcbee37de1147784585a097f3a2819fdf.pdf

The code for summarizer service can be found in service folder. After creating the service, this project was hosted as a flask API.

Steps

From the above mentioned paper

  1. Read a text in and split it intoindividual tokens.
  2. Remove the stop words to filter the text
  3. Assign a weight value to each individual terms. The weight is calculated as:
     weight = (frequency of that term)/(total number of terms)
    
  4. Add a boost factor to bold, italic or underlined text
  5. Find the weight of each sentence (sum of individual weights)
  6. Rank inidivivdual sentences according to weight
  7. Extract n highest ranked sentences

Things implemented

  1. Read the text
  2. Pre-process the data
    • Convert to lower case
    • Remove special characters
    • Remove digits
    • Remove all the extra spaces with a single space
    • return the clean text
  3. Tokenize the data into sentences
  4. Remove stop words
  5. Create a word-count dictionary
  6. Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
     weight = (frequency of that word)/(total number of terms)
    
  7. Assign score to each sentence
  8. Rank inidivivdual sentences according to weight and extract n highest ranked sentences

Steps involved illustrated -

  1. Read the text

Read a text document, or ask input from user. Here we create a function which would take the input text and return the summarized text. The second argument to the function is the number of high scored senteces which you want to extract

def summarize_text (text, num_sent):
    ...
    ...
    return summary
  1. Pre-process the data

Steps and code as shown below -

def preprocess (text):
    # Convert to lower case
    clean_text = text.lower()
    # Remove special characters
    clean_text = re.sub (r"\W", " ", clean_text)
    # Remove digits
    clean_text = re.sub (r"\d", " ", clean_text)
    # Remove all the extra spaces with a single space
    clean_text = re.sub (r"\s+", " ", clean_text)
    # return the clean text
    return clean_text
  1. Tokenize the data into sentences

We use sent_tokenize() provided by nltk library

sentences = nltk.sent_tokenize (text)
  1. Remove stop words

Again, we use nltk

stop_words = nltk.corpus.stopwords.words('english')
  1. (contd. from 4) Remove stop words and create word count dictionary
word_count_dict = {}

for word in nltk.word_tokenize(clean_text):
    if word not in stop_words:
        if word not in word_count_dict.keys():
            word_count_dict[word] = 1
        else:
            word_count_dict[word] += 1
  1. Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
# Find the total number of terms (not necessarily unique) = sum of values in the word_count_dict
total_terms = sum(word_count_dict.values)

# Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
max_value = max(word_count_dict.values())
for key in word_count_dict.keys():
    word_count_dict[key] = word_count_dict[key]/total_terms
  1. Assign scores to each sentence
sentence_score_dict = {}
for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word_count_dict.keys():
            if len(sentence.split(' ')) < 25: # 25 taken at random, to remove very long sentences
                if sentence not in sentence_score_dict.keys():
                    sentence_score_dict[sentence] = word_count_dict[word]
                else:
                    sentence_score_dict[sentence] += word_count_dict[word]
  1. Rank inidivivdual sentences according to weight and extract n highest ranked sentences
best_sentences = heapq.nlargest(num_sent, sentence_score_dict, key=sentence_score_dict.get)

About

Repository accompanying the blog


Languages

Language:Python 100.0%