miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

Home Page:https://miso-belica.github.io/sumy/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Summaries are the same for reversed-order sentences

xodarap777 opened this issue · comments

I tested the LSA, KL, TextRank, LexRank, and SumBasic summarizers on a 10,000-word-ish document with wide-ranging internal topics. I tested them on the document as-is and also with its sentences parsed in reverse order. I double-checked the order of sentences by printing out the result of the summarizers with a max sentence count larger than the total.

The problem is that the results of any size of summary (tested output sizes of 10 sentences, 30% of total, and 50% of total) for any given summarizer were the same regardless of sentence order. I then tried the same tests with randomly-ordered sentences, and again got the same results for forward-order, reverse-order, and random-order inputs.

This may be my misunderstanding, but KL, at the least, should give different results according to the order of sentences - right? If so, something's wrong.

Hi @xodarap777, to be honest, I am quite confused. For example, I don't understand what you mean by printing out the result of the summarizers with a max sentence count larger than the total. Can you maybe provide one example document with the code you use and write what the actual and expected result is/should be? I understand it as follows:

LANGUAGE = "english"
SENTENCES_COUNT = 10


def test_the_sentences_should_be_in_different_order():
    url = "https://en.wikipedia.org/wiki/Automatic_summarization"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)
    summarizer = KLSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)
    reversed_document = ObjectDocumentModel(
        Paragraph(reversed(p.sentences)) for p in reversed(parser.document.paragraphs)
    )

    sentences = summarizer(parser.document, SENTENCES_COUNT)
    reversed_sentences = summarizer(reversed_document, SENTENCES_COUNT)

    assert reversed(sentences) == reversed_sentences

As you can see in 935035b I added the test for this and it seems OK to me.