miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

Home Page:https://miso-belica.github.io/sumy/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tip: how to make it summarize mid-tail languages, e.g. Polish

Manamama opened this issue · comments

Problem:

The sumy module uses the nltk package for stemming and stop words, but nltk does not support e.g. the Polish language out of the box.

Solution:

Stop words:

Download the Polish stop words file from e.g. here, rename it to polish.txt, and place it in the sumy stop words directory (~/.local/lib/python3.10/site-packages/sumy/data/stopwords/polish.txt).

Stemming:

Use the pystempel package, which provides a stemmer for the Polish language. Here’s the code:

from stempel import StempelStemmer
class CallableStemmer:
    def __init__(self, stemmer):
        self.stemmer = stemmer

    def __call__(self, word):
        return self.stemmer.stem(word)

def get_stemmer(language):
    if language == 'pol':
        # Create a StempelStemmer object for Polish
        stemmer_obj = StempelStemmer.default()
        # Wrap it in a CallableStemmer
        return CallableStemmer(stemmer_obj)
    else:
        # For non-Polish languages, use the original Stemmer
        return Stemmer(language)

Then in this section, in the handle_arguments function, replace the line where the stemmer is created with a call to get_stemmer:

def handle_arguments(args, default_input_stream=sys.stdin):
    # ... (other code) ...

    language = args["--language"]
    if args["--stopwords"]:
        stop_words = read_stop_words(args["--stopwords"])
    else:
        stop_words = get_stop_words(language)

    parser = parser(document_content, Tokenizer(language))
    stemmer = get_stemmer(language)

    # ... (other code) ...

This way, if the language is Polish, get_stemmer will return a CallableStemmer that wraps a StempelStemmer. For any other language, it will return the original Stemmer.

Credit for most of the code: MS Copilot aka Bing

Hi, thank you for the issue. If NLTK support is not good enough maybe it would be better to add the support you are suggesting into NLTK. WDYT?

I have seen that you have stemmers in your code for Slovak, Greek etc. We had better add Polish there, instead.

(BTW, I know next to nothing about such architecture, I have just been hacking here... )