Tip: how to make it summarize mid-tail languages, e.g. Polish
Manamama opened this issue · comments
Problem:
The sumy module uses the nltk package for stemming and stop words, but nltk does not support e.g. the Polish language out of the box.
Solution:
Stop words:
Download the Polish stop words file from e.g. here, rename it to polish.txt
, and place it in the sumy stop words directory (~/.local/lib/python3.10/site-packages/sumy/data/stopwords/polish.txt
).
Stemming:
Use the pystempel package, which provides a stemmer for the Polish language. Here’s the code:
from stempel import StempelStemmer
class CallableStemmer:
def __init__(self, stemmer):
self.stemmer = stemmer
def __call__(self, word):
return self.stemmer.stem(word)
def get_stemmer(language):
if language == 'pol':
# Create a StempelStemmer object for Polish
stemmer_obj = StempelStemmer.default()
# Wrap it in a CallableStemmer
return CallableStemmer(stemmer_obj)
else:
# For non-Polish languages, use the original Stemmer
return Stemmer(language)
Then in this section, in the handle_arguments function, replace the line where the stemmer is created with a call to get_stemmer:
def handle_arguments(args, default_input_stream=sys.stdin):
# ... (other code) ...
language = args["--language"]
if args["--stopwords"]:
stop_words = read_stop_words(args["--stopwords"])
else:
stop_words = get_stop_words(language)
parser = parser(document_content, Tokenizer(language))
stemmer = get_stemmer(language)
# ... (other code) ...
This way, if the language is Polish, get_stemmer will return a CallableStemmer that wraps a StempelStemmer. For any other language, it will return the original Stemmer.
Credit for most of the code: MS Copilot aka Bing
Hi, thank you for the issue. If NLTK support is not good enough maybe it would be better to add the support you are suggesting into NLTK. WDYT?
I have seen that you have stemmers in your code for Slovak, Greek etc. We had better add Polish there, instead.
(BTW, I know next to nothing about such architecture, I have just been hacking here... )