Tokenizer/Stemmer and few other questions

Question

Tokenizer/Stemmer and few other questions

vprelovac opened this issue 4 years ago · comments

Hey Mišo

I spent a lot of time on text rank and while digging deeper into Sumy I want to ask you a few clarifying questions about some of the choices you made: This is all for English language.

_WORD_PATTERN = re.compile(r"^[^\W\d_]+$", re.UNICODE)

Used with word_tokenize() to filter 'non-word' words. The problem is it "kills" words like "data-mining" or "sugar-free". Also word_tokenize is very slow. Here is an alternative to replace these two to consider:

WORDS = re.compile(r"\w+(?:['-]\w+)*")
words = WORDS.findall(sentence)

What made you choose Snowball vs Porter stemmer.

Snowball: DVDs -> dvds
Porter: DVDs -> dvd

I don't have particual opinion just wondering how did you make the decision.

How did you come up with your stopwords (for english?) It is very different thatn nltk defaults for example.
Heuristics in plaintext parser are interesting.

In this example of text extracted from https://www.karoly.io/amazon-lightsail-review-2018/

Is Amazon Lightsail worth it?
Written by Niklas Karoly 10/28/2018 â�¢ 8 min read
Amazon AWS Lightsail review 2018
In November of 2016 AWS launched its brand Amazon Lightsail to target the ever growing market that DigitalOcean , Linode and co. made popular.

This ends up as two sentences instead of four.

Mišo Belica · Answer 1 · Thu Mar 05 2020 00:13:13 GMT+0800 (China Standard Time)

Hi Vladimir, I think you know the code more than me because TextRank was not contributed by me. At least not the current implementation. But I will try to check the code and respond to your questions.

I am not against to replace the implementation with the simpler/faster one, but the tokenizing is not always about the regexes. There are other languages and the Sumy I rellies on NLTK or other libs so I don't want to make it perfect for one language and break for others. Also, I trust NLTK to do its job better than me. But you are right those works should be fixed and tested. Also, Sumy is pluggable so you can provide your own tokenizer implemented as you mentioned. I think Regex can be simplified.

WORDS = re.compile(r"[\w'-]+")
words = WORDS.findall(sentence)

Snowball vs Porter stemmer) To be honest I don't remember the decision. I was years ago. I don't even know if I tried both and picked the better or simply use the first I saw in the documentation.
I barely remember it's a mix of NLTK, wiki freq. words and the stopwords from the other projects I was involved in. Sumy was my experiment in the early days and I used what gave me better results and I started to make it more generic when more people "joined" the project on Github.
Yep, the sentences are separated by the correct end mark, not the new line if that is what you mean.

Vladimir Prelovac · Answer 2 · Thu Mar 05 2020 02:31:58 GMT+0800 (China Standard Time)

I agree it is complex. However you already decided to use regex approach for english, and my point is that the regex I provided is higher quality and faster overall.

note: Your tweaked version would leave lonely dashes floating.

Fair enough
Sharing my current
stopWords=frozenset(['front', 'wednesday', 'whole', 'thin', "you're", 'appear', 'could', 'further', 'q', 'fri', 'willing', 'years', 'saturday', 'be', 'is', 's', 'various', 'example', 'your', "i'd", 'specifying', 'entirely', 'follows', 'therefore', 'asking', "we're", 'otherwise', 'newsinfo', 'doesn', 'becomes', 'ie', 't', 'inner', 'friday', 'ltd', 'however', 'different', 'herein', 'got', 'mightn', 'lately', "that'll", 'been', 'sometime', 'wherein', 'i', 'inquirer', 'no', 'along', '1', 'ever', 'hereupon', 'mean', 'value', 'described', 'via', '2', 'move', 'shouldn', 'december', 'five', 'anyone', "that's", 'sincere', 'toward', 'useful', 'had', 'normally', 'seems', 'am', 'allows', 'sent', 'april', 'instead', '5', 'yourselves', 'fifth', 'top', 'all', 'hasnt', 'inward', 'say', 'thickv', 'll', 'soon', 'weren', 'while', 'a', '10', 'might', 'sixty', 'anyways', "we've", 'please', 'little', 'least', 'definitely', 'eg', 'her', 'accordingly', 'hereafter', 'home', 'sun', 'y', 'seriously', 'whose', 'clearly', 'the', 'said', 'came', 'herself', 'stories', 'wouldn', 'ain', 'z', 'on', 'doing', 'until', 'except', 'anyhow', 'former', 'concerning', 'same', 'whereby', 'possible', 'going', 'still', "it's", "he's", 'keep', 'see', 'done', 'find', "c's", 'thus', 'indicates', 'ours', 'itself', 'thank', 'inc', 'lest', 'beyond', "wouldn't", 'currently', "we'd", 'himself', 'just', 'thu', 'although', 'consider', 'between', 'far', 'percent', 'o', 'will', 'looking', 'tries', "they're", 'okay', 'cannot', 'put', 'hundred', 'thereafter', 'mainly', 'ex', 'look', 'ten', 'allow', 'thanks', 'getting', 'much', "i've", 'gotten', 'my', 'plus', 'w', 'become', 'why', 'wants', 'after', 'zero', "when's", 'certain', 'unlikely', "how's", '0', 'photo', 'necessary', 'more', 'says', 'ma', 'whereas', 'so', 'whether', 'self', 'afterwards', 'rappler', 'yet', 'especially', 'wonder', "don't", '6', 'in', 'hopefully', 'having', "she'd", 'others', 'myself', 'often', 'tried', 'may', 'awfully', 'whoever', 'does', 'own', 'anything', 'besides', 'gives', "shouldn't", 'c', 'reasonably', 'again', 'associated', 'best', 'tends', 'amount', "aren't", 'ye', 'pm', 'anyway', 'would', 'sorry', 'mine', 'reuters', 'everywhere', 'found', 'of', 'specify', "i'm", 'looks', "hadn't", 're', 'yung', 'able', 'last', "you've", 'few', 'something', 'tue', 'this', "you'd", 'empty', "isn't", 'must', 'either', 'considering', 'whereafter', "we'll", 'eleven', 'usually', 'time', "hasn't", 'our', 'greetings', 'since', 'you', 'thursday', 'particularly', 'gone', 'don', 'above', 'new', 'amongst', 'seen', 'up', 'consequently', 'many', 'needs', 'behind', 'has', 'couldn', 'contain', 'tell', 'under', 'twenty', 'use', 'well', 'following', 'sports', 'later', 'go', 'every', 'but', 'it', 'indeed', 'namely', 'not', "weren't", 'once', 'each', 'first', 'beside', 'hardly', 'did', 'thence', 'liked', 'sub', 'used', 'b', 'hi', 'think', 'maybe', "should've", 'ako', 'rather', 'eight', 'against', "haven't", 'hers', 'too', 'was', 'beforehand', 'rapplercom', 'right', 'vs', 'seem', 'unto', 'sat', 'seemed', 'then', 'welcome', 'when', 'part', 'serious', 'can', 'sup', 'here', 'wherever', 'saying', 'ang', 'second', 'alone', 'another', 'with', 'co', 'according', 'ask', 'nowhere', 'wed', 'despite', 'particular', 'by', 'nothing', 'year', 'qv', 'regarding', 'nd', 'his', 'january', 'side', 'section', 'tuesday', 'never', 'both', 'indicated', "here's", 'quite', 'k', 'full', "couldn't", 'february', 'aren', 'somewhere', 'available', 'yes', 'into', 'per', 'g', "they've", 'thats', 'n', 'than', 'sometimes', 'uucp', 'always', 'back', 'get', 'merely', 'nobody', 'october', 'yourself', 'followed', 'specified', 'even', 'for', 'nor', 'shall', 'rd', 'whence', 'somebody', 'howbeit', 'f', 'news', 'down', 'july', "let's", 'third', 'yours', 'fifteen', 'hadn', 'seeming', '3', 'bottom', 'v', 'saw', 'contains', 'immediate', 'now', 'trying', 'though', 'march', 'story', 'certainly', 'mon', "why's", 'tweet', 'placed', 'latterly', 'monday', 'try', 'haven', 'made', 'changes', 'those', 'latter', 'enough', 'noone', 'together', 'viz', 'someone', 'september', "where's", 'onto', 'make', 'were', 'elsewhere', 'do', 'thorough', 'overall', "he'd", 'thereupon', 'non', 'gets', 'containing', 'he', 'most', 'downwards', 'kept', 'everybody', "shan't", 'towards', 'happens', 'cant', 'already', 'how', 'un', 'using', 'sure', 'nine', 'meanwhile', "didn't", 'great', 'selves', 've', 'because', 'outside', 'some', 'there', 'four', 'amoungst', 'from', 'take', 'way', 'detail', 'throughout', 'moreover', 'anywhere', "i'll", 'among', 'oh', 'actually', 'isn', 'l', 'comes', 'six', 'wasn', 'an', 'ourselves', 'them', 'over', 'wish', "what's", 'only', 'keeps', 'being', 'upon', 'regardless', 'm', 'didn', 'd', 'several', 'else', "they'll", 'describe', 'novel', 'e', 'better', 'that', 'exactly', 'who', 'people', 'want', 'none', 'course', 'june', 'without', 'me', 'sensible', 'sa', 'nevertheless', 'very', 'unless', 'presumably', 'needn', 'about', 'let', 'somewhat', 'whenever', 'indicate', 'such', 'mill', 'shan', 'before', '2012', 'ok', 'during', 'yun', 'us', 'due', 'come', 'que', 'appreciate', 'fire', 'themselves', 'within', 'insofar', 'name', 'everyone', 'are', 'forth', 'at', 'ones', 'believe', 'brief', 'secondly', 'th', 'everything', 'also', 'thanx', 'next', 'if', 'away', 'somehow', 'furthermore', 'seven', 'mostly', 'help', "it'll", "doesn't", 'took', 'perhaps', 'neither', 'what', "there's", "t's", 'less', 'apart', 'hereby', 'as', 'they', 'thereby', "needn't", 'should', 'other', 'near', 'went', 'hither', 'inasmuch', 'provides', 'cause', 'forty', 'de', "he'll", "wasn't", 'and', 'p', 'x', '9', 'anybody', "it'd", 'yahoo', 'corresponding', 'around', 'one', 'truly', 'hasn', 'formerly', 'out', 'hello', "mightn't", 'off', 'three', 'twelve', 'ought', 'she', 'which', 'theres', 'won', 'thoroughly', 'two', 'whither', 'causes', '8', 'became', 'call', 'u', 'mustn', 'any', 'h', 'need', 'becoming', 'homepage', 'fifty', "a's", 'almost', 'or', 'known', 'really', 'taken', 'edu', 'likely', 'where', 'we', 'have', "mustn't", 'given', 'ignored', 'nearly', 'uses', 'show', "she'll", 'ko', 'hence', "can't", 'unfortunately', 'november', 'respectively', 'j', 'r', "ain't", 'relatively', 'probably', 'et', 'theirs', "she's", 'fill', 'august', "won't", 'these', "c'mon", 'sunday', 'through', 'him', 'etc', 'regards', "who's", 'whom', 'thru', 'com', 'appropriate', 'knows', 'know', 'seeing', 'goes', 'below', "they'd", 'whereupon', 'na', 'con', "you'll", 'aside', 'old', '4', 'twice', 'across', 'give', 'obviously', 'its', '2013', 'therein', '7', 'ng', 'whatever', 'like', 'to', 'their'])
Yes but that is wrong as these are clearly four sentences.

Thanks!

Mišo Belica · Answer 3 · Thu Mar 05 2020 14:34:15 GMT+0800 (China Standard Time)

1 - It's not completely true. Sumy uses nltk.word_tokenize and the regex is used only to filter some words out. You are right it should not filter some words with - or ' probably, but your version removes the NLTK completely and relies only on regex, and I am not sure this is OK for me. Especially when it's not hard to implement and use custom tokenizer with Sumy. Anyway, thanks for the info why you decided to go with more complicated Regex :)
3 - Yep, you can use these or any others. That's why I left the Sumy open for custom components.
4 - Yes, it is as I can see. Unfortunately, NLTK couldn't detect it. If you have a better implementation of the Python sentence tokenizer, I will be happy to test it and replace NLTK in Sumy with it 👍