diasks2 / pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

single quotes return different result based on language setting

maia opened this issue · comments

commented

A word that's surrounded by quotes should not be handled differently in English and German, see:

> text = "Charlie Hebdo backlash over 'racist' Alan Kurdi cartoon - https://t.co/J8N2ylVV3w"
> PragmaticTokenizer::Tokenizer.new(text, language: :en).tokenize
=> ["charlie", "hebdo", "backlash", "over", "'", "racist", "'", "alan", "kurdi", "cartoon", "-", "https://t.co/j8n2ylvv3w"]
> PragmaticTokenizer::Tokenizer.new(text, language: :de).tokenize
=> ["charlie", "hebdo", "backlash", "over", "'racist", "'", "alan", "kurdi", "cartoon", "-", "https://t.co/j8n2ylvv3w"]

Thanks, should be fixed now: 30e0fba