single quotes return different result based on language setting
maia opened this issue · comments
maia commented
A word that's surrounded by quotes should not be handled differently in English and German, see:
> text = "Charlie Hebdo backlash over 'racist' Alan Kurdi cartoon - https://t.co/J8N2ylVV3w"
> PragmaticTokenizer::Tokenizer.new(text, language: :en).tokenize
=> ["charlie", "hebdo", "backlash", "over", "'", "racist", "'", "alan", "kurdi", "cartoon", "-", "https://t.co/j8n2ylvv3w"]
> PragmaticTokenizer::Tokenizer.new(text, language: :de).tokenize
=> ["charlie", "hebdo", "backlash", "over", "'racist", "'", "alan", "kurdi", "cartoon", "-", "https://t.co/j8n2ylvv3w"]
Kevin Dias commented
Thanks, should be fixed now: 30e0fba