arbox / tokenizer

A simple tokenizer in Ruby for NLP tasks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

URLs tokenzing individual characters in the URL

joshweir opened this issue · comments

Tokenizer::WhitespaceTokenizer.new.tokenize "www.google.com"
=> ["www", ".", "g","o","o","g","l","e",".","c","o","m"]

I want the website urls to be tokenized as a single noun effectively so would expect www.google.com to tokenize as "www.google.com". I am happy to fork this repo and would like to contribute.

Thanks for repo btw, it's useful.

@joshweir thank you for reporting, I'll review this next week.