A fast NLP tokenizer that detects sentences, words, numbers, urls, hostnames, emails, filenames, dates, and phone numbers. Tokenize integrates and standardize the documents, remove the punctuations and duplications.
git clone https://github.com/callforpapers-source/doc2term
cd doc2term
python setup.py install
The installation requires to compile the original C code using gcc
.
Example notebook: doc2term
>>> import doc2term
>>> doc2term.doc2term_str("Actions speak louder than words. ... ")
"Actions speak louder than words ."
>>> doc2term.doc2term_str("You can't judge a book by its cover. ... from thoughtcatalog.com")
"You can't judge a book by its cover . from"
>>> doc2term.doc2term_str("You can't judge a book by its cover. ... from thoughtcatalog.com", include_hosts_files=1)
"You can't judge a book by its cover . from thoughtcatalog.com"