This project contains examples from the LinkedIn course titled NLP with Python for Machine Learning Essential Training by Derek Jedamski.
It assumes you already have Python installed. These instructions are taken directly from http://www.nltk.org/install.html.
From the terminal:
sudo pip3 install -U nltk
From the terminal,
sudo pip3 install pandas
Natural language processing is a field concerned with the ability of a computer to understand, analyze, manipulate, and potentially generate human language.
By human language, we're simply referring to any language used for everyday communication. This can be English, Spanish, French, anything like that. Python doesn't naturally know what any given word means. All it sees is a string of characters.
The natural language toolkit is the most utilized package for handling natural language processing tasks in Python. Usually called NLTK for short, it is a suite of open-source tools originally created in 2001 at the University of Pennsylvania for the purpose of making building NLP processes in Python easier. This package has been expanded through the extensive contributions of open-source users in the years since its original development.
Takeaways
-
Useful methods for tokenizing
findall()
- will search for the actual words while ignoring the thingssplit()
will search for the characters that split the words while ignoring the actual words themselves
-
Useful regexes for tokenizing
\W
&\w
- words\S
&\s
- whitespaces