Data cleanup code written in Python that uses Natural Language Took Kit (NLTK) Libraries and various other libraries and logic for data preprocessing
- Read files from a given path one by one.
- Read lines from each file.
- Substitute white spaces in each line.
- Substitute white spaces before apostrophes.
- Substitute contractions in each line.
- Tokenize the sentences in each line.
- Separate Numbers and letters. 60m to 60 m.
- Convert words to lower case.
- Convert words to lower case.
- Remove Stop words.
- Convert all numerals to word equivalent. 65 to Sixty Five.
- Stem the words using Porter Stemmer.