anandkn1884 / dataCleanup

Data cleanup code written in Python

dataCleanup

Data cleanup code written in Python that uses Natural Language Took Kit (NLTK) Libraries and various other libraries and logic for data preprocessing

Read files from a given path one by one.
Read lines from each file.
Substitute white spaces in each line.
Substitute white spaces before apostrophes.
Substitute contractions in each line.
Tokenize the sentences in each line.
Separate Numbers and letters. 60m to 60 m.
Convert words to lower case.
Convert words to lower case.
Remove Stop words.
Convert all numerals to word equivalent. 65 to Sixty Five.
Stem the words using Porter Stemmer.

About

Data cleanup code written in Python

Languages

Language:Python 100.0%