anandkn1884 / dataCleanup

Data cleanup code written in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dataCleanup

Data cleanup code written in Python that uses Natural Language Took Kit (NLTK) Libraries and various other libraries and logic for data preprocessing

  1. Read files from a given path one by one.
  2. Read lines from each file.
  3. Substitute white spaces in each line.
  4. Substitute white spaces before apostrophes.
  5. Substitute contractions in each line.
  6. Tokenize the sentences in each line.
  7. Separate Numbers and letters. 60m to 60 m.
  8. Convert words to lower case.
  9. Convert words to lower case.
  10. Remove Stop words.
  11. Convert all numerals to word equivalent. 65 to Sixty Five.
  12. Stem the words using Porter Stemmer.

About

Data cleanup code written in Python


Languages

Language:Python 100.0%