LokeshNanda / nlp_spark

Natural Language Processing with Spark's MLlib

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

#Natural Language Processing with Spark's ML

##Requires

  • Anaconda Python 3.4
    • NLTK
    • langid
    • findspark (for local spark install only)
  • Spark 1.6
    • Local install OK

#Example Description

  • How to create a Data Science vs Spam classifier for twitter?
  • How to choose the right algorithm?
  • What do I need to start?

##Use PySpark to preprocess text data

  • Language Classification
  • Stop Word Removal
  • Custom Twitter Specific Clean Up
  • Part of Speech Tagging
  • Lemmatization/Stemming of Text
  • General Cleanup

##Converting text to numerical data with ML Pipelines

  • Tokenization
  • Term Frequency Hashing
  • Inverse Document Frequency

##Training & Testing a Model

  • Crossvalidation with ML Pipeline CrossValidator
  • Evaluation with ML Pipeline Evaluator

##Watch the Talk

About

Natural Language Processing with Spark's MLlib


Languages

Language:Jupyter Notebook 63.7%Language:Python 36.3%