thucnt / sparkLearning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spark-LDA-example

A simple Spark LDA example. This project contains a basic Document Clustering example in which data cleaning is also done.

We are going to perform these procedures for the document clustering, these steps include:

  1. Spark RegexTokenizer : For Tokenization

  2. Stanford NLP Morphology : For Stemming and lemmatization

  3. Spark StopWordsRemover : For removing stop words and punctuation

  4. Spark TF-IDF : For computing term frequencies or tf-idf

  5. Spark LDA : For Clustering of documents.

About

License:Other


Languages

Language:Scala 100.0%