thucnt / sparkLearning

spark-LDA-example

A simple Spark LDA example. This project contains a basic Document Clustering example in which data cleaning is also done.

We are going to perform these procedures for the document clustering, these steps include:

Spark RegexTokenizer : For Tokenization
Stanford NLP Morphology : For Stemming and lemmatization
Spark StopWordsRemover : For removing stop words and punctuation
Spark TF-IDF : For computing term frequencies or tf-idf
Spark LDA : For Clustering of documents.

About

Other

Languages

Language:Scala 100.0%