alejandronotario / LDA-Topic-Modeling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LDA-Topic-Modeling



Overview

This repository contains code to run a LDA (Latent Dirichlet Allocation) topic modeling. This model usually reuquires loads of memory and could be quite slow in Python. For this reason its is better to know a cuple of ways to run it quicker when datasets are outsize, in this case using Apache Spark with the Python API.

It has been done using Dataproc at Google Cloud using a cluster configuration which allows work with Jupyter Notebooks.

It has been used a dataset from Kaggle which contains over a million news headlines.

Prerequisites

  • A Google Cloud Account

  • Python 3

  • Pyspark v.2.2.1

  • GENSIM Mallet Module

Resources

Python LDA Model

This notebook is based on GENSIM toolkit.

pySpark LDA Model

There are 2 notebooks to make it quite different ways

About

License:MIT License


Languages

Language:Java 94.5%Language:Jupyter Notebook 5.0%Language:Shell 0.2%Language:HTML 0.1%Language:Makefile 0.1%Language:Batchfile 0.1%