Predict-Next-Word

This is Coursera JHU Data Science Specialization capstone project, to build a Shiny App with Natural Processing Model (NLP) to predict what user next input word fast and accurately.

The dataset is from SwiftKey, a industry partner with JHU. The original text dataset can be downloaded here.

Models I used

N-Gram Tokenization. The usual n-gram consists of a sequence of n words, e.g. "I like to eat" is a Quandgram. This project used up to Pentagram (N=5).
Stupid-Backoff. One of the most used scheme for large language models. For more infomation about this model, click here.

Files/folders description

preprocessing.R: Loading,sampling text dataset. Data cleaning & profanity filtering. Building N-Grams.
global.R: Store all the functions and text analysis models to be used in the app.
ui.R/server.R: Shiny App file.
ngram.rds: Cleaned ngrams data from preprocessing.
Milestone2: A folder with a report on exploratory analysis of original text data.

ShinyApp Homepage

Links

Check out my Shiny App HERE
Link to my milestone report on exploratory analysis about whole dataset HERE.

About

Coursera Data Science Capstone Project. Built Shiny App to predict next word.

Languages

Language:R 100.0%