language-model ngram-model nlp-machine-learning

N-Gram-Language-Model

Includes:

Index words
Store ngrams in a Trie data structure
Efficiently extract ngrams and their frequencies
Compute out-of-vocabulary (OOV) rate
Compute ngram probabilities with absolute discounting with interpolation smoothing.
Compute Perplexity

Introduction

A statistical language model is the development of probabilistic models to predict the probability of a sequence of words. It is able to predict the next word in a sequence given a history context represented by the preceding words.

The probability that we want to model can be factorized using the chain rule as follows:

where is a special token to denote the start of the sentence.

In practice, we usually use what is called N-Gram models that use Markov process assumption to limit the history context. Examples of N-Grams are:

Training

Using Maximum Likelihood criteria, these probabilities can be estimated using counts. For example, for the bigram model,

However, this can be problamatic if we have unseen data because the counts will be 0 and thus the probability is undefined. To solve this problem, we use smoothing techniques. There are different smoothing techniques and the one that we used is called absolute discounting with interpolation.

Perplexity

To meausre the performance of a language model, we compute the perplexity of the test corpus using trained m-Grams:

Results

Model was tested on europarl dataset (dir data):

Test PP with bigrams = 130.09

Test PP with trigrams = 94.82

About

Language modeling based on ngrams models and smoothing techniques

language-model ngram-model nlp-machine-learning

Languages

Language:Python 100.0%