ngrams-sandbox

A playground to fiddle with ngrams based language model predictive analysis.

Problem Statement

Using the Chain rule of probability, we could derive a formula as given below:
P(Humpty Dumpty sat) = P(Humpty Dumpty sat | Humpty Dumpty) * P(Humpty Dumpty | Humpty) * P(Humpty)

To compute each term in the RHS, we use the sample corpus of text.

For the first term(from right):
P(Humpty) = count(Humpty)/total_word_count

Notice that for unigrams, we consider the total count of words to derive the probability.

For the second term:
P(Humpty Dumpty | Humpty) = count(Humpty Dumpty) / count(Humpty)

Likewise, for the third term:
P(Humpty Dumpty sat | Humpty Dumpty) = count(Humpty Dumpty sat) / count(Humpty Dumpty)

NOTE:

ProbabilityComputer is a general, non-optimized implementation using chain rule of probability only. To further optimize, we would use a combination of the Markov Assumption, and a specific n-gram model, for example, bigram, trigram etc.
MemoizedMarkovProbabilityModel is a more optimal implementation, that uses memoization technique, the Markov assumption, maximum likelihood estimation and application of special symbol padding.

Using the Markov assumption, usually this can be simplified as follows:

P(Humpty Dumpty sat | Humpty Dumpty) = C(Humpty Dumpty sat) / C(Humpty Dumpty), n=3
P(<seq> Humpty Dumpty sat on a wall </seq>) = P(<seq> Humpty)*P(Humpt Dumpty)*...*P(wall </seq>), n=2

where, C is the count function and P is the probability function.

For more details on the Markov assumption please refer to the bibliography section below.

Sample Corpus

Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again.

Given the above corpus, for example, the 3-grams would be as follows:

Humpty Dumpty sat
Dumpty sat on
sat on a
on a wall
a wall Humpty
Humpty Dumpty had
Dumpty had a
had a great
a great fall
great fall All
fall All the
the King's horses
King's horses and
horses and all
and all the
all the king's
the king's men
king's men couldn't
men couldn't put
couldn't put Humpty
put Humpty together
Humpty together again

User Guide

If you wish to play with the classes provided as a library, consider the following code snippet:

class SomeClass {
  public static void main(String[] args) {
    var n = 2; // for 2-grams
    var model = new MemoizedMarkovProbabilityModel(n); // create model
    var corpus = "Humpty Dumpty sat on a wall"; // training corpus
    model.train(corpus); // train the model.
    var probability = model.computeProbability("Humpty Dumpty");
    System.out.println(probability);
    var nextWord = model.predictNextWord("Humpty Dumpty");
    System.out.println(nextWord);
  }
}

octetful / ngram-sandbox

ngrams-sandbox

Problem Statement

Sample Corpus

User Guide

Bibliography

About

Languages