octetful / ngram-sandbox

A playground to fiddle with ngrams based language model predictive analysis.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ngrams-sandbox

A playground to fiddle with ngrams based language model predictive analysis.

Problem Statement

Using the Chain rule of probability, we could derive a formula as given below:
P(Humpty Dumpty sat) = P(Humpty Dumpty sat | Humpty Dumpty) * P(Humpty Dumpty | Humpty) * P(Humpty)

To compute each term in the RHS, we use the sample corpus of text.

For the first term(from right):
P(Humpty) = count(Humpty)/total_word_count

Notice that for unigrams, we consider the total count of words to derive the probability.

For the second term:
P(Humpty Dumpty | Humpty) = count(Humpty Dumpty) / count(Humpty)

Likewise, for the third term:
P(Humpty Dumpty sat | Humpty Dumpty) = count(Humpty Dumpty sat) / count(Humpty Dumpty)


NOTE:

  • ProbabilityComputer is a general, non-optimized implementation using chain rule of probability only. To further optimize, we would use a combination of the Markov Assumption, and a specific n-gram model, for example, bigram, trigram etc.

  • MemoizedMarkovProbabilityModel is a more optimal implementation, that uses memoization technique, the Markov assumption, maximum likelihood estimation and application of special symbol padding.


Using the Markov assumption, usually this can be simplified as follows:

  • P(Humpty Dumpty sat | Humpty Dumpty) = C(Humpty Dumpty sat) / C(Humpty Dumpty), n=3
  • P(<seq> Humpty Dumpty sat on a wall </seq>) = P(<seq> Humpty)*P(Humpt Dumpty)*...*P(wall </seq>), n=2

where, C is the count function and P is the probability function.

For more details on the Markov assumption please refer to the bibliography section below.

Sample Corpus

Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again.

Given the above corpus, for example, the 3-grams would be as follows:

  • Humpty Dumpty sat
  • Dumpty sat on
  • sat on a
  • on a wall
  • a wall Humpty
  • Humpty Dumpty had
  • Dumpty had a
  • had a great
  • a great fall
  • great fall All
  • fall All the
  • the King's horses
  • King's horses and
  • horses and all
  • and all the
  • all the king's
  • the king's men
  • king's men couldn't
  • men couldn't put
  • couldn't put Humpty
  • put Humpty together
  • Humpty together again

User Guide

If you wish to play with the classes provided as a library, consider the following code snippet:

class SomeClass {
  public static void main(String[] args) {
    var n = 2; // for 2-grams
    var model = new MemoizedMarkovProbabilityModel(n); // create model
    var corpus = "Humpty Dumpty sat on a wall"; // training corpus
    model.train(corpus); // train the model.
    var probability = model.computeProbability("Humpty Dumpty");
    System.out.println(probability);
    var nextWord = model.predictNextWord("Humpty Dumpty");
    System.out.println(nextWord);
  }
}

Bibliography

About

A playground to fiddle with ngrams based language model predictive analysis.


Languages

Language:Java 100.0%