A playground to fiddle with ngrams based language model predictive analysis.
Using the Chain rule of probability, we could derive a formula as given below:
P(Humpty Dumpty sat) = P(Humpty Dumpty sat | Humpty Dumpty) * P(Humpty Dumpty | Humpty) * P(Humpty)
To compute each term in the RHS, we use the sample corpus of text.
For the first term(from right):
P(Humpty) = count(Humpty)/total_word_count
Notice that for unigrams, we consider the total count of words to derive the probability.
For the second term:
P(Humpty Dumpty | Humpty) = count(Humpty Dumpty) / count(Humpty)
Likewise, for the third term:
P(Humpty Dumpty sat | Humpty Dumpty) = count(Humpty Dumpty sat) / count(Humpty Dumpty)
NOTE:
-
ProbabilityComputer
is a general, non-optimized implementation using chain rule of probability only. To further optimize, we would use a combination of the Markov Assumption, and a specific n-gram model, for example, bigram, trigram etc. -
MemoizedMarkovProbabilityModel
is a more optimal implementation, that uses memoization technique, the Markov assumption, maximum likelihood estimation and application of special symbol padding.
Using the Markov assumption, usually this can be simplified as follows:
P(Humpty Dumpty sat | Humpty Dumpty) = C(Humpty Dumpty sat) / C(Humpty Dumpty), n=3
P(<seq> Humpty Dumpty sat on a wall </seq>) = P(<seq> Humpty)*P(Humpt Dumpty)*...*P(wall </seq>), n=2
where, C
is the count function and P
is the probability function.
For more details on the Markov assumption please refer to the bibliography section below.
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again.
Given the above corpus, for example, the 3-grams would be as follows:
- Humpty Dumpty sat
- Dumpty sat on
- sat on a
- on a wall
- a wall Humpty
- Humpty Dumpty had
- Dumpty had a
- had a great
- a great fall
- great fall All
- fall All the
- the King's horses
- King's horses and
- horses and all
- and all the
- all the king's
- the king's men
- king's men couldn't
- men couldn't put
- couldn't put Humpty
- put Humpty together
- Humpty together again
If you wish to play with the classes provided as a library, consider the following code snippet:
class SomeClass {
public static void main(String[] args) {
var n = 2; // for 2-grams
var model = new MemoizedMarkovProbabilityModel(n); // create model
var corpus = "Humpty Dumpty sat on a wall"; // training corpus
model.train(corpus); // train the model.
var probability = model.computeProbability("Humpty Dumpty");
System.out.println(probability);
var nextWord = model.predictNextWord("Humpty Dumpty");
System.out.println(nextWord);
}
}