Adaptive Language Models in Python

⚠️ A python re-interepretation of the PPM JS Library. Original found at https://github.com/google-research/google-research/tree/master/jslm - see the original for more code comments.

This directory contains collection of simple adaptive language models that are cheap enough memory- and processor-wise to train in a browser on the fly.

Language Models

Prediction by Partial Matching (PPM)

Prediction by Partial Matching (PPM) character language model. See the bibliography below. If you are looking for alternative implementations

Javascript: https://github.com/google-research/google-research/tree/master/jslm
C++ https://github.com/pmcharrison/ppm, https://github.com/money6g/ppm
Swift - https://github.com/kdv123/PPMLM
Python (NB: For compression rather than prediction) https://pyppmd.readthedocs.io/en/latest/index.html https://pypi.org/project/pyppmd/

Bibliography

Cleary, John G. and Witten, Ian H. (1984): “Data Compression Using Adaptive Coding and Partial String Matching”, IEEE Transactions on Communications, vol. 32, no. 4, pp. 396–402.
Moffat, Alistair (1990): “Implementing the PPM data compression scheme”, IEEE Transactions on Communications, vol. 38, no. 11, pp. 1917–1921.
Ney, Reinhard and Kneser, Hermann (1995): “Improved backing-off for M-gram language modeling”, Proc. of Acoustics, Speech, and Signal Processing (ICASSP), May, pp. 181–184. IEEE.
Chen, Stanley F. and Goodman, Joshua (1999): “An empirical study of smoothing techniques for language modeling”, Computer Speech ＆ Language, vol. 13, no. 4, pp. 359–394, Elsevier.
Ward, David J. and Blackwell, Alan F. and MacKay, David J. C. (2000): “Dasher – A Data Entry Interface Using Continuous Gestures and Language Models”, UIST '00 Proceedings of the 13th annual ACM symposium on User interface software and technology, pp. 129–137, November, San Diego, USA.
Drinic, Milenko and Kirovski, Darko and Potkonjak, Miodrag (2003): “PPM Model Cleaning”, Proc. of Data Compression Conference (DCC'2003), pp. 163–172. March, Snowbird, UT, USA. IEEE
Jin Hu Huang and David Powers (2004): “Adaptive Compression-based Approach for Chinese Pinyin Input”, Proceedings of the Third SIGHAN Workshop on Chinese Language Processing, pp. 24–27, Barcelona, Spain. ACL.
Cowans, Phil (2005): “Language Modelling In Dasher – A Tutorial”, June, Inference Lab, Cambridge University (presentation).
Steinruecken, Christian and Ghahramani, Zoubin and MacKay, David (2016): “Improving PPM with dynamic parameter updates”, Proc. of Data Compression Conference (DCC'2015), pp. 193–202, April, Snowbird, UT, USA. IEEE.
Steinruecken, Christian (2015): “Lossless Data Compression”, PhD dissertation, University of Cambridge.

Histogram Language Model

Very simple context-less histogram character language model.

Bibliography

Steinruecken, Christian (2015): “Lossless Data Compression”, PhD dissertation, University of Cambridge.
Pitman, Jim and Yor, Marc (1997): “The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator.”, The Annals of Probability, vol. 25, no. 2, pp. 855–900.
Stanley F. Chen and Joshua Goodman (1999): “An empirical study of smoothing techniques for language modeling”, Computer Speech and Language, vol. 13, pp. 359–394.

Pólya Tree (PT) Language Model

Context-less predictive distribution based on balanced binary search trees. Tentative implementation is here.

Bibliography

Gleave, Adam and Steinruecken, Christian (2017): “Making compression algorithms for Unicode text”, arXiv preprint arXiv:1701.04047.
Steinruecken, Christian (2015): “Lossless Data Compression”, PhD dissertation, University of Cambridge.
Mauldin, R. Daniel and Sudderth, William D. and Williams, S. C. (1992): “Polya Trees and Random Distributions”, The Annals of Statistics, vol. 20, no. 3, pp. 1203–1221.
Lavine, Michael (1992): “Some aspects of Polya tree distributions for statistical modelling”, The Annals of Statistics, vol. 20, no. 3, pp. 1222–1235.
Neath, Andrew A. (2003): “Polya Tree Distributions for Statistical Modeling of Censored Data”, Journal of Applied Mathematics and Decision Sciences, vol. 7, no. 3, pp. 175–186.

Example

Please see a simple example usage of the model API in example.py.

The example has no command-line arguments. To run it using Python invoke

> python example.py

Notes

Something is wrong with my PPM library. Its continually predicting the same ids no matter the context. I don't get it.

Test Utility (and Demo of Character or Word prediction)

A simple test driver language_model_driver.py can be used to check that the model behaves using Python 3+. The driver takes three parameters: the maximum order for the language model, the training file and the test file in text format. Currently only the PPM model is supported. Note we show in this how you can do next letter and next word predictions use max_length of around 30 Be warned too: training is fast. running test_model can take a long time for the word models. Look at the code - you will need a larger max_length for words_

Example:

> python language_model_driver.py 30 training_small.txt training_small_test.txt
Results: numSymbols = 54, ppl = 13.268624243648365, entropy = 3.7299468876181376 bits/char
Top 5 character predictions for 'he': ['l', ' ', 'e', 't', 'o']
Results: numSymbols = 54, ppl = 9.575973715690846, entropy = 3.2594191923509106 bits/char
Top 5 word predictions for 'Hello ': ['<OOV>', 'everyone', 'sequence', 'test', 'world']

Example train and test files to use

train.txt

hello world hello everyone hello there hello world
this is a test this is a trial this is a sequence
welcome to the model test welcome to the world
Gorgeous Doris Day is lovely. One day i went to the beach. 
Today I was at the shops. What day is it today?

test.txt

hello world this is a test sequence
welcome to the test

willwade / pylm

Adaptive Language Models in Python

Language Models

Prediction by Partial Matching (PPM)

Bibliography

Histogram Language Model

Bibliography

Pólya Tree (PT) Language Model

Bibliography

Example

Test Utility (and Demo of Character or Word prediction)

Example train and test files to use

About

Languages