Markov Text Generator

A quick Python implementation of a text generator based on a Markov process.

Overview

Sometimes you just want to generate random text that's usually sort of grammatical. For this purpose, a Markov chain is a good fit. Text generated is nonsense, but there are cases where sometimes that's all you need. Here's some text, generated by running the command python3 gentext.py models/jack-masden.json from the project root:

This is the same for Transwestern Pipeline Company name should see you. Tracy, wanted to sit in? checked by anyone as planned, with this additional schedule, 2. need to communicate this one? Do you can constructively talk through one last week of his list for the updated version of the best estimate at your comments as to you need to have input and Administration. Legal Consolidation Data Viewer DataWarehouse User Role Consolidated Thank you. You are not sure it will be aware of you pland to outweigh the percentage of Networks, and privileged material for was ok Let's plan assessments from...

Installation

This package is not currently on PyPI. You can install this repo as a pip package using the following command:

pip install git+ssh://git@github.com/lambdacasserole/markov-text-generator.git

Usage (Command Line)

You can use the models entirely from the command line if you like, it's really straightforward.

Training

If you want to train a new model from a set of text files, use genmodel.py. Do this:

python3 genmodel.py <file1> [file2] ... [filen]

It'll take as many files as you give it as long as you give it at least one. The serilized model is written to standard output, so to train on two files called emails.txt and tweets.txt and save the model file in my_model.json, do one of these:

python3 genmodel.py emails.txt tweets.txt > my_model.json

Note there are filters built in to genmodel.py to do some basic data cleaning. These are designed for the Enron dataset [1] and remove email headers etc. You'll have to adjust them to your training set.

Text Generation

Once you have a trained model, you can use gentext.py to generate text. This is even simpler. To generate text from my_model.json, do this:

python3 gentext.py my_model.json

This will generate a 100-token string by default. If you want to generate longer/shorter strings, you can specify the length of the string in tokens like so (in this case, 1000 tokens will be generated):

python3 gentext.py my_model.json 1000

Usage (Python)

You can also do things programmatically from within Python. It's a bit more involved, but still super simple. This example is as for the command-line section. We want to train a model from two text files emails.txt and tweets.txt and save it to my_model.json, so:

import genmodel as gm

# Analyse files, getting frequency analysis and starting tokens.
analysis, starts = gm.analyze(["emails.txt", "tweets.txt"])

# Compute model from analysis.
model = gm.compute_model(analysis, starts)

# Now, save model,
model.persist("my_model.json")

# Generate and print a 100-token string.
print(model.generate_text(100))

Training Data

The training data used to generate the model files in /models is drawn from the Enron Dataset [1]. I selected 5 people from it at random, generated random names for them and trained the model on their sent_items or sent folders.

Acknowlegements

The training data used to generate the model files in /models comes from the Enron Dataset [1].

References

Klimt, B. and Yang, Y., 2004, September. The enron corpus: A new dataset for email classification research. In European Conference on Machine Learning (pp. 217-226). Springer, Berlin, Heidelberg.

lambdacasserole / markov-text-generator