patrick-brian-mooney / markov-sentence-generator

Generates "natural-sounding" text using a Markov model and sample textual training input. Given some sample text from which to build a model, the program prints out one or more sentences by randomly traversing a Markov chain that models the source text.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Patrick Mooney's Markov Sentence Generator

v2.4, 5 December 2020

This program generates (one or more sentences of) "natural-sounding" random text based on an existing text (or more than one text) that it analyzes, creates a model of, and uses as the basis for the new text that it generates. That is to say, given some sample text, the program creates a set of Markov chains that models those input text(s) and then generates new text by randomly traversing that set of chains. Use it from the terminal by doing something like:

$ ./text_generator.py [options] -i FILENAME [-i FILENAME ] [-i filename ...]

Note that users of non-Unix-based operating systems (notably Windows) may need to drop the ./ at the beginning of that command. It should, in theory, run fine on non-Linux operating systems, but I haven't tested this, myself. Feedback is welcome on this or other matters. Collaboration is also quite welcome. (See the file PROGRAMMING.md for more information.) This script requires Python 3.5 or later.

text_generator.py needs existing text to use as the basis for the text that it generates. You must either specify at least one plain-text file (with -i or --input) for this purpose, or else must use -l or --load to specify a file containing compiled probability data ("saved chains"), created with -o on a previous run. The -l (or --load) option is a convenience to save processing time: the program will run more quickly, but you can't combine -l/--load with -i/--input, nor can you use more than one -l/--load in a single program run. There are other options—those that would alter an existing model, primarily—that are incompatible with -l/--load, too. See below for more details.

If you're looking for something to play with, try passing in a book from Project Gutenberg with -i or --input, and trying using different (fairly small) integers to the -m parameter, e.g. -m 2 or -m 5.

A quick reference list of available command-line options:

short formlong formeffect
-h--helpDisplay a long help message.
-v--verboseIncrease how chatty the script is. Primarily useful in debugging.
-q--quietDecrease how chatty the script is.
-m NUM--markov-length=NUMLength (in words) of the Markov chains used by the program. Cannot be used with --load or -1.
-i FILENAME--input=FILENAMESpecify an input file to use as the basis of the generated text. Cannot be used with --load or -1.
-l FILE--load=FILELoad generated probability data ("chains") from a previous run that have been saved with -o or --output.
-o FILE--output=FILESpecify a file into which the generated probability data (the "chains") should be saved.
-c NUM--count=NUMSpecify how many sentences the script should generate.
-r--charsUse individual characters, rather than individual words, as the tokens for the text generator. Cannot be used with --load or -1.
-w NUM--columns=NUMWrap the output to a specified number of columns. If W is -1 (or not specified), the sentence generator does its best to wrap to the width of the current terminal. If W is 0, no wrapping at all is performed, and words may be split between lines.
-p NUM--pause=NUMPause for roughly NUM seconds after each paragraph. The actual pause length may be more or less than specified.
 --htmlWrap paragraphs of text output by the program with <p> ... </p>..

You can use ./text_generator.py --help to get more detailed usage information.

Chain length defaults to 1 (which is fastest), but increasing this may generate more "realistic" text (depending on what you think that means and how lucky the algorithm gets on a particular run), though slightly more slowly and at the cost of requiring additional memory (and disk space, if you save the generated chains with -o). Depending on the text, increasing the chain length past 6 or 7 words probably won't do much good—at that point you're usually plucking whole sentences from the source text(s) anyway, so using a Markov model to pick sentences is probably overkill.

You can also use this script as a module from other Python 3.X code to produce text; see the file PROGRAMMING.md for more information about how to do so. It plays nicely with Cython to build faster versions of the text-generating libraries, as well; there's a brief introduction in PROGRAMMING.md.

This script is Patrick Mooney's fork of Harry R. Schwartz's Markov Sentence Generator, initially created for my automated text blog Ulysses Redux. (I also use it on many of my other automated text projects.) HRS did the hard work here; my changes reflect adaptations to the needs my own projects (and were largely motivated by a desire to learn a bit about Python, and about text modeling). It also seeks to be more generally useful as a command-line program than its progenitor, though how well I have succeeded at that goal is of course a matter of opinion. The command-line interface options are intended to be a superset of those used in Jamie Zawinski's DadaDodo, which is also a Markov-based text generator (though this program is not explicitly intended to be a drop-in replacement for DadaDodo and—notably—it cannot read compiled DadaDodo chains, nor produce chains DadaDodo can read).

If you want to help develop this script, you are welcome to do so: I am interested in good ideas and I welcome collaboration. If you'd rather go off on your own with it, why, then you should be aware that this script, like Schwartz's original, is licensed under the GPL, either version 3 or (at your option) any later option. A copy of version 3 of the GPL is included as the file LICENSE.md; a listing of changes is included as the file HISTORY.md.

About

Generates "natural-sounding" text using a Markov model and sample textual training input. Given some sample text from which to build a model, the program prints out one or more sentences by randomly traversing a Markov chain that models the source text.

License:GNU General Public License v3.0


Languages

Language:Python 100.0%