RobertTLange/sequitur

Type "sequitur -h" for a list of command-line options.

To get the basic idea of operation and the algorithm (on Unix):
$ echo -n abcdbcabcdbc | ./sequitur -p

To compress and decompress:
$ sequitur -c < input > compressed
$ sequitur -u < compressed > uncompressed

Here are some notes, and credits to those who have helped refine
the code:

______________________________________________________________________

December 2004:

I have added test.pl to provide a suite of minimal regression tests.

______________________________________________________________________

June 2004:

Roberto Maglica (romag@email.si) ported sequitur to Windows using the
Windows port of gcc 2.95-2, and cleaned up much of the code. He did this as
part of his BSc graduation thesis "Stiskanje podatkov z metodo Sequitur"
("Data Compression Using the Sequitur Method"), submitted to the Faculty of
Computer and Information Science, Ljubljana, Slovenia
http://www.fri.uni-lj.si

Roberto's comments here:

- setting binary-mode input/output when working on the Windows platform.
Unlike Unix, binary mode is not the default on Windows, so we have to
explicitly set it in order to correctly read and write data.

- module getopt.c, which contains the getopt() command-line parsing
function, again to use with Windows.

- fixed bug with -f ("memory limit") option. It can happen that when we
have to output a non-terminal for the first time, this non-terminal is
used only once in the right-hand sides of the rules. As it is being
output for the first time, we have to output the rule definition (its
right-hand side). The bug was that the program output a code for the
non-terminal, not the rule definition.

- possibility to have more than 256 terminal symbols. I did this by
differently arranging codes in the 'symbol' context (compress.cc
module). The first few codes (0,1,...) are used for special symbols like
START_RULE, END_OF_FILE, etc., then odd numbers are assigned to terminal
symbols, and even numbers to non-terminal symbols.

- a few optimizations to reduce the size of the output. These include:

  + recording the least and the greatest terminal symbol, so in the
    compression module we know which range terminal symbols are in,
    insert only symbols from that range into the "symbol" context,
    which results in using fewer bits to code them

  + recording maximal rule length (no MAX_LENGTH constant), and symbols
    0 and 1 are not inserted into the "lengths" context, because we
    don't have rules 0 characters or 1 character long; reason, same as
    above

  + if we did not use -f, create "symbol" and "lengths" contexts as
    static. Using a static context, rather than dynamic, for the same
    set of symbols, results in fewer bits being used.

  The gain from these optimizations is small -- typically, 0.5% of the
  length of the uncompressed data. However, it is present.

There are still issues to be resolved. For example, using the -k option
crashes the program on my system almost every time -- I have not
investigated this.

______________________________________________________________________

Richard O'Keefe <ok@cs.otago.ac.nz> uses Linux on an UltraSPARC. He has made
many helpful comments to clean up my non-portable code. The modifications
that I didn't implement because they don't work on RedHat Linux are:

- change $(CC) to $(CCC) in the Makefile, except for the .c files
- Change "-lstdc++" to "$(LIBS)" and define LIBS=

______________________________________________________________________
RobertTLange / sequitur

About

Languages