bay3s/d-tree

***************************************************************************
* Readme file for the decision tree learning algorithm
*
* (C) 1999 Dan Foygel (dfoygel@cs.cmu.edu)
* Carnegie Mellon University
*
* Based heavily on code written by Dimitris Margaritis (dmarg@cs.cmu.edu)
***************************************************************************

***********************
* PROGRAM DESCRIPTION *
***********************

In this directory you will find the code for a decision tree learning
program. The executable is called "dt". You will have to compile it
for the computer architecture you'll be using. There are two
techniques for using this program.

USAGE #1:The "dt" program takes either 4 or 6 arguments:

- (Optional) The random number generator seed can be specified by
typing "-s <seed>" _right after_ "dt" (it will not work if you put
it anywhere else on the command line). If no seed is specified,
the seed will be chosen (semi-)randomly from the microseconds of
the computer clock.

- (Optional) dt can run in batch mode if you type "-b <number>"
_right after_ "dt". When doing this, the program will run the
algorithm <number> times and only report the summary statistics.
See "Batch" section in this README for detailed information.

(NOTE: The -s and -b flags cannot both be used.)

- The fraction of the examples that are to be used for growing the
decision tree.

- The fraction of the examples to be used for post-pruning
(reduced-error pruning) of the decision tree.

- The fraction of the examples to be used for testing the accuracy of
the grown decision tree, after training and (possibly)
post-pruning.

- The name of the file containing the examples. Its format is
"SSV". The format is explained below.

The three sets of examples as specified by the three fractions are
mutually exclusive. The must add up to at most 1.0 (less than 1 is ok).

USAGE #2: dt -tpt <trainfile> <prunefile> <testfile>
dt -tp <trainfile> <prunefile>
dt -tt <trainfile> <testfile>

This form allows you to specifically specify which examples are for training and which are for training, which is useful for understanding how pruning works.

*************
* EXAMPLES: *
*************

dt 1 0 0 tennis.ssv

This will cause dt to use 100% of the available examples in tennis.ssv
to train the decision tree. No pruning or testing will be done.

dt .4 0 .3 tennis.ssv

This will cause dt to use 40% of the available examples in tennis.ssv
to train the decision tree and 30% of these examples as a test set to
evaluate the final learned tree. No pruning will be done.

dt .4 .3 .3 tennis.ssv

This will cause dt to use 40% of the available examples in tennis.ssv
to train the decision tree, 30% of these examples for post-pruning, and
30% of these examples as a test set to evaluate the final learned tree.

dt -s 123456 .4 .3 .3 tennis.ssv

This will do exactly the same thing as before, except the seed 123456
will be used to ensure repeatable random number generation. Use this
argument when you want to make sure that the data is split into the
training, pruning, and test sets the same way every time.

NOTE: Running the program with the wrong number of examples, or
fractions not in the range [0.0, 1.0], or fractions summing up to more
than 1.0 will cause the program to abort with a message displaying its
usage.

**************
* BATCH MODE *
**************

Example:

dt -b 100 .4 .3 .3 tennis.ssv

This will run the decision tree learner 100 times (using 40% of the
data for training, 30% for pruning, and 30% for test) using a
different random split of the data each time. Instead of reporting
individual trees and statistics, only aggregate numbers will be
displayed - the mean and standard deviation for the number of nodes in
the tree, the training accuracy, and the test accuracy.

Use this mode when you want to compare particular parameter settings -
a batch size of at least 100 will ensure a reasonable level of
reliability.

*******************
* SSV FILE FORMAT *
*******************

All data files use the SSV file format. It is a simple text format,
consisting of lines of either administrative information (the
"header", first 3 lines), or data lines (the rest). Each line
consists of a number of words. There is an arbitrary number of spaces
or tabs allowed between words. However, reasonably, a line cannot
contain newlines.

Header (first 3 lines):

The first line contains two numbers, the number of fields
(attributes, target attribute included) and the number of 0
(included for reasons of backwards compatability - please do not
modify or remove). The second line contains as many words as are
fields. Each word represents the name of the attribute. The third
line contains as many characters as attributes. Each character is
either 'c' (continuous attribute), 'b' (binary, 0/1 attribute) or
'd' (discrete attribute, more than two alternatives).

Data (rest):

The rest of the file contains the data, with one example per line.
Note that binary attributes can only be represented with the two
numbers 0 and 1. Discrete attributes can contain an arbitrary
number of values, each corresponding to a different string. The
"dt" program automatically deduces the cardinality of each discrete
attribute.

NOTE: the target attribute is ALWAYS the first column and can only
be binary.

Note that this is a rigid format, and you should make sure to follow
it if you decide to add additional data.

***********
* THE END *
***********

bay3s / d-tree

About

Languages