pwc2 / decision-trees

Implementation of decision trees for binary categorical data using numpy. Includes regular decision trees, random forest, and boosted trees.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

decision-trees

decision-trees contains implementations of the decision tree classifier, random forest, and boosted trees (with AdaBoost) for binary categorical data.

Requirements:

  • numpy 1.17.3

  • pandas 0.25.2

Usage:

For a single decision tree:

import pandas as pd

from models.decision_tree import DecisionTreeClassifier

train_set = pd.read_csv('data/pa3_train.csv')
validation_set = pd.read_csv('data/pa3_val.csv')
test_set = pd.read_csv('data/pa3_test.csv')

tree = DecisionTreeClassifier(train=train_set, validation=validation_set,test=test_set, 
                                label='class', max_depth=2)
results = tree.train()

For a random forest:

import pandas as pd

from models.random_forest import RandomForestClassifier

train_set = pd.read_csv('data/pa3_train.csv')
validation_set = pd.read_csv('data/pa3_val.csv')
test_set = pd.read_csv('data/pa3_test.csv')

rf = RandomForestClassifier(train=train_set, validation=validation_set, test=test_set,
                             label='class', n_trees=5, n_features=5, seed=1, max_depth=2)
results = rf.train()

For a boosted trees with AdaBoost:

import pandas as pd

from models.adaboost import AdaBoostClassifier

train_set = pd.read_csv('data/pa3_train.csv')
validation_set = pd.read_csv('data/pa3_val.csv')
test_set = pd.read_csv('data/pa3_test.csv')

boosted_trees = AdaBoostClassifier(train=train_set, validation=validation_set, test=test_set,
                                     label='class', n_classifiers=5, max_depth=2)
results = boosted_trees.train()

Data:

The data/ folder contains .csv files with training, validation, and test sets.

To run models:

  • run_part1.py creates decision trees with varied depths.
  • run_part2.py creates random forests with varied parameters.
  • run_part3.py creates boosted trees with varied parameters.

python main.py will run all three parts in order, output will be saved in model_output folder.

Future improvements:

  • Refactor AdaBoostClassifier and RandomForestClassifier classes to inherit attributes from DecisionTreeClassifier class.

About

Implementation of decision trees for binary categorical data using numpy. Includes regular decision trees, random forest, and boosted trees.


Languages

Language:Python 88.4%Language:Jupyter Notebook 11.6%