zshwuhan / HoeffdingTree

Incremental classification and regression tree learning algorithm.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HoeffdingTree

Implementation of algorithms for incremental classification [1,2] and regression [3,6] tree learning from time-changing data streams. See [8] for detailed description of the main ideas behind these algorithms.

As of this writing this is still work in progress. The plan is to integrate this into QMiner [9] soon. (Note that this code requires GLib.)

Note: If you'll try to build this, note that GLib uses v0.10 version of libuv; so run git clone -b v0.10 https://github.com/joyent/libuv.git in case you need to build it

Note: We started integrating this code into QMiner.

Note: I will no longer maintain this repository. As of now the learner will be availble through QMiner. I might clean up the code so others can use it, but I believe it will be much, much nicer when exposed through QMiner Javascript API. It will be easy to use, assume zero C++ knowledge, require no recompilation, etc. See QMiner documentation for details.

Note: Check out incremental classification and regression tree learning in QMiner here. User code is in src/ht.js. (This is a working version on my fork. For the stable one go here.)

Simple usage example

Configuration file describes data stream. Below is a simple config example.

dataFormat: (status,age,sex,survived)
status: discrete(first,second,third,crew)
age: discrete(adult,child)
sex: discrete(male,female)
survived: discrete(yes,no)

Future version of the HoeffdingTree --- to be available in QMiner --- will accept JSON configuration.

{
	"dataFormat": ["status", "sex", "age", "survived"],
	"sex": {
		"type": "discrete",
		"values": ["male", "female"]
	},
	"status": {
		"type": "discrete",
		"values": ["first", "second", "third", "crew"]
	},
	"age": {
		"type": "discete",
		"values": ["child", "adult"]
	},
	"survived": {
		"type": "discrete",
		"values": ["yes", "no"]
	}
}

The goal is to expose it through QMiner Javascript API. (This will happen the next few days.)

Simple usage example.

// ... 
int main(int argc, char** argv) {
	// create environment
	Env = TEnv(argc, argv, TNotify::StdNotify);
	// pass some parameters 
	const bool ConceptDriftP = Env.IsArgStr("drift"); // use version that handles concept-drift? 
	const double SplitConfidence = Env.GetIfArgPrefixFlt("-splitConfidence:", 1e-6, "Split confidence"); // 1e-6 
	const double TieBreaking = Env.GetIfArgPrefixFlt("-tieBreaking:", 0.01, "Tie breaking"); // 1e-2 
	const TStr ConfigFNm = Env.GetIfArgPrefixStr("-config:", "titanic.config", "Config file");
	const TStr AttrHeuristic= Env.GetIfArgPrefixStr("-attrEval:", "InfoGain", "Attribute evaluation heuristic");
	const int GracePeriod = Env.GetIfArgPrefixInt("-gracePeriod:", 300, "Grace period"); // 3e2 
	const int DriftCheck = Env.GetIfArgPrefixInt("-driftCheck:", 10000, "Drift check"); // 1e4 
	const int WindowSize = Env.GetIfArgPrefixInt("-windowSize:", 100000, "Window size"); // 1e5 
	
	// usage example 
	PHoeffdingTree ht = THoeffdingTree::New("docs/" + ConfigFNm, GracePeriod, SplitConfidence, TieBreaking,
		DriftCheck, WindowSize);
	ht->SetAdaptive(ConceptDriftP);
	ProcessData("data/titanic-220M.dat", ht);
	ht->Export("exports/titanic-"+TInt(ht->ExportN).GetStr()+".gv", TExportType::DOT);
	return 0;
}
// process data line-by-line 
void ProcessData(const TStr& FileNm, PHoeffdingTree HoeffdingTree) {
	Assert(TFile::Exists(FileNm));
	TFIn FIn(FileNm);
	TStr Line;
	while (FIn.GetNextLn(Line)) { HoeffdingTree->Process(Line, ','); }
}

References

About

Incremental classification and regression tree learning algorithm.

License:GNU General Public License v3.0


Languages

Language:C++ 97.4%Language:Python 2.4%Language:Batchfile 0.2%