Implementation of algorithms for incremental classification [1,2] and regression [3,6] tree learning from time-changing data streams. See [8] for detailed description of the main ideas behind these algorithms.
As of this writing this is still work in progress. The plan is to integrate this into QMiner [9] soon. (Note that this code requires GLib.)
Note: If you'll try to build this, note that GLib uses v0.10 version of libuv; so run git clone -b v0.10 https://github.com/joyent/libuv.git
in case you need to build it
Note: We started integrating this code into QMiner.
Note: I will no longer maintain this repository. As of now the learner will be availble through QMiner. I might clean up the code so others can use it, but I believe it will be much, much nicer when exposed through QMiner Javascript API. It will be easy to use, assume zero C++ knowledge, require no recompilation, etc. See QMiner documentation for details.
Note: Check out incremental classification and regression tree learning in QMiner here. User code is in src/ht.js
. (This is a working version on my fork. For the stable one go here.)
Configuration file describes data stream. Below is a simple config example.
dataFormat: (status,age,sex,survived)
status: discrete(first,second,third,crew)
age: discrete(adult,child)
sex: discrete(male,female)
survived: discrete(yes,no)
Future version of the HoeffdingTree --- to be available in QMiner --- will accept JSON configuration.
{
"dataFormat": ["status", "sex", "age", "survived"],
"sex": {
"type": "discrete",
"values": ["male", "female"]
},
"status": {
"type": "discrete",
"values": ["first", "second", "third", "crew"]
},
"age": {
"type": "discete",
"values": ["child", "adult"]
},
"survived": {
"type": "discrete",
"values": ["yes", "no"]
}
}
The goal is to expose it through QMiner Javascript API. (This will happen the next few days.)
Simple usage example.
// ...
int main(int argc, char** argv) {
// create environment
Env = TEnv(argc, argv, TNotify::StdNotify);
// pass some parameters
const bool ConceptDriftP = Env.IsArgStr("drift"); // use version that handles concept-drift?
const double SplitConfidence = Env.GetIfArgPrefixFlt("-splitConfidence:", 1e-6, "Split confidence"); // 1e-6
const double TieBreaking = Env.GetIfArgPrefixFlt("-tieBreaking:", 0.01, "Tie breaking"); // 1e-2
const TStr ConfigFNm = Env.GetIfArgPrefixStr("-config:", "titanic.config", "Config file");
const TStr AttrHeuristic= Env.GetIfArgPrefixStr("-attrEval:", "InfoGain", "Attribute evaluation heuristic");
const int GracePeriod = Env.GetIfArgPrefixInt("-gracePeriod:", 300, "Grace period"); // 3e2
const int DriftCheck = Env.GetIfArgPrefixInt("-driftCheck:", 10000, "Drift check"); // 1e4
const int WindowSize = Env.GetIfArgPrefixInt("-windowSize:", 100000, "Window size"); // 1e5
// usage example
PHoeffdingTree ht = THoeffdingTree::New("docs/" + ConfigFNm, GracePeriod, SplitConfidence, TieBreaking,
DriftCheck, WindowSize);
ht->SetAdaptive(ConceptDriftP);
ProcessData("data/titanic-220M.dat", ht);
ht->Export("exports/titanic-"+TInt(ht->ExportN).GetStr()+".gv", TExportType::DOT);
return 0;
}
// process data line-by-line
void ProcessData(const TStr& FileNm, PHoeffdingTree HoeffdingTree) {
Assert(TFile::Exists(FileNm));
TFIn FIn(FileNm);
TStr Line;
while (FIn.GetNextLn(Line)) { HoeffdingTree->Process(Line, ','); }
}
- [1] Domingos and Hulten, Mining high-speed data streams, KDD`00
- [2] Hulten et al., Mining time-changing data streams, KDD`01
- [3] Ikonomovska et al., Learning model trees from evolving data streams, Data Mining and Knowledge Discovery (2011)
- [4] Gama et al., On evaluating stream learning algorithms, Machine Learning (2013)
- [5] Gama et al., Issues in Evaluation of Stream Learning Algorithms, KDD`09
- [6] Elena Ikonomovska, Algorithms for Learning Regression Trees and Ensembles on Evolving Data Streams, PhD thesis (2012)
- [7] Wassily Hoeffding, Probability Inequalities for Sums of Bounded Random Variables, Journal of the American Statistical Association (1963)
- [8] Blaz Sovdat, Algorithms for incremental learning of decision trees from time-changing data streams, Undergraduate thesis (2013)
- [9] Blaz Fortuna, Jan Rupnik, and others. QMiner: Data Analytics Platform for Processing Streams of Structured and Unstructured Data, NIPS`14.