blip
This is the "Bayesian network Learning Improved Project" (blip), an open-source Java package that offers a wide range of structure learning algorithms. It is developed my Mauro Scanagatta and it is distributed under the LGPL-3 by IDSIA.
It focuses on score-based learning, mainly the BIC and the BDeu score functions, and allows the user to learn BNs from datasets containing thousands of variables. It provides state-of-the-art algortihms for the following tasks: parent set identification ( BIC ), general structure optimization (WINASOBS-ENT), bounded treewidth structure optimization (KMAX) and structure learning on incomplete data sets (SEM-KMAX).
An R binding is also available: (https://github.com/mauro-idsia/r.blip).
References
This package implements the algorithms detailed in the following papers:
- Learning Bayesian Networks with Thousands of Variables (NIPS 2015) Mauro Scanagatta, Giorgio Corani, Cassio P. de Campos, Marco Zaffalon
- Learning Treewidth-Bounded Bayesian Networks with Thousands of Variables (NIPS 2016) Mauro Scanagatta, Giorgio Corani, Cassio P. de Campos, Marco Zaffalon
- Efficient learning of bounded-treewidth Bayesian networks from complete and incomplete data sets (IJAR 2018) - supplementary material
- Improved Local Search in Bayesian Networks Structure Learning (AMBN 2017)
- Approximated Structural Learning for Large Bayesian Networks (ECML PKDD 2018) supplementary material
Usage
The process of learning a bounded-treewidth BN is explained by using the "child" network as example.
Dataset format
The format for the initial dataset has to be the same as the file "child-5000.dat", namely a space-separated file containing:
* First line: list of variables names, separated by space;
* Second line: list of variables cardinalities, separated by space;
* Following lines: list of values taken by the variables in each datapoint, separated by space.
Parent set identification
The first step is build the parent sets score cache. The state-of-the-art approach is to use BIC* (for the BIC score):
java -jar blip.jar scorer.is -d data/child-5000.dat -j data/child-5000.jkl -t 10 -b 0
Main options:
- -d VAL : Datafile input path (.dat format)
- -j VAL : Parent set scores output file (.jkl format)
- -t N : Maximum time limit, in seconds (default: 10)
- -b N : Number of machine cores to use - if 0, all are used (default: 1)
General structure optimization
Given the parent sets score cache, now it is time to learn the structure. The state-of-the-art approach is to use WINASOBS (Windows operator applied to ASOBS) with ENT (entropy-based) ordering:
java -jar blip.jar solver.winasobs.adv -smp ent -d data/child-5000.dat -j data/child-5000.jkl -r data/child.wa.res -t 10 -b 0
Main options:
- -smp VAL : Advanced sampler (possible values: std, mi, ent, r_mi, r_ent)
- -d VAL : Datafile input path (.dat format)
- -j N : Parent set scores input file (.jkl format)
- -r VAL : Structure output file (.res format)
- -t N : Maximum time limit, in seconds (default: 10)
- -b N : Number of machine cores to use - if 0, all are used (default: 1)
Bounded-treewidth structure optimization
Given the parent sets score cache, it is possible to learn a structure under a bounded treewidth constraints. The state-of-the-art approach is to use k-max:
For perfoming with k-max:
java -jar blip.jar solver.kmax -w 4 -j data/child-5000.jkl -r data/child-5000.kmax.res -t 10 -b 0
Main options:
- -w N : Maximum treewidth allowed
- -j N : Parent set scores input file (.jkl format)
- -r VAL : Structure output file (.res format)
- -t N : Maximum time limit, in seconds (default: 10)
- -b N : Number of machine cores to use - if 0, all are used (default: 1)
Structure learning from incomplete data sets
To learn a structure from data containing missing values the state-of-the-art approach is to use SEM-kMAX:
java -jar blip.jar imputation.sem -d data/child-5000-missing.dat -o data/child-5000-imputed.dat -r data/child.res -t 1 -tmp data/tmp -w 6 -b 0
Main options:
- -d VAL : Datafile (with missing valus) input path (.dat format)
- -o VAL : Datafile (with imputed values) output path (.dat format)
- -r VAL : Structure output file (.res format)
- -t N : Time regulation parameter (default: 1)
- -tmp VAL : Temporary directory
- -w N : Learning treewidth (default: 6)
- -b N : Number of machine cores to use - if 0, all are used (default: 1)
Interpreting the result
The format of the ".res" file is as follows: each line indicates the parent set assigned to each variable and its score.
For example the line "4: -2797.39 (10,17,18)" indicates that to the variable with index 4 in the dataset are assgined as parents the variables with index (10,17,18). This parent set has score -2797.39 (by default the score function is the BIC).
Learn the parameters
Using the structure found it is possible to learn the parameters with:
java -jar blip.jar parle -d data/child-5000.dat -r data/child-5000.kmax.res -n data/child-5000.kmax.uai
Main options:
- -d VAL : Datafile input path (.dat format)
- -r VAL : Structure input file (.res format)
- -n VAL : BN output file (.uai format)
The final output will be a full Bayesian network in UAI format.