atharvas / autoppl

C++ template library for probabilistic programming

Home Page:https://jamesyang007.github.io/autoppl/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is a private fork intended for my own personal use. Please go to https://github.com/JamesYang007/autoppl to view the most up to date info on this project.

AutoPPL

Build Status Coverage Status

Table of Contents

Overview

AutoPPL is a C++ template library providing high-level support for probabilistic programming. Using operator overloading and expression templates, AutoPPL provides a generic framework for specifying probabilistic models and applying inference algorithms.

Features

Distributions

Sampling Algorithms

See examples for more detail on how to use these sampling algorithms.

Design Choices

Intuitive Model Specification

In Bayesian statistics, the first step in performing probabilistic inference is to specify a generative model for the data of interest. For example, in mathematical notation, a Gaussian model could look like the following:

X ~ N(mu, I)
mu ~ U(-1, 1)

Given its simplicity and expressiveness, we wanted to mimic this compact notation as much as possible for users of our library.

For the model specified above, the code in AutoPPL could look like the following:

ppl::Data<double> X;
ppl::Param<double> mu;
auto model = (
    mu |= ppl::uniform(-1., 1.),
    X |= ppl::normal(mu, 1)
);

A ppl::Param<T> stores a univariate parameter of our model, while ppl::Data holds samples that have been observed from our distribution. The goal of Bayesian statistics is to sample from the posterior distribution given observed data, or maximize the likelihood of our data over a set of possible parameters. We support these methods using advanced modeling and sampling algorithms.

Efficient Memory Usage

We make the assumption that users are able to specify the probabilistic model at compile-time. As a result, AutoPPL can construct all model expressions at compile-time simply from the type information using expression templates. A model expression is simply a binary tree that relates variables with distributions, which can potentially depend on other variables.

As an example, our model expression such as model in the previous section is a compile-time constructed (binary) tree storing information about the relationship between m, X and their (conditional) distributions.

A model object makes no heap-allocations and is extremely minimal in size. It is simply a small, contiguous slab of memory representing the binary tree. The model object in the previous section is about 48 bytes on x86_64-apple-darwin17.7.0 using clang-10.

For a more complicated model such as the following:

ppl::Data<double> X;
std::array<ppl::Param<double>, 6> theta;

auto model = (
    theta[0] |= ppl::uniform(-1., 1.),
    theta[1] |= ppl::uniform(theta[0], theta[0] + 2.),
    theta[2] |= ppl::normal(theta[1], theta[0] * theta[0]),
    theta[3] |= ppl::normal(-2., 1.),
    theta[4] |= ppl::uniform(-0.5, 0.5),
    theta[5] |= ppl::normal(theta[2] + theta[3], theta[4]),
    X |= ppl::normal(theta[5], 1.)
);

The size of the model is 192 bytes on the same machine and compiler.

A model expression simply references the variables used in the expression such as theta[0], theta[1], ..., theta[5], X. The model does not store any data, but rather only serves to relate variables.

High-performance Inference Methods

Users interface with the inference methods via model expressions and other configuration parameters for that particular method. Hence, the inference algorithms are completely general and work with any model so long as the model expression is properly constructed. Moreover, due to the statically-known model specification, algorithms have opportunities to make compile-time optimizations. See Examples for more detail. Drawing 100 samples from the posterior of a hierarchical Gaussian model using a NUTS sampler, we see significant gains over STAN and PyMC3 in both compilation and runtime.

Y ~ N(l1 + l2, sigma) 
l1 ~ N(0, 10)
l2 ~ N(0, 10)
sigma ~ Uniform(0, 20)

AutoPPL

Compilation time: 3.39s
Run time: 0.987008s

STAN

Compilation time: 40.48s
Run time: 0.613672s (Warm-up)
          0.704271s (Sampling)
          1.31794s (Total)

PyMC3

Run time: 38s

Installation (dev)

First, clone the repository:

git clone --recurse-submodules https://github.com/JamesYang007/autoppl

To build and run tests, run the following:

./setup.sh
./clean-build.sh debug
cd build/debug
ctest

Examples

Sampling from Joint Distribution

Although AutoPPL was designed to perform inference on posterior distributions, one can certainly use it to sample from any joint distribution defined by the priors and conditional distributions. As an example, we can sample 1000 points from a standard normal distribution using Metropolis-Hastings in the following way:

std::array<double, 1000> samples;
ppl::Param<double> theta {samples.data()};
auto model = (
    theta |= ppl::normal(0., 1.)
);
ppl::mh(model, 1000);

Note that the parameter theta is bound to the storage samples at construction. After calling mh, the 1000 samples are placed into samples.

In general, so long as the joint PDF is known, or equivalently and more commonly if the conditional and prior PDFs are known, one can sample from the distribution.

As another example, we may sample from a more complicated joint distribution:

std::array<double, 1000> theta1_samples;
std::array<double, 1000> theta2_samples;
ppl::Param<double> theta1 {theta1_samples.data()};
ppl::Param<double> theta2 {theta2_samples.data()};
auto model = (
    theta1 |= ppl::uniform(-1., 1.),
    theta2 |= ppl::normal(theta1, 1.)
);
ppl::mh(model, 1000);

Sampling Posterior Mean and Standard Deviation

The following is an example of fitting a Gaussian model to some data. We put a Normal(0,3) prior on the mean and Uniform(0,2) prior on the standard deviation. It is recommended to use the state-of-the-art NUTS sampler to sample from a posterior distribution.

std::array<double, 1000> mu_samples, sigma_samples;

ppl::Data<double> x {1.0, 1.5, 1.7, 1.2, 1.5};
ppl::Param<double> mu {mu_samples.data()};
ppl::Param<double> sigma {sigma_samples.data()};

auto model = (
    mu |= ppl::normal(0., 3.),
    sigma |= ppl::uniform(0., 2.),
    x |= ppl::normal(mu, sigma)
);

size_t warmup = 1000;
size_t n_samples = 1000;
size_t n_adapt = 1000;
ppl::nuts(model, warmup, n_samples, n_adapt);

Bayesian Linear Regression

This example covers ordinary linear regression in a Bayesian setting. We created a fictitious dataset consisting of (x,y) coordinates. The true relationship is the following: y = x + 1. By specifying two parameters for the weight and bias, we propose the following probabilistic model:

y ~ N(wx + b, 0.5)
w ~ U(0, 2)
b ~ U(0, 2)

In AutoPPL, we can write the following code and sample from the posterior:

std::array<double, 1000> w_storage;
std::array<double, 1000> b_storage;

ppl::Data<double> x{2.4, 3.1, 3.6, 4, 4.5, 5.};
ppl::Data<double> y{3.5, 4, 4.4, 5.01, 5.46, 6.1};
ppl::Param<double> w {w_storage.data()};
ppl::Param<double> b {b_storage.data()};

auto model = (
        w |= ppl::uniform(0, 2),
        b |= ppl::uniform(0, 2),
        y |= ppl::normal(x * w + b, 0.5)
);

size_t warmup = 1000;
size_t n_samples = 1000;
size_t n_adapt = 1000;
ppl::nuts(model, warmup, n_samples, n_adapt);

Contributors

James Yang Jacob Austin Jenny Chen Lucie Le Blanc
JamesYang007 jacobaustin123 jenchen1398 lucieleblanc
github.com/JamesYang007 github.com/jacobaustin123 github.com/jenchen1398 github.com/lucieleblanc

Third Party Tools

Many thanks to the following third party tools used in this project:

  • Armadillo: matrix library used for inference algorithms.
  • Clang: one of the main compilers used.
  • CMake: build system.
  • Coveralls: check test coverage.
  • Cpp Coveralls: check test coverage specifically for C++ code.
  • FastAD: automatic differentiation library.
  • GCC: one of the main compilers used.
  • Google Benchmark: benchmark library algorithms.
  • GoogleTest: unit/integration-tests.
  • Travis CI: continuous integration for Linux using GCC.
  • Valgrind: check memory leak and errors.

About

C++ template library for probabilistic programming

https://jamesyang007.github.io/autoppl/


Languages

Language:C++ 91.0%Language:CMake 6.3%Language:Python 1.3%Language:Shell 1.0%Language:Stan 0.4%