Foundry

Foundry is a package for forging interpretable predictive modeling pipelines with a sklearn style-API. It includes:

A Glm class with a Pytorch backend. This class is highly extensible, supporting (almost) any distribution in pytorch's distributions module.
A preprocessing module that includes helpful classes like DataFrameTransformer and InteractionFeatures.
An evaluation module with tools for interpreting any sklearn-API model via MarginalEffects.

You should use Foundry to augment your workflows if any of the following are true:

You are attempting to model a target that is 'weird': for example, highly skewed data, binomial count-data, censored or truncated data, etc.
You need some help battling some annoying aspects of feature-engineering: for example, you want an expressive way of specifying interaction-terms in your model; or perhaps you just want consistent support for getting feature-names despite being stuck on python 3.7.
You want to interpret your model: for example, perform statistical inference on its parameters, or understand the direction and functional-form of its predictors.

Getting Started

foundry can be installed with pip:

pip install git+https://github.com/strongio/foundry.git#egg=foundry

Let's walk through a quick example:

# data:
from foundry.data import get_click_data
# preprocessing:
from foundry.preprocessing import DataFrameTransformer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PowerTransformer
from sklearn.pipeline import make_pipeline
# glm:
from foundry.glm import Glm
# evaluation:
from foundry.evaluation import MarginalEffects

Here's a dataset of click user pageviews and clicks for domain with lots of pages:

df_train, df_val = get_click_data()
df_train

	attributed_source	user_agent_platform	page_id	page_market	page_feat1	page_feat2	page_feat3	num_clicks	num_views
0	8	Windows	7	b	0.0	0.0	35.0	0.0	32.0
1	8	Windows	7	b	0.0	1.0	0.0	0.0	14.0
2	8	Windows	7	a	0.0	0.0	5.0	0.0	8.0
3	8	Windows	7	a	0.0	0.0	9.0	0.0	7.0
4	8	Windows	7	a	0.0	0.0	20.0	0.0	40.0
...	...	...	...	...	...	...	...	...	...
423188	1	Android	95	f	0.0	0.0	25.0	0.0	1.0
423189	10	Android	26	a	0.0	2.0	7.0	15.0	860.0
423190	10	Android	32	a	0.0	0.0	36.0	37.0	651.0
423191	0	Other	10	b	0.0	0.0	26.0	0.0	1.0
423192	0	Other	31	a	0.0	1.0	34.0	0.0	1.0

423193 rows × 9 columns

We'd like to build a model that let's us predict future click-rates for different pages (page_id), page-attributes (e.g. market), and user-attributes (e.g. platform), and also learn about each of these features -- e.g. perform statistical inference on model-coefficients ("are users with missing user-agent data significantly worse than average?")

Unfortunately, these data don't fit nicely into the typical regression/classification divide: each observations captures counts of clicks and counts of pageviews. Our target is the click-rate (clicks/views) and our sample-weight is the pageviews.

One workaround would be to expand our dataset so that each row indicates is_click (True/False) -- then we could use a standard classification algorithm:

df_train_expanded, df_val_expanded = get_click_data(expanded=True)
df_train_expanded

	attributed_source	user_agent_platform	page_id	page_market	page_feat1	page_feat2	page_feat3	is_click
0	8	Windows	67	b	0.0	1.0	0.0	False
1	8	Windows	67	b	0.0	1.0	0.0	False
2	8	Windows	67	b	0.0	1.0	0.0	False
3	8	Windows	67	b	0.0	1.0	0.0	False
4	8	Windows	67	b	0.0	1.0	0.0	False
...	...	...	...	...	...	...	...	...
7760666	7	OSX	61	c	3.0	1.0	12.0	False
7760667	7	OSX	61	c	3.0	1.0	12.0	False
7760668	7	OSX	61	c	3.0	1.0	12.0	False
7760669	7	OSX	61	c	3.0	1.0	12.0	False
7760670	7	OSX	61	c	3.0	1.0	12.0	False

7760671 rows × 8 columns

But this is hugely inefficient: our dataset of ~400K explodes to almost 8MM.

Within foundry, we have the Glm, which supports binomial data directly:

Glm('binomial', penalty=10_000)

Glm(family='binomial', penalty=10000)

Let's set up a sklearn model pipeline using this Glm. We'll use foundry's DataFrameTransformer to support passing feature-names to the Glm (newer versions of sklearn support this via the set_output() API).

preproc = DataFrameTransformer([
    (
        'one_hot', 
        make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder()), 
        ['attributed_source', 'user_agent_platform', 'page_id', 'page_market']
    )
    ,
    (
        'power', 
        PowerTransformer(),
        ['page_feat1', 'page_feat2', 'page_feat3']
    )
])

glm = make_pipeline(
    preproc, 
    Glm('binomial', penalty=1_000)
).fit(
    X=df_train,
    y={
        'value' : df_train['num_clicks'],
        'total_count' : df_train['num_views']
    },
)

Epoch 8; Loss 0.3183; Convergence 0.0003131/0.001:  42%|█████▊        | 5/12 [00:00<00:00, 10.99it/s]

Estimating laplace coefs... (you can safely keyboard-interrupt to cancel)


Epoch 8; Loss 0.3183; Convergence 0.0003131/0.001:  42%|█████▊        | 5/12 [00:07<00:10,  1.55s/it]

By default, the Glm will estimate not just the parameters of our model, but also the uncertainty associated with them. We can access a dataframe of these with the coef_dataframe_ attribute:

df_coefs = glm[-1].coef_dataframe_
df_coefs

	name	estimate	se
0	probs__one_hot__attributed_source_0	0.000042	0.031622
1	probs__one_hot__attributed_source_1	-0.003277	0.031578
2	probs__one_hot__attributed_source_2	-0.058870	0.030623
3	probs__one_hot__attributed_source_3	-0.485669	0.024011
4	probs__one_hot__attributed_source_4	-0.663989	0.016975
...	...	...	...
141	probs__one_hot__page_market_z	0.353556	0.025317
142	probs__power__page_feat1	0.213486	0.002241
143	probs__power__page_feat2	0.724601	0.004021
144	probs__power__page_feat3	0.913425	0.004974
145	probs__bias	-5.166077	0.022824

146 rows × 3 columns

Using this, it's easy to plot our model-coefficients:

df_coefs[['param', 'trans', 'term']] = df_coefs['name'].str.split('__', n=3, expand=True)

df_coefs[df_coefs['name'].str.contains('page_feat')].plot('term', 'estimate', kind='bar', yerr='se')
df_coefs[df_coefs['name'].str.contains('user_agent_platform')].plot('term', 'estimate', kind='bar', yerr='se')

<AxesSubplot:xlabel='term'>

Model-coefficients are limited because they only give us a single number, and for non-linear models (like our binomial GLM) this doesn't tell the whole story. For example, how could we translate the importance of page_feat3 into understanable terms? This only gets more difficult if our model includes interaction-terms.

To aid in this, there is MarginalEffects, a tool for plotting our model-predictions as a function of each predictor:

glm_me = MarginalEffects(glm)
glm_me.fit(
    X=df_val_expanded, 
    y=df_val_expanded['is_click'],
    vary_features=['page_feat3']
).plot()

<ggplot: (8777751556441)>

Here we see that how this predictor's impact on click-rates varies due to floor effects.

As a bonus, we plotted the actual values alongside the predictions, and we can see potential room for improvement in our model: it looks like very high values of this predictor have especially high click-rates, so an extra step in feature-engineering that captures this discontinuity may be warranted.

About

MIT License

Languages

Language:Python 100.0%