dswah / pyGAM

[HELP REQUESTED] Generalized Additive Models in Python

Home Page:https://pygam.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with partial dependence and categories

jonathan-taylor opened this issue · comments

It seems categorical variables must contain 0 as one of the values. This is apparent in partial_dependence:

import numpy as np
from pygam import LinearGAM, s, f

X = np.random.standard_normal((100, 3))
X[:,2] = np.random.choice([0,1], 100, replace=True)
Y = np.random.standard_normal(100)

G = LinearGAM(s(0) + s(1) + f(2)).fit(X, Y)
G.partial_dependence(0)

This works fine, but:

X2 = X.copy()
X2[:,2] += 2
G2 = LinearGAM(s(0) + s(1) + f(2)).fit(X2, Y)
G2.partial_dependence(0)

raises the following:

ValueError: X data is out of domain for categorical feature 2. Expected data on [2.0, 3.0], but found data on [0.0, 0.0]

Issue is that check_X looks at categorical of the formed _modelmat which has 0s everywhere but the term's column. Really, _modelmat just needs valid values -- the partial dependence just requires the other columns are constant, not necessarily 0. I'd also recommend centering the partial dependence values as it is their shape that is of interest rather than the value...

An issue with this fix is that the standard error of the bars will depend on where we evaluate. Might be better to return \hat{\mu}(X_grid)-\hat{\mu}(\bar{X}). So it would be evaluated along a line through \bar{X}.

It seems categorical variables must contain 0 as one of the values. This is apparent in partial_dependence:

import numpy as np
from pygam import LinearGAM, s, f

X = np.random.standard_normal((100, 3))
X[:,2] = np.random.choice([0,1], 100, replace=True)
Y = np.random.standard_normal(100)

G = LinearGAM(s(0) + s(1) + f(2)).fit(X, Y)
G.partial_dependence(0)

This works fine, but:

X2 = X.copy()
X2[:,2] += 2
G2 = LinearGAM(s(0) + s(1) + f(2)).fit(X2, Y)
G2.partial_dependence(0)

raises the following:

ValueError: X data is out of domain for categorical feature 2. Expected data on [2.0, 3.0], but found data on [0.0, 0.0]

Issue is that check_X looks at categorical of the formed _modelmat which has 0s everywhere but the term's column. Really, _modelmat just needs valid values -- the partial dependence just requires the other columns are constant, not necessarily 0. I'd also recommend centering the partial dependence values as it is their shape that is of interest rather than the value...

@jonathan-taylor does it occur to you that G2.partial_dependence(2) is actually working and only G2.partial_dependence(0) and G2.partial_dependence(1) not, which means the issue which caused by X2[:,2] is affecting X2[:,0] and X2[:,1]? How can we explain this?

Yeah, I see — I'm getting this too. The problem emerges, I think, because evaluation of the zeros that get filled in (and I think pyGAM is assuming are the omitted category) for partial_dependence for any other feature are un-evaluable