Problem with partial dependence and categories
jonathan-taylor opened this issue · comments
It seems categorical variables must contain 0 as one of the values. This is apparent in partial_dependence
:
import numpy as np
from pygam import LinearGAM, s, f
X = np.random.standard_normal((100, 3))
X[:,2] = np.random.choice([0,1], 100, replace=True)
Y = np.random.standard_normal(100)
G = LinearGAM(s(0) + s(1) + f(2)).fit(X, Y)
G.partial_dependence(0)
This works fine, but:
X2 = X.copy()
X2[:,2] += 2
G2 = LinearGAM(s(0) + s(1) + f(2)).fit(X2, Y)
G2.partial_dependence(0)
raises the following:
ValueError: X data is out of domain for categorical feature 2. Expected data on [2.0, 3.0], but found data on [0.0, 0.0]
Issue is that check_X
looks at categorical of the formed _modelmat
which has 0s everywhere but the term's column. Really, _modelmat
just needs valid values -- the partial dependence just requires the other columns are constant, not necessarily 0. I'd also recommend centering the partial dependence values as it is their shape that is of interest rather than the value...
See #302
An issue with this fix is that the standard error of the bars will depend on where we evaluate. Might be better to return \hat{\mu}(X_grid)-\hat{\mu}(\bar{X}). So it would be evaluated along a line through \bar{X}.
It seems categorical variables must contain 0 as one of the values. This is apparent in
partial_dependence
:import numpy as np from pygam import LinearGAM, s, f X = np.random.standard_normal((100, 3)) X[:,2] = np.random.choice([0,1], 100, replace=True) Y = np.random.standard_normal(100) G = LinearGAM(s(0) + s(1) + f(2)).fit(X, Y) G.partial_dependence(0)
This works fine, but:
X2 = X.copy() X2[:,2] += 2 G2 = LinearGAM(s(0) + s(1) + f(2)).fit(X2, Y) G2.partial_dependence(0)
raises the following:
ValueError: X data is out of domain for categorical feature 2. Expected data on [2.0, 3.0], but found data on [0.0, 0.0]
Issue is that
check_X
looks at categorical of the formed_modelmat
which has 0s everywhere but the term's column. Really,_modelmat
just needs valid values -- the partial dependence just requires the other columns are constant, not necessarily 0. I'd also recommend centering the partial dependence values as it is their shape that is of interest rather than the value...
@jonathan-taylor does it occur to you that G2.partial_dependence(2) is actually working and only G2.partial_dependence(0) and G2.partial_dependence(1) not, which means the issue which caused by X2[:,2] is affecting X2[:,0] and X2[:,1]? How can we explain this?
Yeah, I see — I'm getting this too. The problem emerges, I think, because evaluation of the zeros that get filled in (and I think pyGAM is assuming are the omitted category) for partial_dependence for any other feature are un-evaluable