Problem with partial dependence and categories

Question

Problem with partial dependence and categories

jonathan-taylor opened this issue 3 years ago · comments

It seems categorical variables must contain 0 as one of the values. This is apparent in partial_dependence:

import numpy as np
from pygam import LinearGAM, s, f

X = np.random.standard_normal((100, 3))
X[:,2] = np.random.choice([0,1], 100, replace=True)
Y = np.random.standard_normal(100)

G = LinearGAM(s(0) + s(1) + f(2)).fit(X, Y)
G.partial_dependence(0)

This works fine, but:

X2 = X.copy()
X2[:,2] += 2
G2 = LinearGAM(s(0) + s(1) + f(2)).fit(X2, Y)
G2.partial_dependence(0)

raises the following:

ValueError: X data is out of domain for categorical feature 2. Expected data on [2.0, 3.0], but found data on [0.0, 0.0]

Issue is that check_X looks at categorical of the formed _modelmat which has 0s everywhere but the term's column. Really, _modelmat just needs valid values -- the partial dependence just requires the other columns are constant, not necessarily 0. I'd also recommend centering the partial dependence values as it is their shape that is of interest rather than the value...

Jonathan Taylor commented 3 years ago

See #302

Jonathan Taylor · Answer 1 · Wed Sep 15 2021 07:07:25 GMT+0800 (China Standard Time)

An issue with this fix is that the standard error of the bars will depend on where we evaluate. Might be better to return \hat{\mu}(X_grid)-\hat{\mu}(\bar{X}). So it would be evaluated along a line through \bar{X}.

5ch0r5ch1 · Answer 2 · Thu Jan 04 2024 01:42:18 GMT+0800 (China Standard Time)

It seems categorical variables must contain 0 as one of the values. This is apparent in partial_dependence:
import numpy as np
from pygam import LinearGAM, s, f

X = np.random.standard_normal((100, 3))
X[:,2] = np.random.choice([0,1], 100, replace=True)
Y = np.random.standard_normal(100)

G = LinearGAM(s(0) + s(1) + f(2)).fit(X, Y)
G.partial_dependence(0)
This works fine, but:
X2 = X.copy()
X2[:,2] += 2
G2 = LinearGAM(s(0) + s(1) + f(2)).fit(X2, Y)
G2.partial_dependence(0)
raises the following:

ValueError: X data is out of domain for categorical feature 2. Expected data on [2.0, 3.0], but found data on [0.0, 0.0]

Issue is that check_X looks at categorical of the formed _modelmat which has 0s everywhere but the term's column. Really, _modelmat just needs valid values -- the partial dependence just requires the other columns are constant, not necessarily 0. I'd also recommend centering the partial dependence values as it is their shape that is of interest rather than the value...

@jonathan-taylor does it occur to you that G2.partial_dependence(2) is actually working and only G2.partial_dependence(0) and G2.partial_dependence(1) not, which means the issue which caused by X2[:,2] is affecting X2[:,0] and X2[:,1]? How can we explain this?

Nick Eubank · Answer 3 · Thu Feb 15 2024 23:23:31 GMT+0800 (China Standard Time)

Yeah, I see — I'm getting this too. The problem emerges, I think, because evaluation of the zeros that get filled in (and I think pyGAM is assuming are the omitted category) for partial_dependence for any other feature are un-evaluable