SingleCrossFit `invalid value encountered in log`

Question

SingleCrossFit `invalid value encountered in log`

miaow27 opened this issue 2 years ago · comments

@pzivich, When using Singlecrossfit TMLE for a continuous outcome with sm.Gaussian GLM class. I have encountered the following error:

xxx/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
xxx/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))

Here is how I defined the estimator for superleaner as well as the parameter input.

link_i = sm.genmod.families.links.identity()
SL_glm = GLMSL(family = sm.families.family.Gaussian(link=link_i))
GLMSL(family = sm.families.family.Binomial())

sctmle = SingleCrossfitTMLE(dataset = df, exposure='treatment', outcome='paid_amt', continuous_bound = 0.01)
sctmle.exposure_model('gender_cd_F + prospective_risk + age_nbr', GLMSL(family = sm.families.family.Binomial()), bound=[0.01, 0.99])
sctmle.outcome_model('gender_cd_F + prospective_risk + age_nbr', SL_glm)
sctmle.fit(n_splits = 2, n_partitions=3, random_state=12345, method = 'median')
sctmle.summary()

If I uses any other ML estimates such as Lasso, GBM, RandomForest from Sklearn for outcome model estimator, it will work fine.
The error only related to use of GLMSL family.

Could you share any idea of the reason of this error and how I can fix this issue?
Much appreciated!

Paul Zivich · Answer 1 · Wed Oct 05 2022 07:21:21 GMT+0800 (China Standard Time)

I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's LinearRegression. That should be the same thing as GLMSL with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening here

The original motivation for GLMSL was to get around the automatic penalization that sci-kit learn's LogisticRegression does (since I find tuning the C parameter to be inconsistent.

Paul Zivich · Answer 2 · Wed Oct 05 2022 07:23:23 GMT+0800 (China Standard Time)

What the warning looks like is that a 'bad' value is being returned by GLMSL somehow, which is passing a np.nan to the targeting step. That might be some bad fitting on the GLM's part? I will see if I can replicate in the next week

Miao Wang · Answer 3 · Wed Oct 05 2022 09:42:25 GMT+0800 (China Standard Time)

Awesome! I dig further into different scenario when I only fit 1 estimator at a time for both exposure model and the outcome model using SingleCorssfit. It seems to be coming from the outcome model for non-penalized regression (Linear regression from SKlearn or sm.familiy.Gaussian) as well as Gradient Boost Machine.

My outcome is highly skewed to the right as below:

However, I have tried cap at 90% or bound to 0-1. it is still throwing the same error.

See detailed information below.

Detailed Estimator Defined

if outcome_binary:
        SL_emp = EmpiricalMeanSL()
        SL_glm = LogisticRegression(penalty = 'none', tol = 1e-3, solver = 'saga', max_iter = 1000)
        SL_lasso = LogisticRegression(penalty = 'l1', C = 1/lasso_alpha, tol = 1e-3, solver = 'saga', max_iter = 1000)
        SL_gmsl = GLMSL(family = sm.families.family.Binomial())
        SL_rf = RandomForestClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
        SL_gbm = GradientBoostingClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
    else:
        SL_emp = EmpiricalMeanSL()
        SL_glm = LinearRegression()
        SL_lasso = Lasso(alpha = 0.1)
        link_i = sm.genmod.families.links.identity()
        SL_gmsl = GLMSL(family = sm.families.family.Gaussian(link=link_i))
        SL_rf = RandomForestRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
        SL_gbm = GradientBoostingRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2)

Result and Error

Result

It seems that Either Sklearn version of Regression of Statsmodel version of Regression failed. As well as GBM. However, Lasso is okay

ps_model: SL_emp, outcome_model: SL_emp, ATE: 299.7524716151026
ps_model: SL_emp, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_emp, outcome_model: SL_lasso, ATE: 299.1976197069547
ps_model: SL_emp, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_emp, outcome_model: SL_rf, ATE: 94.01894562905046
ps_model: SL_emp, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_glm, outcome_model: SL_emp, ATE: 147.88143010969998
ps_model: SL_glm, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_glm, outcome_model: SL_lasso, ATE: 147.93800377478118
ps_model: SL_glm, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_glm, outcome_model: SL_rf, ATE: 76.54026004386105
ps_model: SL_glm, outcome_model: SL_gbm, ATE: nan -> showing the above error

ps_model: SL_lasso, outcome_model: SL_emp, ATE: 148.62888266058008
ps_model: SL_lasso, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_lasso, outcome_model: SL_lasso, ATE: 148.65439623543404
ps_model: SL_lasso, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_lasso, outcome_model: SL_rf, ATE: 76.49970421812958
ps_model: SL_lasso, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_gmsl, outcome_model: SL_emp, ATE: 104.65699066369935
ps_model: SL_gmsl, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_gmsl, outcome_model: SL_lasso, ATE: 104.65699066369935
ps_model: SL_gmsl, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_gmsl, outcome_model: SL_rf, ATE: 80.77866133286848
ps_model: SL_gmsl, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_rf, outcome_model: SL_emp, ATE: 91.54591473217262
ps_model: SL_rf, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_rf, outcome_model: SL_lasso, ATE: 91.54591473217262
ps_model: SL_rf, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_rf, outcome_model: SL_rf, ATE: 76.42024750089442
ps_model: SL_rf, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_gbm, outcome_model: SL_emp, ATE: 50.585009347615674
ps_model: SL_gbm, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_gbm, outcome_model: SL_lasso, ATE: 50.585009347615674
ps_model: SL_gbm, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_gbm, outcome_model: SL_rf, ATE: 58.8199678774994
ps_model: SL_gbm, outcome_model: SL_gbm, ATE: nan -> showing the below error

Error Message

Each line represent a test case for getting ATE for continuous outcome (everything else is same).
When there is ATE:nan it comes with error message as below.

The reason it repeated 3 times is because I defined partition = 3.

I also notice the error occur at line 1663, 1669, 1671 and 1668 in crossfit.py within DoubleCrossfitTMLE class.
However, I only called SingleCrossfitTMLE.

/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/a474369/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))

I hope these information can provide some insights about why Singlecrossfit failed!

Miao Wang · Answer 4 · Sat Oct 08 2022 05:28:33 GMT+0800 (China Standard Time)

I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's LinearRegression. That should be the same thing as GLMSL with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening here

The original motivation for GLMSL was to get around the automatic penalization that sci-kit learn's LogisticRegression does (since I find tuning the C parameter to be inconsistent.

I think it tends to have this particular error for GLMSL or even LASSO (from SKlearn, when alpha is not large enough).

Paul Zivich · Answer 5 · Sat Oct 08 2022 05:31:27 GMT+0800 (China Standard Time)

Yeah, it sounds like you are running into sparse data, so GLMSL returns nan because that's what statsmodels.GLM returns. Those nan get fed to the targeting model (another GLM, the one called out in the warning).

So, it sounds like a specific sparsity issue coming up with the data (not with something in GLMSL itself). Penalized model (like you did) is probably the way to fix it. Implicitly random forests do something like that, so that also explains why they work here.

Miao Wang · Answer 6 · Sat Oct 08 2022 05:41:02 GMT+0800 (China Standard Time)

This is very helpful insight! log-transform the outcome would be ideal, but i will not able to get ATE but ratio instead.
Random Forest with parameter that prevent overfitting or large penalization might be the way to go...

Thank you!

Paul Zivich · Answer 7 · Sat Oct 08 2022 05:45:58 GMT+0800 (China Standard Time)

Personally, I would recommend using SuperLearer with a LASSO, Ridge, and Random Forest (and maybe some other algorithms that regularize). That should keep you on the ATE, and all those are regularized, so they should avoid this issue.

My recommendation with using LASSO + Ridge is based on some observations that random forests can be a little unpredictable when used alone in finite samples (also see "Demystifying Statistical Inference When Using Machine Learning in Causal Research"). Essentially you want to give prevent overfitting by random forest if you can

Miao Wang · Answer 8 · Mon Oct 10 2022 05:13:30 GMT+0800 (China Standard Time)

This is extremely helpful suggestion! Thank you so much!