pzivich / zEpid

Epidemiology analysis package

Home Page:http://zepid.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Perfect separation error for using `SingleCrossfitTMLE`

miaow27 opened this issue · comments

I am using the v0.9.0 version of SingleCrossfitTMLE.

When using TMLE, it works fine ...

ranger = RandomForestRegressor()
tmle = TMLE(df, a_var, y_var_bound)
tmle.exposure_model(g_formula)
tmle.outcome_model(q_formula, custom_model=ranger)
tmle.fit()

However, when using SingleCrossfitTMLE, it keeps saying Perfect separation detected, results not available

# SuperLearner set-up
g_labels = ["LogR"]
g_candidates = [GLMSL(sm.families.family.Binomial())]
Q_labels = ['RandomForest']
Q_candidates = [RandomForestRegressor(random_state=12345)]

# Single cross-fit TMLE
sctmle = SingleCrossfitTMLE(df, exposure=a_var, outcome=y_var)
sctmle.exposure_model(g_formula, SuperLearner(g_candidates, g_labels, folds=3, loss_function="nloglik"))
sctmle.outcome_model(q_formula, SuperLearner(Q_labels, Q_candidates, folds=3))
sctmle.fit(n_partitions=3, random_state=12345)
sctmle.summary()

I have checked all pairwise correlation between treated variables and all confounders and removed any category variable pairs with correlation above 0.2 but still seeing the same error. Could you share some way to resolve this issues?

Hi @miaow27 so the PerfectSeparation error comes up from the targeting step in TMLE. It's hard to tell exactly what is happening here. Can you copy the error here?

There are two possible causes: the random forest over-fitting or something in the g-model is highly correlated with the exposure (which ends up with a perfect separation when trying to fit that model in a split randomly).

Essentially in the cross-fit process, we break everything into two pieces then fit the algorithm (SL with one learner in the above). When the data gets split, sometimes the random forests have a tendency to over-fit (especially with a SL).

  • The easiest fix would be to tune the hyperparameters of the random forest. I would try changing min_samples_split to something like 5 or 10 (instead of 2).
  • Another potential fix would be to increase the folds in the SL (3 is pretty low, essentially it takes the split then splits it in 3 pieces. More pieces gives more data to fit with). You could also forgo SL (since it only have one model).
  • Lastly, you could instead add some 'smoother' learners to the Q SL. Something like a GAM or MARS would shrink the influence of the random forest (if the variance for the random forest is very high).

If it is the g-model correlation, then I would try a different seed. That might get the cross-fit to run. How to fix that issue is a little trickier to think through

@pzivich , Thanks so much for the suggestion. It ended up to be an issue for the exposure model (g model), where somehow 1 of the dataset only have 1 class. After I increase the fold in the SL to 10, it worked. If I forge the SL and use sctmle.exposure_model(g_formula, GLMSL(sm.families.family.Binomial())) it also works.