Unexpected results for CATE methods when predicting on new data
AlxndrMlk opened this issue · comments
Describe the bug
When using DoWhy 0.10.1, I am getting unexpected predictions on new data.
In particular, I tried replicating the results in the S-Learner: The Lone Ranger section of this notebook originally developed using DoWhy 0.8
I used two ways to generate the predictions on the test set:
[1] Using _estimator_object
:
estimate._estimator_object.effect(my_test_data)
[2] Using model.estimate_effect()
method:
model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.metalearners.SLearner',
fit_estimator=False,
target_units=my_test_data,
).cate_estimates
I both cases the results were constant for all rows.
To make sure that it was not an issue with the data, I generated predictions for the training data using both methods ([1] and [2]).
The results were constant again, which is inconsistent with the original predictions generated on the training data.
Steps to reproduce the behavior
- Install DoWhy 0.10.1
- Run the following cells from this notebook
- imports
- cells 47-58
- run the following code (generates predictions on the test set):
estimate._estimator_object.effect(earnings_interaction_test.drop(['true_effect', 'took_a_course'], axis=1))
or
model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.metalearners.SLearner',
fit_estimator=False,
target_units=earnings_interaction_test.drop(['true_effect', 'took_a_course'], axis=1),
).cate_estimates
- Compare the original predictions on the training data:
estimate.cate_estimates
(Let's call this result ORIGINAL_PRED)
with the predictions on the training data generated using methods [1] and [2]:
estimate._estimator_object.effect(earnings_interaction_train.drop(['earnings', 'took_a_course'], axis=1))
(Let's call this result NEW_PRED_1)
model.estimate_effect(
identified_estimand=estimand,
method_name='backdoor.econml.metalearners.SLearner',
fit_estimator=False,
target_units=earnings_interaction_train.drop(['earnings', 'took_a_course'], axis=1),
).cate_estimates
(Let's call this result NEW_PRED_2)
Expected behavior
We expect ORIGINAL_PRED to be identical to NEW_PRED_1 and NEW_PRED_2 (at least assuming fixed seed), but this was not the case for me.
The results are as expected in DoWhy 0.8 when using:
model.causal_estimator.effect(earnings_interaction_test.drop(['true_effect', 'took_a_course'], axis=1))
to generate predictions on new data.
Version information:
- DoWhy version 0.10.1
Additional context
...
Thanks for raising this @AlxndrMlk
I had a chance to check this issue and realized that the error is due to passing the columns in the incorrect order.
If you change earnings_interaction_test.drop(['true_effect', 'took_a_course'],axis=1)
to earnings_interaction_test[['python_proficiency', 'age']]
, then the code runs as expected.
However, I do realize that this behavior in v0.10 creates an additional burden on the user. I am adding PR #1061 that restores v0.8 behavior where user can simply provide the dataframe (earnings_interaction_test
) and the selection of columns is done automatically.