py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.

Home Page:https://www.microsoft.com/en-us/research/project/alice/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High dimensional categorical confounders (W)

suryadipta opened this issue · comments

Hi,
I have a clarification question on how to use categorical variables taking a large number of unique values as confounders in the family of DML models. A simplified hypothetical view of my causal model is as follows:

causal_graph = """ digraph { catvar1 -> log_T; catvar1 -> log_Y; catvar2 -> log_T; catvar2 -> log_Y; X_scaled -> log_Y; catvar3 -> log_T; catvar3 -> log_Y; catvar4 -> log_T; catvar4 -> log_Y; log_T -> log_Y; } """

The following generates the causal graph that supports the assumptions that I am making:
model = CausalModel(data=dd, graph=causal_graph, treatment='log_T', outcome='log_Y')

In my case, the issue is that catvar1 is a high-cardinality variable taking around 1000 unique values. In the example to estimate price elasticity of orange juice in Sec. 4 (https://github.com/py-why/EconML/blob/main/notebooks/Double%20Machine%20Learning%20Examples.ipynb), the categorical variable 'brand' that takes only 3 distinct values has been converted to dummy variables, which have been subsequently used in the model. I was wondering if you can advise me the correct approach with my data and if any of the estimators (e.g. SparseLinearDML (?)) will take care of such high-dimensional categorical confounders automatically without having to use the get_dummies function.

I have obtained the CATE estimates (together with confidence intervals) by using catvar1 through catvar4 as confounders in W without converting them to dummies, but only by using label encoding and converting them to numeric values, since that approach supports my causal graph (by using the CausalForest DML estimator), but I am not sure if that is the correct approach.

So, the basic question is if a highly cardinal confounding variable can be used AS IS in estimating the nuisance regressions, for example when the nuisance regressions are using XGBoost, when log of quantity and log of price are the outcome and the treatment respectively, as in the example in sec. 4 (https://github.com/py-why/EconML/blob/main/notebooks/Double%20Machine%20Learning%20Examples.ipynb).

Any thoughts/suggestions will be greatly appreciated! Thank you in advance!

If you want to use your variable as a control (W), then as long as your first-stage nuisance models can deal with categoricals in this form (like XGBoost, as you mention), then this should be fine.

However, if you want to use your variable as a feature (X), meaning that the price elasticity can vary depending on the setting of that categorical variable, then for a model like CausalForestDML that should probably still be fine (because the final model fit on X is a forest-based non-parameteric model) but for models like LinearDML and SparseLinearDML you should one-hot-encode it (and drop one category, so that the columns are linearly independent), because those models fit treatment effect models that are linear in the numeric values in X.

Keith,
Thank you very much for your feedback! I was wondering that an alternative to one-hot-encoding a highly cardinal confounding variable (in W) might be to remove the fixed effects by demeaning the outcome and the treatment variables, when the demeaning happens at the level of this variable. We will need to use the demeaned outcome (quantity) and demeaned treatment (price) in the case of orange juice (price) elasticity estimation. However, this will not work if the highly cardinal variable is a source of treatment heterogeneity (in X).

Also, I was wondering if any other encoding technique like feature hashing, etc. can be used instead of one-hot-encoding the variable because otherwise we will have too many variables in X that can lead to overfitting problems.

Once again, your thoughts and suggestions are greatly appreciated!

Keith,
My basic question has been answered. I believe that we can close this issue. Thanking you and your team for your feedback and for this wonderful contribution to the community.