Is categorical feature currently supported by causalnex with label encoding?

Question

Is categorical feature currently supported by causalnex with label encoding?

tonyabracadabra opened this issue 2 years ago · comments

Xupeng (Tony) Tong commented 2 years ago

I know conducting label encoding on categorical variable would make the algorithm works with categorical variables, but is it mathematically valid for validating their causal relationships when those label encoding are applied?

Xupeng (Tony) Tong · Answer 1 · Tue Sep 27 2022 13:23:14 GMT+0800 (China Standard Time)

Hey folks, is there any updates on this question? @oentaryorj @GabrielAzevedoFerreiraQB Any insights would be helpful. I think we might need to handle the independence test for categorical variable separately and I am not sure if that is implemented in the system now.

Gabriel Azevedo Ferreira · Answer 2 · Tue Sep 27 2022 14:14:14 GMT+0800 (China Standard Time)

Hey Tony,

Hope you are well! Thanks for the great question!

You're absolutely right.

For NOTEARS, we do need continuous variables as you correctly mentioned.
It doesn't always make sense to do a simple label encoding. For example, encoding a variable "countries" directly ("randomly") would not give any signal for NOTEARS to learn relationship.
However, in certain situations it is still possible to do such encoding:
- case where variables are binary
- case where there is an ordinal order in the variables - say days of the week (to certain extent)

One thing to note, though, is that NOTEARS is not "scale invariant", meaning that if we multiply a variable by a constant, NOTEARS results are different. There are discussions on the best way to handle this, but I'd (personally!) recommend thinking about normalizing the variables more carefully if dealing with encoded discrete variables

Xupeng (Tony) Tong · Answer 3 · Tue Sep 27 2022 16:19:55 GMT+0800 (China Standard Time)

Hey Tony,

Hope you are well! Thanks for the great question!

You're absolutely right.

For NOTEARS, we do need continuous variables as you correctly mentioned.

It doesn't always make sense to do a simple label encoding. For example, encoding a variable "countries" directly ("randomly") would not give any signal for NOTEARS to learn relationship.

However, in certain situations it is still possible to do such encoding:

case where variables are binary

case where there is an ordinal order in the variables - say days of the week (to certain extent)

One thing to note, though, is that NOTEARS is not "scale invariant", meaning that if we multiply a variable by a constant, NOTEARS results are different. There are discussions on the best way to handle this, but I'd (personally!) recommend thinking about normalizing the variables more carefully if dealing with encoded discrete variables

Thanks Gabriel for answering my question!

I saw that in the release note, it says Added categorical distributed data support for pytorch NOTEARS., what does that mean?

Is there any plans on supporting causal discoveries with mixed type of data with newly published papers?

jinjones · Answer 4 · Fri Jun 23 2023 12:26:46 GMT+0800 (China Standard Time)

in that case, can i do one hot encoding for categorical variables?