mckinsey / causalnex

A Python library that helps data scientists to infer causation rather than observing correlation.

Home Page:http://causalnex.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is categorical feature currently supported by causalnex with label encoding?

tonyabracadabra opened this issue · comments

I know conducting label encoding on categorical variable would make the algorithm works with categorical variables, but is it mathematically valid for validating their causal relationships when those label encoding are applied?

Hey folks, is there any updates on this question? @oentaryorj @GabrielAzevedoFerreiraQB Any insights would be helpful. I think we might need to handle the independence test for categorical variable separately and I am not sure if that is implemented in the system now.

Hey Tony,

Hope you are well! Thanks for the great question!

You're absolutely right.

  • For NOTEARS, we do need continuous variables as you correctly mentioned.
  • It doesn't always make sense to do a simple label encoding. For example, encoding a variable "countries" directly ("randomly") would not give any signal for NOTEARS to learn relationship.
  • However, in certain situations it is still possible to do such encoding:
    • case where variables are binary
    • case where there is an ordinal order in the variables - say days of the week (to certain extent)

One thing to note, though, is that NOTEARS is not "scale invariant", meaning that if we multiply a variable by a constant, NOTEARS results are different. There are discussions on the best way to handle this, but I'd (personally!) recommend thinking about normalizing the variables more carefully if dealing with encoded discrete variables

Hey Tony,

Hope you are well! Thanks for the great question!

You're absolutely right.

  • For NOTEARS, we do need continuous variables as you correctly mentioned.

  • It doesn't always make sense to do a simple label encoding. For example, encoding a variable "countries" directly ("randomly") would not give any signal for NOTEARS to learn relationship.

  • However, in certain situations it is still possible to do such encoding:

    • case where variables are binary
    • case where there is an ordinal order in the variables - say days of the week (to certain extent)

One thing to note, though, is that NOTEARS is not "scale invariant", meaning that if we multiply a variable by a constant, NOTEARS results are different. There are discussions on the best way to handle this, but I'd (personally!) recommend thinking about normalizing the variables more carefully if dealing with encoded discrete variables

Thanks Gabriel for answering my question!

I saw that in the release note, it says Added categorical distributed data support for pytorch NOTEARS., what does that mean?

Is there any plans on supporting causal discoveries with mixed type of data with newly published papers?

in that case, can i do one hot encoding for categorical variables?