Doubt on variance importance and (multi)collinearity

Question

Doubt on variance importance and (multi)collinearity

MiqG opened this issue 4 years ago · comments

Miquel Anglada Girotto commented 4 years ago

Hi!

First, thank you for developing such a cool new concept for network inference in the omics!

After reading your paper, I was wondering whether the variable importances obtained could be confounded by having multicollinearity between genes like is explained here.
Then, I understand that highly collinear features (genes) will be used for splitting observations close to the root a few times for each tree because they hold very similar information with respect to the target variable. And, therefore, these will have low importance when averaging over all the ensemble.
Is this true with the current implementation of variable importance?

Thank you very much again!

Miquel

Vân Anh Huynh-Thu · Answer 1 · Tue Jan 12 2021 00:20:38 GMT+0800 (China Standard Time)

Hi Miquel,

Yes, this is an issue that you can have with the current implementation of variable importance.
When several input genes are highly correlated, the information that they bring about the target gene will tend to spread across them, resulting in lower importance scores.

I have however never tried to correct this issue in the context of network inference.