vahuynh / GENIE3

Machine learning-based approach for the inference of gene regulatory networks from expression data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Doubt on variance importance and (multi)collinearity

MiqG opened this issue · comments

Hi!

First, thank you for developing such a cool new concept for network inference in the omics!

After reading your paper, I was wondering whether the variable importances obtained could be confounded by having multicollinearity between genes like is explained here.
Then, I understand that highly collinear features (genes) will be used for splitting observations close to the root a few times for each tree because they hold very similar information with respect to the target variable. And, therefore, these will have low importance when averaging over all the ensemble.
Is this true with the current implementation of variable importance?

Thank you very much again!

Miquel

Hi Miquel,

Yes, this is an issue that you can have with the current implementation of variable importance.
When several input genes are highly correlated, the information that they bring about the target gene will tend to spread across them, resulting in lower importance scores.

I have however never tried to correct this issue in the context of network inference.