Add VIF?

Question

Add VIF?

azev77 opened this issue 3 years ago · comments

There was a request for VIF on Discourse.
It's pretty simple to add using the fact that VIFs = the diagonal elements of the inverse of the correlation matrix.
-Belsley, D. A., E. Kuh, and R. E. Welsch. Regression Diagnostics. Hoboken, NJ: John Wiley & Sons, 1980.

vif_GLM(mod) = diag(inv(cor(mod.model.pp.X[:,2:end])))

using DataFrames, GLM, Statistics, LinearAlgebra, RDatasets 
airquality = rename(dataset("datasets", "airquality"), "Solar.R" => "Solar_R")
#
y =  airquality.Wind
X1 = airquality.Temp    
X2 = airquality.Solar_R  
X3 = airquality.Ozone 
d = DataFrame(X1=X1, X2=X2, X3=X3, y=y)
d = d[completecases(d), :]
X = [one.(d.X1) d.X1 d.X2 d.X3]  .|> Float64
y = d.y .|> Float64
#
β_LS = X \ y
vifm = diag(inv(cor(X[:,2:end])))
#
m_y  = lm(@formula(y ~ 1 + X1 + X2 + X3), d)
m_X1 = lm(@formula(X1 ~ 1 + X2 + X3), d)
m_X2 = lm(@formula(X2 ~ 1 + X1 + X3), d)
m_X3 = lm(@formula(X3 ~ 1 + X1 + X2), d)

vif_X1 = 1.0/(1.0-r2(m_X1))
vif_X2 = 1.0/(1.0-r2(m_X2))
vif_X3 = 1.0/(1.0-r2(m_X3))
#
vifm ≈ [vif_X1; vif_X2; vif_X3;]
#
vif_GLM(m_y) ≈ vifm ≈ [vif_X1; vif_X2; vif_X3;]

or

function vif_glm1(mod::StatsModels.TableRegressionModel)
    diag(inv(cor(mod.model.pp.X[:,2:end])))
end

Milan Bouchet-Valat · Answer 1 · Sun May 02 2021 22:32:12 GMT+0800 (China Standard Time)

Sure, feel free to make a pull request (with tests and documentation). Though just like #415 it would make sense to add the empty function definition to StatsBase and StatsModels too.

HiramTheHero · Answer 2 · Tue Dec 20 2022 22:44:46 GMT+0800 (China Standard Time)

@azev77 are you (or anyone else) working on this?

If not, I'll give it a swing. I just don't want to duplicate effort.

azev77 · Answer 3 · Tue Dec 20 2022 22:46:21 GMT+0800 (China Standard Time)

Not working on it

HiramTheHero · Answer 4 · Thu Dec 22 2022 00:10:26 GMT+0800 (China Standard Time)

After working on this, I learned that the solution mentioned at the beginning of the post is a little outdated and doesn't take categorical variables into account accurately. (No criticism meant. I apologize if I came off harsh.) There are more modern methods for calculating VIFs.

See:
Fox, J. and Monette, G. (1992) Generalized collinearity diagnostics. JASA, 87, 178–183.
Fox, J. (2016) Applied Regression Analysis and Generalized Linear Models, Third Edition. Sage.
Fox, J. and Weisberg, S. (2018) An R Companion to Applied Regression, Third Edition, Sage.

R provides a vif function in the car package that is maintained by John Fox. They have done extensive work on VIF calculations. The vif function in R calculates Generalized Variance-Inflation Factors.

I'm not going to pretend to be more of an expert on the topic than them. So, I copied the code for the default vif function from R to julia.

@nalimilan, I have a couple of design questions.

My current workings of a vif function only work on linear models. There is a vif function in R that works on Mixed Models. Should I add that code to MixedModels.jl, or include it here? (I assume it would be better put in MixedModels.jl.)
I used PrettyTables.jl to print the results of the function output. I really don't want to figure out terminal printing without PrettyTables.jl. However, I am willing to try if it would be preferable for the overall health of the package. (I think PrettyTables is very nice though because of its ability to print out LaTeX tables.)

Below is a screenshot of a linear model that does have a categorical variable, and one that doesn't; and then how the vif function I copied works with each model.

Please let me know what you think. Thank you.

HiramTheHero · Answer 5 · Thu Dec 22 2022 00:19:48 GMT+0800 (China Standard Time)

Just to clarify:

There is more work to be done with VIFs when it comes to mixed models, weighted liner models, and models with interactions in them. I'll keep working on the code to ensure that these other features are ready before a pull is requested.

Just to clarify the output:

The beginning part of the R docs about their vif function is:

"If all terms in an unweighted linear model have 1 df, then the usual variance-inflation factors are calculated.

If any terms in an unweighted linear model have more than 1 df, then generalized variance-inflation factors (Fox and Monette, 1992) are calculated. These are interpretable as the inflation in size of the confidence ellipse or ellipsoid for the coefficients of the term in comparison with what would be obtained for orthogonal data."

azev77 · Answer 6 · Thu Dec 22 2022 00:36:35 GMT+0800 (China Standard Time)

Be careful. I don’t think we can copy R code & keep the MIT license

HiramTheHero · Answer 7 · Thu Dec 22 2022 00:41:30 GMT+0800 (China Standard Time)

Thanks for the heads up. How much different does it have to be to not count as copying? Lolz.

The R code merely implements mathematical equations for VIFS and GVIFS, and outputs it in R format.

Mine also just implements mathematical equations and outputs it in Julia.

If you know of resources on what makes something "copied" or not, please let me know. I can read further into it.

HiramTheHero · Answer 8 · Thu Dec 22 2022 00:49:03 GMT+0800 (China Standard Time)

R's car library is distributed under the GPL-2 & GPL-3 license which gives permission to modify the code, and it looks like the MIT license doesn't seem to prohibit any sort of behavior on how software is created. From my limited legal knowledge, however much you'd like to trust it, it looks like there is no violation of either license in what I did. If anyone else would like to chime with more legal knowledge, feel free to do so.

Milan Bouchet-Valat · Answer 9 · Thu Dec 22 2022 19:45:29 GMT+0800 (China Standard Time)

Thanks for working on this @HiramTheHero. Unfortunately, @azev77 is right that we cannot include code that is derived from GPL code in GLM.jl, as the MIT license is less restrictive than GPL, so the whole package would have to be GPL. It's hard to defined what is "derived" code though. Unless the code is less than a handful of lines, to be on the safe side, I would recommend writing to John Fox to ask him for permission to release your Julia code inspire by his R code under the MIT license (you can put me in CC as I know him).

My current workings of a vif function only work on linear models. There is a vif function in R that works on Mixed Models. Should I add that code to MixedModels.jl, or include it here? (I assume it would be better put in MixedModels.jl.)

Better add code for LMs and GLMs here, and code for mixed models in MixedModels.jl. Or if the code can be written only using APIs defined in StatsAPI.jl, maybe it could live in StatsModels.jl. Anyway we should add an empty definition of vif in StatsAPI.jl so that all packages that need it can override it.

I used PrettyTables.jl to print the results of the function output. I really don't want to figure out terminal printing without PrettyTables.jl. However, I am willing to try if it would be preferable for the overall health of the package. (I think PrettyTables is very nice though because of its ability to print out LaTeX tables.)

We probably don't want GLM.jl to depend on PrettyTables.jl just for that. You could take inspiration from the code used for FTestResult, you'll see it's relatively simple. But this is actually a tricky issue: ideally, there would be an easy way to extract the value for a given variable via indexing, e.g. vif(m)["X"] or vif(m)["X", "GVIF"]. This would be possible using NamedArrays.jl or AxisKeys.jl, but due to the existence of several competing packages for this we don't use any of them in JuliaStats packages. This is annoying.

HiramTheHero · Answer 10 · Sat Dec 24 2022 06:49:15 GMT+0800 (China Standard Time)

Thank you @nalimilan for the response.

I've send an email to John Fox with your email on github CC'd to the email.

Better add code for LMs and GLMs here, and code for mixed models in MixedModels.jl. Or if the code can be written only using APIs defined in StatsAPI.jl, maybe it could live in StatsModels.jl. Anyway we should add an empty definition of vif in StatsAPI.jl so that all packages that need it can override it.

To calculate the VIF, I've used the det function from the LinearAlgebra package, and the cov2cor function in the StatsBase package. It looks like both of those packages are dependencies within StatsModels. Just let me know where you would prefer it to be placed. I'll make sure to add an empty vif function to StatsAPI.jl.

We probably don't want GLM.jl to depend on PrettyTables.jl just for that. You could take inspiration from the code used for FTestResult, you'll see it's relatively simple.

👍

But this is actually a tricky issue: ideally, there would be an easy way to extract the value for a given variable via indexing, e.g. vif(m)["X"] or vif(m)["X", "GVIF"]. This would be possible using NamedArrays.jl or AxisKeys.jl, but due to the existence of several competing packages for this we don't use any of them in JuliaStats packages. This is annoying.

I created two structs that hold the results of the vif function. So, their contents can be accessed. Not necessarily by a specific variable though.

struct SingleVifResult
    coefNames::Vector{String}
    vif::Vector{Float64}
end

struct VifResults
    coefNames::Vector{String}
    dof::Vector{Int}
    gvif::Vector{Float32}
    gvifModified::Vector{Float32}
end

Accessing Results:

julia> vif(MOD2).coefNames
4-element Vector{String}:
 "SepalWidth"
 "PetalLength"
 "PetalWidth"
 "Species"

julia> vif(MOD2).gvif
4-element Vector{Float32}:
  2.2274659
 23.161648
 21.0214
 40.039177

HiramTheHero · Answer 11 · Thu Dec 29 2022 07:17:56 GMT+0800 (China Standard Time)

Just an update. We emailed John Fox and he was very supportive on us integrating his VIF methods into the Julia ecosystem.

It sounds like someone else has already built a VIF functionality for MixedModels, and that creator is wanting to extend their method to GLM. (JuliaStats/MixedModels.jl#661 (comment)) To avoid duplication of effort, I won't be putting more effort into this until a decision is made by the package maintainers on what direction should be taken.

Phillip Alday · Answer 12 · Fri Dec 30 2022 00:33:21 GMT+0800 (China Standard Time)

The implementation in MixedModelsExtras will actually already handle (some) models fit with GLM.jl:

https://github.com/palday/MixedModelsExtras.jl/blob/d92910c011ca4c920a4e5e7dfc84b99d87f12754/test/vif.jl#L45-L61

@testset "GVIF and RegrssionModel" begin
    duncan = rdataset("car", "Duncan")


    lm1 = lm(@formula(Prestige ~ 1 + Income + Education), duncan)
    @test termnames(lm1) == coefnames(lm1)
    vif_lm1 = vif(lm1)


    # values here taken from car
    @test isapprox(vif_lm1, [2.1049, 2.1049]; atol=1e-5)
    @test isapprox(vif_lm1, gvif(lm1))


    lm2 = lm(@formula(Prestige ~ 1 + Income + Education + Type), duncan)
    @test termnames(lm2) == ["(Intercept)", "Income", "Education", "Type"]
    @test isapprox(gvif(lm2), [2.209178, 5.297584, 5.098592]; atol=1e-5)
    @test isapprox(gvif(lm2; scale=true),
                   [1.486330, 2.301648, 1.502666]; atol=1e-5)
end

Andreas Noack · Answer 13 · Tue Jan 03 2023 17:08:25 GMT+0800 (China Standard Time)

Might be worth moving the implementation to a more upstream package for better visibility. It didn't look like the implementations were using anything MixedModel specific (but I might have missed it)

Bogumił Kamiński · Answer 14 · Mon Feb 20 2023 01:57:11 GMT+0800 (China Standard Time)

What is the status of this issue. Is there any help needed to get it moving?

Phillip Alday · Answer 15 · Mon Feb 20 2023 10:17:47 GMT+0800 (China Standard Time)

@bkamins The code is written; there is just a little bit of discussion about what the API should look like and where various parts should live: JuliaStats/MixedModels.jl#661 (comment)

Currently we have up to three functions:

termnames -- owned and implemented in StatsModels?
vif -- owned by StatsAPI and potentially implemented there? (The naive implementation for Regressionmodel is very straightforward.)
gvif -- owned by StatsAPI or maybe StatsModels? Should this be a separate function or a kwarg to vif?

cc @kleinschmidt if we put a termnames in StatsModels.

Bogumił Kamiński · Answer 16 · Mon Feb 20 2023 15:07:09 GMT+0800 (China Standard Time)

My opinion is:

it is OK for vif and gvif to be separate functions (especially that they already are in MixedModelsExtras.jl)
where is the proposal of the code? (I quickly checked https://github.com/palday/MixedModelsExtras.jl/blob/main/src/vif.jl and was not sure that all corner cases, especially with respect to how intercept term could be passed to the model are covered)
regarding the package choice where to put them; vif and gvif can be owned by StatsAPI.jl; they could be implemented in StatsModels.jl or StatsAPI.jl (wherever is more convenient; in practice if these functions are used both packages will be loaded anyway I think)

HiramTheHero · Answer 17 · Wed Apr 05 2023 05:35:57 GMT+0800 (China Standard Time)

Any updates on this feature being implemented?

Phillip Alday · Answer 18 · Thu Apr 06 2023 02:01:18 GMT+0800 (China Standard Time)

if @kleinschmidt is happy for StatsModels to own termnames, then I'll open a PR there (against both 0.6 and 0.7 series), get that merged, then I'll move vif over to StatsAPI.

Phillip Alday · Answer 19 · Thu Apr 06 2023 02:02:07 GMT+0800 (China Standard Time)

@HiramTheHero in the meantime, you should "just" be able to do

using MixedModelsExtras
vif(my_glm_model)

and things will probably work.

HiramTheHero · Answer 20 · Thu Apr 06 2023 05:13:48 GMT+0800 (China Standard Time)

Just a side note for anyone following along with this thread. The following code works if your model has categorical variables and you're wanting scaled GVIFs (similar output to vif in R's car package).

using MixedModelsExtras
gvif(my_glm_model, scale = true) # This produces GVIF^(1/(2*Df))

Milan Bouchet-Valat · Answer 21 · Fri Sep 01 2023 14:00:58 GMT+0800 (China Standard Time)

if @kleinschmidt is happy for StatsModels to own termnames, then I'll open a PR there (against both 0.6 and 0.7 series), get that merged, then I'll move vif over to StatsAPI.

@kleinschmidt Is that OK for you?

Carlos Parada · Answer 22 · Sat Sep 02 2023 09:27:47 GMT+0800 (China Standard Time)

I think I mentioned elsewhere that I'm very much against VIF being added to any packages, as it's a statistical footgun in the waiting. I'll explain why here.

The VIF measures the degree to which two variables are confounded with each other. In other words, if the VIF between A and B is high, the variables are strongly correlated with each other, and either one could be the true explanation of the outcome variable. We cannot know, without additional data, whether A, B, or both are the real cause of the outcome variable Y.

The problem with this measure is that early in the history of econometrics, it was misinterpreted by the person deriving it as saying that, if we remove B from our regression, we will be able to measure the effect of A more accurately. This is not the case--pretending that a confounder doesn't exist will make your regression worse, not better. Unfortunately, it is common to remove variables with a high VIF from a regression, because economists (and scientists in general) are often trained to think of their p-values and estimated standard errors as indicating their study's quality.

Removing variables with a high VIF invalidates any calculated statistics or p-values, as you are now applying a post-hoc procedure, and (in the case of observational data) routinely breaks causal inference, by removing variables that need to be included to avoid confounding. This procedure introduces a garden of forking paths, where the analysis researchers actually perform is strongly dependent on the results of their experiment, introducing substantial model-based variability that is not captured in the p-values or standard errors reported by GLM.jl.

Phillip Alday · Answer 23 · Sat Sep 02 2023 09:49:46 GMT+0800 (China Standard Time)

We are not responsible for how users use or misuse a particular statistical quantity -- otherwise, I certainly wouldn't want to be responsible for any code that outputs a p-value! There are valid uses of VIF and not all statistical education follows the pattern in econometrics. The typical applications I've seen in my domains is an indicator of whether or not individual coefficient standard errors are conflated or whether there are potential numerical issues. I had not seen the advice to remove terms with a high VIF, but rather discussions on orthogonalization, residualization, etc.

Carlos Parada · Answer 24 · Sat Sep 02 2023 11:14:29 GMT+0800 (China Standard Time)

The typical applications I've seen in my domains is an indicator of whether or not individual coefficient standard errors are conflated or whether there are potential numerical issues. I had not seen the advice to remove terms with a high VIF, but rather discussions on orthogonalization, residualization, etc.

In that case, if the goal is to use it as a diagnostic, I think it makes sense to have the VIF as an internal function. However, I would avoid exporting it or including it in StatsAPI.jl, to avoid encouraging common misunderstandings that users must "check" for collinearity, and GLM.jl should automatically center/orthogonalize variables when necessary (rather than pushing this problem onto the user).

We are not responsible for how users use or misuse a particular statistical quantity [...] There are valid uses of VIF and not all statistical education follows the pattern in econometrics.

I understand that in some cases, statistics can be used incorrectly or misinterpreted. My main concern is that from the statistical (not numerical) perspective, the VIF has very few valid uses (unlike the p-value), but many common invalid ones.

I don't think we can bear all the responsibility for users' abuse of software, but we should still make an effort to make the software easy to use correctly. If we created a function called optimize_errors that did regression by maximizing (rather than minimizing) the sum of squared errors, that would be a very bad idea, because it's far more likely to be used by mistake than intentionally. This is true even if the function is correctly documented, so that using the values is technically "the user's fault" for misusing the statistic. It's like the old Far Side cartoon--if you put that button there, sooner or later, someone's going to press it ;)

Milan Bouchet-Valat · Answer 25 · Sat Sep 02 2023 23:14:50 GMT+0800 (China Standard Time)

As @palday noted there are valid use cases for the VIF and people expect to be able to compute it. Automatically orthogonalizing variables wouldn't replace this.

(FWIW, in my field the BIC is also often abused to claim that a variable has no effect, yet I wouldn't suggest dropping it as it's a classic indicator.)

Carlos Parada · Answer 26 · Sun Sep 03 2023 03:23:12 GMT+0800 (China Standard Time)

(FWIW, in my field the BIC is also often abused to claim that a variable has no effect, yet I wouldn't suggest dropping it as it's a classic indicator.)

I think that's a mostly correct interpretation (it's correct if you assume that the coefficient might be 0). This is roughly my point: there are correct and incorrect uses for BIC, or p-values, or most other measures, but most uses are at least 50% correct. People don't understand lots of statistics, but they at least know basics like:

A lower p-value means less uncertainty than a higher one
A lower BIC is evidence you should include a variable in your model
A higher R^2 indicates a better model

i.e. even if they're misinterpreting them, they at least know whether the variable is good or bad, so it's not doing more harm than just not doing statistics at all.

For a better analogy, imagine people routinely dropped variables that improved the BIC, i.e. they tried to select the model that assigned the lowest probability to the data.

Phillip Alday · Answer 27 · Sun Sep 03 2023 03:28:07 GMT+0800 (China Standard Time)

A higher R^2 indicates a better model

No, even assuming that there is a clear definition of R^2 for a given model (which there isn't for mixed models).

https://www.johnmyleswhite.com/notebook/2016/07/23/why-im-not-a-fan-of-r-squared/

Carlos Parada · Answer 28 · Sun Sep 03 2023 03:36:40 GMT+0800 (China Standard Time)

A higher R^2 indicates a better model

No, even assuming that there is a clear definition of R^2 for a given model (which there isn't for mixed models).

https://www.johnmyleswhite.com/notebook/2016/07/23/why-im-not-a-fan-of-r-squared/

I'm not a huge fan of R^2 either, and it's definitely easy to misinterpret. But at least this interpretation is somewhat accurate in some common situations. "High R^2 is better" is dramatically oversimplified and ignores a lot of other factors that can matter more to a model. Unfortunately, this misunderstanding causes people to make mistakes in how they interpret . But it still has a kernel of truth at its core: lower variance means better predictions. Sometimes, people using R^2 are using it correctly (e.g. to compare linear regression models; as the blog post notes, the true model will always have a better R^2 than an incorrect model, given a big enough sample size). I don't think I've ever seen a correct use of VIF in a paper.

As another example, imagine replacing all our p values with a brand-new ρ-value, defined as 1-p. Just because our users might make mistakes sometimes, doesn't mean we should make it easier for them to misunderstand--our goal should be to avoid footguns rather than encourage them.

Phillip Alday · Answer 29 · Sun Sep 03 2023 03:53:22 GMT+0800 (China Standard Time)

But it still has a kernel of truth at its core: lower variance means better predictions.

Again, no! For one thing, R2 captures a particular measure of variance (relative to a saturated model) and not the variance of the predictions. The Stein paradox and the bias-variance tradeoff speak to there being a tradeoff that needs to be made, but making that tradeoff should be informed by the broader context of the analysis.

VIF is not a quantity constructed explicitly to be misleading like your alternative rho-value or error-maximizing approach, but rather one that has valid applications and is included in software implemented by prominent applied statisticians (e.g. John Fox in R's car and the Applied Regression Analysis book that car is the companion to). I am interested in having a canonical implementation so that users who know what they're doing statistically but maybe don't understand numerical stability can use it. That's why I wrote the implementation in MixedModelsExtras.

If you want to view it from a responsibility to users perspective, then users will compute this value somehow. Whether it's using RCall or some problematic code they copied and pasted from somewhere, they will find a way to get it computed. Maybe their boss or a reviewer or a regulatory agency demanded it, but they will find a way to compute it. The least we can do is make sure that they have the correctly computed value and not something that uses the wrong or numerically unstable formula. That's how we as software authors can support our users doing valid inference. It's up to them to use that value correctly. For that part, we can be educators (and I do also teach), but that is a different role than software author.

Finally, I want to note that the proposal is for defining a function to compute this value and not for showing it in default outputs. I would also be opposed to including VIF in the default show method for models.

Carlos Parada · Answer 30 · Sun Sep 03 2023 12:00:45 GMT+0800 (China Standard Time)

Again, no! For one thing, R2 captures a particular measure of variance (relative to a saturated model) and not the variance of the predictions. The Stein paradox and the bias-variance tradeoff speak to there being a tradeoff that needs to be made, but making that tradeoff should be informed by the broader context of the analysis.

Again, I'm talking in the context of comparing simple linear regression models on the same dataset, in which case the scaling isn't important and there is no bias-variance tradeoff (as OLS is unbiased). But regardless, my point is it's common to misunderstand R^2 partially, but complete reversals are rare. Few people try to choose models that actively minimize R^2.

VIF is not a quantity constructed explicitly to be misleading

Right, but my point is it's just as misleading as those examples. The intentions of the inventor don't matter; what matters is whether the function is misleading. If a function does something very different from what users expect it to do, there's a problem with the function.

Dustin Irwin · Answer 31 · Thu Sep 14 2023 14:09:19 GMT+0800 (China Standard Time)

Whether it's using RCall or some problematic code they copied and pasted from somewhere, they will find a way to get it computed. Maybe their boss or a reviewer or a regulatory agency demanded it, but they will find a way to compute it. The least we can do is make sure that they have the correctly computed value and not something that uses the wrong or numerically unstable formula.

FWIW, I am this user about to copy-pasta a VIF formula from the wilds of the internet (not because I particularly want to, or understand the subject greatly, but simply because it is being required by the course material I am engaged with) and I would greatly appreciate a vetted function conveniently located somewhere in the Julia/MLJ ecosystem instead.