JuliaStats / GLM.jl

Generalized linear models in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add VIF?

azev77 opened this issue · comments

There was a request for VIF on Discourse.
It's pretty simple to add using the fact that VIFs = the diagonal elements of the inverse of the correlation matrix.
-Belsley, D. A., E. Kuh, and R. E. Welsch. Regression Diagnostics. Hoboken, NJ: John Wiley & Sons, 1980.

vif_GLM(mod) = diag(inv(cor(mod.model.pp.X[:,2:end])))

using DataFrames, GLM, Statistics, LinearAlgebra, RDatasets 
airquality = rename(dataset("datasets", "airquality"), "Solar.R" => "Solar_R")
#
y =  airquality.Wind
X1 = airquality.Temp    
X2 = airquality.Solar_R  
X3 = airquality.Ozone 
d = DataFrame(X1=X1, X2=X2, X3=X3, y=y)
d = d[completecases(d), :]
X = [one.(d.X1) d.X1 d.X2 d.X3]  .|> Float64
y = d.y .|> Float64
#
β_LS = X \ y
vifm = diag(inv(cor(X[:,2:end])))
#
m_y  = lm(@formula(y ~ 1 + X1 + X2 + X3), d)
m_X1 = lm(@formula(X1 ~ 1 + X2 + X3), d)
m_X2 = lm(@formula(X2 ~ 1 + X1 + X3), d)
m_X3 = lm(@formula(X3 ~ 1 + X1 + X2), d)

vif_X1 = 1.0/(1.0-r2(m_X1))
vif_X2 = 1.0/(1.0-r2(m_X2))
vif_X3 = 1.0/(1.0-r2(m_X3))
#
vifm  [vif_X1; vif_X2; vif_X3;]
#
vif_GLM(m_y)  vifm  [vif_X1; vif_X2; vif_X3;]

or

function vif_glm1(mod::StatsModels.TableRegressionModel)
    diag(inv(cor(mod.model.pp.X[:,2:end])))
end

Sure, feel free to make a pull request (with tests and documentation). Though just like #415 it would make sense to add the empty function definition to StatsBase and StatsModels too.

@azev77 are you (or anyone else) working on this?

If not, I'll give it a swing. I just don't want to duplicate effort.

Not working on it

After working on this, I learned that the solution mentioned at the beginning of the post is a little outdated and doesn't take categorical variables into account accurately. (No criticism meant. I apologize if I came off harsh.) There are more modern methods for calculating VIFs.

See:
Fox, J. and Monette, G. (1992) Generalized collinearity diagnostics. JASA, 87, 178–183.
Fox, J. (2016) Applied Regression Analysis and Generalized Linear Models, Third Edition. Sage.
Fox, J. and Weisberg, S. (2018) An R Companion to Applied Regression, Third Edition, Sage.

R provides a vif function in the car package that is maintained by John Fox. They have done extensive work on VIF calculations. The vif function in R calculates Generalized Variance-Inflation Factors.

I'm not going to pretend to be more of an expert on the topic than them. So, I copied the code for the default vif function from R to julia.

@nalimilan, I have a couple of design questions.

  1. My current workings of a vif function only work on linear models. There is a vif function in R that works on Mixed Models. Should I add that code to MixedModels.jl, or include it here? (I assume it would be better put in MixedModels.jl.)

  2. I used PrettyTables.jl to print the results of the function output. I really don't want to figure out terminal printing without PrettyTables.jl. However, I am willing to try if it would be preferable for the overall health of the package. (I think PrettyTables is very nice though because of its ability to print out LaTeX tables.)

Below is a screenshot of a linear model that does have a categorical variable, and one that doesn't; and then how the vif function I copied works with each model.

Please let me know what you think. Thank you.

Screen Shot 2022-12-21 at 8 45 54 AM

Just to clarify:

There is more work to be done with VIFs when it comes to mixed models, weighted liner models, and models with interactions in them. I'll keep working on the code to ensure that these other features are ready before a pull is requested.

Just to clarify the output:

The beginning part of the R docs about their vif function is:

"If all terms in an unweighted linear model have 1 df, then the usual variance-inflation factors are calculated.

If any terms in an unweighted linear model have more than 1 df, then generalized variance-inflation factors (Fox and Monette, 1992) are calculated. These are interpretable as the inflation in size of the confidence ellipse or ellipsoid for the coefficients of the term in comparison with what would be obtained for orthogonal data."

Be careful. I don’t think we can copy R code & keep the MIT license

Thanks for the heads up. How much different does it have to be to not count as copying? Lolz.

The R code merely implements mathematical equations for VIFS and GVIFS, and outputs it in R format.

Mine also just implements mathematical equations and outputs it in Julia.

If you know of resources on what makes something "copied" or not, please let me know. I can read further into it.

R's car library is distributed under the GPL-2 & GPL-3 license which gives permission to modify the code, and it looks like the MIT license doesn't seem to prohibit any sort of behavior on how software is created. From my limited legal knowledge, however much you'd like to trust it, it looks like there is no violation of either license in what I did. If anyone else would like to chime with more legal knowledge, feel free to do so.

Thanks for working on this @HiramTheHero. Unfortunately, @azev77 is right that we cannot include code that is derived from GPL code in GLM.jl, as the MIT license is less restrictive than GPL, so the whole package would have to be GPL. It's hard to defined what is "derived" code though. Unless the code is less than a handful of lines, to be on the safe side, I would recommend writing to John Fox to ask him for permission to release your Julia code inspire by his R code under the MIT license (you can put me in CC as I know him).

  • My current workings of a vif function only work on linear models. There is a vif function in R that works on Mixed Models. Should I add that code to MixedModels.jl, or include it here? (I assume it would be better put in MixedModels.jl.)

Better add code for LMs and GLMs here, and code for mixed models in MixedModels.jl. Or if the code can be written only using APIs defined in StatsAPI.jl, maybe it could live in StatsModels.jl. Anyway we should add an empty definition of vif in StatsAPI.jl so that all packages that need it can override it.

  • I used PrettyTables.jl to print the results of the function output. I really don't want to figure out terminal printing without PrettyTables.jl. However, I am willing to try if it would be preferable for the overall health of the package. (I think PrettyTables is very nice though because of its ability to print out LaTeX tables.)

We probably don't want GLM.jl to depend on PrettyTables.jl just for that. You could take inspiration from the code used for FTestResult, you'll see it's relatively simple. But this is actually a tricky issue: ideally, there would be an easy way to extract the value for a given variable via indexing, e.g. vif(m)["X"] or vif(m)["X", "GVIF"]. This would be possible using NamedArrays.jl or AxisKeys.jl, but due to the existence of several competing packages for this we don't use any of them in JuliaStats packages. This is annoying.

Thank you @nalimilan for the response.

I've send an email to John Fox with your email on github CC'd to the email.

Better add code for LMs and GLMs here, and code for mixed models in MixedModels.jl. Or if the code can be written only using APIs defined in StatsAPI.jl, maybe it could live in StatsModels.jl. Anyway we should add an empty definition of vif in StatsAPI.jl so that all packages that need it can override it.

To calculate the VIF, I've used the det function from the LinearAlgebra package, and the cov2cor function in the StatsBase package. It looks like both of those packages are dependencies within StatsModels. Just let me know where you would prefer it to be placed. I'll make sure to add an empty vif function to StatsAPI.jl.

We probably don't want GLM.jl to depend on PrettyTables.jl just for that. You could take inspiration from the code used for FTestResult, you'll see it's relatively simple.

👍

But this is actually a tricky issue: ideally, there would be an easy way to extract the value for a given variable via indexing, e.g. vif(m)["X"] or vif(m)["X", "GVIF"]. This would be possible using NamedArrays.jl or AxisKeys.jl, but due to the existence of several competing packages for this we don't use any of them in JuliaStats packages. This is annoying.

I created two structs that hold the results of the vif function. So, their contents can be accessed. Not necessarily by a specific variable though.

struct SingleVifResult
    coefNames::Vector{String}
    vif::Vector{Float64}
end

struct VifResults
    coefNames::Vector{String}
    dof::Vector{Int}
    gvif::Vector{Float32}
    gvifModified::Vector{Float32}
end

Accessing Results:

julia> vif(MOD2).coefNames
4-element Vector{String}:
 "SepalWidth"
 "PetalLength"
 "PetalWidth"
 "Species"

julia> vif(MOD2).gvif
4-element Vector{Float32}:
  2.2274659
 23.161648
 21.0214
 40.039177

Just an update. We emailed John Fox and he was very supportive on us integrating his VIF methods into the Julia ecosystem.

It sounds like someone else has already built a VIF functionality for MixedModels, and that creator is wanting to extend their method to GLM. (JuliaStats/MixedModels.jl#661 (comment)) To avoid duplication of effort, I won't be putting more effort into this until a decision is made by the package maintainers on what direction should be taken.

The implementation in MixedModelsExtras will actually already handle (some) models fit with GLM.jl:

https://github.com/palday/MixedModelsExtras.jl/blob/d92910c011ca4c920a4e5e7dfc84b99d87f12754/test/vif.jl#L45-L61

@testset "GVIF and RegrssionModel" begin
    duncan = rdataset("car", "Duncan")


    lm1 = lm(@formula(Prestige ~ 1 + Income + Education), duncan)
    @test termnames(lm1) == coefnames(lm1)
    vif_lm1 = vif(lm1)


    # values here taken from car
    @test isapprox(vif_lm1, [2.1049, 2.1049]; atol=1e-5)
    @test isapprox(vif_lm1, gvif(lm1))


    lm2 = lm(@formula(Prestige ~ 1 + Income + Education + Type), duncan)
    @test termnames(lm2) == ["(Intercept)", "Income", "Education", "Type"]
    @test isapprox(gvif(lm2), [2.209178, 5.297584, 5.098592]; atol=1e-5)
    @test isapprox(gvif(lm2; scale=true),
                   [1.486330, 2.301648, 1.502666]; atol=1e-5)
end

Might be worth moving the implementation to a more upstream package for better visibility. It didn't look like the implementations were using anything MixedModel specific (but I might have missed it)

What is the status of this issue. Is there any help needed to get it moving?

@bkamins The code is written; there is just a little bit of discussion about what the API should look like and where various parts should live: JuliaStats/MixedModels.jl#661 (comment)

Currently we have up to three functions:

  • termnames -- owned and implemented in StatsModels?
  • vif -- owned by StatsAPI and potentially implemented there? (The naive implementation for Regressionmodel is very straightforward.)
  • gvif -- owned by StatsAPI or maybe StatsModels? Should this be a separate function or a kwarg to vif?

cc @kleinschmidt if we put a termnames in StatsModels.

My opinion is:

  • it is OK for vif and gvif to be separate functions (especially that they already are in MixedModelsExtras.jl)
  • where is the proposal of the code? (I quickly checked https://github.com/palday/MixedModelsExtras.jl/blob/main/src/vif.jl and was not sure that all corner cases, especially with respect to how intercept term could be passed to the model are covered)
  • regarding the package choice where to put them; vif and gvif can be owned by StatsAPI.jl; they could be implemented in StatsModels.jl or StatsAPI.jl (wherever is more convenient; in practice if these functions are used both packages will be loaded anyway I think)

Any updates on this feature being implemented?

if @kleinschmidt is happy for StatsModels to own termnames, then I'll open a PR there (against both 0.6 and 0.7 series), get that merged, then I'll move vif over to StatsAPI.

@HiramTheHero in the meantime, you should "just" be able to do

using MixedModelsExtras
vif(my_glm_model)

and things will probably work.

Just a side note for anyone following along with this thread. The following code works if your model has categorical variables and you're wanting scaled GVIFs (similar output to vif in R's car package).

using MixedModelsExtras
gvif(my_glm_model, scale = true) # This produces GVIF^(1/(2*Df))

if @kleinschmidt is happy for StatsModels to own termnames, then I'll open a PR there (against both 0.6 and 0.7 series), get that merged, then I'll move vif over to StatsAPI.

@kleinschmidt Is that OK for you?

I think I mentioned elsewhere that I'm very much against VIF being added to any packages, as it's a statistical footgun in the waiting. I'll explain why here.

The VIF measures the degree to which two variables are confounded with each other. In other words, if the VIF between A and B is high, the variables are strongly correlated with each other, and either one could be the true explanation of the outcome variable. We cannot know, without additional data, whether A, B, or both are the real cause of the outcome variable Y.

The problem with this measure is that early in the history of econometrics, it was misinterpreted by the person deriving it as saying that, if we remove B from our regression, we will be able to measure the effect of A more accurately. This is not the case--pretending that a confounder doesn't exist will make your regression worse, not better. Unfortunately, it is common to remove variables with a high VIF from a regression, because economists (and scientists in general) are often trained to think of their p-values and estimated standard errors as indicating their study's quality.

Removing variables with a high VIF invalidates any calculated statistics or p-values, as you are now applying a post-hoc procedure, and (in the case of observational data) routinely breaks causal inference, by removing variables that need to be included to avoid confounding. This procedure introduces a garden of forking paths, where the analysis researchers actually perform is strongly dependent on the results of their experiment, introducing substantial model-based variability that is not captured in the p-values or standard errors reported by GLM.jl.

We are not responsible for how users use or misuse a particular statistical quantity -- otherwise, I certainly wouldn't want to be responsible for any code that outputs a p-value! There are valid uses of VIF and not all statistical education follows the pattern in econometrics. The typical applications I've seen in my domains is an indicator of whether or not individual coefficient standard errors are conflated or whether there are potential numerical issues. I had not seen the advice to remove terms with a high VIF, but rather discussions on orthogonalization, residualization, etc.

The typical applications I've seen in my domains is an indicator of whether or not individual coefficient standard errors are conflated or whether there are potential numerical issues. I had not seen the advice to remove terms with a high VIF, but rather discussions on orthogonalization, residualization, etc.

In that case, if the goal is to use it as a diagnostic, I think it makes sense to have the VIF as an internal function. However, I would avoid exporting it or including it in StatsAPI.jl, to avoid encouraging common misunderstandings that users must "check" for collinearity, and GLM.jl should automatically center/orthogonalize variables when necessary (rather than pushing this problem onto the user).

We are not responsible for how users use or misuse a particular statistical quantity [...] There are valid uses of VIF and not all statistical education follows the pattern in econometrics.

I understand that in some cases, statistics can be used incorrectly or misinterpreted. My main concern is that from the statistical (not numerical) perspective, the VIF has very few valid uses (unlike the p-value), but many common invalid ones.

I don't think we can bear all the responsibility for users' abuse of software, but we should still make an effort to make the software easy to use correctly. If we created a function called optimize_errors that did regression by maximizing (rather than minimizing) the sum of squared errors, that would be a very bad idea, because it's far more likely to be used by mistake than intentionally. This is true even if the function is correctly documented, so that using the values is technically "the user's fault" for misusing the statistic. It's like the old Far Side cartoon--if you put that button there, sooner or later, someone's going to press it ;)

As @palday noted there are valid use cases for the VIF and people expect to be able to compute it. Automatically orthogonalizing variables wouldn't replace this.

(FWIW, in my field the BIC is also often abused to claim that a variable has no effect, yet I wouldn't suggest dropping it as it's a classic indicator.)

(FWIW, in my field the BIC is also often abused to claim that a variable has no effect, yet I wouldn't suggest dropping it as it's a classic indicator.)

I think that's a mostly correct interpretation (it's correct if you assume that the coefficient might be 0). This is roughly my point: there are correct and incorrect uses for BIC, or p-values, or most other measures, but most uses are at least 50% correct. People don't understand lots of statistics, but they at least know basics like:

  1. A lower p-value means less uncertainty than a higher one
  2. A lower BIC is evidence you should include a variable in your model
  3. A higher R^2 indicates a better model

i.e. even if they're misinterpreting them, they at least know whether the variable is good or bad, so it's not doing more harm than just not doing statistics at all.

For a better analogy, imagine people routinely dropped variables that improved the BIC, i.e. they tried to select the model that assigned the lowest probability to the data.

A higher R^2 indicates a better model

No, even assuming that there is a clear definition of R^2 for a given model (which there isn't for mixed models).

https://www.johnmyleswhite.com/notebook/2016/07/23/why-im-not-a-fan-of-r-squared/

A higher R^2 indicates a better model

No, even assuming that there is a clear definition of R^2 for a given model (which there isn't for mixed models).

https://www.johnmyleswhite.com/notebook/2016/07/23/why-im-not-a-fan-of-r-squared/

I'm not a huge fan of R^2 either, and it's definitely easy to misinterpret. But at least this interpretation is somewhat accurate in some common situations. "High R^2 is better" is dramatically oversimplified and ignores a lot of other factors that can matter more to a model. Unfortunately, this misunderstanding causes people to make mistakes in how they interpret . But it still has a kernel of truth at its core: lower variance means better predictions. Sometimes, people using R^2 are using it correctly (e.g. to compare linear regression models; as the blog post notes, the true model will always have a better R^2 than an incorrect model, given a big enough sample size). I don't think I've ever seen a correct use of VIF in a paper.

As another example, imagine replacing all our p values with a brand-new ρ-value, defined as 1-p. Just because our users might make mistakes sometimes, doesn't mean we should make it easier for them to misunderstand--our goal should be to avoid footguns rather than encourage them.

But it still has a kernel of truth at its core: lower variance means better predictions.

Again, no! For one thing, R2 captures a particular measure of variance (relative to a saturated model) and not the variance of the predictions. The Stein paradox and the bias-variance tradeoff speak to there being a tradeoff that needs to be made, but making that tradeoff should be informed by the broader context of the analysis.

VIF is not a quantity constructed explicitly to be misleading like your alternative rho-value or error-maximizing approach, but rather one that has valid applications and is included in software implemented by prominent applied statisticians (e.g. John Fox in R's car and the Applied Regression Analysis book that car is the companion to). I am interested in having a canonical implementation so that users who know what they're doing statistically but maybe don't understand numerical stability can use it. That's why I wrote the implementation in MixedModelsExtras.

If you want to view it from a responsibility to users perspective, then users will compute this value somehow. Whether it's using RCall or some problematic code they copied and pasted from somewhere, they will find a way to get it computed. Maybe their boss or a reviewer or a regulatory agency demanded it, but they will find a way to compute it. The least we can do is make sure that they have the correctly computed value and not something that uses the wrong or numerically unstable formula. That's how we as software authors can support our users doing valid inference. It's up to them to use that value correctly. For that part, we can be educators (and I do also teach), but that is a different role than software author.

Finally, I want to note that the proposal is for defining a function to compute this value and not for showing it in default outputs. I would also be opposed to including VIF in the default show method for models.

Again, no! For one thing, R2 captures a particular measure of variance (relative to a saturated model) and not the variance of the predictions. The Stein paradox and the bias-variance tradeoff speak to there being a tradeoff that needs to be made, but making that tradeoff should be informed by the broader context of the analysis.

Again, I'm talking in the context of comparing simple linear regression models on the same dataset, in which case the scaling isn't important and there is no bias-variance tradeoff (as OLS is unbiased). But regardless, my point is it's common to misunderstand R^2 partially, but complete reversals are rare. Few people try to choose models that actively minimize R^2.

VIF is not a quantity constructed explicitly to be misleading

Right, but my point is it's just as misleading as those examples. The intentions of the inventor don't matter; what matters is whether the function is misleading. If a function does something very different from what users expect it to do, there's a problem with the function.

Whether it's using RCall or some problematic code they copied and pasted from somewhere, they will find a way to get it computed. Maybe their boss or a reviewer or a regulatory agency demanded it, but they will find a way to compute it. The least we can do is make sure that they have the correctly computed value and not something that uses the wrong or numerically unstable formula.

FWIW, I am this user about to copy-pasta a VIF formula from the wilds of the internet (not because I particularly want to, or understand the subject greatly, but simply because it is being required by the course material I am engaged with) and I would greatly appreciate a vetted function conveniently located somewhere in the Julia/MLJ ecosystem instead.