JuliaStats / GLM.jl

Generalized linear models in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistencies in Computing Standard Errors in a Weighted Model

abianco88 opened this issue · comments

In my opinion, I observed an unexpected behavior when using GLM.jl to run a weighted linear model (i.e., a "standard" weighted linear regression) and compared the results with R (CRAN R v3.6.1).

  1. First, the standard errors produced by Julia do not match the standard errors produced by R. The standard errors calculated by R are correct in the sense they match exactly with those one would get by calculating them implementing the textbook formulae.
  2. Second, there are cases in which Julia fails to fit the model and returns an error, such as (apparently) when the weights are small in absolute value.

For reproducibility purposes, I attached a mock data set (test.csv zipped in test.zip) of simulated data containing 4 columns (y, x, w1 and w2) and 100 observations. The specifications tested are below, along with the Julia and R code used and the terminal output.

  • Case A: y ~ intercept + x, with weights w1
  • Case B: y ~ intercept + x, with weights w2

Julia code snippet:

#= ####       Julia v1.4.1        #### =#
using CSV, DataFrames, GLM

df = CSV.read("path/to/file/test.csv", copycols = true)

# Case A: The following executes but returns a standard error different from R
GLM.glm(@formula(y ~ x), df, Normal(), IdentityLink(), wts= df.w1)

# Case B: The following does not run and throws an error
GLM.glm(@formula(y ~ x), df, Normal(), IdentityLink(), wts= df.w2)

Julia output for case A:

StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────
             Estimate  Std. Error  z value  Pr(>|z|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  0.391087   0.0844367  4.63172    <1e-5    0.225594   0.55658
x            0.170312   0.139899   1.21739    0.2235  -0.103885   0.444509
──────────────────────────────────────────────────────────────────────────

Julia output for case B (stacktrace truncated for readability) :

Error showing value of type StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}:
ERROR: DomainError with -0.3638936290479994:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).

R code snippet:

#= ####       CRAN R v3.6.1        #### =#
df <- read.csv("path/to/file/test.csv")

# Case A
summary(lm(y ~ x, data = df, weights = w1))

# Case B
summary(lm(y ~ x, data = df, weights = w2))

R output for case A:

Call:
lm(formula = y ~ x, data = df, weights = w1)

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-0.46747 -0.14740  0.01759  0.15185  0.42419 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.39109    0.06094   6.418 4.93e-09 ***
x            0.17031    0.10096   1.687   0.0948 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2185 on 98 degrees of freedom
Multiple R-squared:  0.02822,	Adjusted R-squared:  0.0183 
F-statistic: 2.846 on 1 and 98 DF,  p-value: 0.0948

R output for case B:

Call:
lm(formula = y ~ x, data = df, weights = w2)

Weighted Residuals:
      Min        1Q    Median        3Q       Max 
-0.064187 -0.020239  0.002415  0.020851  0.058245 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.39109    0.06094   6.418 4.93e-09 ***
x            0.17031    0.10096   1.687   0.0948 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03001 on 98 degrees of freedom
Multiple R-squared:  0.02822,	Adjusted R-squared:  0.0183 
F-statistic: 2.846 on 1 and 98 DF,  p-value: 0.0948

For case A, it is quite evident that the standard errors reported above from Julia do not match those from R.
In case B Julia returns an error, that I believe it is linked to the stderror function, but R works perfectly fine.

What is the cause of the difference for case A and what is causing the error aforementioned for case B? I read on Issue #327 a comment by @nalimilan from July 26, 2019 stating

weights will be interpreted as case/frequency weights, which differs e.g. from the R behavior
This raises the question: how would I treat weights differently in Julia so that the results are comparable with those from R? Also, would that explain why GLM.jl has sometimes issues in calculating the variance of the estimates when weights are employed?

Below are my Julia specs.

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
(@v1.4) pkg> st
Status `...\.julia\environments\v1.4\Project.toml`
  [336ed68f] CSV v0.6.1
  [5d742f6a] CSVFiles v1.0.0
  [a93c6f00] DataFrames v0.20.2
  [38e38edf] GLM v1.3.9

test.zip

As I noted in the quote you give, in GLM.jl weights are frequency (a.k.a. case) weights while in R they are analytical weights. So it's expected that you get different results. I don't think there's a way to use analytical weights currently. But are you sure that's what you need? They generally arise from data in which each observation is the average of multiple observations.

Regarding the second error, since frequency weights are supposed to be, well, frequencies, it doesn't make much sense to pass very small values like that. Some have even suggested to throw an error when passing non-integer values, but I don't really like this as it allows getting point estimates for any kind of weights when you're not interested in the standard errors. In the longer term we should allow passing other kinds of AbstractWeights.

I appreciate your answer, @nalimilan. Thank you for clarifying the nature of the weights accepted as argument by the fit/GLM function.

I know the issue has been closed, but I want to humbly point out that analytical weights, also known as inverse variance weights, are precisely those necessary to estimate a Weighted Least Squares/Weighted Linear Regression model, which assumes errors may present heteroskedasticity but are uncorrelated (i.e., the conditional covariance matrix must be diagonal but the values on the principal diagonal may differ from each other). Indeed, for the linear estimator to be Best Linear Unbiased Estimator, the weight of each observation needs to be proportional to the reciprocal of the variance of the response for that observation. For this reason, I wonder if they'll ever be supported in GLM.jl, since I saw that the Pull Request #194 is still open and the choice to set for frequency weights only seemed to be a temporary solution back in 2017.

Yes, analytical weights are very useful, somebody just needs to do the work to finish that pull request. (FWIW, fhe choice of frequency weights is much older than 2017.)