Wrong r-squared when run without intercept or fixed effects

Question

Wrong r-squared when run without intercept or fixed effects

clintonTE opened this issue 3 years ago · comments

r2 is wrong when the model has no fixed effects or intercept.

using DataFrames, BenchmarkTools, Dates, Random, StatsBase, FixedEffectModels, LinearAlgebra, GLM
Random.seed!(11)
function mwefem(N=10^5)
  #simulate the data
  X = [rand(N,2) ones(N)]
  β = collect(1:3)
  ε = rand(N) .* 10
  y = X * β .+ ε

  d = hcat(DataFrame(X, :auto), DataFrame(y=y))
  b = svd(X)\y
  ϵ = y .- X * b
  actualr2 = 1 - sum(ϵ.^2)/sum((y .- mean(y)).^2)

  f = @formula(y ~ x1 + x2 + x3 + 0)
  fem = FixedEffectModels.reg(d, f)
  @info "FixedEffectModels r2: $(FixedEffectModels.r2(fem))"
  @info "Actual r2: $(actualr2)"
  @info "GLM r2: $(r2(lm(f, d)))"

end
@time mwefem()

[ Info: FixedEffectModels r2: 0.9155702184900961
[ Info: Actual r2: 0.04458471550606924
[ Info: GLM r2: 0.04458471550607035

With respect to the use case, when running a mixture of fixed effects specifications and specifications with an intercept only, I find it is often easiest to programmatically account for the intercept as its own column.

Matthieu Gomez · Answer 1 · Wed Jun 23 2021 05:15:32 GMT+0800 (China Standard Time)

In the absence of intercept, FixedEffectModels.jl computes the r2 as 1 - sum(ϵ.^2)/sum(y.^2), which gives 0.9155.
This is similar to what Stata does.

Clinton Tepper · Answer 2 · Wed Jun 30 2021 03:06:34 GMT+0800 (China Standard Time)

That seems more an artifact of how Stata handles the command in the no-fixed effects scenario than anything else, though I can't confirm without a license. It's neither the explained variance nor the squared correlation of the predictions (without fixed effects)- That is, it will be right with fixed effects (because of the demeaning) but wrong without them. Why not give the correct answer in both cases?

Matthieu Gomez · Answer 3 · Fri Jul 09 2021 00:48:30 GMT+0800 (China Standard Time)

There is no right or wrong, it's just that there are different definitions of what R2 means in a model without intercept.

See: https://stats.stackexchange.com/questions/26176/removal-of-statistically-significant-intercept-term-increases-r2-in-linear-mo
or
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-why-are-r2-and-f-so-large-for-models-without-a-constant/

Clinton Tepper · Answer 4 · Tue Jul 20 2021 11:34:57 GMT+0800 (China Standard Time)

I see what you mean, thanks for the links.

Still disconcerting though to get a different answer with a manual intercept column. Not sure what the most intuitive way to handle it would be, maybe treat a column of 1s as a special case and/or flash a warning on the behavior if there are no fixed effects.

Matthieu Gomez · Answer 5 · Sun Aug 29 2021 23:04:45 GMT+0800 (China Standard Time)

True. Just checked and the same issue happens in Stata, so not sure what to do.

Mike Keehan · Answer 6 · Mon Aug 30 2021 04:43:51 GMT+0800 (China Standard Time)

I start to understand more now...
Possible choices:
a) Just document the behaviour. That would make clear what is going on.
b) Possibly change the API (but not behaviour)? If you are using the r2() from statsmodels it should behave similarly like the r2(). Users would benefit from signaling in the API that the stata derived r2() behaves quite differently to GLM or the R lm package. This is what caused my misunderstanding.
c) start changing behaviour of the software. i.e. Use the R lm calculation for no mean.