FixedEffects / FixedEffectModels.jl

Fast Estimation of Linear Models with IV and High Dimensional Categorical Variables

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong r-squared when run without intercept or fixed effects

clintonTE opened this issue · comments

r2 is wrong when the model has no fixed effects or intercept.

using DataFrames, BenchmarkTools, Dates, Random, StatsBase, FixedEffectModels, LinearAlgebra, GLM
Random.seed!(11)
function mwefem(N=10^5)
  #simulate the data
  X = [rand(N,2) ones(N)]
  β = collect(1:3)
  ε = rand(N) .* 10
  y = X * β .+ ε

  d = hcat(DataFrame(X, :auto), DataFrame(y=y))
  b = svd(X)\y
  ϵ = y .- X * b
  actualr2 = 1 - sum.^2)/sum((y .- mean(y)).^2)

  f = @formula(y ~ x1 + x2 + x3 + 0)
  fem = FixedEffectModels.reg(d, f)
  @info "FixedEffectModels r2: $(FixedEffectModels.r2(fem))"
  @info "Actual r2: $(actualr2)"
  @info "GLM r2: $(r2(lm(f, d)))"

end
@time mwefem()
[ Info: FixedEffectModels r2: 0.9155702184900961
[ Info: Actual r2: 0.04458471550606924
[ Info: GLM r2: 0.04458471550607035

With respect to the use case, when running a mixture of fixed effects specifications and specifications with an intercept only, I find it is often easiest to programmatically account for the intercept as its own column.

In the absence of intercept, FixedEffectModels.jl computes the r2 as 1 - sum(ϵ.^2)/sum(y.^2), which gives 0.9155.
This is similar to what Stata does.

That seems more an artifact of how Stata handles the command in the no-fixed effects scenario than anything else, though I can't confirm without a license. It's neither the explained variance nor the squared correlation of the predictions (without fixed effects)- That is, it will be right with fixed effects (because of the demeaning) but wrong without them. Why not give the correct answer in both cases?

I see what you mean, thanks for the links.

Still disconcerting though to get a different answer with a manual intercept column. Not sure what the most intuitive way to handle it would be, maybe treat a column of 1s as a special case and/or flash a warning on the behavior if there are no fixed effects.

True. Just checked and the same issue happens in Stata, so not sure what to do.

I start to understand more now...
Possible choices:
a) Just document the behaviour. That would make clear what is going on.
b) Possibly change the API (but not behaviour)? If you are using the r2() from statsmodels it should behave similarly like the r2(). Users would benefit from signaling in the API that the stata derived r2() behaves quite differently to GLM or the R lm package. This is what caused my misunderstanding.
c) start changing behaviour of the software. i.e. Use the R lm calculation for no mean.