FixedEffects / FixedEffectModels.jl

Fast Estimation of Linear Models with IV and High Dimensional Categorical Variables

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug for residuals

alfaromartino opened this issue · comments

It seems that if you don't specify the intercept as a fixed effect, residuals can't be retrived. The following is a MWE:

df = DataFrame(a = rand(100), b = rand(100), c = rand(100), d = ones(100))

ols = FixedEffectModels.reg(df, @formula(a ~ b + c), save = true)
residuals(ols) # this gives error

ols = FixedEffectModels.reg(df, @formula(a ~ b + c + fe(d)), save = true)
residuals(ols) #this works fine

Maybe it's worth making this remark on the documentation, rather than fixing it? After all, it's a package for the inclusion of fixed effects. But sometimes you want to compare results with a simple OLS.

I just also noticed that adding an intercept in this way crashes predict if there are missing values (it works fine if there are no missing values). The MWE is:

df = DataFrame(a = rand(100), b = rand(100), c = rand(100), d = ones(100))
allowmissing!(df)
df.b[[30,40]] .= missing

ols = FixedEffectModels.reg(df, @formula(a ~ b + c + fe(d)), save = true)
residuals(ols) #this works fine
predict(ols,df) #this doesn't work

ols = FixedEffectModels.reg(df, @formula(a ~ b + c), save = true)
residuals(ols) #this gives error 
predict(ols,df) #this works fine

I have just fixed the first issue. @nilshg: could you have a look at the second one?

So what happens here is that with missing we are getting two predicted fixed effects for the one group included in the model:

julia> unique(ols.fe)
2×2 DataFrame
 Row │ d         fe_d
     │ Float64?  Float64?
─────┼──────────────────────────
   1 │      1.0        0.608499
   2 │      1.0  missing

Which then means when leftjoining fixed effects onto the original data to be able to add them to the predicted response, the data set is duplicated - every row with a fixed effect value of 1 will turn into one row with a fixed effect ot 0.6, and one with a missing fixed effect. We then get an error as the nonmissings vector used to pick the rows without missing data is only half as long as the data set we're indexing into following duplication.

I'm a little surprised that we end up with a missing fixed effect here, as I would have expected rows with missing observations to get dropped. The immediate issue would be solved by by replacing

fes = leftjoin(select(df, m.fekeys), unique(m.fe); on = m.fekeys, makeunique = true, matchmissing = :equal)

with

fes = leftjoin(select(df, m.fekeys), dropmissing(unique(m.fe)); on = m.fekeys, makeunique = true, matchmissing = :equal)

but it feels like a fix someplace before we reach predict might be more appropriate?