JuliaStats / GLM.jl

Generalized linear models in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How do I create a weighted model?

cmcaine opened this issue · comments

The documentation suggests that this might work:

using GLM, DataFrames
df = DataFrame(x = rand(100), y = rand(100), w = repeat([1,1,1,100], 25));
fit(LinearModel, @formula(y ~ x), df, wts = df.w)

But that gives a method not found error:

ERROR: MethodError: no method matching fit(::Type{LinearModel}, ::Array{Float64,2}, ::Array{Float64,1}; wts=[1, 1, 1, 100, 1, 1, 1, 100, 1, 1    1, 100, 1, 1, 1, 100, 1, 1, 1, 100])
Closest candidates are:
  fit(::Type{LinearModel}, ::AbstractArray{T,2} where T, ::AbstractArray{T,1} where T) at /home/colin/.julia/dev/GLM/src/lm.jl:136 got unsupported keyword argument "wts"
  fit(::Type{LinearModel}, ::AbstractArray{T,2} where T, ::AbstractArray{T,1} where T, ::Bool) at /home/colin/.julia/dev/GLM/src/lm.jl:136 got unsupported keyword argument "wts"
  fit(::Type{StatsBase.Histogram}, ::Any...; kwargs...) at /home/colin/.julia/packages/StatsBase/rTkaz/src/hist.jl:319
  ...
Stacktrace:
 [1] #fit#57(::Dict{Symbol,Any}, ::Base.Iterators.Pairs{Symbol,Array{Int64,1},Tuple{Symbol},NamedTuple{(:wts,),Tuple{Array{Int64,1}}}}, ::Function, ::Type{LinearModel}, ::FormulaTerm{Term,Term}, ::DataFrame) at /home/colin/.julia/packages/StatsModels/G9zlM/src/statsmodel.jl:88
 [2] (::getfield(StatsBase, Symbol("#kw##fit")))(::NamedTuple{(:wts,),Tuple{Array{Int64,1}}}, ::typeof(fit), ::Type{LinearModel}, ::FormulaTerm{Term,Term}, ::DataFrame) at ./none:0
 [3] top-level scope at none:0

This looks like it might work, but this can't be the intended API:

using GLM, DataFrames
df = DataFrame(x1 = rand(100), x2 = rand(100), y = rand(100), w = repeat([1,1,1,100], 25));

m = convert(Matrix, df)
X = m[:, 1:2] # Type signature of cholpred can't handle X being a vector, probably a bug?
y = m[:, 3]
w = m[:, 4]

lr = GLM.LmResp{typeof(y)}(
    fill!(similar(y), 0),
    similar(y, 0),
    w,
    y)

fit!(LinearModel(lr, GLM.cholpred(X, false)))

Where does the docs suggest this? We should fix that until we support it.

The docstring on fit says that it accepts weights as the wts argument: https://juliastats.github.io/GLM.jl/stable/api/#StatsBase.fit

A closer reading shows that the initial model has to be a GeneralizedLinearModel and that LinearModel is not a subtype of that (which of course, because GLM sounds like (and is) a concrete type, but I didn't really read that bit), but it tripped me up all the same.

I don't know how to foolproof the documentation. I'll try to have a think about it. For now this issue might help people find the answer.

For anyone wanting to do this, a simple way is to just construct a GLM that is equivalent to a linear model:

glm(@formula(y ~ x), data, Normal(), IdentityLink(), wts = data.w)

I haven't confirmed yet that this does give me the weighted least squares that I wanted (I don't know enough about GLMs and it has been way too hot ;)), but that's my problem.

The docstring on fit says that it accepts weights as the wts argument: https://juliastats.github.io/GLM.jl/stable/api/#StatsBase.fit

A closer reading shows that the initial model has to be a GeneralizedLinearModel and that LinearModel is not a subtype of that (which of course, because GLM sounds like (and is) a concrete type, but I didn't really read that bit), but it tripped me up all the same.

I don't know how to foolproof the documentation. I'll try to have a think about it. For now this issue might help people find the answer.

A separate docstring for fit on LinearModel should probably be added.

For anyone wanting to do this, a simple way is to just construct a GLM that is equivalent to a linear model:

glm(@formula(y ~ x), data, Normal(), IdentityLink(), wts = data.w)

I haven't confirmed yet that this does give me the weighted least squares that I wanted (I don't know enough about GLMs and it has been way too hot ;)), but that's my problem.

Be very careful with this: weights will be interpreted as case/frequency weights, which differs e.g. from the R behavior.