Linear models failing to display with ambiguous datatypes with unhelpful error messages

jakewilliami opened this issue · comments

The linear model is failing somewhere in show (I guess) if you give it DataFrames with type Any. MWE:

julia> df_test = DataFrame(date_numeric = [737810, 737841, 737869, 737900, 737930, 737961, 737991, 738022, 738053, 738083, 738114], num = Any[-74587.93, -74550.49, -74482.09, -74441.45, -74316.17, -74252.81, -73976.21, -73587.65, -73170.53, -72753.41, -72304.01])
11×2 DataFrame
 Row │ date_numeric  num
     │ Int64         Any
   1737810  -74587.9
   2737841  -74550.5
   3737869  -74482.1
   4737900  -74441.4
   5737930  -74316.2
   6737961  -74252.8
   7737991  -73976.2
   8738022  -73587.6
   9738053  -73170.5
  10738083  -72753.4
  11738114  -72304.0

julia> model = fit(LinearModel, @formula(date_numeric ~ num), df_test)
ERROR: ArgumentError: FDist: the condition ν1 > zero(ν1) && ν2 > zero(ν2) is not satisfied.
  # ...

julia> df_test = DataFrame(date_numeric = [737810, 737841, 737869, 737900, 737930, 737961, 737991, 738022, 738053, 738083, 738114], num = Float64[-74587.93, -74550.49, -74482.09, -74441.45, -74316.17, -74252.81, -73976.21, -73587.65, -73170.53, -72753.41, -72304.01]);

julia> model = fit(LinearModel, @formula(date_numeric ~ num), df_test)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted
{Float64, Matrix{Float64}}}}, Matrix{Float64}}

date_numeric ~ 1 + num

                      Coef.    Std. Error       t  Pr(>|t|)  Lower 95%      Upper 95%
(Intercept)   746741.0       1103.35       676.79    <1e-21  7.44245e5  749237.0
num                0.118876     0.0149383    7.96    <1e-04  0.0850827       0.152668

Took me quite a while to figure out what exactly was happening here (realised something weird was happening when running model = fit(LinearModel, @formula(date_numeric ~ num), df_test); in the REPL worked, but didn't work when I didn't suppress the display (i.e., when I removed the semicolon).

The problem is that variables with eltype Any are treated as categorical. You get the same error by wrapping num in a CategoricalArray. So the underlying problem is that coeftable fails for a model with zero residual degrees of freedom. See #458.

I agree with the OP that the error message could be improved here. Here is another case, which I just found when reading data from an Excel sheet with XLSX.readtable (it parses everything as Any by default).

julia> d = DataFrame(Y = Any[1,2,3], X = Any[3,2,1])
3×2 DataFrame
 Row │ Y    X   
     │ Any  Any 
   1 │ 1    3
   2 │ 2    2
   3 │ 3    1

julia> lm(@formula(Y ~ X), d)
ERROR: MethodError: no method matching fit(::Type{LinearModel}, ::Matrix{Float64}, ::Matrix{Float64}, ::Nothing)
Closest candidates are:
  fit(::StatisticalModel, ::Any...) at /Users/74097/.julia/packages/StatsBase/IPydo/src/statmodels.jl:178
  fit(::Type{StatsBase.Histogram}, ::Any...; kwargs...) at /Users/74097/.julia/packages/StatsBase/IPydo/src/hist.jl:383
  fit(::Type{LinearModel}, ::AbstractMatrix{var"#s49"} where var"#s49"<:Real, ::AbstractVector{var"#s50"} where var"#s50"<:Real, ::Union{Nothing, Bool}; wts, dropcollinear) at /Users/74097/.julia/packages/GLM/hDWc9/src/lm.jl:156

can we not check for the element type of each column and show a warning if we find Any?