Linear models failing to display with ambiguous datatypes with unhelpful error messages
jakewilliami opened this issue · comments
The linear model is failing somewhere in show
(I guess) if you give it DataFrame
s with type Any
. MWE:
julia> df_test = DataFrame(date_numeric = [737810, 737841, 737869, 737900, 737930, 737961, 737991, 738022, 738053, 738083, 738114], num = Any[-74587.93, -74550.49, -74482.09, -74441.45, -74316.17, -74252.81, -73976.21, -73587.65, -73170.53, -72753.41, -72304.01])
11×2 DataFrame
Row │ date_numeric num
│ Int64 Any
─────┼────────────────────────────
1 │ 737810 -74587.9
2 │ 737841 -74550.5
3 │ 737869 -74482.1
4 │ 737900 -74441.4
5 │ 737930 -74316.2
6 │ 737961 -74252.8
7 │ 737991 -73976.2
8 │ 738022 -73587.6
9 │ 738053 -73170.5
10 │ 738083 -72753.4
11 │ 738114 -72304.0
julia> model = fit(LinearModel, @formula(date_numeric ~ num), df_test)
ERROR: ArgumentError: FDist: the condition ν1 > zero(ν1) && ν2 > zero(ν2) is not satisfied.
Stacktrace:
# ...
julia> df_test = DataFrame(date_numeric = [737810, 737841, 737869, 737900, 737930, 737961, 737991, 738022, 738053, 738083, 738114], num = Float64[-74587.93, -74550.49, -74482.09, -74441.45, -74316.17, -74252.81, -73976.21, -73587.65, -73170.53, -72753.41, -72304.01]);
julia> model = fit(LinearModel, @formula(date_numeric ~ num), df_test)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted
{Float64, Matrix{Float64}}}}, Matrix{Float64}}
date_numeric ~ 1 + num
Coefficients:
─────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────────────
(Intercept) 746741.0 1103.35 676.79 <1e-21 7.44245e5 749237.0
num 0.118876 0.0149383 7.96 <1e-04 0.0850827 0.152668
─────────────────────────────────────────────────────────────────────────────────────
Took me quite a while to figure out what exactly was happening here (realised something weird was happening when running model = fit(LinearModel, @formula(date_numeric ~ num), df_test);
in the REPL worked, but didn't work when I didn't suppress the display (i.e., when I removed the semicolon).
The problem is that variables with eltype Any
are treated as categorical. You get the same error by wrapping num
in a CategoricalArray
. So the underlying problem is that coeftable
fails for a model with zero residual degrees of freedom. See #458.
I agree with the OP that the error message could be improved here. Here is another case, which I just found when reading data from an Excel sheet with XLSX.readtable
(it parses everything as Any
by default).
julia> d = DataFrame(Y = Any[1,2,3], X = Any[3,2,1])
3×2 DataFrame
Row │ Y X
│ Any Any
─────┼──────────
1 │ 1 3
2 │ 2 2
3 │ 3 1
julia> lm(@formula(Y ~ X), d)
ERROR: MethodError: no method matching fit(::Type{LinearModel}, ::Matrix{Float64}, ::Matrix{Float64}, ::Nothing)
Closest candidates are:
fit(::StatisticalModel, ::Any...) at /Users/74097/.julia/packages/StatsBase/IPydo/src/statmodels.jl:178
fit(::Type{StatsBase.Histogram}, ::Any...; kwargs...) at /Users/74097/.julia/packages/StatsBase/IPydo/src/hist.jl:383
fit(::Type{LinearModel}, ::AbstractMatrix{var"#s49"} where var"#s49"<:Real, ::AbstractVector{var"#s50"} where var"#s50"<:Real, ::Union{Nothing, Bool}; wts, dropcollinear) at /Users/74097/.julia/packages/GLM/hDWc9/src/lm.jl:156
can we not check for the element type of each column and show a warning if we find Any
?