JuliaStats / GLM.jl

Generalized linear models in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong loglikelihood definitions

getzze opened this issue · comments

The definition of loglikelihood (and nullloglikelihood) for LinearModel seems incorrect to me. As defined in StatsBase (https://juliastats.org/StatsBase.jl/latest/statmodels/#StatsBase.deviance), the deviance is 2*(l_S - l), where l is the loglikelihood of the model and l_S the loglikelihood of the saturated model, that is often 0. So I would expect that the loglikelihood is proportional to the deviance.

Also for GLMs, the definition of the loglikelihood is a bit confusing to me, I don't get why the deviance is used as an estimate of the second parameter of the distribution. Is the deviance supposed to be an estimate of the variance?

AFAIK the resulting values have been checked against R and/or Stata. See discussion at #115 (comment). I'm not sure the loglikelihood for the saturated model is guaranteed to be zero: do you have any reference?

Also for GLMs, the definition of the loglikelihood is a bit confusing to me, I don't get why the deviance is used as an estimate of the second parameter of the distribution. Is the deviance supposed to be an estimate of the variance?

Well the deviance is just the residual sum of squares for a GLM with identity link and normal distribution, so yes.

I don't have a reliable reference for that, but the saturated model corresponds to a model with as many parameters as data points, see for example this thread: https://stats.stackexchange.com/questions/283/what-is-a-saturated-model
So the saturated model gives the maximum loglikelihood possible (with this kind of model). If the data points span the support of the distribution from the model (for instance, datas are real numbers and the model is Normal GLM), then you can exactly fit all the data points, therefore the loglikelihood of the saturated model is 0. However, if some data points are outside of the support (e.g. negative data points with a Poisson GLM), they cannot be fitted exactly and l_S is less than zero. This case is very unlikely because you usually make sure that the data is in the support of the model (or you choose the appropriate model for your data).

But taking the log of the deviance that is already a log, cannot be correct.
I looked at MixedModels.jl and the loglikelihood is defined as proportional to the deviance:
https://github.com/JuliaStats/MixedModels.jl/blob/master/src/linearmixedmodel.jl#L446-L451

Well the deviance is just the residual sum of squares for a GLM with identity link and normal distribution, so yes.

My question was for other distributions. The deviance happens to be the RSS for a Normal distribution, that is also, up to a proportionality constant, an estimator of the variance, variance that is the second parameter of the Normal distribution.
But I don't think this chain of relationship is true for other distributions but maybe I am wrong.
What is the problem of defining the loglikelihood with the deviance (as it is stated in StatsBase)?

On a sidenote, Stata is proprietary and closed-source afaik, should we make tests that rely on it?

I don't have a reliable reference for that, but the saturated model corresponds to a model with as many parameters as data points, see for example this thread: https://stats.stackexchange.com/questions/283/what-is-a-saturated-model
So the saturated model gives the maximum loglikelihood possible (with this kind of model). If the data points span the support of the distribution from the model (for instance, datas are real numbers and the model is Normal GLM), then you can exactly fit all the data points, therefore the loglikelihood of the saturated model is 0.

Yes, I know what the saturated model is, but I don't think the likelihood of such a model is supposed to be 1. See https://stats.stackexchange.com/questions/184753/in-a-glm-is-the-log-likelihood-of-the-saturated-model-always-zero

But taking the log of the deviance that is already a log, cannot be correct.

Where do we take the log of the deviance? :-/

I looked at MixedModels.jl and the loglikelihood is defined as proportional to the deviance:
https://github.com/JuliaStats/MixedModels.jl/blob/master/src/linearmixedmodel.jl#L446-L451

Note that this is only the definition for linear models. The general definition is at https://github.com/JuliaStats/MixedModels.jl/blob/705fb880b75174b1d9dcdcc95d5c9f5ead4da232/src/generalizedlinearmixedmodel.jl#L379.

But I don't think this chain of relationship is true for other distributions but maybe I am wrong.

Actually not all distributions use the deviance to compute the likelihood of a given data point. See the various method definitions for loglik_obs.

What is the problem of defining the loglikelihood with the deviance (as it is stated in StatsBase)?

We haven't been able to replicate R and Stata's loglikelihood nor pseudo-R² values with deviance, but it works using the current definition of loglikelihood. See JuliaStats/StatsBase.jl#396. OTC, the current computation of the log-likelihood follows documented formulas (though I don't have references offhand, they are not too hard to find).

On a sidenote, Stata is proprietary and closed-source afaik, should we make tests that rely on it?

Well unless you have a better reference, I'd rather trust existing implementations. Also as I said it matches R, for which the code is available. If you think something is incorrect, you have to come up either with strong references, or with a case where it's obviously wrong.

Where do we take the log of the deviance? :-/

-n/2 * (log(2π * deviance(r)/n) + 1)

Sorry, I should have quoted this line first, to make my point clearer.

Thanks for the link for the loglikelihood of the saturated model. So yes, the loglikelihood of saturated models of GLMs should not be considered zeros, so I understand why we have to use the actual loglikelihood (logpdf) for each distribution.

Actually not all distributions use the deviance to compute the likelihood of a given data point.

The general definition is at https://github.com/JuliaStats/MixedModels.jl/blob/705fb880b75174b1d9dcdcc95d5c9f5ead4da232/src/generalizedlinearmixedmodel.jl#L379.

In MixedModels.jl, only Bernoulli and Poisson are allowed (Normal falls back to a Linear MM), which are single-parameter distributions. They are also the one that don't use the deviance in the definitions in loglik_obs and I agree they are correct. My concern is for the other distributions Gamma, Inverse Gamma and Normal (as it does not fall back to a linear model). These 3 distributions need a second parameter apart from the mean μ, that is not per se optimized from the GLM, only μ is optimized. It can be estimated instead, using a formula to estimate it's value given the data and μ.
For a Normal distribution, if we estimate the second parameter (standard deviation), it corresponds to the sqrt of the deviance divided by N, so I guess the formula of loglik_obs makes sense. However the statement from StatsBase that the deviance is -2 times the loglik plus a constant is not true anymore, because the loglik is now, with φ = deviance = Σ(y-μ)² and σ² = φ/N:

l = -N/2 log(2π) - N/2 log(σ²) - Σ (y-μ)²/(2σ²)
  = -N/2 log(2π) - N/2 log(φ/N) - N/2 
  ≠ -N/2 log(2π) - φ/2

The last term is the one I would expect, that assumes that σ=1.
As the formula of loglik is inconsistent with the statement of a proportionality with the deviance for the Normal distribution, I guess it is also inconsistent for InvGaussian and Gamma. So I wanted to get an explanation of why these formula were used.
Afaik and as it says in the first link you sent, the deviance is the useful quantity for GLMs, the one that is used to compute pseudo-R2 with the formula 1 - D/D_0, so I would not care too much for the constant in the relation deviance = -2 loglik + constant but at least I would expect the proportionality constant to be correct. Actually, I just realized from JuliaStats/StatsBase.jl#396 that R2 is computed using loglik and not deviances, I will make a PR.

I think the mistake is that instead of using dispersion in the formulas for the loglik in LM and GLM, deviance was used :) Maybe from an old naming convention...
The formulas of loglik_obs are correct if φ = dispersion.

I found that the (MLE of the) dispersion is equal to the deviance for the Normal and InverseGaussian, but not for the Gamma, if φ is the dispersion and D the deviance, log φ + ψ0(1/φ) = -D/2 where ψ0 is the digamma function.
I guess the relation loglik = - deviance / 2 + constant is valid only if the dispersion is assumed to be 1. Otherwise it is loglik = - deviance / (2φ) + constant. It does not change the loglikelihood ratio (as long as the deviance of the saturated model is zero, which is the case). But then it should be stated in StatsBase definition of the deviance.

Actually, I just realized from JuliaStats/StatsBase.jl#396 that R2 is computed using loglik and not deviances, I will make a PR.

Well there's been a long discussion at JuliaStats/StatsBase.jl#396, and the conclusion was that pseudo-R² are defined in terms of log-likelihood rather than in terms of deviance. also PR JuliaStats/StatsBase.jl#400 is supposed to have been checked against Stata.

I'm not sure about the Gamma case, but it is so commonly stated that the deviance is twice the log-likelihood up to a constant that I would be surprised if that wasn't correct for that function. Can you give a concrete example which appears to be wrong?

Cc: @lbittarello

That is for LM:

using Test
using StatsBase
using GLM

N = 100
a, b = -5, 10

x = range(-1, 1, length=N)
y = a * x .+ b .+ randn(N)

# using Plots
# plot(x, y)

X = hcat(ones(size(x, 1)), x)
res = lm(X, y)

μ0 = mean(y)       # Null model
μ = X * coef(res)  # LM
μS = y                  # Saturated model

# Replicate the formula from GLM#master
fdeviance(y, μ) = sum(abs2.(y .- μ))
floglik(y, μ) = begin n=length(y); -n/2 * (log(2π * fdeviance(y, μ)/n) + 1) end

# Make sure deviance of the saturated model is 0
#!! SUCCESSFUL TEST !!
@test(fdeviance(y, μS) ≈ 0)

dev  = fdeviance(y, μ)
dev0 = fdeviance(y, μ0)
devS = fdeviance(y, μS)

ll  = floglik(y, μ)
ll0 = floglik(y, μ0)
llS = floglik(y, μS)

#!! SUCCESSFUL TEST !!
@test(fdeviance(y, μ) ≈ deviance(res))
#!! SUCCESSFUL TEST !!
@test(floglik(y, μ) ≈ loglikelihood(res))

### Failure of the statement "deviance is -2 times loglikelihood plus a constant"
#?? FAILING TEST ??
@test_broken(dev - dev0 ≈ -2(ll - ll0))

r2_dev = 1 - dev/dev0
r2_ll  = 1 - ll/ll0

#?? FAILING TEST ??
@test_broken(r2_dev ≈ r2_ll)

R2 = 1 - sum(abs2.(y .- μ)) / sum(abs2.(y .- μ0))

#!! SUCCESSFUL TEST !!
@test(r2_dev ≈ R2)

#?? FAILING TEST ??
@test_broken(r2_ll ≈ R2 )

For Normal GLM:

using Test
using StatsBase
using Distributions
using GLM

N = 100
a, b = -5, 10
D = Gamma(5, 1)

x = range(-1, 1, length=N)
y = a * x .+ b .+ randn(N)

# using Plots
# plot(x, y)

X = hcat(ones(size(x, 1)), x)
#res = lm(X, y)
res = glm(X, y, Normal())

μ0 = mean(y) * ones(size(y))       # Null model
μ = X * coef(res)  # LM
μS = y                  # Saturated model

# Replicate the formula from GLM#master
fdeviance(y, μ) = sum(abs2.(y .- μ))
floglik(y, μ) = begin
    N = size(y, 1)
    ϕ = fdeviance(y, μ) / N
    ll = 0
    for (yi, μi) in zip(y, μ)
        ll += logpdf(Normal(μi, sqrt(ϕ)), yi)
    end
    ll
end

## The dispersion is set to 1 in the model
floglik_good(y, μ) = begin
    N = size(y, 1)
    ll = 0
    for (yi, μi) in zip(y, μ)
        ll += logpdf(Normal(μi, 1), yi)
    end
    ll
end

# Make sure deviance of the saturated model is 0
#!! SUCCESSFUL TEST !!
@test(fdeviance(y, μS) ≈ 0)

dev  = fdeviance(y, μ)
dev0 = fdeviance(y, μ0)
devS = fdeviance(y, μS)

ll  = floglik(y, μ)
ll0 = floglik(y, μ0)
llS = floglik(y, μS)
ll_good  = floglik_good(y, μ)
ll0_good = floglik_good(y, μ0)

#!! SUCCESSFUL TEST !!
@test(fdeviance(y, μ) ≈ deviance(res))
#!! SUCCESSFUL TEST !!
@test(floglik(y, μ) ≈ loglikelihood(res))

### Failure of the statement "deviance is -2 times loglikelihood plus a constant"
#?? FAILING TEST ??
@test_broken(dev - dev0 ≈ -2(ll - ll0))

@test(dev - dev0 ≈ -2(ll_good - ll0_good))

r2_dev = 1 - dev/dev0
r2_ll  = 1 - ll/ll0

#?? FAILING TEST ??
@test_broken(r2_dev ≈ r2_ll)

R2 = 1 - sum(abs2.(y .- μ)) / sum(abs2.(y .- μ0))

#!! SUCCESSFUL TEST !!
@test(r2_dev ≈ R2)

#?? FAILING TEST ??
@test_broken(r2_ll ≈ R2 )

Stata is proprietary and closed-source afaik, should we make tests that rely on it?

Perhaps not. But Stata is very well documented with references and all (unlike R...), so it's a good starting point.

pseudo-R² are defined in terms of log-likelihood rather than in terms of deviance

Yes. To reiterate: neither r2_dev ≈ r2_ll nor r2_ll ≈ R2 are true.

I think the mistake is that instead of using dispersion in the formulas for the loglik in LM and GLM, deviance was used

The deviance is a consistent estimator of the dispersion, isn't it? But the Pearson residuals are generally preferred (McCullagh and Nelder, 1989). Note that R also uses the deviance estimator: see here, here and around here (sorry, R source code is a rabbit hole).

it is so commonly stated that the deviance is twice the log-likelihood up to a constant

The deviance is defined as twice the difference between the log likelihood of the saturated model and the model at hand; the log likelihood of the saturated model is constant and does not depend on the model at hand; therefore, the deviance is twice the negative log likelihood of the model at hand plus a constant.

The deviance is a consistent estimator of the dispersion, isn't it?

I didn't know that, but it seems to be the case, yes.

McCullagh and Nelder, 1989

In chapter 2.3.1, they make the distinction between the deviance and the scaled deviance, the latter being equal to the deviance divided by the dispersion. In chapter 2.1.2, they say that the scaled deviance is equal to -2 the difference of loglikelihood. Note that it is the same as saying that the deviance = -2 loglik + constant with the dispersion assumed to be 1.
Another important point from ch. 2.3.1 is that the dispersion ϕ is considered to be fixed. In the current definition in GLM.jl, the dispersion is estimated from the deviance, which is not fixed! If the μ estimate changes, the deviance changes.

Another important point from ch. 2.3.1 is that the dispersion ϕ is considered to be fixed. In the current definition in GLM.jl, the dispersion is estimated from the deviance, which is not fixed! If the μ estimate changes, the deviance changes.

Where do you see that they say it's fixed? AFAICT it's fixed for a given combination of data and model, but they don't say it's fixed for a given combination of data and models family.

Where do you see that they say it's fixed? AFAICT it's fixed for a given combination of data and model, but they don't say it's fixed for a given combination of data and models family.

The dispersion is fixed (i.e. the same) for the model and the null model. That is not the case with the current implementation in GLM.jl, so taking the difference of loglik of the model and of the nullmodel does not correspond to deviance and the formula for r² does not work.

Well you still don't say where you read that...

Apart for the fact that the deviance is used as an estimator for the dispersion instead of the Pearson Chi2, the formulas for the loglikelihoods of GLMs and LMs are correct actually, it took me a while to realize it because I was not expecting to estimate the dispersion. I never used R, so maybe they do it like that, but from the textbooks, the dispersion is not estimated in GLMs, it is assumed to be known (set to 1 usually). With a fixed dispersion, all the formulas I mentioned work: the deviance is -2 the difference of loglikelihood (with a dispersion set to 1), r² = 1 - dev/dev_0, etc. And also the dof does not include the dispersion, it is equal to the number of regressors, so it can be used for the Ftest of nested models.

I think it would be good to allow all the GLM module to work with known, fixed dispersion, like python statsmodels (https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html?highlight=glm#statsmodels.genmod.generalized_linear_model.GLM). Estimation or not of the dispersion could be set when creating the model.

Where do you see that they say it's fixed?

Chapter 2.3.1 of the book:

Let l(μ,ф; γ) be the log likelihood maximized over β for a fixed
value of the dispersion parameter ф. The maximum likelihood
achievable in a full model with n parameters is l(у,ф;у), which is
ordinarily finite. The discrepancy of a fit is proportional to twice
the difference between the maximum log likelihood achievable and
that achieved by the model under investigation.

Note that textbooks generally ignore the dispersion parameter because it doesn't affect the estimation of the mean.

OK, so what's the conclusion? Do we need to change anything in the end?

But taking the log of the deviance that is already a log, cannot be correct.
I looked at MixedModels.jl and the loglikelihood is defined as proportional to the deviance:
https://github.com/JuliaStats/MixedModels.jl/blob/master/src/linearmixedmodel.jl#L446-L451

Note that the "deviance" for linear mixed models in MixedModels in the documentation and show methods is called either "the objective" (for optimization) or "-2 log likelihood". We do this because it's not clear at all what the saturated model would be for mixed-models (do you add parameters in the fixed or random effects to attain the one parameter per observation?). In that sense, things are done "on the deviance scale" because we don't know what the constant for the saturated model is. For most things depending on the deviance, that constant cancels out (e.g. difference of deviances for the likelihood-ratio test), so it doesn't matter. In other words, it often makes sense to treat -2 log likelihood like the deviance, but it is not the deviance. For generalized linear mixed models, we actually compute the deviance and likelihood through two independent methods.

Regarding estimation of the dispersion in GLMs: It's taken to be 1 by convention for distributions where the variance is a function of the mean (Poisson, Binomial, Bernoulli). Of course, this can be problematic for e.g. zero-inflated Poisson, but then again, strictly speaking, that is no longer a Poisson distribution, but rather a related, yet distinct distribution. For distributions where the variance isn't a function of the mean (Gamma, Gaussian, Inverse Gaussian), the dispersion is computed based on the data.

There are use cases for forcing the dispersion/variance to a particular value (e.g. meta analysis), so that would be a nice feature to have, but that would involve some rather more extensive plumbing changes (and probably a new type to make dispatch work nicely on some things with different behavior). Likewise, it could be nice to have an "empirical dispersion" or other tool to check for over-/under-dispersion.