stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LogNormal Dist as implemented is not inclusive of 0.

CompRhys opened this issue · comments

Given failure data that follows a Tweedie distribution. I wanted to attempt to model this with a lognormal distribution in ngboost as the number of zero's is low. However as implemented I get the following error:

  File "/Users/rhys/opt/miniconda3/envs/py310/lib/python3.10/site-packages/ngboost/ngboost.py", line 276, in fit
    self.fit_init_params_to_marginal(Y)
  File "/Users/rhys/opt/miniconda3/envs/py310/lib/python3.10/site-packages/ngboost/ngboost.py", line 121, in fit_init_params_to_marginal
    self.init_params = self.Manifold.fit(
  File "/Users/rhys/opt/miniconda3/envs/py310/lib/python3.10/site-packages/ngboost/distns/lognormal.py", line 124, in fit
    m, s = sp.stats.norm.fit(np.log(Y))
  File "/Users/rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 63, in wrapper
    return fun(self, *args, **kwds)
  File "/Users/rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 364, in fit
    raise RuntimeError("The data contains non-finite values.")
RuntimeError: The data contains non-finite values.

If I add a small positive amount of noise to the Y labels then the model trains and this is probably the solution in my case but wanted to highlight it explicitly.

>>> a = np.exp(np.random.randn(1000000))
>>> scipy.stats.lognorm.fit(a, floc=0, method="MM")
(1.0019818648205723, 0, 0.9985804299185244)
>>> scipy.stats.norm.fit(np.log(a))
(0.0007051204420198203, 0.9990557672515611)
>>> a[0]=0
>>> scipy.stats.lognorm.fit(a, floc=0, method="MM")
(1.0019818648205723, 0, 0.9985805077931086)
>>> scipy.stats.norm.fit(np.log(a))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/chemix-rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 63, in wrapper
    return fun(self, *args, **kwds)
  File "/Users/chemix-rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 364, in fit
    raise RuntimeError("The data contains non-finite values.")
RuntimeError: The data contains non-finite values.

FYI norm fit returns mu, sigma and lognorm fit returns s, loc, scale where s = sigma and scale = exp(mu). Have to use MM fit as opposed to MLE fit due to MLE also imposing a > 0 requirement. As there appears to be a potential speed penalty associated with this perhaps it can be only used if Y contains 0?

(base) ➜  ~ python -mtimeit -s'import numpy as np; import scipy.stats; np.seed=0; a=np.exp(np.random.randn(10000))' 'scipy.stats.norm.fit(np.log(a))'
5000 loops, best of 5: 89.7 usec per loop
(base) ➜  ~ python -mtimeit -s'import numpy as np; import scipy.stats; np.seed=0; a=np.exp(np.random.randn(10000))' 'scipy.stats.lognorm.fit(a, floc=0, method="MM")'
50 loops, best of 5: 5.14 msec per loop