LogNormal Dist as implemented is not inclusive of 0.
CompRhys opened this issue · comments
Given failure data that follows a Tweedie distribution. I wanted to attempt to model this with a lognormal distribution in ngboost as the number of zero's is low. However as implemented I get the following error:
File "/Users/rhys/opt/miniconda3/envs/py310/lib/python3.10/site-packages/ngboost/ngboost.py", line 276, in fit
self.fit_init_params_to_marginal(Y)
File "/Users/rhys/opt/miniconda3/envs/py310/lib/python3.10/site-packages/ngboost/ngboost.py", line 121, in fit_init_params_to_marginal
self.init_params = self.Manifold.fit(
File "/Users/rhys/opt/miniconda3/envs/py310/lib/python3.10/site-packages/ngboost/distns/lognormal.py", line 124, in fit
m, s = sp.stats.norm.fit(np.log(Y))
File "/Users/rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 63, in wrapper
return fun(self, *args, **kwds)
File "/Users/rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 364, in fit
raise RuntimeError("The data contains non-finite values.")
RuntimeError: The data contains non-finite values.
If I add a small positive amount of noise to the Y labels then the model trains and this is probably the solution in my case but wanted to highlight it explicitly.
>>> a = np.exp(np.random.randn(1000000))
>>> scipy.stats.lognorm.fit(a, floc=0, method="MM")
(1.0019818648205723, 0, 0.9985804299185244)
>>> scipy.stats.norm.fit(np.log(a))
(0.0007051204420198203, 0.9990557672515611)
>>> a[0]=0
>>> scipy.stats.lognorm.fit(a, floc=0, method="MM")
(1.0019818648205723, 0, 0.9985805077931086)
>>> scipy.stats.norm.fit(np.log(a))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/chemix-rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 63, in wrapper
return fun(self, *args, **kwds)
File "/Users/chemix-rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 364, in fit
raise RuntimeError("The data contains non-finite values.")
RuntimeError: The data contains non-finite values.
FYI norm fit returns mu, sigma and lognorm fit returns s, loc, scale where s = sigma and scale = exp(mu). Have to use MM fit as opposed to MLE fit due to MLE also imposing a > 0 requirement. As there appears to be a potential speed penalty associated with this perhaps it can be only used if Y contains 0?
(base) ➜ ~ python -mtimeit -s'import numpy as np; import scipy.stats; np.seed=0; a=np.exp(np.random.randn(10000))' 'scipy.stats.norm.fit(np.log(a))'
5000 loops, best of 5: 89.7 usec per loop
(base) ➜ ~ python -mtimeit -s'import numpy as np; import scipy.stats; np.seed=0; a=np.exp(np.random.randn(10000))' 'scipy.stats.lognorm.fit(a, floc=0, method="MM")'
50 loops, best of 5: 5.14 msec per loop