Dask-GLM doesn't converge with Dask array
pentschev opened this issue · comments
After a bit of profiling, this is what I found out for Dask-GLM with Dask array:
14339 0.139 0.000 0.814 0.000 /home/pentschev/.local/lib/python3.5/site-packages/dask/local.py:430(fire_task)
44898 19.945 0.000 19.945 0.000 {method 'acquire' of '_thread.lock' objects}
4055 0.042 0.000 19.992 0.005 /usr/lib/python3.5/threading.py:261(wait)
14339 0.107 0.000 20.234 0.001 /usr/lib/python3.5/queue.py:147(get)
14339 0.018 0.000 20.253 0.001 /home/pentschev/.local/lib/python3.5/site-packages/dask/local.py:140(queue_get)
122 0.117 0.001 22.327 0.183 /home/pentschev/.local/lib/python3.5/site-packages/dask/local.py:345(get_async)
122 0.013 0.000 22.346 0.183 /home/pentschev/.local/lib/python3.5/site-packages/dask/threaded.py:33(get)
122 0.004 0.000 22.733 0.186 /home/pentschev/.local/lib/python3.5/site-packages/dask/base.py:345(compute)
1 0.020 0.020 23.224 23.224 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/algorithms.py:200(admm)
1 0.000 0.000 23.267 23.267 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/utils.py:13(normalize_inputs)
1 0.000 0.000 23.268 23.268 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/estimators.py:65(fit)
A big portion of the time seems to be spent on waiting for thread lock. Also, looking at the callers, we see 100 compute()
calls departing from admm()
, which means it's not converging and stopping only at max_iter
as @cicdw suggested:
/home/pentschev/.local/lib/python3.5/site-packages/dask/base.py:345(compute) <- 100 0.004 19.637 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/algorithms.py:197(admm)
Running with NumPy, the algorithm converges, showing only 7 compute()
calls:
/home/pentschev/.local/lib/python3.5/site-packages/dask/base.py:345(compute) <- 7 0.000 0.120 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/algorithms.py:197(admm)
I'm running Dask 1.1.4 and Dask-GLM master branch, to ensure that my local changes aren't introduce any bugs. However, if I run my Dask-GLM branch and use CuPy as a backend, it also converges in 7 iterations.
To me this seems to suggest that we have one of those very well-hidden and difficult to track bugs in Dask. Before I spent hours with this, any suggestions what could we look for?
Originally posted by @pentschev in dask/dask-blog#15
Note also that Dask-GLM estimators were deprecated in #66 in favor of Dask-ML. If this is truly a bug in Dask-GLM, it may have been fixed in Dask-ML already.