dask / dask-glm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about convergence on repetitive data

mrocklin opened this issue · comments

This is a question for @moody-marlin and @mcg1969

I expect that many large datasets will be highly repetitive. That is that I expect a sample of the data to be decently representative of the full dataset. Given this structure, it feels inefficient for our iterative algorithms to go over the entire dataset before updating parameters.

Are there situations in which it is advantageous to cycle through different subsets of the full dataset?

As a naive (probably wrong) example perhaps we would partition the data into ten randomly selected subsets, and then perform a single round of gradient descent on each in turn to obtain a new gradient direction.

This is the insight that randomized algorithms like stochastic gradient descent take advantage of; so, in theory you could compute the gradient on a randomly selected chunk and use that to update the coefficients at each iteration. I'm not convinced this would work well with only 10 chunks, but we could also take random subsets of data from a given chunk to increase the randomness. I do think there is an appetite for randomized algorithms, and with the current code structure + how easy dask makes it to get a list of chunks, I think this would be fairly easy to write up a simple implementation.

Related, what are the costs of rechunking large datasets every 10-20 iterations?

If you want to do a full shuffle then the answer is "somewhat expensive". If you want to split per-chunk then the answer is "pretty cheap"

We can trivially cycle through random chunks. This probably violates iid assumptions though in the common case. We can easily produce 10 1/10th size datasets by splitting each chunk into 10 and treating those splits as 10 dask.arrays. This has the cost of increasing the partition count by 10x, but that's usually ok (maybe not though if we're running into scheduling overheads (which we are)). Full shuffles are fairly expensive, I might be overly pessimistic here.

@moody-marlin for admm, does it matter if data is sorted or randomized?

I suspect that randomization matters more if you consider only batches or samples at a time.

@hussainsultan yea, randomization is best -- sorted is probably not a good idea (example: imagine one of the local updates tries to fit a logistic regression on data that has no observed 1 event).

ADMM will still converge if the data is not randomized but take a long time, is that what you meant by not a good idea?