Question about convergence on repetitive data

Question

Question about convergence on repetitive data

mrocklin opened this issue 7 years ago · comments

This is a question for @moody-marlin and @mcg1969

I expect that many large datasets will be highly repetitive. That is that I expect a sample of the data to be decently representative of the full dataset. Given this structure, it feels inefficient for our iterative algorithms to go over the entire dataset before updating parameters.

Are there situations in which it is advantageous to cycle through different subsets of the full dataset?

As a naive (probably wrong) example perhaps we would partition the data into ten randomly selected subsets, and then perform a single round of gradient descent on each in turn to obtain a new gradient direction.

Chris White · Answer 1 · Wed Mar 01 2017 23:16:48 GMT+0800 (China Standard Time)

This is the insight that randomized algorithms like stochastic gradient descent take advantage of; so, in theory you could compute the gradient on a randomly selected chunk and use that to update the coefficients at each iteration. I'm not convinced this would work well with only 10 chunks, but we could also take random subsets of data from a given chunk to increase the randomness. I do think there is an appetite for randomized algorithms, and with the current code structure + how easy dask makes it to get a list of chunks, I think this would be fairly easy to write up a simple implementation.

Related, what are the costs of rechunking large datasets every 10-20 iterations?

Matthew Rocklin · Answer 2 · Wed Mar 01 2017 23:37:12 GMT+0800 (China Standard Time)

If you want to do a full shuffle then the answer is "somewhat expensive". If you want to split per-chunk then the answer is "pretty cheap"

We can trivially cycle through random chunks. This probably violates iid assumptions though in the common case. We can easily produce 10 1/10th size datasets by splitting each chunk into 10 and treating those splits as 10 dask.arrays. This has the cost of increasing the partition count by 10x, but that's usually ok (maybe not though if we're running into scheduling overheads (which we are)). Full shuffles are fairly expensive, I might be overly pessimistic here.

Hussain Sultan · Answer 3 · Tue Mar 14 2017 10:14:03 GMT+0800 (China Standard Time)

@moody-marlin for admm, does it matter if data is sorted or randomized?

Matthew Rocklin · Answer 4 · Tue Mar 14 2017 10:23:37 GMT+0800 (China Standard Time)

I suspect that randomization matters more if you consider only batches or samples at a time.

Chris White · Answer 5 · Tue Mar 14 2017 22:17:53 GMT+0800 (China Standard Time)

@hussainsultan yea, randomization is best -- sorted is probably not a good idea (example: imagine one of the local updates tries to fit a logistic regression on data that has no observed 1 event).

Hussain Sultan · Answer 6 · Wed Mar 15 2017 01:33:42 GMT+0800 (China Standard Time)

ADMM will still converge if the data is not randomized but take a long time, is that what you meant by not a good idea?