Implement Parallelized SGD as in NIPS 2010 paper

Question

Implement Parallelized SGD as in NIPS 2010 paper

ogrisel opened this issue 12 years ago · comments

See http://cs.markusweimer.com/pub/2010/2010-NIPS.pdf

This will be "trivial" to implement efficiently once we have proper support for shared memory (both for the coef vector and the data) in joblib: joblib/joblib#44

Olivier Grisel · Answer 1 · Fri Sep 21 2012 22:21:54 GMT+0800 (China Standard Time)

I assign it to me but if someone else want to step in please feel free to add a comment here to coordinate efforts.

Olivier Grisel · Answer 2 · Sun Sep 23 2012 16:59:18 GMT+0800 (China Standard Time)

This series of more recent results might even fare better:

http://arxiv.org/abs/1012.1367
http://research.microsoft.com/en-us/um/people/ohadsh/publications/2011_ICML_DekGilShamXia.pdf
http://research.microsoft.com/en-us/um/people/ohadsh/publications/2012_JMLR_DekGilShaXi.pdf

Although they seem more complicated to implement. /via @zaxtax

Angad Gill · Answer 3 · Thu Jan 14 2016 05:20:40 GMT+0800 (China Standard Time)

Is there active development on this issue?

Sandy4321 · Answer 4 · Thu Jan 14 2016 05:36:02 GMT+0800 (China Standard Time)

sounds great, good to have it

Andreas Mueller · Answer 5 · Sat Jan 16 2016 04:27:04 GMT+0800 (China Standard Time)

All the links seem dead apart from the arxiv one. There is still no consensus on how to do this well afaik. I vote to close. @ogrisel ?

Angad Gill · Answer 6 · Sat Jan 16 2016 06:58:22 GMT+0800 (China Standard Time)

@amueller Noob question: What is the issue with the approach discussed in the first paper on this thread (from NIPS 2010: http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf)?

What about this paper? 'HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent' from NIPS 2011: http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf

Andreas Mueller · Answer 7 · Sat Jan 16 2016 07:39:34 GMT+0800 (China Standard Time)

Hogwild needs very low-level parallelism. We don't have that in scikit-learn yet. I guess we could implement it as a first OpenMP based parallelism?
The first is a parameter server approach, right? not sure how much gain that gives on a single PC, but feel free to do some benchmarks ;)

Gael Varoquaux · Answer 8 · Sat Jan 16 2016 18:26:58 GMT+0800 (China Standard Time)

Hogwild needs very low-level parallelism.

Are you sure that it really needs that? It seems to me that it mostly
needs a common read-write buffer, but no locks or direct message passing.
In which case joblib and a rw memmap pool should do it if we need
multiprocessing. Maybe that would fail under Windows because of
file-opening semantics. However, we might not even need multiprocessing
and we might be able to get away with threading, in which case the
multiple workers would simply be hitting the same array.

Andreas Mueller · Answer 9 · Tue Jan 26 2016 07:11:12 GMT+0800 (China Standard Time)

right, it only needs a common buffer. Not sure what I was thinking.

Sandy4321 · Answer 10 · Tue Jan 26 2016 07:24:33 GMT+0800 (China Standard Time)

may you pls share code example how to use it?

On Mon, Jan 25, 2016 at 6:11 PM, Andreas Mueller notifications@github.com
wrote:

right, it only needs a common buffer. Not sure what I was thinking.

—
Reply to this email directly or view it on GitHub
#1174 (comment)
.

Angad Gill · Answer 11 · Wed Feb 03 2016 06:31:37 GMT+0800 (China Standard Time)

Not available yet, but I'm working on it!

Sandy4321 · Answer 12 · Wed Feb 03 2016 06:33:36 GMT+0800 (China Standard Time)

super it would be to use it

On Tue, Feb 2, 2016 at 5:32 PM, Angad Gill notifications@github.com wrote:

Not available yet, but I'm working on it!

—
Reply to this email directly or view it on GitHub
#1174 (comment)
.

Angad Gill · Answer 13 · Mon Apr 04 2016 08:16:23 GMT+0800 (China Standard Time)

@ogrisel @amueller @GaelVaroquaux
We tried approaches described in [1] and [2] and found that [2] worked better, but the speedup is not great. See the plot below:

This was done on a PC with an Intel i7 processor. The code for this is available here (scroll down to PSGD5).

I think the issue is that the parallelization is still too high-level. We used Joblib to parallelize the call to plain_sgd under the BaseSGDRegressor class, see here. I think it needs to be parallelized in Cython here. I didn't have any success with that as I couldn't get it to compile (embarrassing) with the prange function.

If you guys think that this is good way to go about it, I would like to give it another shot.

Looking forward to your feedback!

References:
[1] Zinkevich, M., Weimer, M., Li, L., & Smola, A. (2010). Parallelized stochastic gradient descent. Advances in Neural Information Processing Systems, 1–36. Retrieved from http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent
[2] Niu, F., Recht, B., Re, C., & Wright, S. J. (2011). HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. Advances in Neural Information Processing Systems, (1), 21. Retrieved from http://arxiv.org/abs/1106.5730

Andreas Mueller · Answer 14 · Tue Oct 11 2016 08:31:06 GMT+0800 (China Standard Time)

@angadgill you should be able to use prange, I think. But you said you are using hogwild, right? How did you get rid of the locking? joblib is even blocking, right?

Angad Gill · Answer 15 · Tue Oct 11 2016 09:28:41 GMT+0800 (China Standard Time)

@amueller

I wasn't able to use prange as I couldn't compile the modified Cython file. I don't remember the error now, but I'll try it again and post the error message here.
It is like hogwild in the sense that it uses shared memory where all threads update the shared weights. I think this is what is going on, based on the behavior! joblib docs aren't clear on how the shared memory works when threading backend is used.
Locking wasn't an issue because most of the computation for plain_sgd is implemented in Cython under nogil. Perhaps I'm mistaken when I think all threads are updating the same set of weights.
joblib is blocking, correct. It waits for all threads to finish processing. That is fine though.

Andreas Mueller · Answer 16 · Tue Oct 11 2016 10:50:15 GMT+0800 (China Standard Time)

Sorry can you provide a link to the code for your hogwild implementation? I'm not sure I'm looking at the right code.

Angad Gill · Answer 17 · Tue Oct 11 2016 12:12:43 GMT+0800 (China Standard Time)

@amueller https://github.com/angadgill/Parallel-SGD/blob/master/scikit-learn/sklearn/linear_model/stochastic_gradient.py#L1070

Andreas Mueller · Answer 18 · Fri Oct 14 2016 04:34:13 GMT+0800 (China Standard Time)

I don't understand that line. All the jobs get the same data, right?

Also, it's blocking. You say that's fine but to mean that sounds like the opposite of hogwild....

Angad Gill · Answer 19 · Fri Oct 14 2016 06:50:55 GMT+0800 (China Standard Time)

@amueller Yes, all jobs get the same data and update the same coefficients, but do fewer iterations (number of iterations per job = total iterations / number of jobs) controlled by this parameter.

By blocking I mean that Python has to wait for the Parallel function to return before it can execute the next line of code. Is this what you're talking about?

Sandy4321 · Answer 20 · Fri Oct 14 2016 06:54:45 GMT+0800 (China Standard Time)

cool thanks

On Thu, Oct 13, 2016 at 6:51 PM, Angad Gill notifications@github.com
wrote:

@amueller https://github.com/amueller Yes, all jobs get the same data
and update the same coefficients, but do fewer iterations (number of
iterations per job = total iterations / number of jobs) controlled by this
parameter
https://github.com/angadgill/Parallel-SGD/blob/master/scikit-learn/sklearn/linear_model/stochastic_gradient.py#L1078
.

By blocking I mean that Python has to wait for the Parallel function
https://github.com/angadgill/Parallel-SGD/blob/master/scikit-learn/sklearn/linear_model/stochastic_gradient.py#L1070
to return before it can execute the next line of code
https://github.com/angadgill/Parallel-SGD/blob/master/scikit-learn/sklearn/linear_model/stochastic_gradient.py#L1088.
Is this what you're talking about?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1174 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AK5ZR_j5b9Eg4-Z1w-JYDQpDNfF-EfZlks5qzrX3gaJpZM4AKglL
.

Andreas Mueller · Answer 21 · Sat Oct 15 2016 02:02:17 GMT+0800 (China Standard Time)

@angadgill but if they all get the same data they all produce the same update?

Angad Gill · Answer 22 · Sat Oct 15 2016 02:59:38 GMT+0800 (China Standard Time)

@amueller Not necessarily, since data is sampled from the shared dataset. plain_sgd shuffles the data before using it to take steps (gradient descent). Since plain_sgd is run in parallel, each thread would use different parts of the dataset to update the weights!

Sandy4321 · Answer 23 · Sat Oct 15 2016 03:05:15 GMT+0800 (China Standard Time)

do you have code example
when for some data set with good or bad results for different methods?

On Fri, Oct 14, 2016 at 3:00 PM, Angad Gill notifications@github.com
wrote:

@amueller https://github.com/amueller Not necessarily, since data is
sampled from the shared dataset. plain_sgd shuffles the data
https://github.com/angadgill/Parallel-SGD/blob/master/scikit-learn/sklearn/linear_model/sgd_fast.pyx#L604
before using it to take steps (gradient descent). Since plain_sgd is
being run in parallel, each thread would use different parts of the dataset
to update the weights!

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1174 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AK5ZR_lIdP8LnGCveFigTVE1-hx2anuxks5qz9FHgaJpZM4AKglL
.

Angad Gill · Answer 24 · Sat Oct 15 2016 03:20:22 GMT+0800 (China Standard Time)

@Sandy4321 Yes! Please this comment and this IPython notebook.

Sandy4321 · Answer 25 · Tue May 04 2021 05:29:12 GMT+0800 (China Standard Time)

@angadgill
thanks a lot
lookin forward to see in scikit ASAP

by the way this code is very old 5 years ago
are there some newer repos?

Sandy4321 · Answer 26 · Mon Jan 31 2022 02:31:05 GMT+0800 (China Standard Time)

hi all
is it implemented in scikit or not ?

Adrin Jalali · Answer 27 · Wed Apr 17 2024 20:59:56 GMT+0800 (China Standard Time)

I think we can close this as it seems not so trivial and don't seem to have much need here. Unless @ogrisel feels differently.