scikit-learn / scikit-learn

scikit-learn: machine learning in Python

Home Page:https://scikit-learn.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement Parallelized SGD as in NIPS 2010 paper

ogrisel opened this issue · comments

See http://cs.markusweimer.com/pub/2010/2010-NIPS.pdf

This will be "trivial" to implement efficiently once we have proper support for shared memory (both for the coef vector and the data) in joblib: joblib/joblib#44

I assign it to me but if someone else want to step in please feel free to add a comment here to coordinate efforts.

Is there active development on this issue?

sounds great, good to have it

All the links seem dead apart from the arxiv one. There is still no consensus on how to do this well afaik. I vote to close. @ogrisel ?

@amueller Noob question: What is the issue with the approach discussed in the first paper on this thread (from NIPS 2010: http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf)?

What about this paper? 'HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent' from NIPS 2011: http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf

Hogwild needs very low-level parallelism. We don't have that in scikit-learn yet. I guess we could implement it as a first OpenMP based parallelism?
The first is a parameter server approach, right? not sure how much gain that gives on a single PC, but feel free to do some benchmarks ;)

Hogwild needs very low-level parallelism.

Are you sure that it really needs that? It seems to me that it mostly
needs a common read-write buffer, but no locks or direct message passing.
In which case joblib and a rw memmap pool should do it if we need
multiprocessing. Maybe that would fail under Windows because of
file-opening semantics. However, we might not even need multiprocessing
and we might be able to get away with threading, in which case the
multiple workers would simply be hitting the same array.

right, it only needs a common buffer. Not sure what I was thinking.

may you pls share code example how to use it?

On Mon, Jan 25, 2016 at 6:11 PM, Andreas Mueller notifications@github.com
wrote:

right, it only needs a common buffer. Not sure what I was thinking.


Reply to this email directly or view it on GitHub
#1174 (comment)
.

Not available yet, but I'm working on it!

super it would be to use it

On Tue, Feb 2, 2016 at 5:32 PM, Angad Gill notifications@github.com wrote:

Not available yet, but I'm working on it!


Reply to this email directly or view it on GitHub
#1174 (comment)
.

@ogrisel @amueller @GaelVaroquaux
We tried approaches described in [1] and [2] and found that [2] worked better, but the speedup is not great. See the plot below:
image
This was done on a PC with an Intel i7 processor. The code for this is available here (scroll down to PSGD5).

I think the issue is that the parallelization is still too high-level. We used Joblib to parallelize the call to plain_sgd under the BaseSGDRegressor class, see here. I think it needs to be parallelized in Cython here. I didn't have any success with that as I couldn't get it to compile (embarrassing) with the prange function.

If you guys think that this is good way to go about it, I would like to give it another shot.

Looking forward to your feedback!

References:
[1] Zinkevich, M., Weimer, M., Li, L., & Smola, A. (2010). Parallelized stochastic gradient descent. Advances in Neural Information Processing Systems, 1–36. Retrieved from http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent
[2] Niu, F., Recht, B., Re, C., & Wright, S. J. (2011). HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. Advances in Neural Information Processing Systems, (1), 21. Retrieved from http://arxiv.org/abs/1106.5730

@angadgill you should be able to use prange, I think. But you said you are using hogwild, right? How did you get rid of the locking? joblib is even blocking, right?

@amueller

  • I wasn't able to use prange as I couldn't compile the modified Cython file. I don't remember the error now, but I'll try it again and post the error message here.
  • It is like hogwild in the sense that it uses shared memory where all threads update the shared weights. I think this is what is going on, based on the behavior! joblib docs aren't clear on how the shared memory works when threading backend is used.
  • Locking wasn't an issue because most of the computation for plain_sgd is implemented in Cython under nogil. Perhaps I'm mistaken when I think all threads are updating the same set of weights.
  • joblib is blocking, correct. It waits for all threads to finish processing. That is fine though.

Sorry can you provide a link to the code for your hogwild implementation? I'm not sure I'm looking at the right code.

I don't understand that line. All the jobs get the same data, right?

Also, it's blocking. You say that's fine but to mean that sounds like the opposite of hogwild....

@amueller Yes, all jobs get the same data and update the same coefficients, but do fewer iterations (number of iterations per job = total iterations / number of jobs) controlled by this parameter.

By blocking I mean that Python has to wait for the Parallel function to return before it can execute the next line of code. Is this what you're talking about?

cool thanks

On Thu, Oct 13, 2016 at 6:51 PM, Angad Gill notifications@github.com
wrote:

@amueller https://github.com/amueller Yes, all jobs get the same data
and update the same coefficients, but do fewer iterations (number of
iterations per job = total iterations / number of jobs) controlled by this
parameter
https://github.com/angadgill/Parallel-SGD/blob/master/scikit-learn/sklearn/linear_model/stochastic_gradient.py#L1078
.

By blocking I mean that Python has to wait for the Parallel function
https://github.com/angadgill/Parallel-SGD/blob/master/scikit-learn/sklearn/linear_model/stochastic_gradient.py#L1070
to return before it can execute the next line of code
https://github.com/angadgill/Parallel-SGD/blob/master/scikit-learn/sklearn/linear_model/stochastic_gradient.py#L1088.
Is this what you're talking about?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1174 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AK5ZR_j5b9Eg4-Z1w-JYDQpDNfF-EfZlks5qzrX3gaJpZM4AKglL
.

@angadgill but if they all get the same data they all produce the same update?

@amueller Not necessarily, since data is sampled from the shared dataset. plain_sgd shuffles the data before using it to take steps (gradient descent). Since plain_sgd is run in parallel, each thread would use different parts of the dataset to update the weights!

do you have code example
when for some data set with good or bad results for different methods?

On Fri, Oct 14, 2016 at 3:00 PM, Angad Gill notifications@github.com
wrote:

@amueller https://github.com/amueller Not necessarily, since data is
sampled from the shared dataset. plain_sgd shuffles the data
https://github.com/angadgill/Parallel-SGD/blob/master/scikit-learn/sklearn/linear_model/sgd_fast.pyx#L604
before using it to take steps (gradient descent). Since plain_sgd is
being run in parallel, each thread would use different parts of the dataset
to update the weights!


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1174 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AK5ZR_lIdP8LnGCveFigTVE1-hx2anuxks5qz9FHgaJpZM4AKglL
.

@Sandy4321 Yes! Please this comment and this IPython notebook.

@angadgill
thanks a lot
lookin forward to see in scikit ASAP

by the way this code is very old 5 years ago
are there some newer repos?

hi all
is it implemented in scikit or not ?

I think we can close this as it seems not so trivial and don't seem to have much need here. Unless @ogrisel feels differently.