lmj / lparallel

Parallelism for Common Lisp

Home Page:http://lparallel.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dynamic bindings for workers

phmarek opened this issue · comments

I've got a need to provide some special variables to the threads.

These bindings are not simply global items, like make-kernel :bindings allows; these depend on the specific task that's being
requested. The special variables are mostly the same every time, but
their content depends on the caller.

While I could provide that via a closure, this doesn't sound like a good idea:

  (let ((var1 *var1*))
    (lparallel:premove-if-not
      (lambda (x)
        (let ((*var1* var1))
          ....))
      sequence ...))

So I'd like to ask/discuss how to solve that.

worker-loop with the exec-task/worker function looks like a good
place to do such bindings; one open question is whether to provide a
list of specials to relay at make-kernel time, and/or whether the various
functions (eg. premove-if-not) should get an additional argument
(:relay-bindings?).

Why isn't the closure in your example a good idea? This may be generalized as dynamic-lambda, which I've considered adding to lparallel as a convenience utility (the final version would also parse declarations).

I guess you are looking to bypass the overhead of creating closures/bindings by executing tasks within the same bindings, with those bindings being determined after the kernel is created? I don't think there would be much (if any) performance benefit, and it suffers from modularity problems, but you can try

(ql:quickload :lparallel)
(defvar *foo* 0)
(setf lparallel:*kernel* (lparallel:make-kernel 4 :bindings '((*foo* . nil))))
(lparallel:broadcast-task (lambda () (setf *foo* 3)))
*foo* ;=> 0
(lparallel:pmapcar (lambda (x) (+ x *foo*)) '(1 2 3)) ;=> (4 5 6)

Any resurrected worker threads will have lost the new *foo* value, of course.

Thank you very much for the quick answer!

Why isn't the closure in your example a good idea?

Because of a) performance (imagine setting the bindings every time psort calls the comparison function!), and b) convenience - a simple (pmapcar #'1+ sequence) becomes a much more unsightly blob.

with those bindings being determined after the kernel is created

Hmmm, not quite.
To keep the psort example - the bindings should be relayed from the thread and at the time calling psort, to the worker threads outside the per-thread-psort loop (resp. sort) call, to do the bindings only once, if possible.

Imagine multiple independent threads (eg. hunchentoot calls) doing something like

  (let ((*var* (thread-specific-data, eg. (random 5))))
    (pmapcar (lambda (x) (+ *var* x)) input1))

The above should preferably "DWIM". But as pmapcar et.al. won't be rewritten as macros doing a code-walk to find specials, I'm find with declaring the specials to relay.

Either at make-kernel time, or at the actual call site (but that might be too verbose if required every time, so at least some defaults would be nice with make-kernel); but the actual values must be gathered when calling.

There are at least three ways to transfer the values of current dynamic bindings to tasks. The easiest is dynamic-lambda, and if you wish to cut down on verbosity you can write task-lambda for a specific case

(defmacro task-lambda (lambda-list &body body)
  `(dynamic-lambda ,lambda-list (*foo*)
     ,@body))

with the usage being

(defvar *foo*)
(let ((*foo* (+ 1 1 1)))
  (lparallel:pmapcar (task-lambda (x) (+ x *foo*)) '(1 2 3)))
;=> (4 5 6)

That is the most modular solution. In the unlikely scenario that the overhead of creating a dynamic binding is significant (does that really happen in practice?), you can go to the other extreme by using global variables, which are both maximally efficient and maximally unmodular.

(ql:quickload :global-vars)
(global-vars:define-global-parameter -foo- 0)
(setf -foo- (+ 1 1 1))
(lparallel:pmapcar (lambda (x) (+ x -foo-)) '(1 2 3))
;=> (4 5 6)

There is also an intermediate solution between these two extremes. You can wrap the unmodular code inside a macro:

(defmacro task-let (bindings &body body)
  (let ((vars (mapcar #'first bindings))
        (forms (mapcar #'second bindings))
        (vals (loop repeat (length bindings) collect (gensym))))
    `(let ,(mapcar #'list vals forms)
       (lparallel:broadcast-task
        (lambda ()
          ,@(mapcar (lambda (var val) `(setf ,var ,val)) vars vals)))
       ,@body)))

with the usage being

(defvar *foo*)
(task-let ((*foo* (+ 1 1 1)))
  (lparallel:pmapcar (lambda (x) (+ x *foo*)) '(1 2 3)))
;=> (4 5 6)

Is there really a case for which all three of these solutions are inadequate? You mentioned (pmapcar #'1+ ...), but that can't possibly depend upon a dynamic variable. For psort, it's hard to imagine a comparison function that would depend upon some external state, and even if that happens it's hard to imagine that the binding overhead would be significant, and even if that happens it's hard to imagine that none of these three solutions would be adequate.

I don't mean to sound hard. The way to make your case is to show a real situation where using global variables (as above) gives a meaningful improvement over task-lambda and task-let. I think what you are proposing comes down to wrapping the worker task loops inside a progv form and providing the ability to restart those task loops with new progv arguments. That lacks modularity and isn't very different from the intermediate solution above.

A code walker that tweaks tasks containing specials is an interesting idea (even if mentioned it in jest), but remember that the main thread may be expected to have different dynamic bindings than the worker threads.

Well, perhaps describing my usecase helps a bit.

I've got some data on disk; a request coming in via hunchentoot causes that data to be loaded, and then processed. To reduce the answer latency I'd like to do some processing in parallel.

Now, some of the tasks need to see the "whole data" (and some "slices") to know how to process the given job[1]; communicating these was done (in the single-threaded case) via *specials*. With lparallel this isn't that easy any more - the value of the specials depends on the hunchentoot thread (because multiple requests might come in before others have finished), and re-binding the specials in predicates, sort order functions, and/or the :KEY arguments just looks wrong.

Ad 1: Lots of logfiles. Sorting them eg. requires to know the servers' time offsets for the given point in time, so that logfiles can be associated across multiple servers. So there's a special that points to the whole dataset, and the psort predicate to sort a subset of logfile lines has to take the servers' time differences into account - which is fetched via the dataset.

Rather than having specials in predicates etc., perhaps you can just make closures and avoid the whole issue? That is, instead of

(defvar *foo* 0)
(defun predicate (x y)
  (< x (+ y *foo*)))
(lparallel:psort (vector 1 2 3) #'predicate)

you have

(defun make-predicate (foo)
  (lambda (x y)
    (< x (+ y foo))))
(lparallel:psort (vector 1 2 3) (make-predicate *foo*))

Going back to the proposed change, did I understand it correctly? You want progv around the worker loops? So when pmapcar is given this :relay-bindings flag, it restarts all the worker loops with the given bindings? If so, then isn't task-let already sufficient? If not, how would it work?

I already thought whether I can switch to closures; but it's not that easy.
Some accessors need to know about the dataset to be able to loop up data; using a special variable for that worked fine so far. Passing in additional data wouldn't make the code nicer, too - again, think of psort seq #'< :key #'accessor versus having a LAMBDA in there...

You want progv around the worker loops?

No, not around worker-loop, but in worker-loop, around exec-task/worker; IIUC psort results in a few tasks to be sent to the workers, and this would be the place so that the bindings would be done once per thread per task, right?

So when pmapcar is given this :relay-bindings flag, it restarts all the worker loops with the given bindings?

No restarting required. When a new task is created, it would need to pass the specials' symbols and their values along with the task.

If so, then isn't task-let already sufficient? If not, how would it work?

task-let has the disadvantage of doing multiple setf calls for each invocation. So using that within pmapcar would reduce performance quite a bit, and that's detrimental to the lparallel idea...

Please stand by, I'll post a POC pull request later, perhaps that shows better what I'm trying to achieve.

See phmarek@1989186 for a first draft (untested, just for discussion).

I thought the purpose here was to amortize the binding of dynamic variables: performing a binding once and then many tasks are executed under that binding. That's what task-let does. You didn't like dynamic-lambda, which creates a binding when the task executes. But your proposal does just that: it creates a dynamic binding for each task. Your aim is to improve performance, but that can only be slower because it adds overhead for task creation and execution. And progv can only be slower than let because the latter is amenable to compiler analysis.

Maybe you misunderstood task-let? It's not placed within pmapcar but around it. An arbitrary number of pmapcars and other parallel tasks may be executed inside task-let, and all tasks will see the values given in the task-let arguments. And the cost of setf is totally negligible compared to task execution.

Could we put aside this comparison between (make-key *foo*) and #'key? If you really want the latter then you'd just write (setf (fdefinition 'key) (make-key *foo*)). From my perspective, writing make-key and other functions that take specials as arguments and return closures is the clearest and most elegant approach. Not only does it bypass all these binding issues, but because it makes the dependencies explicit you'll better understand your own code six months from now.

Maybe you misunderstood task-let? It's not placed within pmapcar but around it.

Yeah, I guess that's what this discussion boils down to.

writing make-key and other functions that take specials as arguments and return closures is the clearest and most elegant approach.

Not sure about that; I'm seeing specials more as per-thread global variables, and their implicit benefit is to not have to pass them along the whole call chain.

I'll try to use task-let in my code base, and see what comes out of it.
Although, if I understand it correctly for my use case - if the worker pool is global and shared between two hunchentoot requests, tjem the task-let broadcast-task call sets some specials, but as there's no relation/locking between worker threads and the hunchentoot request thread there might be mismatches in the specials?

Well I wrote task-let not really believing that it would be needed, since the overhead of binding dynamic variables is not usually on the radar, especially for web servers. But if for whatever reason you want to avoid rebinding, then task-let would be the way to do it. The price for that is non-modularity: two task-lets running simultaneously on the same kernel would be a problem.

If you want to maintain modularity and you don't want closures-without-specials or closures-with-specials (task-lambda), there is yet another solution: use the condition system. The following effectively does what your proposed change would do.

(defvar *foo*)
(defvar *bar*)

(define-condition transfer-specials (condition) ())

(defun make-transfer-specials-handler ()
  (let ((foo *foo*)
        (bar *bar*))
    (lambda (condition)
      (declare (ignore condition))
      (setf *foo* foo)
      (setf *bar* bar))))

(setf lparallel:*kernel*
      (lparallel:make-kernel 4 :bindings '((*foo* . 3) (*bar* . 4))))

(let ((*foo* 99)
      (*bar* 100))
  (lparallel:task-handler-bind
      ((transfer-specials (make-transfer-specials-handler)))
    (lparallel:pmapcar (lambda (x)
                         (signal 'transfer-specials)
                         (+ x *foo* *bar*))
                       '(1 2 3))))
;=> (200 201 202)

It's safe because a single thread won't be executing two tasks at once. If it matters, you can save/restore the original values of the specials, leaving the worker thread the way it was. Of course a simple macro would fix the monotony in make-transfer-specials-handler.

Well, excuse me if I bother you (and/or sound dumb); perhaps for a better explanation, here's an illustration.

(defvar *data-set*)

(defclass data-set ()
  ((server-names        :type list)
   (server-time-delta   :type hash-table)
   (log-lines           :type list)))
  
(defclass log-line ()
  ((timestamp           :type fixnum)
   (server              :type string)
   (tags)
   (content)))


(lparallel:make-kernel 5 :bindings '(*data-set)
                       ; :submit-time-bindings '(*data-set*)
                       )


(defun effective-time (line)
  (- (timestamp line)
     (gethash (server line)
              (server-time-delta *data-set*))))


(hunchentoot:define-easy-handler (foo :uri "/foo") (data from to filter)
  (let ((*data-set* (load-my-data data))
        (ch (lparallel:make-channel)))
    (let ((result (lparallel:premove-if-not (compile-filter filter)
                                            (log-lines *data-set*))))
      (loop for line in (lparallel:psort result #'effective-time)
            ...HTML output ...
            ))))

How would you make that run with lparallel? With my suggested :submit-time-bindings (which includes a progv, but typically for empty lists), this above would be all that is needed.

I can see how dynamic-lambda would solve that case; still, I don't think that

      (loop for line in (lparallel:psort result 
                                         (dynamic-lambda (x) (*data-set*)
                                           (effective-time x))

looks nice. Neither does the signal handler version...

If we want to cut out the progv, we could do as well with the normal :bindings, and have each worker run some function to set the specials. Still I believe that should only be done once per task, and not on every invocation of a predicate!

If there was a way to get the calling thread, and to ask special variable bindings from a given thread, then we could have a simple function that gets called from the worker threads, once.

I guess I'm running around in circles... I'll do some benchmarks, and get back to you.

To the issue of performance, it's difficult to believe that dynamically binding a variable has any cost significance compared to what goes on in a web server or most any other application.

To the issue of looking "nice", lisp lets you abstract as far as you like. For example,

(defmacro dynamic-wrap ((fn vars) &body body)
  (let ((vals (loop repeat (length vars) collect (gensym))))
    `(let ,(mapcar #'list vals vars)
       (flet ((,fn (&rest args)
                (let ,(mapcar #'list vars vals)
                  (apply #',fn args))))
         ,@body))))

(defvar *foo*)

(defun add-foo (x)
  (+ x *foo*))

(let ((*foo* 3))
  (dynamic-wrap (add-foo (*foo*))
    (lparallel:pmapcar #'add-foo '(1 2 3))))
;=> (4 5 6)

or you could write with a functional-language accent,

(defun add-foo (foo x)
  (+ x foo))

(let ((*foo* 3))
  (lparallel:pmapcar (alexandria:curry #'add-foo *foo*) '(1 2 3)))
;=> (4 5 6)

or you could use a defun/dynamic macro that writes a make- closure for you,

(defmacro defun/dynamic (name lambda-list vars &body body)
  (let ((args (gensym)))
    `(progn
       (defun ,name ,lambda-list
         ,@body)
       (defun ,(alexandria:symbolicate '#:make- name) ()
         (dynamic-lambda (&rest ,args) ,vars
           (apply #',name ,args))))))

(defun/dynamic add-foo (x) (*foo*)
  (+ x *foo*))

(let ((*foo* 3))
  (lparallel:pmapcar (make-add-foo) '(1 2 3)))
;=> (4 5 6)

Given all the options available, including the aforementioned technique using the condition system (which wouldn't require rebinding the psort predicate, not that that matters), I don't think adding overhead to task creation and execution would be justified. I want lparallel to remain lightweight in order to cover even the hard cases (which are hard because the functions to be parallelized are cost-wise comparable to task overhead). Also, garbage collectors are sworn enemies of parallelization, and increasing the size of tasks would only make the enemy stronger.

@phmarek After further considering special-variable-heavy cases like hunchentoot, I think you've convinced me that lparallel should provide such functionality. Please see b2e81a6. Your feedback would be valuable. Because it's implemented on top of the condition system, there's no additional overhead when the feature isn't used. Synopsis:

(defpackage :example (:use :cl :lparallel))
(in-package :example)

(defvar *foo*)
(defvar *bar*)

(define-dynamic-task add-foo-bar (x)
  (+ x *foo* *bar*))

(with-dynamic-tasks ((*foo* 3) (*bar* 100))
  (pmapcar #'add-foo-bar '(1 2 3)))
;=> (104 105 106)

*dynamic-task-vars* holds a list of vars that are bound in dynamic tasks.

(defparameter *my-vars* '(*foo* *bar*))

(let ((*foo* 3)
      (*bar* 100)
      (*dynamic-task-vars* *my-vars*))
  (with-dynamic-tasks ()
    (pmapcar #'add-foo-bar '(1 2 3))))
;=> (104 105 106)

The lambda counterpart is dynamic-task.

(with-dynamic-tasks ((*foo* 3) (*bar* 100))
  (pmapcar (dynamic-task (x)
             (+ x *foo* *bar*))
           '(1 2 3)))
;=> (104 105 106)

Looks good so far...

The only gripe I have is that CALL-DYNAMIC-TASK (and with it the PROGV) seems to be called on every invocation.

With CALL-DYNAMIC-TASK traced:

(with-dynamic-tasks ((*foo* 3) (*bar* 100))
  (pmapcar (dynamic-task (x)
             (declare (ignore x))
             sb-thread:*current-thread*)
           :parts 1
           '(1 2 3)))

gives me

  0: (LPARALLEL.DYNAMIC-TASK::CALL-DYNAMIC-TASK
      #<FUNCTION (LAMBDA ()) {1006D7377B}>)
  0: LPARALLEL.DYNAMIC-TASK::CALL-DYNAMIC-TASK returned
       #<SB-THREAD:THREAD "lparallel" RUNNING {1005B15493}>
  0: (LPARALLEL.DYNAMIC-TASK::CALL-DYNAMIC-TASK
      #<FUNCTION (LAMBDA ()) {1006D7377B}>)
  0: LPARALLEL.DYNAMIC-TASK::CALL-DYNAMIC-TASK returned
       #<SB-THREAD:THREAD "lparallel" RUNNING {1005B15493}>
  0: (LPARALLEL.DYNAMIC-TASK::CALL-DYNAMIC-TASK
      #<FUNCTION (LAMBDA ()) {1006D7377B}>)
  0: LPARALLEL.DYNAMIC-TASK::CALL-DYNAMIC-TASK returned
       #<SB-THREAD:THREAD "lparallel" RUNNING {1005B15493}>
(#<SB-THREAD:THREAD "lparallel" RUNNING {1005B15493}>
 #<SB-THREAD:THREAD "lparallel" RUNNING {1005B15493}>
 #<SB-THREAD:THREAD "lparallel" RUNNING {1005B15493}>)

So it's always the same thread that runs the DYNAMIC-TASK, but nonetheless the condition is emitted, caught, handled multiple times...

And there's something I don't understand yet... perhaps I'm doing something wrong, but the inner LET doesn't work as expected.

(with-dynamic-tasks ((*foo* 2) (*bar* 100))
  (pmapcar #'add-foo-bar
             (let ((*foo* 10) (*bar* 300))
                          (pmapcar #'add-foo-bar
                                                '(1)))))
(205)

Something like this can easily happen if some function hierarchy isn't aware of other functions using LPARALLEL, too.

Yes, I know that this risks a deadlock if too few threads are available...
But in one place I'd actually like to run some calculation in one thread asynchronously, and pass the results (which are converted to HTML as they're coming in) via a queue. But the function that is called asynchronously might (and, even, should) use some more LPARALLEL threads to reduce the time to completion.

But yeah, from the API it looks fine.

The only way to avoid doing a binding inside each task is to break modularity, as in task-let. But the overhead of dynamic bindings is something you almost certainly shouldn't care about in a web server.

Dynamic tasks need to communicate with their parent thread. The endpoints of that line of communication are, on one end, with-dynamic-tasks in the parent thread and, on the other end, a signal call that define-dynamic-task (or dynamic-task) inserts into the task code. When there's a new binding in the parent thread without a corresponding with-dynamic-tasks, subsequent tasks won't see that new binding.

There is a way to make this easier on the user, but it adds a little bit of task overhead: a special variable lookup. The trade-off is that the API is much cleaner -- in fact it's just a single special variable. See 9a0aac5. Usage is:

(defvar *foo*)
(defvar *bar*)

(let ((*foo* 3)
      (*bar* 100)
      (lparallel:*transfer-specials* '(*foo* *bar*)))
  (lparallel:pmapcar (lambda (x)
                       (+ x *foo* *bar*))
                     '(1 2 3)))
;=> (104 105 106)

Unfortunately the special variable lookup is in most profiled and most scrutinized code path of the entire library, so I'd have to run some tests before being comfortable with such a change.

The only way to avoid doing a binding inside each task is to break modularity, as in task-let.

I'm well aware that it needs to be done in each task - but the current approaches do it per invocation of the function passed in.

But the overhead of dynamic bindings is something you almost certainly shouldn't care about in a web server.

Please forget about the web server - that's just an initiator.
The actual workload will then consist of pmapcar, psort, etc., across (potentially) a few million rows; so doing the LET in each predicate invocation hurts much more than once per task, as each task will then do (a few million)/(number of threads) lambda calls (for pmapcar, and some more for psort)!

So I still believe that the (PROGV vars vals (exec-task/worker ...)) would be the best-performing place.

This would also reduce the impact of the lparallel:*transfer-specials* lookup.

*transfer-specials* does just what you want: dynamic bindings inside the task are seen by whatever is called inside the task, such as a predicate for psort. It doesn't wrap the predicate and it doesn't require the user to wrap the predicate. It doesn't require dynamic-lambda or the like. It's much like your :submit-time-bindings but better because (a) it doesn't bloat the task size from a cons to a four-slot structure (the extra slots will be rarely used in general); (b) there's only a tiny (probably imperceptible) effect on code not using the feature; (c) there's no unnecessary tying of specials to kernels -- one needn't bother setting up a kernel with the right :submit-time-bindings.

But going back to dynamic-lambda, even millions of dynamic bindings is probably nothing compared to whatever you are doing with that data. Until you have a benchmark showing a significant difference between a predicate with and without dynamic-lambda, you shouldn't worry about it. If you are that focused performance then you shouldn't have a special in the predicate in the first place -- you should write closures like make-predicate above.

I've pushed a tweak to the transfer-specials branch. Previously psort needed *transfer-specials* to list itself (it's the only function in lparallel that spawns tasks recursively), but no longer. All recursive tasks will now inherit *transfer-specials*.

Sorry about the delay - I'm a bit busy right now. I haven't forgotten, and this issue is still in an open tab ;)
I'll get back to you when I've investigated my needs and your proposed solution.

Thank you very much!

If you're still following then the new feature will probably be more like the transfer-bindings branch. API is

(defvar *foo*)
(defvar *bar*)

(let ((*foo* 3)
      (*bar* 100))
  (lparallel:transfer-bindings '(*foo* *bar*)
    (lparallel:pmapcar (lambda (x)
                         (+ x *foo* *bar*))
                       '(1 2 3))))
;=> (104 105 106)

The feature should be scoped like this since it adds overhead even to tasks not using the bindings. I think it would be unexpected if setting the value of a special variable caused a global slowdown.

I also wondered if task-let and task-progv would be more appropriate, which would not bind in the current thread in the fashion of task-handler-bind. But considering that task stealing is an option (a task can run in the current thread), binding in the current thread seems like a good idea.

I'm sorry about the long delay; I kept the browser tab open all the time, but there was too much going on.

Please decide on some solution (*transfer-specials* is nicely transparent and easily understandable), and push it to the master/release branch so that it's picked up by QL soonish; I don't care about the way I can get that to work, only that I can use it.

Thank you very much for all your patience and time!

How about merging a solution to master, to get it into QL and "downstream"?

Thanks!