simonmar / monad-par

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nested scheduler bug

simonmar opened this issue · comments

The following program fails with <<loop>> and blocked indefinitely on MVar exceptions with monad-par-0.3 but not with monad-par-0.1.0.3. It looks like it is using nesting quite heavily.

{-

$ cabal install -O2 monad-par-0.3
$ ghc -O2 -threaded -rtsopts -with-rtsopts -N turbofibpar.hs
$ ./turbofibpar 10000000
2089877
$ ./turbofibpar 10000000
turbofibpar: <<loop>>turbofibpar: turbofibpar: turbofibpar: thread blocked indefinitely in an MVar operation
<<loop>>

<<loop>>
turbofibpar: <<loop>>
$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.4.1

-}

import Control.Monad.Par
import System.Environment (getArgs)
import Control.DeepSeq

data M = M !Integer !Integer !Integer !Integer
instance NFData M

instance Num M where
  m * n = runPar $ do
    m' <- spawn (return m)
    n' <- spawn (return n)
    m'' <- get m'
    n'' <- get n'
    return (m'' `mul` n'')

(M a b c d) `mul` (M x y z w) = M
  (a * x + b * z) (a * y + b * w)
  (c * x + d * z) (c * y + d * w)

fib :: Integer -> Integer
fib n = let M f _ _ _ = M 0 1 1 1 ^ (n + 1) in f

main :: IO ()
main = print . length . show . fib . read . head =<< getArgs

I can't reproduce this on two cores, but I can on four. I added it to the repo:

https://github.com/simonmar/monad-par/blob/master/tests/issue21.hs

Interesting that we passed the perversely nested parfib (binary tree of runPars) in our examples but that we fail this "turbo" version.

Adam, it has the same failure with Control.Monad.Par.Meta.SMP!! Control.Monad.Par.Scheds.Direct on the other hand never implemented "Nested" like the Trace scheduler, so it passes this test.

Shall we ask Daniel to take a look at his nested Trace code?

@simonmar: in the commented example runs, it looks like one run finishes while the next one fails. Is this a run with the older monad-par, or is this issue non-deterministic?

Ryan and I have a thread about other problems with nesting behavior on the meta-par side of the fence; maybe we can use this as our smoke test going forward if regular nested parfib failed to uncover it.

It is non-deterministic. For me it fails sometimes with -N3, never with -N2.

Unfortunately I don't completely understand the nested scheduler, I think we'll probably have to ask Daniel (and going forward we need some documentation about how the nested scheduler works - we had planned to write it up, but somehow didn't get around to it).

Any movement on this? I abandoned using monad-par since our app would randomly crash in production. I would love to get back to using it.

Daniel is the only person who currently understands the nested scheduler and he hasn't had time to look at it. Fortunately it's quite easy to work around: you can use the "direct" scheduler instead by importing Control.Monad.Par.Scheds.Direct instead of Control.Monad.Par, or you can back off to monad-par-0.1.0.3 which has the very simple trace scheduler before we added nesting. I believe the direct scheduler is the fastest currently.

Thanks for pinging the thread. I will try to hack on this myself if I can make the time.

Let me also mention another workaround. Are you only using the methods of ParFuture? (And combinators built on them like parMap?) If so, it should be perfectly fine to use the Sparks scheduler, which will exhibit good "Nested" behavior.

Thanks. I will give these a shot.

I confirmed that the Sparks scheduler has no problem with the above test, as expected. Working on the Trace bug now...

Hmm. Using this as a case study for debugging MVar errors...

Simon, is there any way that I can get the threadID of the thread that is blocked indefinitely?

Of course, it would be even better to know where the code for the readMVar that blocks indefinitely actually IS. I'm afraid the stack traces produced by +RTS -xc are not helping in this case..

It's the FIRST takeMVar in runPar_internal within TraceInternal.hs where the indefinite-blockage is happening.

I don't understand Daniel's idling protocol yet but it looks like "status" is operated on only with atomicModifyIORef EXCEPT in pushWork where it is read with a plain readIORef.

Maybe Daniel can comment on the UID scheme. I think the idea should be that if the Par RTS is a shared resource, then we have to be careful not to introduce interdependencies between two "runPars". If one Haskell thread is running a non-terminating computation, it's not fair for a new runPar on another thread to get entangled with it and blocked on it.

Establishing this should be EASIER than with, say, Cilk because we don't use incoming user threads (not created by us) to steal work. (Cilk allows, that, but makes sure they only steal work with on the same UID -- 'team' in cilk terminology.)

The current TraceInternal has a restriction on work stealing based simply on the linear order in which runPar's happened to hit the global RTS (uid >=). I'm not clear what this ordering is supposed to ensure however...

I wrote this about a year ago, and either I lost the extra documentation, or I never wrote it in the first place. I'll try to fix that. Let me see if I can clarify a bit about how this is supposed to work.

The uid system is to ensure that threads return where they're supposed to. Basically, we want to avoid the problems that arise when a worker thread encounters a runPar in a task, blocks to work on other tasks, and then never finds its way back to the task it was working on. The uid system I'm using is rather primitive (but hopefully sound). When a thread encounters a runPar, it bumps the global uid counter, sets its own uid to match, and labels all tasks created from this runPar to the same uid. Then, we make the restriction that a given thread can only work on tasks with uid greater than or equal to its own. If none exist, then it will wait until all the work is done on the tasks of its uid (potentially being woken up to do work in the interim) and then "close out" the runPar, returning the value it produced. Thus, we don't allow the threads to work on tasks with a lower uid simply to force them to be in the right place to return from the runPar.

For the record, we keep track of the tasks on a per-thread basis using the WorkPool data type. It's basically a list where each element is a tuple of uid, the count of how many workers are working on tasks of this uid, and the tasks at this uid. We take as an invariant that the workpools are sorted such that the head is always the highest uid.

The other thing that is interesting is the distinction between Idle and ExtIdle types. Worker threads go idle when they are waiting for tasks, but in some cases we need to wake up threads that are not workers (like some random thread that happened to call a runPar): we keep track of these as external idlers.

As far as the bug is concerned, I still haven't figured it out. We know that it is the first takeMVar in runPar_internal, but I don't see why that should hang. This is the MVar that is blocking on the end result of everything, and it is being kept in AllStatus as the one and only ExtIdle value. I'm not sure why it is never woken up. Clearly, all the worker threads are getting deadlocked, but I can't see why they should.

As for getting debug information, I've been replacing the takeMVar m command with the following (requires adding Control.Exception as an import):
catch (takeMVar m) ((\_ -> doDebugStuff) :: BlockedIndefinitelyOnMVar -> IO a)
Because doDebugStuff can be anything, you can use the IO monad and look up values in IORefs and stuff.

Lastly, I don't know if this is a workaround or a clue to fixing this, but if I change the instance declaration in the failing program with:

  m * n = runPar $ do
    m'  <- spawn (return m)
    n'  <- {- spawn -} (return n)
    m'' <- get m'
    n'' <- return {- get -} n'
    return (m'' `mul` n'')

then it seems to work fine.

Just one more bit of information. I implemented a simple strategy for nested support in the Direct scheduler. You can enable it with a #define:

e490bb5

It's simple because it ONLY supports nested runPars that happen on a thread that's already a worker thread. If any user-forked threads do their own runPar calls, those will instantiate their own gang of worker threads. In general I would argue that it is necessary to keep around more than numCapability workers to retain the fairness expected of user-forked IO threads, but I'll get into that argument later.

Here's the funny thing -- it works for nested parfib, but gets the SAME error on issue21.hs. We now have three different implementations of nested runPar behavior (Meta.SMP, Trace, Direct) that ALL trip the same bug in the same way. There must be some significant conceptual error here!

Ok, now that I've got a chance to read Daniel's UID description in detail, let me lay out what I see as the constraints/limits on this kind of scheme. Basically I don't think it's safe to have ExtIdle threads block rather than participating in the computation. (FYI, the Cilk model is that they participate but they don't help anyone else. They only steal [back] work of their own UID.)

The way I see it, forkIO promises that the thread will be scheduled fairly vis-a-vis the other threads. (This can be betrayed by other threads with non-allocating loops, but that's a separate issue.)

Let's say there are 8 cores, and 8 corresponding runPar workers, but the user forks 100 threads. Further, 92 of the 100 threads call many short runPars in a loop, whereas 8 of them evaluate runPars that take a year to evaluate.

I argue that there is no way we avoid the scenario where the small tasks might block for a year with only 8 workers. We need >=100 workers in that scenario to guarantee progress for all. The problem is that once a task gets into a runPar worker, of course it runs until completion without any kind of preemption. It's for parallelism, not concurrency after all!

Anyway, this limitation was my reason for only optimizing "nested" (runPar from an existing worker thread) in Direct. However, there is probably some benefit to having external threads plug into the existing workers (and help!) to use ~107 threads in my example rather that 800.

I think the situation we're trying to avoid is something like this:

runPar $ do
  let x = runPar $ ...
  fork (.. x ..)
  .. x ..

Now, we have an outer runPar and an inner runPar that is triggered by the reference to x in the outer runPar.
Remember that while running the inner runPar, x is a blackhole. Now, if the worker that started the inner runPar gets blocked on its current work item and then takes another item - which happens to be the fork in the outer runPar, and then subsequently tries to evaluate x, we get a deadlock. This is what Daniel's UID scheme is trying to avoid, unless I'm mistaken.

Ah, yes, blackholes. Just to make sure I'm on the same wavelength, Simon, your example is a deadlock only in the one-worker case, right? (Otherwise the blocked work of the inner runPar can be picked up by another worker and continued. Blackholes are death though, because they are invisible to our sched.)

From Simon's example it sounds like the intention is that a runPar instance can not steal from parents, correct? And the monotonically increasing UID is a conservative approximation of that (children will necessarily have higher UID than all parents, being evaluated later). Would it be possible, for arguments sake, to explicitly check parent-child relationships?

Btw, right now in Direct if I turn off idling I'm actually seeing live locking not deadlocking. (On -N3 it's spending 300%, or more rarely 200% cpu.) I've actually spent quite a bit of time by now on this bug and am impressed at its tenaciousness.

Watching, as this affects criterion.

Released version 0.3.4 that doesn't suffer from this bug.