simonmar / monad-par

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

thread blocked indefinitely in an MVar operation with parMap

Shimuuar opened this issue · comments

Bug was originally reported against crierion. But Till Berger created reduced test case:

import Control.Monad.Par

test :: [Int] -> IO [Int]
test xs = do
    let list = runPar $ parMap (\x -> x + 1) xs
    putStrLn $ show list
    test list

main = do
    test [1]

If compiled with -threaded program fails after a few (5-30) iterations with message thread blocked indefinitely in an MVar operation or occasionaly with message Impossible state in globalWorkComplete. If it's compiled without threadng it still fails but it require much more iterations (tens of thousands)

Still a concern here.

This is likely the same as issue #21. Sorry about this - we've known about the problem for a while, but unfortunately the code in question was written by Daniel Winograd-Court during his internship at Microsoft and it is a bit inscrutable.

There are workarounds:

  1. Use the direct scheduler: import Control.Monad.Par.Scheds.Direct instead of Control.Monad.Par
  2. Use monad-par-0.1.0.3 instead of 0.3

I propose to do one of the following (Ryan, please let me know your preference): either

  1. make the direct scheduler the default, for the time being, or
  2. go back to the original non-nested Trace scheduler from 0.1.0.3

It is now even necessary to use parMap a simple spawn will do.

For instance,

import Data.List(foldl')
import qualified Control.Monad.Par as P

psum :: [Int] -> Int
psum xs = foldl' fun 0 xs
  where fun acc i = P.runPar $ (P.spawn.return $ i+acc) >>= P.get >>= return

main = do
    print $ psum [1..128]

Compiled with -threaded will fail with thread blocked indefinitely in an MVar operation. Even with +RTS -N1.

But as @simonmar says, using Control.Monad.Par.Scheds.Direct seems to fix it.

After looking at this a bit more, I'm not sure it has anything to do with nesting. There's no nesting going on in this particular example, unlike #21. I think it's just a flat-out bug, triggered by a particular interleaving of threads while runPar is shutting down. Here is the RTS debugging output:

2b3dcd264b40: cap 0: running thread 3 (ThreadRunGHC)
2b3dcd264b40: cap 0: created thread 11
2b3dcd264b40: cap 0: thread 3 stopped (blocked on an MVar)
        thread    3 @ 0x2b3dcda05ee0 is blocked on an MVar @ 0x2b3dcda16f50 (TSO
_DIRTY)
2b3dcd264b40: giving up capability 0
2b3dcd264b40: passing capability 0 to worker 0x2b3dcdf01700
2b3dcdf01700: woken up on capability 0
2b3dcdf01700: resuming capability 0
2b3dcdf01700: cap 0: running thread 11 (ThreadRunGHC)
2b3dcdf01700: cap 0: waking up thread 3 on cap 0
2b3dcdf01700: cap 0: thread 11 stopped (yielding)
2b3dcdf01700: giving up capability 0
2b3dcdf01700: passing capability 0 to bound task 0x2b3dcd264b40
2b3dcd264b40: woken up on capability 0
2b3dcd264b40: resuming capability 0
2b3dcd264b40: cap 0: running thread 3 (ThreadRunGHC)
2b3dcd264b40: cap 0: thread 3 stopped (blocked on an MVar)
        thread    3 @ 0x2b3dcda05ee0 is blocked on an MVar @ 0x2b3dcda18828 (TSO
_DIRTY)
2b3dcd264b40: giving up capability 0
2b3dcd264b40: passing capability 0 to worker 0x2b3dcdf01700
2b3dcdf01700: woken up on capability 0
2b3dcdf01700: resuming capability 0
2b3dcdf01700: cap 0: running thread 11 (ThreadRunGHC)
2b3dcdf01700: cap 0: thread 11 stopped (finished)
2b3dcdf01700: giving up capability 0
2b3dcdf01700: freeing capability 0
2b3dcdd00700: returning; I want capability 0
2b3dcdd00700: resuming capability 0
2b3dcdd00700: cap 0: running thread 2 (ThreadRunGHC)
2b3dcdd00700: cap 0: thread 2 stopped (suspended while making a foreign call)
2b3dcdd00700: passing capability 0 to worker 0x2b3dcdf01700
2b3dcdf01700: woken up on capability 0
2b3dcdf01700: resuming capability 0
2b3dcdf01700: deadlocked, forcing major GC...

thread 11 is the Par monad thread, thread 3 is the main thread. Thread 11 wakes up thread 3, and then yields (this seems to be crucial). Then thread 3 gets blocked again, and never wakes up.

I don't understand the nested trace scheduler well enough to say why, but maybe this will help Daniel.

Any progress here?

@rrnewton is preparing a release that will have the fix (workaround actually). See #26.

Also, I backed off the trace scheduler to the non-nested version (18e1968), because the nested version has at least two separate bugs (this one and #21).

Released version 0.3.4 that doesn't suffer from this bug.

I seem to have this bug or something very much like it happen with criterion for me today with the new haskell platform release when running criterion. I'll try and see if its the same one or not

@bos
@simonmar

the test case at the top of this ticket doesn't trigger the problem, will investigate more, might be a criterion side problem instead.

@cartazio: this ticket is closed, we released a version of monad-par without the bug (0.3.4). Maybe you're using an older version?

@simonmar I'm on the haskell platform. the one released last week

it might be an unrelated problem in criterion that triggers a similar error message.

The test case at the opening of the ticket doesn't seem to trigger the bug, but building my criterion test suite with -threaded triggers the error.

Might not be a monad-par bug, but if i can figure out a simple small repro, i'll share it here as well as opening an suitable criterion ticket

(In airport, on the way back into the US.)

I'd like to take a look at this. Alas, if it is monad-par you're hitting
it indirectly through criterion so you can't play around with different
schedulers as easily.

But you can fairly easily play around with different monad-par versions
which expose different schedulers. For example, install criterion along
with :

  • monad-par 0.1.0.3 -- Trace scheduler without nesting
  • monad-par 0.3 -- Trace scheduler + nesting (known bugs)
  • 0.3.4.2 -- Direct scheduler (with idling + parent stealing to be
    specific, but no nesting)

A reproducer would be great...

-Ryan

On Thu, Jun 6, 2013 at 10:16 PM, Carter Tazio Schonwald <
notifications@github.com> wrote:

@simonmar https://github.com/simonmar I'm on the haskell platform. the
one released last week

it might be an unrelated problem in criterion that triggers a similar
error message.

The test case at the opening of the ticket doesn't seem to trigger the
bug, but building my criterion test suite with -threaded triggers is.

Might not be a monad-par bug, but if i can figure out a simple small
repro, i'll share it here as well as opening an suitable criterion ticket


Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-19071142
.

Actually, 0.3.4.1 was a mistake, but is perhaps another useful datapoint.
It was released in a debugging mode. (Busy waiting, no idling.)

On Thu, Jun 6, 2013 at 10:30 PM, Ryan Newton rrnewton@gmail.com wrote:

(In airport, on the way back into the US.)

I'd like to take a look at this. Alas, if it is monad-par you're hitting
it indirectly through criterion so you can't play around with different
schedulers as easily.

But you can fairly easily play around with different monad-par versions
which expose different schedulers. For example, install criterion along
with :

  • monad-par 0.1.0.3 -- Trace scheduler without nesting
  • monad-par 0.3 -- Trace scheduler + nesting (known bugs)
  • 0.3.4.2 -- Direct scheduler (with idling + parent stealing to be
    specific, but no nesting)

A reproducer would be great...

-Ryan

On Thu, Jun 6, 2013 at 10:16 PM, Carter Tazio Schonwald <
notifications@github.com> wrote:

@simonmar https://github.com/simonmar I'm on the haskell platform. the
one released last week

it might be an unrelated problem in criterion that triggers a similar
error message.

The test case at the opening of the ticket doesn't seem to trigger the
bug, but building my criterion test suite with -threaded triggers is.

Might not be a monad-par bug, but if i can figure out a simple small
repro, i'll share it here as well as opening an suitable criterion ticket


Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-19071142
.

@rrnewton haskell/criterion#28 heres the repro

i'm just using vanilla haskell platform 64bit released last week on my mac

#31 is my repro with current monad par (i've not had the time to unpeal the statistics / criterion wrapper from it, but it seems related to this issue since the only use of monad-par in the code is indirectly, via par-map and runPar)