Trace sheduler works as stack, not a queue.

Question

Trace sheduler works as stack, not a queue.

SkorikGG opened this issue 5 years ago · comments

For example:

...
runPar $ do
    ...
    fork f1 >>  fork f2  >> fork f3 >> fork f4 >> fork f5 >>  fork f6  >> fork f7 >> fork f8
    fork f9 >>  fork f10  >> fork f11 >> fork f12 >> fork f13 >>  fork f14  >> fork f15 >> fork f16
    ...

On a quad-core processor, execution of forks will occur in the order of f1,f2,f3,f4,f16,f15,14,..., f5.
Thus, the time to execute f5 will reach when the remaining threads are processed.
This behavior can degrade performance and reduce parallelism due to the long wait for f5 results.

SkorikGG · Answer 1 · Mon Jan 06 2020 23:54:03 GMT+0800 (China Standard Time)

Pull request #71 added.

Simon Marlow · Answer 2 · Tue Jan 07 2020 02:29:45 GMT+0800 (China Standard Time)

Could you explain why this "degrades performance"? In your example it doesn't matter which order we execute things in, so long as we keep all the cores busy.

SkorikGG · Answer 3 · Tue Jan 07 2020 06:07:53 GMT+0800 (China Standard Time)

A few percent performance drop can be seen in this example:

data IVList a = Nil | Cons a (IVar (IVList a))
type Stream a = IVar (IVList a)
parFoldMapBuffer :: Int ->  (c -> b -> Par c) -> (a -> Par b) -> c -> [a] -> Par c
parFoldMapBuffer n f g z0 xs0 = do
    sr <- new
    mswxs<- start n sr xs0
    folder sr z0 mswxs
  where
    start _ sw [] = put_ sw Nil >> return Nothing
    start 0 sw xs = return $ Just (sw,xs)
    start !m sw (x:xs) = do
        sw'<-new
        ivy<- new
        put_ sw (Cons ivy sw')
        fork $ g x >>= put_ ivy
        start (m-1) sw' xs
    folder sr !z mswxs = do
        se <-get sr
        case se of
            Nil -> return z
            Cons ivy sr' -> do
                y<-get ivy
                z'<- f z y
                mswxs'<- case mswxs of
                    Nothing -> return Nothing
                    Just (sw,[]) -> put_ sw Nil >> return Nothing
                    Just (sw,x:xs) -> do
                        sw'<-new
                        ivy'<- new
                        put_ sw (Cons ivy' sw')
                        fork $ g x >>= put_ ivy'
                        return $ Just (sw',xs)
                folder sr' z' mswxs'

fn :: Int -> Double -> Double
fn 0 !x = x
fn !n !x = fn (n-1) $ sin $ 3* x

g :: Double -> Double
g x = fn 10000 x

test :: Int -> Int ->  Double
test m n = runPar $ parFoldMapBuffer 128 (\x y -> return $ fn 2500 (x+y)) (return . g . fromIntegral)  0.0 [m..n]

main = do
    t1<- getTime Realtime
    print $ test 0 50000
    t2<- getTime Realtime
    print $ fromIntegral (sec t2 - sec t1) + 1e-9 * fromIntegral  (nsec t2 - nsec t1)

Some uneven activity of the processor cores is noticeable in eventlog.
Eventlog.zip

SkorikGG · Answer 4 · Tue Jan 07 2020 17:50:36 GMT+0800 (China Standard Time)

In addition, the new commit in the pull request #71 making scheduler behavior more intuitively expected. With this change, the following function works properly (with full parallelism):

data IVList a = Nil | Cons a (IVar (IVList a))
type Stream a = IVar (IVList a)
parFoldMap :: (c -> b -> Par c) -> (a -> Par b) -> c -> [a] -> Par c
parFoldMap f g z0 xs0 = do
    sr <- new
    fork $ producer sr xs0
    consumer sr z0
  where
    producer sw [] = put_ sw Nil
    producer sw (x:xs) = do
        sw'<-new
        ivy<- new
        put_ sw (Cons ivy sw')
        fork $ g x >>= put_ ivy
        producer sw' xs
    consumer sr !z = do
        se <-get sr
        case se of
            Nil -> return z
            Cons ivy sr' -> do
                y<-get ivy
                z'<- f z y
                consumer sr' z'

For comparison, the work of spark pool:

parFoldMap2 :: (c -> b -> c) -> (a -> b) -> c -> [a] -> c
parFoldMap2 f g z xs = ys `pseq` foldl' f z ys where
    ys = start ys' ys'
    ys'= map g xs
    start [] ys = ys
    start (x:xs) ys = x `par` start xs ys

Simon Marlow · Answer 5 · Mon Jan 27 2020 16:26:50 GMT+0800 (China Standard Time)

It's been a while since I looked at monad-par in detail so apologies if I'm being a bit slow here. Could you explain why we get more parallelism with your change?

My intuition is that there isn't any clearly "better" scheduling strategy without knowing something about the workload, which the scheduler can't have any prior knowledge of. So demonstrating an improvement on one example isn't sufficient evidence to change the scheduling strategy - there will be other examples that get slower. You have to make the case either that

the examples that benefit from this strategy are more common, or
somehow you've made a consistent improvement due to bugs or inefficiencies in the original implementation