haskell-perf

GHC Patch: https://github.com/composewell/ghc/tree/ghc-8.10.7-eventlog-enhancements

Enable Linux perf counters

Enable unrestricted use of perf counters:

# echo -1 > /proc/sys/kernel/perf_event_paranoid

Disable CPU scaling

Set the scaling governer of all your cpus to performance:

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
...
...
echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor

Generating the eventlog

To generate the event log, we need to compile the program with the eventlog enabled and run the program setting the -l rts option.

There are multiple ways of doing this.

Using plain GHC:

ghc Main.hs -rtsopts -eventlog
./Main +RTS -l -RTS

Using Cabal:

The .cabal file should contain the following ghc options

ghc-options: -eventlog "-with-rtsopts=-l"

If the -threaded option is used while compiling. You may want to use the -N1 rts option.

Creating windows

Helper function to create windows:

{-# LANGUAGE BangPatterns #-}

import Control.Monad.IO.Class (MonadIO(..))
import Debug.Trace (traceEventIO)

{-# INLINE withTracingFlow #-}
withTracingFlow :: MonadIO m => String -> m a -> m a
withTracingFlow tag action = do
    liftIO $ traceEventIO ("START:" ++ tag)
    !res <- action
    liftIO $ traceEventIO ("END:" ++ tag)
    pure res

We can wrap parts of the flow we want to analyze with withTracingFlow using a tag to help us identify it.

End of Window

You can put the END of the window in different paths but ensure that all paths are covered:

  r <- f x
  case r of
    Just val -> do
      -- _ <- L.runIO $ traceEventIO $ "END:" ++ "window"
      -- Some processing
    Nothing -> do
      -- _ <- L.runIO $ traceEventIO $ "END:" ++ "window"
      -- Some processing

Measurement Overhead

Even when you are measuring an empty block of code there will be some minimum timing and allocations reported because of the measurement overhead.

    _ <- traceEventIO $ "START:emptyWindow"
    _ <- traceEventIO $ "END:emptyWindow"

The timing is due to the time measurement system call itself. The allocations are due to the traceEventIO haskell code execution. TODO: fix the allocations.

Measurement with Lazy Evaluation

If we want to measure the cost of the lookup in the code below we need to evaluate it right there:

    m <- readIORef _configCache
    return . snd $ SimpleLRU.lookup k m

For correct measurement use the following code:

    m <- readIORef _configCache
    _ <- traceEventIO $ "START:" ++ "mapLookup"
    let !v = HM.lookup k m
    _ <- traceEventIO $ "END:" ++ "mapLookup"
    return v

Labelling Threads

We should label our threads to identify the thread to scrutinize while reading the stats.

For example,

To scrutinize the main thread:

import GHC.Conc (myThreadId, labelThread)

main :: IO ()
main = do
    tid <- myThreadId
    labelThread tid "main-thread"
    withTracingFlow "main" $ do
       ...

To scrutinize the server thread in warp we can use the following middleware:

eventlogMiddleware :: Application -> Application
eventlogMiddleware app request respond = do
    tid <- myThreadId
    labelThread tid "server"
    traceEventIO ("START:server")
    app request respond1

    where

    respond1 r = do
        res <- respond r
        traceEventIO ("END:server")
        return res

We can use eventlogMiddleware as the outermost layer.

Reading the results

We get a lot of output currently. We are in the process of simplifying the statistics and making the details controllable via options.

Currently, the program prints a lot of information. It's essential to understand what to ignore given the use case.

The use-case we assume is: Understand the window CPU time and Thread allocated.

Consider the following program:

{-# LANGUAGE BangPatterns #-}

import Control.Monad (unless)
import Control.Monad.IO.Class (MonadIO(..))
import Debug.Trace (traceEventIO)
import GHC.Conc (myThreadId, labelThread)

{-# INLINE withTracingFlow #-}
withTracingFlow :: MonadIO m => String -> m a -> m a
withTracingFlow tag action = do
    liftIO $ traceEventIO ("START:" ++ tag)
    !res <- action
    liftIO $ traceEventIO ("END:" ++ tag)
    pure res

{-# INLINE printSumLoop #-}
printSumLoop :: Int -> Int -> Int -> IO ()
printSumLoop _ _ 0 = print "All Done!"
printSumLoop chunksOf from times = do
    withTracingFlow "sum" $ print $ sum [from..(from + chunksOf)]
    printSumLoop chunksOf (from + chunksOf) (times - 1)

main :: IO ()
main = do
    tid <- myThreadId
    labelThread tid "main-thread"
    withTracingFlow "main" $ do
         printSumLoop 10000 1 100

The statics gleaned from the eventlog of the above program will look like the following:

--------------------------------------------------
Summary Stats
--------------------------------------------------

Global thread wise stat summary
tid       label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
  1 main-thread       2       967,479         434,384
  2           -       1         5,854          17,664

  -           -       3       973,333         452,048


Window [1:main] thread wise stat summary
ProcessCPUTime: 1,174,455
ProcessUserCPUTime: 0
ProcessSystemCPUTime: 1,175,000

ThreadCPUTime:934,898
GcCPUTime:0
RtsCPUTime:239,557
tid       label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
  1 main-thread       1       934,898         429,952

  -           -       1       934,898         429,952


Window [1:sum] thread wise stat summary
ProcessCPUTime: 953,862
ProcessUserCPUTime: 0
ProcessSystemCPUTime: 949,000

ThreadCPUTime:833,991
GcCPUTime:0
RtsCPUTime:119,871
tid       label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
  1 main-thread     100       833,991         328,224

  -           -     100       833,991         328,224


--------------------------------------------------
Detailed Stats
--------------------------------------------------

Window [1:main] thread wise stats for [ThreadCPUTime]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 934,898     1 934,898 934,898 934,898      0


Grand total: 934,898

Window [1:main] thread wise stats for [ThreadAllocated]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 429,952     1 429,952 429,952 429,952      0


Grand total: 429,952

Window [1:sum] thread wise stats for [ThreadCPUTime]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 833,991   100 8,340   5,533  63,493  5,714


Grand total: 833,991

Window [1:sum] thread wise stats for [ThreadAllocated]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 328,224   100 3,282   2,960  31,584  2,844


Grand total: 328,224

Global thread wise stats for [ThreadCPUTime]
tid       label   total count     avg minimum maximum  stddev
--- ----------- ------- ----- ------- ------- ------- -------
  1 main-thread 967,479     2 483,740  33,519 933,960 450,220
  2           -   5,854     1   5,854   5,854   5,854       0


Grand total: 973,333

Global thread wise stats for [ThreadAllocated]
tid       label   total count     avg minimum maximum  stddev
--- ----------- ------- ----- ------- ------- ------- -------
  1 main-thread 434,384     2 217,192   4,920 429,464 212,272
  2           -  17,664     1  17,664  17,664  17,664       0


Grand total: 452,048

From the Global thread wise stat summary under Summary Stats figure out the thread id we want to scrutinize. In this case, we care about the main-thread. The thread id is 1.

We can skip to the Detailed Stats section.

We want to look at all the windows we want to scrutinize that run in the main-thread. The windows in the above program are main and sum. The thread id is prepended to the windows. So we want to look at sections corresponding to [1:main] and [1:sum].

That is,

Window [1:main] thread wise stats for [ThreadCPUTime]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 934,898     1 934,898 934,898 934,898      0


Grand total: 934,898

Window [1:main] thread wise stats for [ThreadAllocated]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 429,952     1 429,952 429,952 429,952      0


Grand total: 429,952

Window [1:sum] thread wise stats for [ThreadCPUTime]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 833,991   100 8,340   5,533  63,493  5,714


Grand total: 833,991

Window [1:sum] thread wise stats for [ThreadAllocated]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 328,224   100 3,282   2,960  31,584  2,844

Consider one specific section,

Window [1:sum] thread wise stats for [ThreadCPUTime]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 833,991   100 8,340   5,533  63,493  5,714

This section is a table. It has 8 columns. It can have multiple rows. We should only scrutinize the row where the tid matches main-thread. ie. tid == 1.

The granularity of ThreadCPUTime is in nanoseconds and ThreadAllocated is in bytes.

Columns:

tid: The thread id
label: The thread label
total: The total accumulated sum of all the samples
count: Number of samples or the times this window is seen
avg: The average size of the samples
minimum: The minimum of all the samples
maximum: The maximum of all the samples
stddev: The standard deviation of the samples

NOTE: It is important to look at stddev. If stddev is more than 30% of the average and if the difference between the minimum and maximum is too much, the average might have unecessary outliers. In the future we would like to remove outliers automatically.

composewell / haskell-perf