StgState allocations dominate
sgraf812 opened this issue · comments
Here's a profile of a simplified benchmark case of NoFib's bernoulli
after #8 has been fixed:
COST CENTRE MODULE SRC %time %alloc
lookupEnvSO Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(631,1)-(649,21) 6.1 3.4
evalStackContinuation.\ Stg.Interpreter lib/Stg/Interpreter.hs:(355,74)-(394,35) 5.6 9.1
builtinStgEval Stg.Interpreter lib/Stg/Interpreter.hs:(154,1)-(201,103) 5.1 4.5
evalExpr.\ Stg.Interpreter lib/Stg/Interpreter.hs:(497,45)-(502,23) 4.8 5.3
evalExpr Stg.Interpreter lib/Stg/Interpreter.hs:(423,1)-(533,93) 3.9 1.0
compare Stg.Syntax lib/Stg/Syntax.hs:(30,3)-(32,12) 3.8 0.0
evalExpr.\ Stg.Interpreter lib/Stg/Interpreter.hs:(504,37)-(510,27) 3.0 1.9
addInterClosureCallGraphEdge.addEdge Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:820:7-127 2.5 0.8
setInsert Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(793,1)-(795,36) 2.5 0.0
decodeStgbin' Stg.IO lib/Stg/IO.hs:52:1-22 2.5 4.6
readHeap Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(655,1)-(660,71) 2.2 0.9
addIntraClosureCallGraphEdge.addEdge Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:831:7-127 2.1 0.8
lookupEnv Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:652:1-53 2.0 1.6
addBinderToEnv Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:621:1-49 2.0 1.6
lookup# Data.HashMap.Base Data/HashMap/Base.hs:509:1-80 1.9 0.5
compare Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:1224:17-19 1.5 0.0
matchFirstLit Stg.Interpreter lib/Stg/Interpreter.hs:(537,1)-(544,112) 1.5 3.0
== Stg.Syntax lib/Stg/Syntax.hs:75:13-14 1.4 0.0
stackPop.\ Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:560:57-166 1.3 0.7
evalStackMachine.\ Stg.Interpreter lib/Stg/Interpreter.hs:339:24-82 1.3 2.5
setProgramPoint Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:841:1-80 1.3 9.8
stackPop Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(558,1)-(563,19) 1.2 4.9
builtinStgApply Stg.Interpreter lib/Stg/Interpreter.hs:(204,1)-(237,69) 1.1 1.2
addZippedBindersToEnv.\ Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:624:60-86 1.1 1.2
matchFirstCon Stg.Interpreter lib/Stg/Interpreter.hs:(564,1)-(569,31) 1.1 1.9
tryNextDebugCommand Stg.Interpreter.Debugger lib/Stg/Interpreter/Debugger.hs:(28,1)-(34,12) 1.0 0.4
store Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(579,1)-(589,106) 0.9 2.6
store.\ Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:580:32-70 0.7 1.7
freshHeapAddress Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(568,1)-(570,87) 0.7 2.4
declareBinding.\ Stg.Interpreter lib/Stg/Interpreter.hs:(579,22)-(584,58) 0.6 1.0
stackPush Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(553,1)-(555,96) 0.5 4.6
>>=.\.\ Data.Conduit.Internal.Conduit src/Data/Conduit/Internal/Conduit.hs:152:51-68 0.5 4.0
store.\ Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:589:38-106 0.5 1.7
addIntraClosureCallGraphEdge Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(830,1)-(838,5) 0.3 1.3
addInterClosureCallGraphEdge Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:(819,1)-(827,5) 0.3 1.3
freshHeapAddress.\ Stg.Interpreter.Base lib/Stg/Interpreter/Base.hs:570:30-87 0.2 2.1
Most of the functions there are related to stack or heap manipulation. Looking at the code and the fact that setProgramPoint
(which does only one thing: modify the StgState
's ssCurrentProgramPoint
) contributes almost 10% of all allocations, I think the lovely simple design of a single StgState
which contains the whole interpreter state in a huge immutable record might be the next bottleneck.
Unfortunately, we don't have mutable fields (yet) in GHC Haskell. So here are other suggestions:
- Make all fields of
StgState
STVar
s orMVar
s. Probably the most performant option - Segregate
StgState
into two (or more) recordsStgStateHot
/StgStateCold
. Put hot stuff likessCurrentProgramPoint
inStgStateHot
. Bonus points for a record pattern synonym that keeps the old interface (but then call sites must be absolutely sure to inline away the PS)
Having pure state was the main goal and achievement of the interpreter. This will not be changed for sure because it would ruin readability and simplicity. Haskell simply needs a better compiler. IMO it is a seriously bad habit to make Haskell programs more imperative to gain performance. Instead improve the compiler.
Use staged compilation to make it faster.
https://github.com/AndrasKovacs/staged
Please customize the interpreter for your needs. The idea is that one could specialize and refactor the interpreter easily to do experiments without worrying the code quality and instead focusing on the creative and research domain specific parts.
Having pure state was the main goal and achievement of the interpreter
Yes, and I agree that's a big deal. From what I heard, implementing our instrumentation ideas on top of your work was quite a breeze.
Instead improve the compiler.
A static analysis that reuses heap cells like that is non-trivial. I also live in the here and now, and at the moment we don't have such an analysis.
Use staged compilation to make it faster.
I agree that might be valuable path to explore, but that is not that much of a short-term solution. It is also unclear to me whether that even optimises away all the StgState
overhead.
What do you think about my second suggestion?
Segregate StgState into two (or more) records StgStateHot/StgStateCold. Put hot stuff like ssCurrentProgramPoint in StgStateHot. Bonus points for a record pattern synonym that keeps the old interface (but then call sites must be absolutely sure to inline away the PS)
I think that will go a long way towards less copying of large StgState
s and it won't impact customisability of the interpreter at all.
I do not plan to optimize the interpreter further. To me the interpreter should be a high level specification which is right now and it should not have optimization related noise at all. The reason why I stick to this idea is because I plan to do experiments where I use the interpreter as a specification literally and generate code from from it. (i.e. free monad based interpreter)
So I need to keep the code simple.
BTW you could optimize it for your custom research if you wish, just fork it. Do not look at the interpreter as a software product, so do not hesitate to do ad-hoc modifications on it, it's cheap.
So please implement your optimization ideas by yourself in your fork.
A static analysis that reuses heap cells like that is non-trivial. I also live in the here and now, and at the moment we don't have such an analysis.
One of GRIN Compiler goal is is to experiment with such analyses and make it real.