Optimization: Special handling for common shadow stack instruction sequences
Robbepop opened this issue · comments
Wasm producers such as LLVM often generate Wasm code with a so-called shadow stack in order to handle parameters and return values that did not fit the Wasm function parameter and result passing model.
Common sequences include:
global.get 0
i32.const $n
i32.sub
local.tee $v
global.set 0
and
global.get 0
i32.const -$n
i32.add
local.tee $v
global.set 0
as well as the counterpart
local.get $v
i32.const $n
i32.add
global.set 0
Where $n
denotes a positive i32
integer literal and $v
denotes a local variable index.
Currently Wasmi (register) translates this to roughly the following bytecode:
$r <- global.get 0
$v <- i32.sub_imm $r $n
global.set 0 $v
However, with special treatment in the Wasmi translator and introduction of new Wasmi bytecode instructions we could easily translate the entire sequences above to a single instruction.
$v <- global.0.i32.sub_assign_imm $n
- We no longer state the global index (0) since Wasm producers implicitly always use the zero-th index and we can use this fact to further compress the special bytecode for these sequences.
- Another optimization that we could apply is that the constant immediate value is always divisible by 4 (since this is about pointer arithmetic on 32-bit systems). So we could store the constants as divided by 4 to increase the effective value range.
- Instead of supporting both
global.0.i32.sub_assign_imm
andglobal.0.i32.add_assign_imm
we can use the fact that sequences withi32.add
always have a negative constant immediate value and thus convert thei32.add
into ani32.sub
using a negated constant.
It is to be expected that this optimization yields good results because:
- the most costly things in an efficient interpreter such as Wasmi is instruction dispatch and this technique results in fewer instructions for the same compute effect
- past similar optimizations with op-code fusion yielded good results
- when inspecting Wasm binaries such as
spidermonkey.wat
orwestend.wat
(Polkadot runtime) we observed thousands of occurrences of the above Wasm instruction sequences.
Having #924 implemented makes this optimization simpler to implement since we no longer have to differentiate between i32.add r c
and i32.sub r c
variants of the above instruction sequences.