Optimization: Special handling for common shadow stack instruction sequences

Question

Optimization: Special handling for common shadow stack instruction sequences

Robbepop opened this issue 8 months ago · comments

Wasm producers such as LLVM often generate Wasm code with a so-called shadow stack in order to handle parameters and return values that did not fit the Wasm function parameter and result passing model.

Common sequences include:

global.get 0
i32.const $n
i32.sub
local.tee $v
global.set 0

and

global.get 0
i32.const -$n
i32.add
local.tee $v
global.set 0

as well as the counterpart

local.get $v
i32.const $n
i32.add
global.set 0

Where $n denotes a positive i32 integer literal and $v denotes a local variable index.
Currently Wasmi (register) translates this to roughly the following bytecode:

$r <- global.get 0
$v <- i32.sub_imm $r $n
global.set 0 $v

However, with special treatment in the Wasmi translator and introduction of new Wasmi bytecode instructions we could easily translate the entire sequences above to a single instruction.

$v <- global.0.i32.sub_assign_imm $n

We no longer state the global index (0) since Wasm producers implicitly always use the zero-th index and we can use this fact to further compress the special bytecode for these sequences.
Another optimization that we could apply is that the constant immediate value is always divisible by 4 (since this is about pointer arithmetic on 32-bit systems). So we could store the constants as divided by 4 to increase the effective value range.
Instead of supporting both global.0.i32.sub_assign_imm and global.0.i32.add_assign_imm we can use the fact that sequences with i32.add always have a negative constant immediate value and thus convert the i32.add into an i32.sub using a negated constant.

It is to be expected that this optimization yields good results because:

the most costly things in an efficient interpreter such as Wasmi is instruction dispatch and this technique results in fewer instructions for the same compute effect
past similar optimizations with op-code fusion yielded good results
when inspecting Wasm binaries such as spidermonkey.wat or westend.wat (Polkadot runtime) we observed thousands of occurrences of the above Wasm instruction sequences.

Robin Freyler · Answer 1 · Sun Feb 04 2024 17:37:30 GMT+0800 (China Standard Time)

Having #924 implemented makes this optimization simpler to implement since we no longer have to differentiate between i32.add r c and i32.sub r c variants of the above instruction sequences.