noloader / SHA-Intrinsics

SHA-1, SHA-256 and SHA-512 compression functions using Intel, ARMv8 and Power8 SHA intrinsics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

POWER8 SHA Vector operations

munroesj52 opened this issue · comments

Your performance problems may be related to load/store and not the crypto operations.The SHA ops are all listed in the User Manual as 2 Cycles and AES as 6-7 cycles.

Make sure your compile is actually inlining the ops. For GCC use attribute((flatten)), Also you may need to unroll you looks a bit. For GCC use attribute((optimize ("unroll-loops")))

If still disappointed you can use performance tools performance simulator and PipeStat to find the bottlenecks.

You have questions or is this just a ping?

Ok this looks bad?

size_t Rijndael_Enc_AdvancedProcessBlocks128_6x1_ALTIVEC(const word32 *subKeys, size_t rounds,
            const byte *inBlocks, const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
{
    return AdvancedProcessBlocks128_6x1_ALTIVEC(POWER8_Enc_Block, POWER8_Enc_6_Blocks,
        subKeys, rounds, inBlocks, xorBlocks, outBlocks, length, flags);
}

etc passing pointers to thunks which are then called from:

template <typename F1, typename F6, typename W>
inline size_t AdvancedProcessBlocks128_6x1_ALTIVEC(F1 func1, F6 func6,
        const W *subKeys, size_t rounds, const byte *inBlocks,
        const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)

as
func1(block, subKeys, rounds);
and
func6(block0, block1, block2, block3, block4, block5, subKeys, rounds);

These should be inlined into AdvancedProcessBlocks128_6x1_ALTIVEC_PWR8 so that the compiler and the POWER8 super-scalar, out-of-order processor can effectively queue up the data and keep its pipelines (there are 16 of them) filled.

And add attribute((flatten)), attribute((optimize ("unroll-loops"))) on top of that!

Compiled cryptopp for POWER8 (Ubuntu 18.04) and profiled (perf record) cryptest b.

Rijndael_Enc_AdvancedProcessBlocks128_6x1_ALTIVEC 2nd in the list at 4.36% (Baseline_Multiply16 is #1 @ 5.74%).

The vcryper/vcypherlast barely register at ~9,5% (of 4.36%) The rest is data fumbling (load/store/permute). Plus a lot of branchy code dealing with data alignment, A place to start is to pass the parms in registers (the ABI allows up to 12 vector reg parms, including small arrays and structs up to 8 registers each) and move the data handling into the driver loop.
This might allow some loop-unrolling and load look-ahead in the driver functions.

Also looked at Baseline_Multiply16, Its not the multiplies. The sums are taking the time and there is only one carry bit in the XER. POWER9 adds a second carry but its a bit awkward to use (has to be cleared before use).

Take a look at PVECLIB .
Especially the quadword multiplies and multiple quadword precision multiplies
vec_muludq
and
vec_mul512x512