riscv / riscv-bitmanip

Working draft of the proposed RISC-V Bitmanipulation extension

Home Page:https://jira.riscv.org/browse/RVG-122

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

differentiate MAX/MIN[U] by register arguments.

David-Horner opened this issue · comments

Forgive me if this was addressed previously [in which case please give me location of discussion].

In a similar manner to SGE[U], defined in terms of SLT[U] with the arguments exchanged, so too
MAX[U] can be defined as MIN[U] with the arguments exchanged, where MIN[U] requires rs1 to be a higher register number than rs2.

Similarly, MIN[U] with rs1 and rs2 the same should be reserved.

I suspect there are other BIT operators that have similar unity operations and should be reserved, and similar reflexive operations that could be implemented by numerically ordering the operands.

We rightly simplified RVI to allow some redundancies, this was and is very prudent for the base operations.
However, the designers did not make glaringly wasteful op map decisions such as this one for MAX[U].
This was partly because of a judicious avoidance of these more esoteric instructions.

However, as we move to "optimizations" we should also ensure optimizations of the opcode space and if that necessitates increased implementation cost, so be it. All optimizations come with a cost. We are at a place now where we can control and limit the impacts. [better than waiting for the unintended consequences because we failed to discriminate and limit those effects].

I will link here the issue in code-size-reduction issue :
Towards quantifying Optimization: guidelines and principles. - None without disruption
riscvarchive/riscv-code-size-reduction#24

@David-Horner @kdockser

Currently with B to compute the absolute of difference [1]:

MIN x3,x1,x2
MAX x4,x1,x2
SUB x4,x4,x3

What happens if you need to order the registers ? Might need to move all the data around:

MIN x3,x1,x2
MV x6,x1
MV x5,x2
MAX x4,x6,x5
SUB x4,x4,x3

of course you can hope the compiler will be smart enough to not use x1 so it can use only one move:

MIN x3,x10,x11
MV x5,x11
MAX x4,x10,x5
SUB x4,x4,x3

In all case, it makes for worse code if you need both MIN and MAX, and it puts a lot of extra burden on the register allocator. Which is not a good idea, from a compiler's write point of view. Or from an implementer's point of view, as it adds another weird case to the decoding.

Please don't over-optimize the encoding to the point the ISA is no longer regular/orthogonal enough for compilers and simple code patterns.

[1] Someone cares about the 32-bits version: riscv/riscv-p-spec#38

Edit: got the order wrong in MAX (I think - hopefully I got it right now...)

I appreciate the reference to the worst case analysis. It is important to examine all cases including edge/fringe. But for this situation we do not need both |MIN| and |MAX|: |SUB x3,x1,x2 SUB x4,x2,x1 MAX x4,x3,x4 | will do the required operation.

Sorry, should have use MINU/MAXU in there to make it more obvious. Using sub/sub/maxu is not a solution for unsigned (0x1-0x0->0x1, 0x0-0x1->0xFFFFFFFF, maxu(0x1,0xFFFFFFFF)->0xFFFFFFFF which is not the answer, 0x1 is).

There is also some interesting corner cases for signed (i.e. (0x80000000-0x0->0x80000000, 0x0-0x80000000 -> 0x80000000, max will be 0x80000000 which is negative so not an absolute value).

Maybe I'm missing something, but I understood the proposal was to merge the MIN[U] and MAX[U] opcodes and use the register ordinals to distinquish the operation

So wouldn't

MINU x3,x1,x2
MAXU x4,x1,x2
SUB x4,x4,x3

Just need to be written as

CombinedMINUMAXUOpcode x3,x2,x1 // rs1 > rs2 -> min operation
CombinedMINUMAXUOpcode x4,x1,x2 // rs1 < rs2 -> max operation
SUB x4,x4,x3

Obviously we'd keep the MINU/MAXU mnemonics for readablity. I only merged the opcode name for illustration.

Hi Ken,

I was just trying to understand the proposal and how a compiler would implement it if the extra moves in the discussion where indeed required. I did not mean to imply that I thought the proposal should be implemented.

While all optimizations come at some cost, we also need to look at the benefit and especially the benefit/cost efficiency.
16-bit opcode space is at a premium and needs to be allocated very very carefully.
32-bit opcode space is relatively plentiful, but still needs to be allocated carefully.
The space savings that can be had from changing instruction semantics based on the ordinal values of the registers used doesn't appear to outweigh the cost of decode complexity. Being more selective about which instructions get added to the architecture will yield better dividends.
I am closing this issue, but we can continue to discuss it elsewhere.