laforest / Octavo

Verilog FPGA Parts Library. Old Octavo soft-CPU project.

Home Page:http://fpgacpu.ca/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Restrict BTM to structured programming support

laforest opened this issue · comments

Currently, we include one BTM per live branch operation, merging their results via an OR-reduction. These support arbitrary CDFGs and scale well, but might be overkill.

Branches in structured programming are properly nested, thus the state of active branches follows a LIFO order.

Instead of loading separate parallel BTM entries which can cover all possible ordering of branches, we can implement a single BTM with a stack to hold entries. By pushing popped entries onto a second stack, we avoid having to reload entries in nested loops.

Some thought will have to go in how to deal with branch nestings that exceed stack depth, and how to handle unstructured code.

It turns out that, arbitrarily, the next branch after a taken or not taken branch can be itself, or any other branch, so a stack containing BTM entries would need to be able to push/pop itself an arbitrary number of steps when the current branch is or is not taken.

These would require extra bits in each BTM entry, and starts to recapitulate the (discarded) original "sub-processor" approach to branches, which spends most of its time idling, and replicates the state of the PC.

How to handle cases where there are more branches than BTM entries resembles the stack allocation problem, which is not a pretty one to solve compared to the simple register allocation one we have now with independent BTM entries.

Although capturing the basic pattern of properly nested loops as a stack of BTM entries would free up a lot of BTM entries, it does force a sequence onto branches which might otherwise fold together, such as when the bottom of an enclosing loop is empty, and we can thus fold its branch with that of the inner loop. Thus we'd need multiple stacks to get branch folding anyway...

However, perhaps we can place the BTM entry stacks under software control, so we can pre-load the entries for multiple program parts, as a sort of branching context, and use a single instruction (a write to H memory) to push/pop one or more stacks (as circular buffers) to change contexts at useful times! That way we get fuller use of the MLAB memories (32 depth / 8 threads = 4 entries per stack), folding branches, basic operation remains unchanged, and we can cache BTM entries on the stack (so we can unload them from data memory too), rotating them out as needed.

We could also automatically "context-switch" the BTM by indexing on the PC.
However, this complicates calculating the addresses for pre-loading.

Ultimately, the problem is that these schemes break down badly when the number of live branches exceeds the capacity of the BTM. Managing the BTM gets very complex, relative to the simple register-allocation scheme we can use now.

For now, close this, and re-visit maybe after #2 gets fixed: we may simply want to convert the BTM entries into CALL/RET-controlled stacks to save branch contexts.