Pipelined IP not working as expected
TheZoq2 opened this issue · comments
Hi. I've been playing around with bambu trying to figure out what it's capable of and have run into something I didn't expect. I wanted to see if this could be used to implement dynamic programming using an external verilog module as the "kernel"
With the following c++ code:
using value_type = int;
extern "C" {
extern value_type compute_cost(
value_type prev,
value_type x1,
value_type x2,
value_type x3,
value_type u1,
value_type u2
);
}
int perform_dp_intern() {
int values[100][100];
for(int i = 1; i < 100; i++) {
for(int j = 0; j < 100; j++) {
values[i][j] = compute_cost(values[i-1][j], 0, 0, 0, 0, 0);
}
}
return values[99][0];
}
intuitively, the inner for-loop can be executed in parallel, which means that if compute_cost
is pipelined and has an "initiation interval" of 1, it should be able to run in 100 cycles + a bit for the ends.
I tried adding an IP.xml file like this:
<?xml version="1.0"?>
<technology>
<library>
<name>STD_FU</name>
<cell>
<name>compute_cost</name>
<operation operation_name="compute_cost" bounded="1" cycles="10" initiation_time="1" stage_period="1.2"/>
<circuit>
<component_o id="compute_cost">
<license>PANDA_LGPLv3</license>
<structural_type_descriptor id_type="compute_cost"/>
<port_o id="clock" dir="IN" is_clock="1">
<structural_type_descriptor type="BOOL" size="1"/>
</port_o>
<port_o id="reset" dir="IN">
<structural_type_descriptor type="BOOL" size="1"/>
</port_o>
<port_o id="start_port" dir="IN">
<structural_type_descriptor type="BOOL" size="1"/>
</port_o>
<port_o id="prev" dir="IN">
<structural_type_descriptor type="INT" size="32"/>
</port_o>
<port_o id="x1" dir="IN">
<structural_type_descriptor type="INT" size="32"/>
</port_o>
<port_o id="x2" dir="IN">
<structural_type_descriptor type="INT" size="32"/>
</port_o>
<port_o id="x3" dir="IN">
<structural_type_descriptor type="INT" size="32"/>
</port_o>
<port_o id="u1" dir="IN">
<structural_type_descriptor type="INT" size="32"/>
</port_o>
<port_o id="u2" dir="IN">
<structural_type_descriptor type="INT" size="32"/>
</port_o>
<port_o id="out1" dir="OUT">
<structural_type_descriptor type="INT" size="32" />
</port_o>
<NP_functionality LIBRARY="compute_cost " VERILOG_FILE_PROVIDED="compute_cost.v"/>
</component_o>
</circuit>
</cell>
</library>
</technology>
which as I understand it would specify that compute_cost
takes 10 clock
cycles to complete (and has an II of 1, though I might be misunderstanding the
initiation_time
variable).
However when simulating this, the whole thing takes ~100k cycles and the
start_port
of the compute_cost
module is only set to 1 every 10 clock
cycles as can be seen from this screenshot
Am I misunderstanding how external IP blocks behave or did I run into a bug?
Edit: I ran bambu using ~/panda/bin/bambu dp.cpp --print-dot IP.xml --compiler=I386_CLANG11
and my bambu version is Version: PandA 0.9.7-dev - Revision 151822f6eb6b28b68ef7cde4c7c3c0add185ed9d-panda-0.9.7-dev
Dear Frans,
your syntax is completely correct, but Bambu does not currently support loop pipelining. Sorry for that.