ferrandi / PandA-bambu

PandA-bambu public repository

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pipelined IP not working as expected

TheZoq2 opened this issue · comments

Hi. I've been playing around with bambu trying to figure out what it's capable of and have run into something I didn't expect. I wanted to see if this could be used to implement dynamic programming using an external verilog module as the "kernel"

With the following c++ code:

using value_type = int;
extern "C" {
extern value_type compute_cost(
        value_type prev,
        value_type x1,
        value_type x2,
        value_type x3,
        value_type u1,
        value_type u2
    );
}

int perform_dp_intern() {
    int values[100][100];

    for(int i = 1; i < 100; i++) {
        for(int j = 0; j < 100; j++) {
            values[i][j] = compute_cost(values[i-1][j], 0, 0, 0, 0, 0);
        }
    }

    return values[99][0];
}

intuitively, the inner for-loop can be executed in parallel, which means that if compute_cost is pipelined and has an "initiation interval" of 1, it should be able to run in 100 cycles + a bit for the ends.

I tried adding an IP.xml file like this:

<?xml version="1.0"?>
<technology>
  <library>
    <name>STD_FU</name>
    <cell>
      <name>compute_cost</name>
      <operation operation_name="compute_cost"  bounded="1" cycles="10" initiation_time="1" stage_period="1.2"/>
      <circuit>
        <component_o id="compute_cost">
          <license>PANDA_LGPLv3</license>
          <structural_type_descriptor id_type="compute_cost"/>
          <port_o id="clock" dir="IN" is_clock="1">
            <structural_type_descriptor type="BOOL" size="1"/>
          </port_o>
          <port_o id="reset" dir="IN">
            <structural_type_descriptor type="BOOL" size="1"/>
          </port_o>
          <port_o id="start_port" dir="IN">
            <structural_type_descriptor type="BOOL" size="1"/>
          </port_o>
          <port_o id="prev" dir="IN">
            <structural_type_descriptor type="INT" size="32"/>
          </port_o>
          <port_o id="x1" dir="IN">
            <structural_type_descriptor type="INT" size="32"/>
          </port_o>
          <port_o id="x2" dir="IN">
            <structural_type_descriptor type="INT" size="32"/>
          </port_o>
          <port_o id="x3" dir="IN">
            <structural_type_descriptor type="INT" size="32"/>
          </port_o>
          <port_o id="u1" dir="IN">
            <structural_type_descriptor type="INT" size="32"/>
          </port_o>
          <port_o id="u2" dir="IN">
            <structural_type_descriptor type="INT" size="32"/>
          </port_o>
          <port_o id="out1" dir="OUT">
            <structural_type_descriptor type="INT" size="32" />
          </port_o>
          <NP_functionality LIBRARY="compute_cost " VERILOG_FILE_PROVIDED="compute_cost.v"/>
        </component_o>
      </circuit>
    </cell>
  </library>
</technology>

which as I understand it would specify that compute_cost takes 10 clock
cycles to complete (and has an II of 1, though I might be misunderstanding the
initiation_time variable).

However when simulating this, the whole thing takes ~100k cycles and the
start_port of the compute_cost module is only set to 1 every 10 clock
cycles as can be seen from this screenshot

image

Am I misunderstanding how external IP blocks behave or did I run into a bug?

Edit: I ran bambu using ~/panda/bin/bambu dp.cpp --print-dot IP.xml --compiler=I386_CLANG11 and my bambu version is Version: PandA 0.9.7-dev - Revision 151822f6eb6b28b68ef7cde4c7c3c0add185ed9d-panda-0.9.7-dev

Dear Frans,
your syntax is completely correct, but Bambu does not currently support loop pipelining. Sorry for that.