Skip to content

Yang-Qirui/Resource-Sharing

Repository files navigation

Resource-Sharing Based on Decreasing Area

A demo repo to study on resource sharing algorithm in EDA, aimed to decrease the circuits' area.

Problem

We measured the compile result of Design Compiler, and find out that its resource sharing algorithm is greedy, which means it can't achieve a best sharing decision. For example:

// test.v

module top(input CLK,
           input [31:0] a,
           input [31:0] b,
           input [31:0] c,
           input [31:0] d,
           input [31:0] e,
           input [31:0] f,
           input [1:0] state,
           output reg[31:0] out
);
always @(posedge CLK) begin
    if (state == 0) begin
        o = a + b + c;
    end
    else if (state == 1) begin
        o = a + b + d;
    end
    else begin
        if (state == 2) begin
            o = b + c + e;
        end
        else begin
            o = c + e + f;
        end
    end
end
endmodule

if we manually do the resource sharing, it should be:

// test_manual.v

if (state == 0) begin
    tmp1 = a;
    tmp2 = b;
    tmp3 = c;
end
else if (state == 1) begin
    tmp1 = a;
    tmp2 = b;
    tmp3 = d;
end
else begin
    if (state == 2) begin
        tmp1 = e;
        tmp2 = b;
        tmp3 = c;
    end
    else begin
        tmp1 = e;
        tmp2 = f;
        tmp3 = c;
    end
end
o = tmp1 + tmp2 + tmp3

and the result is, for every tmp, it only has to choose from two different input, which means it only needs a two input mux to choose value which needed the tmp reg to hold.

But to Design Compiler, it seems to choose to first share b and e (since 3 of 4 branches have the input b), and then share a and c, lastly share c,d,e,f.

The problem is, this sharing decision is apparently not the best one. it use 2 two input mux and 1 four input mux.

// part of test_mapped.v generated by design compiler

// share b and e
OAI21_X1 U476 ( .B1(n387), .B2(n394), .A(n490), .ZN(\U3/U2/Z_8 ) );
NAND2_X1 U477 ( .A1(b[8]), .A2(n489), .ZN(n490) );
INV_X1 U478 ( .A(e[8]), .ZN(n394) );

// share a and c
AOI22_X1 U374 ( .A1(c[1]), .A2(n353), .B1(a[1]), .B2(n354), .ZN(n384) );

// share e,c,d,f
INV_X1 U526 ( .A(e[21]), .ZN(n442) );
INV_X1 U430 ( .A(c[21]), .ZN(n443) );
OAI221_X1 U428 ( .B1(n388), .B2(n442), .C1(n443), .C2(n391), .A(n444), .ZN(
        \U3/U3/Z_21 ) );
AOI22_X1 U429 ( .A1(d[21]), .A2(n351), .B1(f[21]), .B2(n393), .ZN(n444) );

And we measured the area of those two type of resource sharing. We first let the Design compiler do a compile process to the origin code test.v, and then disable its resource sharing (use command set hlo_resource_allocation none) , and let it compile the test_manual.v. And we got those results: (former one is design compiler's resource sharing)

So, how to manual do this better resource sharing?

Preparations

Take the following case as an example, before we start, make sure the equations do not have brackets:

branch1:
o = a + b - c * e;
branch2:
o = a - b * d + f;
branch3:
o = b - c + e * f;
branch4:
o = a + e * f - a;

Even though we know that plus operation is similar to minus operation, we still don't want to import more other operations (like use invert the input and plus one), which will increase the area. So we choose to preserve the operators, and 'align' those equations. Aligned equations should be like:

branch1:
o = a + b - c * e;
branch2:
o = a + f - b * d;
branch3:
o = b + e * f - c;
branch4:
o = a + e * f - a;

This indicates that for the output reg o, every branch includes one plus, one multiply and one minus operators. (Later we will discuss about the branches with different numbers of operators).

We first use share the multiply operations first (also discuss later), and it becomes:

branch1:
o = (a + b) - (tmp);
branch2:
o = (a + f) - (tmp);
branch3:
o = (b + tmp) - (c);
branch4:
o = (a + tmp) - (a);

reg tmp means the multiply results. And then, we can divide this code into two parts, in each part the operators only consists plus. And those parts do resource sharing seperately.

For the first part, it looks like:

branch1:
o1 = (a + b)
branch2:
o1 = (a + f)
branch3:
o1 = (b + tmp)
branch4:
o1 = (a + tmp)

and we can form a table, 1s means the branch has such inputs, zero means hasn't.

a b f tmp
branch1 1 1 0 0
branch2 1 0 1 0
branch3 0 1 0 1
branch4 1 0 0 1

Then use the table to form a matrix: $$ A=\begin{bmatrix} 1 & 1 & 0 & 0 \ 1 & 0 & 1 & 0 \ 0 & 1 & 0 & 1 \ 1 & 0 & 0 & 1 \end{bmatrix} $$ And the problem becomes: $$ \begin{align} (1)&Define\ a\ matrix\ A\ with\ m\ rows\ n\ columns\ ,a\ constant\ number\ c. \ &m\ equals\ to\ the\ number\ of\ branches,n\ equals\ to\ the\ number\ of\ inputs, c\ equals\ to\ the\ number\ of\ plus\ operators.\ &Therefore\ the\ elements\ in\ A\ can\ only\ be\ 0\ or\ 1,the\ number\ of\ 1s\ equal\ to\ c\ times\ m.\ &Apparently, if\ we\ add\ up\ all\ columns,we'll\ get \begin{bmatrix}c\c\c\c\\end{bmatrix}.\ (2)&Define\ \vec{i}\ as\ a\ representation\ of\ the\ i\ th\ column\ vector\ of\ matrix\ A.\ Problem:&Use\ as\ little\ as\ possible\ 'vector\ plus'\ operation\ and\ all\ columns\ to\ get\ c\ \begin{bmatrix}1\1\...\1\end{bmatrix}.\ Example:&A=\begin{bmatrix} 1 & 1 & 0 & 0 \ 1 & 0 & 1 & 0 \ 0 & 1 & 0 & 1 \ 1 & 0 & 0 & 1 \end{bmatrix}\\ &\vec{1}+\vec{2}=\begin{bmatrix}1\1\1\1\\end{bmatrix},and\ A\ become\ A_1= \begin{bmatrix} 1 & 0 & 0 \ 0 & 1 & 0 \ 0 & 0 & 1 \ 0 & 0 & 1 \end{bmatrix} \\ &\vec{2}+\vec{3}+\vec{4}=\begin{bmatrix}1\1\1\1\\end{bmatrix},and\ we\ get\ two\ (c = 2)\ \begin{bmatrix}1\1\1\1\end{bmatrix}\\ &so,the\ result\ is\ 3\ times\ of\ vector\ plus \ operation. \end{align} $$ So $$\vec{1} + \vec{2}$$ means sharing a and b,$$\vec{2} + \vec{3} + \vec{4}$$ means sharing b,f and tmp. Totally use 1 two input mux and 1 three input mux.

Algorithm

In last part, we convert the problem into a math problem. But one thing need to be pay attention to, that is our algorithm need to be global optimum rather than local optimum. This means we can't use greedy strategy, which lead to the simulation of more consequences. Take the former case as an example. We can't claim that $$\vec{1} + \vec{2}$$ and $$\vec{2} + \vec{3} + \vec{4}$$ use the least cost of mux until we measure other possible consequences. This means we must take the executing time of out algorithm into consideration since the case can be extremely large.

We define the process of choosing two columns to do vector plus as an 'operation'. As far as you can see, for a 4*4 matrix, if we randomly choose two columns to do 'operation',it may cost at most m=4 times of 'operations' to get a [1,1,1,1] vector. So we can roughly calculate the scale of all consequences: $$ m \times C_n^2 \times c=O(mn^2c) $$. This complexity isn't durable. So we need to use some methods to decrease it.

Cut leaves

First, this problem can form

The worst case is just similar to the Monte Carlo Method. We need to enumerate all situation to find the best consequence.

About

A demo repo to study on resource sharing algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages