Verilog vector inner product - loops

I am trying to implement a synthesizable verilog module, which produces a vector product of 2 vector/arrays, each containing eight 16-bit unsigned integers. Design Compiler reported error that symbol i must be a constant or parameter. I don't know how to fix it. Here's my code.
module VecMul16bit (a, b, c, clk, rst);
// Two vector inner product, each has 8 elements
// Each element is 16 bits
// So the Output should be at least 2^32*2^3 = 2^35 in order to
// prevent overflow
// Output is 35 bits
input clk;
input rst;
input [127:0] a,b;
output [35:0] c;
reg [15:0] a_cp [0:7];
reg [15:0] b_cp [0:7];
reg [35:0] c_reg;
reg k,c_done;
integer i;
always # (a)
begin
for (i=0; i<=7; i=i+1) begin
a_cp[i] = a[i*15:i*15+15];
end
end
always # (b)
begin
for (i=0; i<=7; i=i+1) begin
b_cp[i] = b[i*15:i*15+15];
end
end
assign c = c_reg;
always #(posedge clk or posedge rst)
begin
if (rst) begin
c_reg <= 0;
k <= 0;
c_done <= 0;
end else begin
c_reg <= c_done ? c_reg : (c_reg + a_cp[k]*b_cp[k]);
k <= c_done ? k : k + 1;
c_done <= c_done ? 1 : (k == 7);
end
end
endmodule
As you can see, I'm trying to copy a to a_cp through a loop, is this the right way to do it?
If yes, how should I defined it i and can a constant be used as a stepper in for loop?

A part select in verilog must have constant bounds. So this is not allowed:
a_cp[i] = a[i*15:i*15+15];
Verilog-2001 introduced a new indexed part select syntax where you specify the starting position and the width of the selected group of bits. So, you need to replace the above line by:
a_cp[i] = a[i*15+:16];
This takes a 16-bit width slice of a starting at bit i*15 counting rightwards. You can use -: instead of +:, in which case you count leftwards.
Be careful: it is very easy to type :+ instead of +: and :+ is valid syntax and so might not be spotted by your compiler (but could still be a bug). In fact I did exactly that when typing this EDA Playground example, though my typo was caught by the compiler in this case.

Actually, what you need for your code to be synthesizable is using genvar as the type of i. Kind of like this (using macros, put it above your module):
`define PACK_ARRAY_2D2(PK_WIDTH,PK_LEN,PK_DIMS,PK_SRC,PK_DEST,PK_OFFS) ({\
genvar pk_idx; genvar pk_dims; \
generate \
for (pk_idx=0; pk_idx<(PK_LEN); pk_idx=pk_idx+1) \
begin \
for (pk_dims=0; pk_dims<(PK_DIMS); pk_dims=pk_dims+1) \
begin \
assign PK_DEST[(((PK_WIDTH)*(pk_idx+pk_dims+1))-1+((PK_WIDTH)*PK_OFFS*pk_idx)):(((PK_WIDTH)*(pk_idx+pk_dims))+((PK_WIDTH)*PK_OFFS*pk_idx))] = PK_SRC[pk_idx][pk_dims][((PK_WIDTH)-1):0];\
end\
end\
endgenerate\
})
`define UNPACK_ARRAY_2D2(PK_WIDTH,PK_LEN,PK_DIMS,PK_DEST,PK_SRC,PK_OFFS) ({\
genvar unpk_idx; genvar unpk_dims; \
generate \
for (unpk_idx=0; unpk_idx<(PK_LEN); unpk_idx=unpk_idx+1) \
begin \
for (unpk_dims=0; unpk_dims<(PK_DIMS); unpk_dims=unpk_dims+1)\
begin \
assign PK_DEST[unpk_idx][unpk_dims][((PK_WIDTH)-1):0] = PK_SRC[(((PK_WIDTH)*(unpk_idx+unpk_dims+1))-1+((PK_WIDTH)*PK_OFFS*unpk_idx)):(((PK_WIDTH)*(unpk_idx+unpk_dims))+((PK_WIDTH)*PK_OFFS*unpk_idx))];\
end end endgenerate\
})
and here is how to use it (just put in inside pack_unpack.v) as an example of function to transpose matrix :
// Macros for Matrix
`include "pack_unpack.v"
module matrix_weight_transpose(
input signed [9*5*32-1:0] weight, // 5 columns, 9 rows, 32 bit data length
output signed [9*5*32-1:0] weight_transposed // 9 columns, 5 rows, 32 bit data length
);
wire [31:0] weight_in [8:0][4:0];
`UNPACK_ARRAY_2D2(32,9,5,weight_in,weight,4)
wire [31:0] weight_out [4:0][8:0];
`PACK_ARRAY_2D2(32,5,9,weight_out,weight_transposed,8)
generate // Computing the transpose
genvar i;
for (i = 0; i < 9; i = i + 1)
begin : columns
genvar j;
for (j = 0; j < 5; j = j + 1)
begin : rows
assign weight_out[j][i] = weight_in[i][j];
end
end
endgenerate
endmodule

Related

how to preset the register arrays in Verilog?

I am trying to define a register file, 32-bit wide 32-bit deep, in Verilog. How to preset all the values to zero or to any value I want with/without a for loop?
Here's my code, I tried but failed:
module register_file(rna, rnb, qa, qb);
input [4:0]rna;
input [4:0]rnb;
output [31:0]qa;
output [31:0]qb;
genvar i;
reg [31:0]registers[0:31];
assign registers[0]=32'b0;
registers[1]=32'b0;
registers[2]=32'b0;
registers[3]=32'b0;
endmodule
A usual way to preset register values is done using clocks and a reset signal. For example:
reg [31:0]registers[0:31];
integer i;
always #(posedge clk) begin
if (reset) begin
for (i = 0; i < 31; i = i + 1)
registers[i] <= 0;
end
else begin
// do some real work with registers here
end
end
in some cases you might want to do some initial setting in your testbench initial block
initial begin
for (i = 0; ...) registers[i]= 0;
end
The above is not usually synthesizable.
There are few other ways available in System Verilog.

Verilog OR of array elements

I want to OR a parameterized number of 32-bit buses as follows:
out_bus = bus1 | bus2 | bus 3 | ... | bus N;
I also want to declare the buses as an array (N is a fixed parameter, defined at compile time):
reg [31:0] bus[N-1:0];
The best I can figure how to do this is something like this:
parameter N;
reg [N-1:0] temp;
reg [31:0] out_bus;
reg [31:0] bus[N-1:0];
always #(*) begin
for (j=0; j<32; j=j+1) begin : bits
for (k=0; k < N; k=k+1) begin : bus
temp = bus[k][j];
end
out_bus[j] = |temp;
end
end
This need to be synthesizable. There's got to be a cleaner/better way, no?
If you were using SystemVerilog, you could replace the entire always block with
assign out_bus = bus.or();
This uses one fewer for loop and one fewer temporary signal:
reg [31:0] out_bus;
reg [31:0] bus[N-1:0];
integer k;
always #(*) begin
out_bus = {32{1'b0}};
for (k=0; k < N; k=k+1) begin
out_bus = out_bus | bus[k];
end
end
The following code in quartus gave the expected results, as verified in schematic view.
module example #(
parameter WIDTH = 32,
parameter DEPTH = 4
)(
input [DEPTH-1:0][WIDTH-1:0] DataIn,
output reg [WIDTH-1:0] DataOut
);
reg [WIDTH-1:0] ORDatain;
always#(*)
begin
ORDatain = 32'h0000_0000;
for(int index=0; index <DEPTH; index++)
ORDatain = ORDatain | DataIn[index];
end
assign DataOut = ORDatain;
endmodule

Converting a fixed point Matlab code to Verilog

I have a fixed point Matlab code and it needs to be converted to Verilog. Below is the Matlab code. yfftshift is 5000x0 and y2shape 100x50.
rows=100;
colms=50;
r=1;
for m=0:colms-1
for n=0:rows-1
y2shape(n+1,m+1)=yfftshift(r,1);
r=r+1;
end
end
How can I create memories in Verilog and call them inside the for loop?
The easiest way to handle fixed precision in Verilog is to introduce a scale factor and allocate sufficiently large registers to hold the maximum value. For example, if you know that the maximum value of your numbers will be 40, and three digits of precision to the right of the decimal place are OK, a scaling factor of 1000 can be used with 16-bit registers. Verilog treats unsigned numbers, so if values can be negative, it's necessary to add "signed" to the declarations. The Verilog could be:
`define NUMBER_ROWS 100
`define NUMBER_COLS 50
`define MAX_ROW (`NUMBER_ROWS-1)
`define MAX_COL (`NUMBER_COLS-1)
module moveMemory();
reg clk;
reg [15:0] y2shape [`MAX_ROW:0][`MAX_COL:0];
reg [15:0] yfftshift [`NUMBER_ROWS * `NUMBER_COLS:0];
integer rowNumber, colNumber;
always #(posedge clk)
begin
for (rowNumber = 0; rowNumber < `NUMBER_ROWS; rowNumber = rowNumber + 1)
for (colNumber = 0; colNumber < `NUMBER_COLS; colNumber = colNumber + 1)
y2shape[rowNumber][colNumber] <= yfftshift[rowNumber * `NUMBER_COLS + colNumber];
end
endmodule
This is OK for an FPGA or simulation project, but for full custom work, an SRAM macro would be used to avoid the die area associated with 16,000 registers. For an FPGA implementation, you've probably already paid for the 16K registers, or you may be able to do some extra work get the synthesizer to map the registers to an on-chip SRAM.
The test bench:
// Testing code
integer loadCount, rowShowNumber, colShowNumber;
initial
begin
// Initialize array with some data
for (loadCount=0; loadCount < (NUMBER_ROWS *NUMBER_COLS); loadCount = loadCount + 1)
yfftshift[loadCount] <= loadCount;
clk <= 0;
// Clock the block
#1
clk <= 1;
// Display the results
#1
$display("Y2SHAPE has these values at time ", $time);
for (rowShowNumber = 0; rowShowNumber < `NUMBER_ROWS; rowShowNumber = rowShowNumber + 1)
for (colShowNumber = 0; colShowNumber < `NUMBER_COLS; colShowNumber = colShowNumber + 1)
$display("y2shape[%0d][%0d] is %d ", rowShowNumber, colShowNumber, y2shape[rowShowNumber][colShowNumber]);
end
The simulation results for NUMBER_ROWS=10, NUMBER_COLS=5
Y2SHAPE has these values at time 2
y2shape[0][0] is 0
y2shape[0][1] is 1
y2shape[0][2] is 2
y2shape[0][3] is 3
y2shape[0][4] is 4
.
.
.
y2shape[9][2] is 47
y2shape[9][3] is 48
y2shape[9][4] is 49

GAUT HLS tool error : "No alternatives to process, unable to select best one"

I am trying to synthetise the following C code by using GAUT tool:
#define N 16
int main (const int tab[N], int* out)
{
// static const int tab[N] = {98,-39,-327,439,950,-2097,-1674,9883,9883,-1674,-2097,950,439,-327,-39,98};
int k = 0, i=1;
for( i = 1; i < N; i++)
{
// invariant : k est l'indice du plus petit
// élément de x[0..i-1]
if(tab[i] < tab[k])
k = i;
}
*out = tab[k];
return 0;
}
Simple program to find the minimum in an array.
It successfully compiles, generates a DFG that seems to be honest.
However when I try to synthetise, I get this error:
"No alternatives to process, unable to select best one"
And thus can't go on with the implementation flow.
Does anyone know what is the problem? I am facing it with other such small test programs as well. I hope some specialist will be able to answer.
Thank you.
Since this is tagged VHDL, perhaps it's worth looking at a straight VHDL port, bypassing the tool completely. This only took a few minutes, and it's in three parts:
1) VHDL has a quirk in that to use an array as a port parameter, it must be a named type (int_array). (C has a different quirk passing arrays around : it doesn't, it passes a pointer instead)
package Types is
type int_array is array (natural range <>) of integer;
end Types;
package body Types is
end Types;
2) The bit that does the work: I left the C code in as a comment to illustrate how closely they correspond:
use Work.Types.all;
-- int main (const int tab[N], int* out)
entity MinArray is
Generic ( N : Natural);
Port ( Tab : in int_array;
Output : out integer );
end MinArray;
architecture Behavioral of MinArray is
-- int k = 0, i=1;
-- for( i = 1; i < N; i++)
-- {
-- if(tab[i] < tab[k])
-- k = i;
-- }
-- *out = tab[k];
-- return 0;
--}
begin
Process(Tab) is
variable k : natural;
begin
k := 1;
for i in tab'range loop
if tab(i) < tab(k) then
k := i;
end if;
end loop;
Output <= tab(k);
end process;
end Behavioral;
3) A test harness:
use Work.Types.all;
ENTITY tester IS
Port ( Minimum : out integer );
END tester;
ARCHITECTURE behavior OF tester IS
--#define N 16
-- static const int tab[N] = {98,-39,-327,439,950,-2097,-1674,9883,9883,-1674,-2097,950,439,-327,-39,98};
constant N : natural := 16;
constant tab : int_array (1 to N) := (98,-39,-327,439,950,-2097,-1674,9883,9883,-1674,-2097,950,439,-327,-39,98 );
BEGIN
uut: entity work.MinArray
Generic Map (N => N)
PORT MAP(
Tab => Tab,
Output => Minimum );
END;
Note that this is all synthesisable in Xilinx XST,
Advanced HDL Synthesis Report
Macro Statistics
# RAMs : 1
32x32-bit single-port distributed Read Only RAM : 1
# Comparators : 15
32-bit comparator greater : 15
# Multiplexers : 32
1-bit 2-to-1 multiplexer : 24
2-bit 2-to-1 multiplexer : 1
3-bit 2-to-1 multiplexer : 4
4-bit 2-to-1 multiplexer : 3
but (because the input tables are a constant array) all the above hardware disappears in the optimisation stage.
Now one of the important things in high level synthesis is to explore different datatypes such as different word widths; such as the 15-bit word required to store the test data. To explore this, let's just modify the "Types" package as follows:
type small_int is range -16384 to 16383;
type int_array is array (natural range <>) of small_int;
I also changed the Output port type to small_int. And as we can see from the synthesis report, hardware usage has been reduced accordingly.
Macro Statistics
# RAMs : 1
32x15-bit single-port distributed Read Only RAM : 1
# Comparators : 15
15-bit comparator greater : 15
# Multiplexers : 32
1-bit 2-to-1 multiplexer : 24
2-bit 2-to-1 multiplexer : 1
3-bit 2-to-1 multiplexer : 4
4-bit 2-to-1 multiplexer : 3
So perhaps a question is : how much easier do the C tools make exploring the design space like custom word widths?

VHDL: Add list of numbers using loop

To start off, I have a very limited knowledge of C, just basic functions. I have been set a task in VHDL of which i have no experience.
The task is to write a program in VHDL that will use a loop to add a list of 10 numbers (13,8,6,5,19,21,7,1,12,3).
I was thinking of a way of doing this even in C to see if i could somewhat mimic the method. so far i have only came up with
int start = 0;
int add = start;
int increment = 5;
for (int i=0; i<10; i++) {
add = add + increment;
}
now i know that is VERY basic but it's the best i can do. that loop will only increment it by 5 as apposed to the list that i have.
Any help is very appreciated and it's my first question so apologies i if i am breaking any 'unwritten laws'
You mention that this is a part of a study on parwan processors, So the way to think about it depends a lot on how you are studying them.
If you are building up an implementation of the processor than just learning the syntax for logical operations is the important part, and you should focus on the types
unsigned range 0 to 255 and signed range -128 to 127. By making use of the package ieee.numeric_std.all you get the addition operation defined for those types.
If however the processor is already defined for you take a good look at the processor interfaces. The code you will write for this will be much more of an explicit state machine.
Either way I find the best way to start is to write a test bench. This is the part that will feed in the list of inputs, because ultimately you wont want it to be a for (int i=0; i<10; i++), but rather a while(1) style of processing.
That's all theory stuff, so here's some pseudo code for a simple accumulator process:
signal acc : unsigned range 0 to 255 := 0; --accumulator register
signal b : unsigned range 0 to 255 := 5; --value to be added
--each cycle you would change b
accumulator :process (clk)
begin
if rising_edge(clk)
acc <= acc + b;
end if;
end process;
or maybe better yet take a look here: Accumulator
The solution below could help you get started with your problem in VHDL:
For the implementation in a FPGA, better solutions could be figured out. So, just consider it as a start...
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity add is
port (
clk : in std_logic;
rst : in std_logic;
add : in std_logic;
sum : out std_logic_vector(31 downto 0));
end entity add;
architecture RTL of add is
constant rom_size : integer := 10;
type t_rom is array (0 to rom_size-1) of unsigned(31 downto 0);
constant rom : t_rom := (
to_unsigned(13, sum'length),
to_unsigned(8, sum'length),
to_unsigned(6, sum'length),
to_unsigned(5, sum'length),
to_unsigned(19, sum'length),
to_unsigned(21, sum'length),
to_unsigned(7, sum'length),
to_unsigned(1, sum'length),
to_unsigned(12, sum'length),
to_unsigned(3, sum'length));
signal add_d : std_logic;
signal index : integer range 0 to rom_size;
signal sum_i : unsigned(sum'range);
begin
p_add : process (clk) is
begin
if rising_edge(clk) then -- rising clock edge
if rst = '1' then -- synchronous reset (active high)
sum_i <= (others => '0');
add_d <= '0';
index <= 0;
else
add_d <= add; -- rising edge detection
if add_d = '0' and add = '1' then -- rising_edge -> add next item to sum
sum_i <= sum_i + rom(index);
index <= index + 1;
end if;
end if;
end if;
end process p_add;
-- output
sum <= std_logic_vector(sum_i);
end architecture RTL;
First, I'll point out there's no need to add the complexity of std_logic_vectors or vector arithmetic with signed and unsigned. This works fine with simple integers:
So, you have some numbers coming in and a sum going out:
entity summer
port (
inputs : integer_vector := (13,8,6,5,19,21,7,1,12,3);
sum_out : integer);
end entity summer;
Note, I've initialise the inputs port with your values - normally you'd write to that port in your testbench.
Now to add them up, you need a process:
process(inputs)
variable sum : integer;
begin
sum := 0;
for i in inputs'range loop
sum := sum + inputs(i);
end for;
sum_out <= sum;
end process;
That's a simplistic solution - to create a "best" solution you need a more detailed specification. For example: how often will the inputs change? How soon do you need the answer after the inputs change? Is there a clock?

Resources