Vivado HLS Synthesize error - vivado-hls

I'm currently trying to do some project on Vivado HLS. However, I got an error as shown in the title during synthesis. However, this error appears:
error:** invalid operands to binary expression ('double' and 'datau32' (aka 'ap_axiu<32, 2, 5, 6>')) imgOut= (0.2126*Imgin[coord] + 0.7152*Imgin[coord+1] + 0.0722*Imgin[coord+2])
This is my HLS code:
#include "core.h"
void imgreading(hls::stream<datau32> &inStream, datau32 Imgin[imagesize])
{
for(int i=0;i<imagesize;i++)
{
Imgin[i]=(datau32)inStream.read();
}
}
void resize_half(hls::stream<datau32> &inStream, hls::stream<datau32> &outStream)
{
#pragma HLS INTERFACE axis port=inStream
#pragma HLS INTERFACE axis port=outStream
#pragma HLS INTERFACE s_axilite port=return bundle=CRTL_BUS
datau32 Imgin[imagesize];
imgreading (inStream,Imgin);
datau32 imgOut;
int coord=0;
#pragma HLS DATAFLOW
for (int a=0; a<240; a++) {
for(int b=0; b<320; b++){
#pragma HLS PIPELINE II=1
coord=6*(a*640+b);
imgOut= (0.2126*Imgin[coord] + 0.7152*Imgin[coord+1] + 0.0722*Imgin[coord+2]) ;
datau32 dataOutSideChannel;
dataOutSideChannel.data = imgOut;
outStream.write (dataOutSideChannel);
}
}
}

The tools complains that it cannot deal with the binary operators in imgOut= (0.2126*Imgin[coord] + 0.7152*Imgin[coord+1] + 0.0722*Imgin[coord+2]). Binary operators are operators with two operands, here * and +. As mentioned in the error message, datau32 is of type ap_axiu<32, 2, 5, 6>. Imgin and imgOut have datau32 as basetype. Hence, the message appears to be referring to the multiplications of Imgin[...] with the floating-point constants (0.2126, etc.)
Anyway, ap_axiu is used to declare AXI buses. It is a structure with the following format (See page 110 of UG902 for Vivado HLS 2017.1.):
template<int D,int U,int TI,int TD>
struct ap_axiu{
ap_uint<D> data;
ap_uint<D/8> keep;
ap_uint<D/8> strb;
ap_uint<U> user;
ap_uint<1> last;
ap_uint<TI> id;
ap_uint<TD> dest;
};
So what you are trying to do is multiply a floating-point constant with a structure. That is not allowed in C++.
If you did intent to use an AXI bus, you will have to use the data field of the structure to convey the data. The result of multiplying the integer data field with a double is another double. double is 64-bit wide, so you will have to handle that discrepancy too. You could use constants of type float, which is likely enough precision. Or you could make your AXI bus wider. Or you could reduce the precision after the computation by casting to float. Or you could use 2 bus cycles to transfer a single element. Note that if you want to convert a double or float to an integer, you will have to use reinterpret_cast to avoid losing precision. Note that you will also have to assign values to all other fields of the ap_axiu structure. Note that you will also have to assign values to all other fields of the ap_axiu structure (keep, strb, etc.).
An easier way to use the AXI bus is by declaring inStream and outStream as arrays, e.g. ap_uint<32> inStream[320*240]. The handshaking (TREADY and TVALID) is automatically taken care of. If you need the so-called side-channel (the remaining signals like TLAST or TUSER), you cannot use this method. That may be the case, for example, if you want to transmit packets of data instead of a continuous stream (can be done with TLAST), or if your data size is not a multiple of the bus size such that you need a byte enable signal (TKEEP).
I can also imagine that you never intended to use an AXI bus. There are types such as ap_uint and ap_fixed that can be used to submit data on a plain bus.
Finally, I want to emphasize that you should always debug your code in software first. There are many bugs that are hard to solve based on the output of the synthesis alone. Some messages tend to point people in the wrong direction. I recommend that you debug your code using the C simulation functionality first. Alternatively, you can compile the code outside Vivado HLS with a regular C compiler such as gcc. I also recommend using a memory checker such as valgrind to ensure that your code does not write outside array bounds, etc. The tool does not always find these issues, but it does result in unusable hardware.
I think this is the solution that you are looking for:
void resize_half(ap_uint<32> inAXI[640 * 480 * 3], ap_uint<32> outAXI[320 * 240])
{
#pragma HLS INTERFACE axis port=inAXI
#pragma HLS INTERFACE axis port=outAXI
#pragma HLS INTERFACE s_axilite port=return bundle=CRTL_BUS
#pragma HLS dataflow
hls::stream<ap_uint<32> > Stream[3];
for (int i = 0; i < 480; i++)
for (int j = 0; j < 640; j++)
for (int k = 0; k < 3; k++)
{
#pragma HLS PIPELINE II=1
ap_uint<32> value = inAXI[3 * (640 * i + j) + k];
if (i % 2 == 0 && j % 2 == 0)
Stream[k].write(value);
}
for (int a = 0; a < 240; a++)
{
for (int b = 0; b < 320; b++)
{
#pragma HLS PIPELINE II=1
ap_uint<32> x = Stream[0].read();
ap_uint<32> y = Stream[1].read();
ap_uint<32> z = Stream[2].read();
outAXI[320 * a + b] = 0.2126 * x + 0.7152 * y + 0.0722 * z;
}
}
}

Related

Enabling HVX SIMD in Hexagon DSP by using instruction intrinsics

I was using Hexagon-SDK 3.0 to compile my sample application for HVX DSP architecture. There are many tools related to Hexagon-LLVM available to use located folder at:
~/Qualcomm/HEXAGON_Tools/7.2.12/Tools/bin
I wrote a small example to calculate the product of two arrays to makes sure I can utilize the HVX hardware acceleration. However, when I generate my assembly, either with -S , or, with -S -emit-llvm I don't find any definition of HVX instructions such as vmem, vX, etc. My C application is executing on hexagon-sim for now till I manage to find a way to run in on the board as well.
As far as I understood, I need to define my HVX part of the code in C Intrinsics, but was not able to adapt the existing examples to match my own needs. It would be great if somebody could demonstrate how this process can be done. Also in the Hexagon V62 Programmer's Reference Manual many of the intrinsic instructions are not defined.
Here is my small app in pure C:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#if defined(__hexagon__)
#include "hexagon_standalone.h"
#include "subsys.h"
#endif
#include "io.h"
#include "hvx.cfg.h"
#define KERNEL_SIZE 9
#define Q 8
#define PRECISION (1<<Q)
double vectors_dot_prod2(const double *x, const double *y, int n)
{
double res = 0.0;
int i = 0;
for (; i <= n-4; i+=4)
{
res += (x[i] * y[i] +
x[i+1] * y[i+1] +
x[i+2] * y[i+2] +
x[i+3] * y[i+3]);
}
for (; i < n; i++)
{
res += x[i] * y[i];
}
return res;
}
int main (int argc, char* argv[])
{
int n;
long long start_time, total_cycles;
/* -----------------------------------------------------*/
/* Allocate memory for input/output */
/* -----------------------------------------------------*/
//double *res = memalign(VLEN, 4 *sizeof(double));
const double *x = memalign(VLEN, n *sizeof(double));
const double *y = memalign(VLEN, n *sizeof(double));
if ( *x == NULL || *y == NULL ){
printf("Error: Could not allocate Memory for image\n");
return 1;
}
#if defined(__hexagon__)
subsys_enable();
SIM_ACQUIRE_HVX;
#if LOG2VLEN == 7
SIM_SET_HVX_DOUBLE_MODE;
#endif
#endif
/* -----------------------------------------------------*/
/* Call fuction */
/* -----------------------------------------------------*/
RESET_PMU();
start_time = READ_PCYCLES();
vectors_dot_prod2(x,y,n);
total_cycles = READ_PCYCLES() - start_time;
DUMP_PMU();
printf("Array product of x[i] * y[i] = %f\n",vectors_dot_prod2(x,y,4));
#if defined(__hexagon__)
printf("AppReported (HVX%db-mode): Array product of x[i] * y[i] =%f\n", VLEN, vectors_dot_prod2(x,y,4));
#endif
return 0;
}
I compile it using hexagon-clang:
hexagon-clang -v -O2 -mv60 -mhvx-double -DLOG2VLEN=7 -I../../common/include -I../include -DQDSP6SS_PUB_BASE=0xFE200000 -o arrayProd.o -c arrayProd.c
Then link it with subsys.o (is found in DSK and already compiled) and -lhexagon to generate my executable:
hexagon-clang -O2 -mv60 -o arrayProd.exe arrayProd.o subsys.o -lhexagon
Finally, run it using the sim:
hexagon-sim -mv60 arrayProd.exe
A bit late, but might still be useful.
Hexagon Vector eXtensions are not emitted automatically and current instruction set (as of 8.0 SDK) only supports integer manipulation, so compiler will not emit anything for the C code containing "double" type (it is similar to SSE programming, you have to manually pack xmm registers and use SSE intrinsics to do what you need).
You need to define what your application really requires.
E.g., if you are writing something 3D-related and really need to calculate double (or float) dot products, you might convert yout floats to 16.16 fixed point and then use instructions (i.e., C intrinsics) like
Q6_Vw_vmpyio_VwVh and Q6_Vw_vmpye_VwVuh to emulate fixed-point multiplication.
To "enable" HVX you should use HVX-related types defined in
#include <hexagon_types.h>
#include <hexagon_protos.h>
The instructions like 'vmem' and 'vmemu' are emitted automatically for statements like
// I assume 64-byte mode, no `-mhvx-double`. For 128-byte mode use 32 int array
int values[16] = { 1, 2, 3, ..... };
/* The following line compiles to
{
r4 = __address_of_values
v1 = vmem(r4 + #0)
}
You can get the exact code by using '-S' switch, as you already do
*/
HVX_Vector v = *(HVX_Vector*)values;
Your (fixed-point) version of dot_product may read out 16 integers at a time, multiply all 16 integers in a couple of instructions (see HVX62 programming manual, there is a tip to implement 32-bit integer multiplication from 16-bit one),
then shuffle/deal/ror data around and sum up rearranged vectors to get dot product (this way you may calculate 4 dot products almost at once and if you preload 4 HVX registers - that is 16 4D vectors - you may calculate 16 dot products in parallel).
If what you are doing is really just byte/int image processing, you might use specific 16-bit and 8-bit hardware dot products in Hexagon instruction set, instead of emulating doubles and floats.

OpenMP: Parallelize for loops inside a while loop

I am implementing an algorithm to compute graph layout using force-directed. I would like to add OpenMP directives to accelerate some loops. After reading some courses, I added some OpenMP directives according to my understanding. The code is compiled, but don’t return the same result as the sequential version.
I wonder if you would be kind enough to look at my code and help me to figure out what is going wrong with my OpenMP version.
Please download the archive here:
http://www.mediafire.com/download/3m42wdiq3v77xbh/drawgraph.zip
As you see, the portion of code which I want to parallelize is:
unsigned long graphLayout(Graph * graph, double * coords, unsigned long maxiter)
Particularly, these two loops which consumes alot of computational resources:
/* compute repulsive forces (electrical: f=-C.K^2/|xi-xj|.Uij) */
for(int j = 0 ; j < graph->nvtxs ; j++) {
if(i == j) continue;
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
// power used for repulsive force model (standard is 1/r, 1/r^2 works well)
// double coef = 0.0; -C*K*K/dist; // power 1/r
double coef = -C*K*K*K/(dist*dist); // power 1/r^2
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
/* compute attractive forces (spring: f=|xi-xj|^2/K.Uij) */
for(int k = graph->xadj[i] ; k < graph->xadj[i+1] ; k++) {
int j = graph->adjncy[k]; /* edge (i,j) */
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
double coef = dist*dist/K;
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
Thank you in advance for any help you can provide!
You have data races in your code, e.g., when doing maxmove = nmove; or energy += nforce2;. In any multi-threaded code, you cannot write into a variable shared by threads until you use an atomic access (#pragma omp atomic read/write/update) or until you synchronize an access to such a variable explicitly (critical sections, locks). Read some tutorial about OpenMP first, there are more problems with your code, e.g.
if(nmove > maxmove) maxmove = nmove;
this line will generally not work even with atomics (you would have to use so-called compare-and-exchange atomic operation to solve this). Much better solution here is to let each thread to calculate its local maximum and then reduce it into a global maximum.

Can anyone help me to optimize this for loop use SSE?

I have a for loop which will run many times, and will cost a lot of time:
for (int z=0; z<temp; z++)
{
float findex= a + b * A[z];
int iindex = findex ;
outArray[z] += inArray[iindex] + (findex - iindex) * (inArray[iindex+1] - inArray[iindex]);
a++;
}
I have optimized this code, but have no performance improvement! Maybe my SSE code is bad, can any one help me?
Try using the restrict keyword on inArray and outArray. Otherwise the compiler has to assume that inArray could be == outArray. In this case no parallelization would be possible.
Your loop has a loop carried dependency when you write to outArray[z]. Your CPU can do more than one floating point sum at once but with your current loop you only allows one sum of outArray[z]. To fix this you should unroll your loop.
for (int z=0; z<temp; z+=2) {
float findex_v1 = a + b * A[z];
int iindex_v1 = findex_v1;
outArray[z] += inArray[iindex_v1] + (findex_v1 - iindex_v1) * (inArray[iindex_v1+1] - inArray[iindex_v1]);
float findex_v2 = (a+1) + b * A[z+1];
int iindex_v2 = findex_v2;
outArray[z+1] += inArray[iindex_v2] + (findex_v2 - iindex_v2) * (inArray[iindex_v2+1] - inArray[iindex_v2]);
a+=2;
}
In terms of SIMD the problem is that you have to gather non-contiguous data when you access inArray[iindex_v1]. AVX2 has some gather instructions but I have not tried them. Otherwise it may be best to do the gather without SIMD. All the operations accessing z access contiguous memory so that part is easy. Psuedo-code (without unrolling) would look something like this
int indexa[4];
float inArraya[4];
float dinArraya[4];
int4 a4 = a + float4(0,1,2,3);
for (int z=0; z<temp; z+=4) {
//use SSE for contiguous memory
float4 findex4 = a4 + b * float4.load(&A[z]);
int4 iindex4 = truncate_to_int(findex4);
//don't use SSE for non-contiguous memory
iindex4.store(indexa);
for(int i=0; i<4; i++) {
inArraya[i] = inArray[indexa[i]];
dinArraya[i] = inArray[indexa[i+1]] - inArray[indexa[i]];
}
//loading from and array right after writing to it causes a CPU stall
float4 inArraya4 = float4.load(inArraya);
float4 dinArraya4 = float4.load(dinArraya);
//back to SSE
float4 outArray4 = float4.load(&outarray[z]);
outArray4 += inArray4 + (findex4 - iindex4)*dinArray4;
outArray4.store(&outArray[z]);
a4+=4;
}

Race conditions despite atomicAdd functions (CUDA)?

I have a problem that is parallel on two levels: I have a ton of sets of (x0, x1, y0, y1) coordinate pairs, which are turned into variables vdx, vdy, vyy and for each of these sets I'm trying to calculate the values of all "monomials" composed of them up to degree n (i.e. all possible combinations of different powers of them, like vdx^3*vdy*vyy^2 or vdx*1*vyy^4). These values are then added up over all the sets.
My strategy (and for now I'd just like to get it to work, it doesn't have to be optimized with multiple kernels or complex reductions, unless it really has to) is to have each thread deal with one set of coordinate pairs and calculate the values of all their corresponding monomials. Each block's shared memory holds all the monomial sums, and when the block is done, the first thread in the block adds the result to the global sum. Since each block's shared memory is accessed by all threads in all places, I'm using atomicAdd; same with the blocks and the global memory.
Unfortunately there still seems to be a race condition somewhere, since I different results every time I run the kernel.
If it helps, I'm currently using degree = 3 and omitting one of the variables, which means that in the code below, the innermost for loop (over evbl) doesn't do anything and just repeats 4 times. Indeed, the output of the kernel looks like this: 51502,55043.1,55043.1,51502,47868.5,47868.5,48440.5,48440.6,46284.7,46284.7,46284.7,46284.7,46034.3,46034.3,46034.3,46034.3,44972.8,44972.8,44972.8,44972.8,43607.6,43607.6,43607.6,43607.6,43011,43011,43011,43011,42747.8,42747.8,42747.8,42747.8,45937.8,45937.8,46509.9,46509.9,... and it's noticable that there is a (rough) pattern of 4-tuples. But everytime I run it the values are all very different.
Everything is in floats, but I'm on a 2.1 GPU and so that shouldn't be a problem. cuda-memcheck also reports no errors.
Can somebody with more CUDA experience give me some pointers how to track down the race condition here?
__global__ void kernel(...) {
extern __shared__ float s_data[];
// just use global memory for now
// get threadID:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= nPairs) return;
// ... do some calculations to get x/y...
// calculate vdx, vdy and vyy
float vdx = (x1 - x0)/(float)xheight;
float vdy = (y1 - y0)/(float)xheight;
float vyy = 0.5*(y0 + y1)/(float)xheight;
const int offs1 = degree + 1;
const int offs2 = offs1 * offs1;
const int offs3 = offs2 * offs1;
float sol = 1.0;
// now calculate monomial results and store in shared memory
for(int evdx = 0; evdx <= degree; evdx++) {
for(int evdy = 0; evdy <= degree; evdy++) {
for(int evyy = 0; evyy <= degree; evyy++) {
for(int evbl = 0; evbl <= degree; evbl++) {
s = powf(vdx, evdx) + powf(vdy, evdy) + powf(vyy, evyy);
atomicAdd(&(s_data[evbl + offs1*evyy + offs2*evdy +
offs3*evdx]), sol/1000.0 );
}
}
}
}
// now copy shared memory to global
__syncthreads();
if(threadIdx.x == 0) {
for(int i = 0; i < nMonomials; i++) {
atomicAdd(&outmD[i], s_data[i]);
}
}
}
You are using shared memory but you are never initializing it.

optimize MSE algorithm using openmp

I wanted to optimize below code using openMP
double val;
double m_y = 0.0f;
double m_u = 0.0f;
double m_v = 0.0f;
#define _MSE(m, t) \
val = refData[t] - calData[t]; \
m += val*val;
#pragma omp parallel
{
#pragma omp for
for( i=0; i<(width*height)/2; i++ ) { //yuv422: 2 pixels at a time
_MSE(m_u, 0);
_MSE(m_y, 1);
_MSE(m_v, 2);
_MSE(m_y, 3);
#pragma omp reduction(+:refData) reduction(+:calData)
refData += 4;
calData += 4;
// int id = omp_get_thread_num();
//printf("Thread %d performed %d iterations of the loop\n",id ,i);
}
}
Any suggestion welcome for optimizing above code currently I have wrong output.
I think the easiest thing you can do is allow it to split into 4 threads, and calculate the UYVY errors in each of those. Instead of making them separate values, make them an array:
double sqError[4] = {0};
const int numBytes = width * height * 2;
#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
for( int i = elem; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
sqError[elem] += (double)(val*val);
}
}
This way, each thread operates exclusively on one thing and there is no contention.
Maybe it's not the most advanced use of OMP, but you should see a speedup.
After your comment about performance hit, I did some experiments and found that indeed the performance was worse. I suspect this may be due to cache misses.
You said:
performance hit this time with openMP : Time :0.040637 with serial
Time :0.018670
So I reworked it using the reduction on each variable and using a single loop:
#pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
for( int i = 0; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
e0 += (double)(val*val);
val = refData[i+1] - calData[i+1];
e1 += (double)(val*val);
val = refData[i+2] - calData[i+2];
e2 += (double)(val*val);
val = refData[i+3] - calData[i+3];
e3 += (double)(val*val);
}
With my test case on a 4-core machine, I observed a little less than 4-fold improvement:
serial: 2025 ms
omp with 2 loops: 6850 ms
omp with reduction: 455 ms
[Edit] On the subject of why the first piece of code performed worse than the non-parallel version, Hristo Iliev said:
Your first piece of code is a terrible example of what false sharing
does in multithreaded codes. As sqError has only 4 elements of 8 bytes
each, it fits in a single cache line (even in a half cache line on
modern x86 CPUs). With 4 threads constantly writing to neighbouring
elements, this would generate a massive amount of inter-core cache
invalidation due to false sharing. One can get around this by using
instead a structure like this struct _error { double val; double
pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate
cache line, hence no false sharing.
The code looks like it's calculating the MSE but adding to the same sum, m. For parallelism to work properly, you need to eliminate sharing of m, one approach would be preallocating an array (width*height/2 I imagine) just to store the different sums, or ms. Finally, add up all the sums at the end.
Also, test that this is actually faster!

Resources