How to flatten non perfect loop in Vitis HLS - vivado-hls

My project is to encode an input string into an integer vector. I already have an encoding method. I create a lookup table and begin to stream the input string in. Compare each char of the input string to the char key of the lookup table, get the vector value and add them up. Here is an example:
lookup table
$ {1, -1, -1, 1, -1, ...., 1}
c {1, -1, 1, 1, 1, ...., 1}
C {1, 1, -1, 1, -1, ...., 1}
....
* {-1, 1, -1, 1, -1, ...., 1}
& {1, 1, -1, 1, 1, ...., 1}
input[in_len] = {'*','C','(','=','O',')','[','C','#','H',']','(','C','C','C','C','N','C','(','=','O',')','O','C','C','O','C',')','N','C','(','=','O',')','O','C','C','O','C'};
""
Here is my code:
void ENCODING_HV(FPGA_DATA input[in_len],
FPGA_DATA lookup_key[NUM_TOKEN],
LOOKUP_DATA lookup_HV[NUM_TOKEN],
int result_HV[CHUNK_NUM][CHUNK_SIZE]){
for(int i = 0; i < in_len; i++){
for(int j = 0; j < NUM_TOKEN; j++){
if(input[i] == lookup_key[j]){
for(int k = 0; k < CHUNK_NUM; k++){
//#pragma HLS PIPELINE II=1
for(int l = 0; l < CHUNK_SIZE; l++){
#pragma HLS ARRAY_RESHAPE variable=lookup_HV complete dim=3
#pragma HLS ARRAY_RESHAPE variable=result complete dim=2
if(lookup_HV[j].map[k][l] == -1)
result_HV[k][l] = result_HV[k][l] + 1;
else
result_HV[k][l] = result_HV[k][l] - 1;
}
}
}
}
}
}
In the second for loop I have an if statement to compare the input char with key char of lookup table. And the Vitis HLS said that "Cannot flatten loop 'VITIS_LOOP_60_2'", and it take a long time to synthesis. Could anyone give me an idea how to do it?
Thank you
WARNING: [HLS 200-960] Cannot flatten loop 'VITIS_LOOP_60_2' (Stream_Interface/HLS_scholarly/data_tokenize_2.cpp:60:28) in function 'create_sample_HV' the outer loop is not a perfect loop because there is nontrivial logic before entering the inner loop.
Resolution: For help on HLS 200-960 see www.xilinx.com/cgi-bin/docs/rdoc?v=2021.1;t=hls+guidance;d=200-960.html

Disclaimer: since you aren't providing a full code, I cannot test my solution myself.
Performance-wise, as a rule of thumb, if-statements are "cheap" in hardware and HLS, since they most of the time resolve in MUXes. That is not true in software, where they cause discontinuities.
In your algorithm, the if-statement guards two nested for-loops and eventually does not execute them. However, in HLS, for-loops cannot just be "skipped", because they will end up representing some physical hardware components. Hence, by looking at your code, I would move the if-statement inside the nested for-loops, since it doesn't affect the algorithm.
A final solution might therefore be:
void ENCODING_HV(FPGA_DATA input[in_len],
FPGA_DATA lookup_key[NUM_TOKEN],
LOOKUP_DATA lookup_HV[NUM_TOKEN],
int result_HV[CHUNK_NUM][CHUNK_SIZE]){
#pragma HLS ARRAY_RESHAPE variable=lookup_HV complete dim=3
#pragma HLS ARRAY_RESHAPE variable=result complete dim=2
for(int i = 0; i < in_len; i++){
for(int j = 0; j < NUM_TOKEN; j++){
for(int k = 0; k < CHUNK_NUM; k++){
#pragma HLS PIPELINE II=1
for(int l = 0; l < CHUNK_SIZE; l++){
if(input[i] == lookup_key[j]){
if(lookup_HV[j].map[k][l] == -1)
result_HV[k][l] = result_HV[k][l] + 1;
else
result_HV[k][l] = result_HV[k][l] - 1;
}
}
}
}
}
}
Side note: by keeping all the if-statements inside the nested for-loops, you can even move the PIPELINE pragma above to fully unroll the most inner for-loops.

Related

Search of max value and index in a vector

I'm trying to parallelize this piece of code that search for a max on a column.
The problem is that the parallelize version runs slower than the serial
Probably the search of the pivot (max on a column) is slower due the syncrhonization on the maximum value and the index, right?
int i,j,t,k;
// Decrease the dimension of a factor 1 and iterate each time
for (i=0, j=0; i < rwA, j < cwA; i++, j++) {
int i_max = i; // max index set as i
double matrixA_maxCw_value = fabs(matrixA[i_max][j]);
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
if (matrixA[i_max][j] == 0) {
j++; //Check if there is a pivot in the column, if not pass to the next column
}
else {
//Swap the rows, of A, L and P
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
lupFactorization(matrixA,L,i,j,rwA);
}
}
void swapRows(double **matrixA, int i, int j, int i_max) {
double temp_val = matrixA[i][j];
matrixA[i][j] = matrixA[i_max][j];
matrixA[i_max][j] = temp_val;
}
I do not want a different code but I want only know why this happens, on a matrix of dimension 1000x1000 the serial version takes 4.1s and the parallelized version 4.28s
The same thing (the overhead is very small but there is) happens on the swap of the rows that theoretically can be done in parallel without problem, why it happens?
There is at least two things wrong with your parallelization
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
You are getting the biggest index of all of them, but that does not mean that it belongs to the max value. For instance looking at the following array:
[8, 7, 6, 5, 4 ,3, 2 , 1]
if you parallelized with two threads, the first thread will have max=8 and index=0, the second thread will have max=4 and index=4. After the reduction is done the max will be 8 but the index will be 4 which is obviously wrong.
OpenMP has in-build reduction functions that consider a single target value, however in your case you want to reduce taking into account 2 values the max and the array index. After OpenMP 4.0 one can create its own reduction functions (i.e., User-Defined Reduction).
You can have a look at a full example implementing such logic here
The other issue is this part:
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
You are swapping those elements in parallel, which leads to inconsistent state.
First you need to solve those issue before analyzing why your code is not having speedups.
First correctness then efficiency. But don't except much speedups with the current implementation, the amount of computation performed in parallelism is that much to justify the overhead of the parallelism.

Given a list of prime ordered pairs, combine two lines with like terms to create an ordered triplet

Using C, I have generated a list of prime numbers to 10 million and from that list created another list consisting of ordered pairs [x,y] such that the difference of x and y is six.
However, when I try to write a code that generates an output of [x,y,z] triples, or [w,x,y,z] quadruple up to 10 million, my computer simply isn't powerful enough to finish the job in a reasonable amount of time.
My idea is to write a code that opens the text file of ordered pairs, recognizes a repeat number on two separate lines and combines those lines like so:
[5, 11]
[11, 17]
[17, 23]
[23, 29]
would turn into [5, 11, 17, 23, 29]
How can I do this in C?
edit: Here's my code to generate progressions of 5 numbers that differ by 6:
#include <stdio.h>
main()
{
freopen("test.txt", "w+", stdout);
FILE *twinPrimes;
twinPrimes = fopen("primes.txt", "r");
//read file into array
int numberArray[664579];
int i;
int j;
int k;
int l;
int m;
for (i = 0; i < 664579; i++)
{
fscanf(twinPrimes, "%d", &numberArray[i]);
}
for(i = 0; i < 664579; i++) {
for(j = i+1; j < i+6; j++) {
for(k = j+1; k < j+6; k++) {
for(l = k+1; l < k+6; l++) {
for(m = l+1; m < l+6; m++) {
if(abs(numberArray[i] - numberArray[j]) == 6) {
if(abs(numberArray[j] - numberArray[k]) == 6) {
if(abs(numberArray[k] - numberArray[l]) == 6) {
if(abs(numberArray[l] - numberArray[m]) == 6) {
printf("[%d, %d, %d, %d, %d]\n", numberArray[i], numberArray[j], numberArray[k], numberArray[l], numberArray[m]);} } } } } } }
}
}
}
However, the problem i'm still having is that I can't generate larger progressions without it slowing down my computer or taking extremely long to finish processing.
I have a text file, however, with a list of ordered pairs of primes that differ by 6. I think a MUCH more efficient code can be made by telling it to take the union of the two lines with a common element as shown above. However, I have no clue where to start with this code. Any help is appreciated.

OpenMP - Why does the number of comparisons decrease?

I have the following algorithm:
int hostMatch(long *comparisons)
{
int i = -1;
int lastI = textLength-patternLength;
*comparisons=0;
#pragma omp parallel for schedule(static, 1) num_threads(1)
for (int k = 0; k <= lastI; k++)
{
int j;
for (j = 0; j < patternLength; j++)
{
(*comparisons)++;
if (textData[k+j] != patternData[j])
{
j = patternLength+1; //break
}
}
if (j == patternLength && k > i)
i = k;
}
return i;
}
When changing num_threads I get the following results for number of comparisons:
01 = 9949051000
02 = 4992868032
04 = 2504446034
08 = 1268943748
16 = 776868269
32 = 449834474
64 = 258963324
Why is the number of comparisons not constant? It's interesting because the number of comparisons halves with the doubling of the number of threads. Is there some sort of race conditions going on for (*comparisons)++ where OMP just skips the increment if the variable is in use?
My current understanding is that the iterations of the k loop are split near-evenly amongst the threads. Each iteration has a private integer j as well as a private copy of integer k, and a non-parallel for loop which adds to the comparisons until terminated.
The naive way around the race condition to declare the operation as atomic update:
#pragma omp atomic update
(*comparisons)++;
Note that a critical section here is unnecessary and much more expensive. An atomic update can be declared on a primitive binary or unary operation on any l-value expression with scalar type.
Yet this is still not optimal because the value of *comparisons needs to be moved around between CPU caches all the time and a expensive locked instruction is performed. Instead you should use a reduction. For that you need another local variable, the pointer won't work here.
int hostMatch(long *comparisons)
{
int i = -1;
int lastI = textLength-patternLength;
long comparisons_tmp = 0;
#pragma omp parallel for reduction(comparisons_tmp:+)
for (int k = 0; k <= lastI; k++)
{
int j;
for (j = 0; j < patternLength; j++)
{
comparisons_tmp++;
if (textData[k+j] != patternData[j])
{
j = patternLength+1; //break
}
}
if (j == patternLength && k > i)
i = k;
}
*comparisons = comparisons_tmp;
return i;
}
P.S. schedule(static, 1) seems like a bad idea, since this will lead to inefficient memory access patterns on textData. Just leave it out and let the compiler do it's thing. If a measurement shows that it's not working efficiently, give it some better hints.
You said it yourself (*comparisons)++; has a race condition. It is a critical section that has to be serialized (I don't think (*pointer)++ is an atomic operation).
So basically you read the same value( i.e. 2) twice by two threads and then both increase it (3) and write it back. So you get 3 instead of 4. You have to make sure the operations on variables, that are not in the local scope of your parallelized function/loop, don't overlap.

OpenMP and 17 Nested For-Loops

I have a giant nested for-loop, designed to set a large array to its default value. I'm trying to use OpenMP for the first time to parallelize, and have no idea where to begin. I have been reading tutorials, and am afraid the process will be performed independently on N number of cores, instead of N cores divided the process amongst itself for a common output. The code is in C, compiled in Visual Studio v14. Any help for this newbie is appreciated -- thanks!
(Attached below is the monster nested for-loop...)
for (j = 0;j < box1; j++)
{
for (k = 0; k < box2; k++)
{
for (l = 0; l < box3; l++)
{
for (m = 0; m < box4; m++)
{
for (x = 0;x < box5; x++)
{
for (y = 0; y < box6; y++)
{
for (xa = 0;xa < box7; xa++)
{
for (xb = 0; xb < box8; xb++)
{
for (nb = 0; nb < memvara; nb++)
{
for (na = 0; na < memvarb; na++)
{
for (nx = 0; nx < memvarc; nx++)
{
for (nx1 = 0; nx1 < memvard; nx1++)
{
for (naa = 0; naa < adirect; naa++)
{
for (nbb = 0; nbb < tdirect; nbb++)
{
for (ncc = 0; ncc < fs; ncc++)
{
for (ndd = 0; ndd < bs; ndd++)
{
for (o = 0; o < outputnum; o++)
{
lookup->n[j][k][l][m][x][y][xa][xb][nb][na][nx][nx1][naa][nbb][ncc][ndd][o] = -3; //set to default value
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
If n is actually a multidimensional array, you can do this:
size_t i;
size_t count = sizeof(lookup->n) / sizeof(int);
int *p = (int*)lookup->n;
for( i = 0; i < count; i++ )
{
p[i] = -3;
}
Now, that's much easier to parallelize.
Read more on why this works here (applies to C as well): How do I use arrays in C++?
This is more of an extended comment than an answer.
Find the iteration limit (ie the variable among box1, box2, etc) with the largest value. Revise your loop nest so that the outermost loop runs over that. Simply parallelise the outermost loop. Choosing the largest value means that you'll get, in the limit, an equal number of inner loop iterations to run for each thread.
Collapsing loops, whether you can use OpenMP's collapse clause or have to do it by hand, is only useful when you have reason to believe that parallelising over only the outermost loop will result in significant load imbalance. That seems very unlikely in this case, so distributing the work (approximately) evenly across the available threads at the outermost level would probably provide reasonably good load balancing.
I believe, based on tertiary research, that the solution might be found in adding #pragma omp parallel for collapse(N) directly above the nested loops. However, this seems to only work in OpenMP v3.0, and the whole project is based on Visual Studio (and therefore, OpenMP v2.0) for now...

Parallelizing giving wrong output

I got some problems trying to parallelize an algorithm. The intention is to do some modifications to a 100x100 matrix. When I run the algorithm without openMP everything runs smoothly in about 34-35 seconds, when I parallelize on 2 threads (I need it to be with 2 threads only) it gets down to like 22 seconds but the output is wrong and I think it's a synchronization problem that I cannot fix.
Here's the code :
for (p = 0; p < sapt; p++){
memset(count,0,Nc*sizeof(int));
for (i = 0; i < N; i ++){
for (j = 0; j < N; j++){
for( m = 0; m < Nc; m++)
dist[m] = N+1;
omp_set_num_threads(2);
#pragma omp parallel for shared(configurationMatrix, dist) private(k,m) schedule(static,chunk)
for (k = 0; k < N; k++){
for (m = 0; m < N; m++){
if (i == k && j == m)
continue;
if (MAX(abs(i-k),abs(j-m)) < dist[configurationMatrix[k][m]])
dist[configurationMatrix[k][m]] = MAX(abs(i-k),abs(j-m));
}
}
int max = -1;
for(m = 0; m < Nc; m++){
if (dist[m] == N+1)
continue;
if (dist[m] > max){
max = dist[m];
configurationMatrix2[i][j] = m;
}
}
}
}
memcpy(configurationMatrix, configurationMatrix2, N*N*sizeof(int));
#pragma omp parallel for shared(count, configurationMatrix) private(i,j)
for (i = 0; i < N; i ++)
for (j = 0; j < N; j++)
count[configurationMatrix[i][j]] ++;
for (i = 0; i < Nc; i ++)
fprintf(out,"%i ", count[i]);
fprintf(out, "\n");
}
In which : sapt = 100;
count -> it's a vector that holds me how many of an each element of the matrix I'm having on each step;
(EX: count[1] = 60 --> I have the element '1' 60 times in my matrix and so on)
dist --> vector that holds me max distances from element i,j of let's say value K to element k,m of same value K.
(EX: dist[1] = 10 --> distance from the element of value 1 to the furthest element of value 1)
Then I write stuff down in an output file, but again, wrong output.
If I understand your code correctly this line
count[configurationMatrix[i][j]] ++;
increments count at the element whose index is at configurationMatrix[i][j]. I don't see that your code takes any steps to ensure that threads are not simultaneously trying to increment the same element of count. It's entirely feasible that two different elements of configurationMatrix provide the same index into count and that those two elements are handled by different threads. Since ++ is not an atomic operation your code has a data race; multiple threads can contend for update access to the same variable and you lose any guarantees of correctness, or determinism, in the result.
I think you may have other examples of the same problem in other parts of your code too. You are silent on the errors you observe in the results of the parallel program compared with the results from the serial program yet those errors are often very useful in diagnosing a problem. For example, if the results of the parallel program are not the same every time you run it, that is very suggestive of a data race somewhere in your code.
How to fix this ? Since you only have 2 threads the easiest fix would be to not parallelise this part of the program. You could wrap the data race inside an OpenMP critical section but that's really just another way of serialising your code. Finally, you could possibly modify your algorithm and data structures to avoid this problem entirely.

Resources