OpenMP - make array counter with thread different start number - c

I am strugling with quite tricky (I guess) problem with parallel file reading.
Right now I have maped file using mmap and I want it to read values and put into three arrays. Well, maybe explanations isn't so clear, so this is my current code:
int counter = header;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) firstprivate(counter)
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
green[i][j] = map[counter+1];
blue[i][j] = map[counter+2];
counter+=3;
}
}
the header is beginning of a file - just image related info like size and some comments.
The problem here is, that counter has to be private. What I am finding difficult is to come up with dividing the counter across threads with different start numbers.
Can someone give me a hint how it can be achieved?

You could, if I've understood your code correctly, leave counter as a shared variable and enclose the update to it in a single section, something like
blue[i][j] = map[counter+2];
#pragma omp single
{
counter+=3;
}
}
I write something like because I haven't carefully checked the syntax nor considered the usefulness of some of the optional clauses for the single directive. I suspect that this might prove a real drag on performance.
An alternative would be to radically re-order your loop nest, perhaps (and again this is not carefully checked):
#pragma omp parallel shared (red,green,blue,map) private (i,j)
#pragma for collapse(2) schedule(hmmm, this needs some thought)
for(counter = 0; counter < counter_max; counter += 3)
{
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
}
}
}
... loop for blue now ...
... loop for green now ...
Note that in this version OpenMP will collapse the first two loops into one iteration space, I expect this will provide better performance than uncollapsed looping but it's worth experimenting to figure that out. It's probably also worth experimenting with the schedule clause.

If I understood well your problem is that of initializing counter to different values for different threads.
If this is the case, then it can be solved like the linearization of matrix indexes:
size_t getLineStart(size_t ii, size_t start, size_t length) {
return start + ii*length*3;
}
int counter;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) private(counter)
for(size_t ii = 0; ii < x; ii++) {
////////
// Get the starting point for the current iteration
counter = getLineStart(ii,header,y);
////////
for(size_t jj = 0; jj < y; jj++) {
red [i][j] = map[counter ];
green[i][j] = map[counter+1];
blue [i][j] = map[counter+2];
counter+=3;
}
}

These kind of counters are useful when it's hard to know how many values will be counted but then they don't work easily with OpenMP. But in your case it's trivial to figure out what the index is: 3*(y*i+j). I suggest fusing the loop and doing something like this:
#pragma omp parallel for
for(n=0; n<x*y; n++) {
(&red [0][0])[n] = map[3*n+header+0];
(&green[0][0])[n] = map[3*n+header+1];
(&blue [0][0])[n] = map[3*n+header+2];
}

Related

Parallelizing inner loop with residual calculations in OpenMP with SSE vectorization

I'm trying to parallelizing the inner loop of a program that has data dependencies (min) outside the scope of the loops. I'm having an issue where the residual calculations occuring outside the scope of the inner j loop. The code gets errors if the "#pragma omp parallel" part is included on the j loop even if the loop doesn't run at all due to a k value being too low. say (1,2,3) for example.
for (i = 0; i < 10; i++)
{
#pragma omp parallel for shared(min) private (j, a, b, storer, arr) //
for (j = 0; j < k-4; j += 4)
{
mm_a = _mm_load_ps(&x[j]);
mm_b = _mm_load_ps(&y[j]);
mm_a = _mm_add_ps(mm_a, mm_b);
_mm_store_ps(storer, mm_a);
#pragma omp critical
{
if (storer[0] < min)
{
min = storer[0];
}
if (storer[1] < min)
{
min = storer[1];
}
//etc
}
}
do
{
#pragma omp critical
{
if (x[j]+y[j] < min)
{
min = x[j]+y[j];
}
}
}
} while (j++ < (k - 1));
round_min = min
}
The j-based loop is a parallel loop so you cannot use j after the loop. This is especially true since you explicitly put j as private, so only visible locally in the thread but not outside the parallel region. You can explicitly compute the position of the remaining j value using (k-4+3)/4*4 just after the parallel loop.
Furthermore, here is few important points:
You may not really need to vectorize the code yourself: you can use omp simd reduction. OpenMP can do all the boring job of computing the residual calculations for you automatically. Moreover, the code will be portable and much simpler. The generated code may also likely be faster than yours. Note however that some compilers might not be able to vectorize the code (GCC and ICC does, while Clang and MSVC often need some help).
Critical section (omp critical) are very costly. In your case this will just annihilate any possible improvement related to the parallel section. The code will likely be slower due to cache-line bouncing.
Reading data written by _mm_store_ps is inefficient here although some compiler (like GCC) may be able to understand the logic of your code and generate a faster implementation (extracting lane data).
Horizontal SIMD reductions inefficient. Use vertical ones that are much faster and that can be easily used here.
Here is a corrected code taking into account the above points:
for (i = 0; i < 10; i++)
{
// Assume min is already initialized correctly here
#pragma omp parallel for simd reduction(min:min) private(j)
for (j = 0; j < k; ++j)
{
const float tmp = x[j] + y[j];
if(tmp < min)
min = tmp;
}
// Use min here
}
The above code is vectorized correctly on x86 architecture on GCC/ICC (both with -O3 -fopenmp), Clang (with -O3 -fopenmp -ffastmath) and MSVC (with /O2 /fp:precise -openmp:experimental).

Paralleling two loops Vivado HLS

For sake of exemplifying, I'm having a loop in which contents of an array are read and written in once cycle per iteration.
int my_array[10] = ...;
for(i=0; i<10; i++) {
if(i<5) {
my_array[i*2] = func_b(my_array[i*2]);//func_b takes double the time of func_a, but it also runs on 1/2 of the time.
}
func_a(myarray[i]);//func_a is executed quickly
}
same element is accessed for read/write operation only on i=0, with proper delay and synchronization algorithm should allow for parallelization. I'm struggling to find proper HLS pragma's to force them to run in parallel like so
//#pragma HLS to delay by two cycles
for(i=0; i<10; i++) {
func_a(my_array[i]);
}
//#pragma HLS to allow to run each iteration in parallel with the first loop, if possible in two cycles
for(i=0; i<5; i++) {
my_array[i*2] = func_b(my_array[i]);//func_b might be split into two halves for each to fit into one cycle.
}
ideally first loop should finish two cycles after the second one has completed, and in that way read correct value of the last my_array element. I might have a misconception, but I would hope that dividing work in such way (across two cycles), should increase clock speed?
The rest of the generated Verilog code is fine although it is cryptic enough that I would rather stick with C than to try modify what HDL was generated. Any suggestions on how to parallelize it or if it's doable?
As a first note, the clock speed is not necessary affected by whether the loops run in parallel or not. What might improve is the latency of the algorithm, i.e. the total amount of cycles required to run it.
There are elements processed by both functions in the same cycle. So unless you rewrite the functions (perhaps merging them together), you have a data dependency which you cannot break (or your algorithm will simply fail). Check the indexes you are accessing and check if the order of the operations remains the same:
for(int i = 0; i < 10; i++) {
if (i < 5) {
// Accessing indexes: 0, 2, 4, 6, 8
my_array[i * 2] = func_b(my_array[i * 2]);
}
// Accessing indexes: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
func_a(myarray[i]);
}
Your proposed solution, the second code snippet, won't work because is not doing the same thing in the same order as in the first code snippet (that must be valid if you just run the C/C++ code, which must have the same outcome).
However, one thing you can try is to put a DATAFLOW pragma like so:
#pragma HLS DATAFLOW
for(i = 0; i < 5; i++) {
my_array[i * 2] = func_b(my_array[i * 2]);
}
for(i = 0; i < 10; i++) {
func_a(my_array[i]);
}
But I'm not sure whether HLS will correctly handle the my_array buffer, since it would be shared by both the loops/processes.
One final idea I have to fully allow DATAFLOW, i.e. by creating a producer-consumer algorithm, is is to split the my_array into two different buffers, one for each part of the algorithm (i even or odd), like so:
int my_array_evens[5] = ...;
int my_array_odds[5] = ...;
#pragma HLS DATAFLOW
for(i = 0; i < 5; i++) {
my_array_evens[i * 2] = func_b(my_array_evens[i * 2]); // Dataflow producer
}
for(i = 0; i < 10; i++) {
if (i % 2 == 0) {
func_a(my_array_evens[i]); // Dataflow consumer
} else {
func_a(my_array_odds[i]);
}
}
NOTE: If the functions are relatively small, add the PIPELINE pragmas in the loops to 'speed' things up.

How can I best "parallelise" a set of four nested for()-loops in a Brute-Force attack?

I have the following homework task:
I need to brute force 4-char passphrase with the following mask
%%##
( where # - is a numeric character, % - is an alpha character )
in several threads using OpenMP.
Here is a piece of code, but I'm not sure if it is doing the right thing:
int i, j, m, n;
const char alph[26] = "abcdefghijklmnopqrstuvwxyz";
const char num[10] = "0123456789";
#pragma omp parallel for private(pass) schedule(dynamic) collapse(4)
for (i = 0; i < 26; i++)
for (j = 0; j < 26; j++)
for (m = 0; m < 10; m++)
for (n = 0; n < 10; n++) {
pass[0] = alph[i];
pass[1] = alph[j];
pass[2] = num[m];
pass[3] = num[n];
/* Working with pass here */
}
So my question is :
How to correctly specify the "parallel for" instruction, in order to split the range of passphrases between several cores?
Help is much appreciated.
Your code is pretty much right, except for using alph instead of num. If you're able to define the pass variable within the loop, that'll save you many a headache.
A full MWE might look like:
//Compile with, e.g.: gcc -O3 temp.c -std=c99 -fopenmp
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int PassCheck(char *pass){
usleep(50); //Sleep for 100 microseconds to simulate work
return strncmp(pass, "qr34", 4)==0;
}
int main(){
const char alph[27] = "abcdefghijklmnopqrstuvwxyz";
const char num[11] = "0123456789";
char goodpass[5] = "----"; //Provide a default password to indicate an error state
int i, j, m, n;
#pragma omp parallel for collapse(4)
for (i = 0; i < 26; i++)
for (j = 0; j < 26; j++)
for (m = 0; m < 10; m++)
for (n = 0; n < 10; n++){
char pass[4];
pass[0] = alph[i];
pass[1] = alph[j];
pass[2] = num[m];
pass[3] = num[n];
if(PassCheck(pass)){
//It is good practice to use `critical` here in case two
//passwords are somehow both valid. This won't arise in
//your code, but is worth thinking about.
#pragma omp critical
{
memcpy(goodpass, pass, 4);
goodpass[4] = '\0';
//#pragma omp cancel for //Escape for loops!
}
}
}
printf("Password was '%s'.\n",goodpass);
return 0;
}
Dynamic scheduling
Using a dynamic schedule here is probably pointless. Your expectation should be that each password will take, on average, about the same amount of time to check. Therefore, each iteration of the loop will take about the same amount of time. Therefore, there is no need to use dynamic scheduling because your loops will remain evenly distributed.
Visual noise
Note that the loop nest is stacked, rather than indented. You'll often see this in code where there are many nested loops as it tends to reduce visual noise.
Breaking early
#pragma omp cancel for is available as of OpenMP 4.0; however, I got a warning using it in this context, so I've commented it out. If you are able to get it working, that'll reduce your run-time by half since all effort is wasted once the correct password has been found and the password will, on average, be located half-way through the search space.
Where the guessed password is generated
One of the commentors suggests moving, e.g. pass[0] so that it is not in the innermost loop. This is a bad idea as doing so will prevent you from using collapse(4). As a result you could parallelize the outer loop, but you run the risk that its iteration count cannot be evenly divided by the number of threads, resulting in a large load imbalance. Alternatively, you could parallelize the inner loop, which exposes you to the same problem plus high synchronization costs each time the loop ends.
Why usleep?
The usleep function causes the code to run slowly. This is intentional; it provides feedback on the effect of parallelism, since the workload is so small.
If I remove the usleep, then the code completes in 0.003s on a single core and 0.004s on 4 cores. You cannot tell that the parallelism is even working. Leaving usleep in gives 8.950s on a single core and 2.257s on 4 cores, an apt demonstration of the effectiveness of the parallelism.
Naturally, you would remove this line once you're sure that parallelism is working correctly.
Further, any actual brute-force password cracker would likely be computing an expensive hash function inside the PassCheck function. Including usleep() here allows us to simulate that function and experiment with high-level design without having to the function first.

openmp data dependency/ wait for var

How to solve following problem with OpenMP:
Edit: Panni is right, my parallel approach doesn't work, so what's a nice way to do this?
Situation:
Function call order:
calc();
move();
function calc() parallelizes a loop via #pragma omp parallel for and changes a array containing structs with x and y values.
function move() parallelizes a loop via #pragma omp for only and accesses the values changed in calc().
Question:
How can I make sure in OpenMP that function 1 has set the values, before I access function 2?
More specific:
#include <...>
static b_t *b;
static void calc() {
#pragma omp parallel for private(j)
for(i = 0; i < n - 1; i++)
for(j = i + 1; j < n; j++) {
b[i].f.x += some_value;
b[i].f.y += some_value;
b[j].f.x -= some_value;
b[j].f.y -= some_value;
}
}
}
static move() {
#pragma omp for
for (i = 0; i < n; i++) {
d.x = b[i].f.x;
d.y = b[i].f.y;
}
}
int main() {
for (i = 0; t < end; t += dt) {
calc();
move();
}
}
So, how can I make sure move() gets only called after calc()? I thought about #pragma omp task depend(in/out) but it doesn't seem to work for me.
Since you're calling the function calc() and move() in sequential order the calc() function will always be executed first.
So the program runs calc() waits until it is completed (no matter how many threads are used) and then continues to run move()
Panni has said exactly what i want to say.
It sucks I still can't comment.
And I don't want to answer for simply like this.
So I guess you may suffer a performance issue for two reasons.
imbalanced workload scheduling on a triangular.
cache miss problem if cache can't hold the whole b.
move func looks weird to me, so I didn't take it into consideration.

Parallelizing a for loop in Visual Studio 2010 (OpenMP)

I've recently reading up about OpenMP and was trying to parallelize some existing for loops in my program to get a speed-up. However, for some reason I seem to be getting garbage data written to the file. What I mean by that is I don't have Points 1,2,3,4 etc. written to my file, I have Points 1,4,7,8 etc. I suspect this is because I am not keeping track of the threads and it just leads to race conditions?
I have been reading as much as I can find about OpenMP, since it seems like a great abstraction to do multi-threaded programing. I'd appreciate any pointers please to get to the bottom of what I might be doing incorrectly.
Here is what I have been trying to do so far (only the relevant bit of code):
#include <omp.h>
pixelIncrement = Image.rowinc/2;
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
disparityThresholdValue = row[j];
// Don't want to save certain points
if ( disparityThresholdValue < threshHold)
{
// Get the data points
x = (int)Image.x[k];
y = (int)Image.y[k];
z = (int)Image.z[k];
grayValue= (int)Image.gray[k];
cloudObject->points[k].x = x;
cloudObject->points[k].y = y;
cloudObject->points[k].z = z;
cloudObject->points[k].grayValue = grayValue;
fprintf( cloudPointsFile, "%f %f %f %d\n", x, y, z, grayValue);
}
}
}
fclose( pointFile );
I did enable OpenMP in my Compiler settings (C/C++ -> Language -> Open MP Support (/openmp).
Any suggestions as to what might be the problem? I am using a Quadcore processor on Windows XP 32-bit.
Are all points written to the file, but just not sequentially, or is the actual point data messed up?
The first case is expected in parallel programming - once you execute something side-by-side you wont be able to guarantee order unless you synchronize the access (at which point you can just leave out the parallelization as it becomes effectively linear). If you need to rely on order, you can parallelize any calculations but need to write it down in one thread.
If the points itself are messed up, check where your variables are declared and if multiple threads are accessing the same.
A few problems here:
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
There's no need for the inner parallel for. The outer loop should contain enough work to keep all cores busy.
Also, for the inner loop k is a shared variable and gets incremented in a non-atomic way. x, y, z are also shared among the inner loop threads and overwritten "randomly". Remove the inner directive and see how it goes.
When you have a loop with a nested loop there is no need for a second omp pragma.
It will already paralelize the first loop. Remember that this is valid only if the second loop has to be executed in sequence. You have a sequencial incrementation, so you can not execute the second loop in a random order. OMP pragmas are a very easy and cool way to paralelize code but do not use them too much!
More details here -> Parallel Loops with OpenMP

Resources