OpenMP reduction, variable not private? - c

I have an array like this (0,0 is bottom left):
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0
0 0 1 0 1 0 1 0 0
0 0 1 1 1 1 1 1 1
1 0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
My goal is to get the index of the higher line who is not completely set to 0. For this I made the code below (which works fine):
max=0;
for (i=0 ; i<width ; ++i) {
for (j=max ; j<height ; ++j) {
if (array[i*height+j]!=0) {
max=j;
}
}
}
For the second loop I initialize j to max, because the global maximum cannot be less than a local maximum. And this way I can reduce the number of tests.
The I tried to parallelize it with OpenMp. My code is now:
max=0;
#pragma omp parallel for default(none) \
shared(spec, width, height) \
collapse(2) \
reduction(max:max)
for (i=0 ; i<width ; ++i) {
for (j=max ; j<height ; ++j) {
if (array[i*height+j]!=0) {
max=j;
}
}
}
Which leads to a segmentation fault. In order to make it works, I changed j=max to j=0. So the problem seems to come from the max variable.
I don't understand why, because with the reduction this variable should be private (or lastprivate) between each threads. So why does it make it crash ? And how can I use my "optimization" with OpenMP ?

First of all, the user High Performance Mark is right in his comment. You shouldn't be using collapse if your loop index values depend on the value of a calculation. In your example, "j" depends on "max", which will produce an incorrect result. However, this is not the cause of your segmentation fault.
I would suggest you to debug your example so that you can find the source of the crash; "max" is being initialized with a negative number by default, which causes "j" to also have said value. Thus, when trying to access array[i*height+(-2147483648)], you get a segmentation fault.
This happens because OpenMP specifies an initial value for each reduction operator. In the case of the max operator, you can find the following description in the specification of OpenMP 3.1:
max Least representable value in the reduction list item type
In our case, that means that each thread will have at the start of the parallel region a private copy of the max variable holding the value of the lowest number that can be stored as an int (usually -2147483648).
I've written a very rudimentary workaround for your example. I removed the collapse clause and I'm initializing the max variable manually at the start of the parallel region:
#pragma omp parallel default(none) private(j) shared(array, width, height) reduction(max:max)
{
// Explicit initialization
max = 0;
#pragma omp for
for (i=0 ; i<width ; ++i) {
for (j=max ; j<height ; ++j) {
if (array[i*height+j]!=0) {
max=j;
}
}
}
}
As an extra remark, you shouldn't need to use max=j everytime. You could try to check when the first 0 is found and use the previous position.
Hope it helps

Related

Why isn't my use of ncurses alternative charset working properly?

I'm trying to write a program that generates a crossword grid, so I'm using the ncurses library because I just need a simple interface to display the grid, the problem is when I use box() function with ACS_VLINE and ACS_HLINE, it doesn't work; it writes 'q' and 'x' instead of the box lines. It worked at the beginning but suddenly it stopped working; I don't know why.
I'm simply initializing ncurses with initscr() and noecho().
Here's the part of the code where I draw the box:
int crossword(char ***grid, WINDOW **win, WINDOW ****c, int x, int y)
{
int i;
int j;
int ch;
t_word_list *wrdlist;
clear();
(void)x;
(void)y;
if (!(wrdlist = get_words("data/words.list")))
return (-1);
box(*win, ACS_VLINE, ACS_HLINE);
i = -1;
while ((*c)[++i])
{
j = -1;
while ((*c)[i][++j])
mvwaddch((*c)[i][j], 0, 0, (*grid)[i][j]);
}
wrefresh(stdscr);
while ((ch = getch()))
if (ch == 27)
{
nodelay(stdscr, TRUE);
ch = getch();
nodelay(stdscr, FALSE);
if (ch < 0)
break ;
}
return (endwin());
}
Output:
lqqqqqqqqqqqqqqqqqqqqqk
x 0 0 0 0 0 0 0 0 0 0 x
x 0 0 0 0 0 0 0 0 0 0 x
x 0 0 0 0 0 0 0 0 0 0 x
x 0 0 0 0 0 0 0 0 0 0 x
x 0 0 0 0 0 0 0 0 0 0 x
x 0 0 0 0 0 0 0 0 0 0 x
x 0 0 0 0 0 0 0 0 0 0 x
x 0 0 0 0 0 0 0 0 0 0 x
x 0 0 0 0 0 0 0 0 0 0 x
x 0 0 0 0 0 0 0 0 0 0 x
mqqqqqqqqqqqqqqqqqqqqqj
EDIT: I recreated the problem with minimal code:
#include <curses.h>
int main(void)
{
WINDOW *win;
initscr();
win = subwin(stdscr, 10, 10, 1, 1);
box(win, ACS_VLINE, ACS_HLINE);
wrefresh(stdscr);
getch();
return (0);
}
Output:
lqqqqqqqqk
x x
x x
x x
x x
x x
x x
x x
x x
mqqqqqqqqj
The flags I use for compilation: gcc main.c -lcurses
Converting parts of comments into an answer.
What did you change between when it worked and when it stopped working? … Can you recreate the steps you would have used while creating the working version in a new MCVE (Minimal, Complete, Verifiable Example
— or MRE or whatever name SO now uses)
or an
SSCCE (Short, Self-Contained, Correct Example).
This would allow you to find out what breaks the working code with the bare minimum of code?
… I just edited my atoi function that I just use in the main for the sizeX and sizeY; I didn't touch anything else and it suddenly stopped working. I tried to undo what I did after it wasn't working and it still doesn't work.
So you changed something else as well, whether or not you realized it. It's possible that the terminal settings are screwed up — funnier things have been known. Have you tried creating a new terminal window and trying again in the new window?
Oh yes! It was the terminal! It worked after a 'reset', thank you! I don't know why I didn't think about that earlier.
Until curses (ncurses) programs have proved themselves reliable, always consider the possibility that a flawed version of the program under test messed up the terminal settings. Use stty -g to generate a string when the terminal is working properly (when first created, before you run your program). You can then use that string to reset the terminal to the same known state (assuming it is stty settings that are the problem). Sometimes, a new terminal window is necessary even so.
good_stty=$(stty -g)
…testing program under development…
stty "$good_stty"
Sometimes, you may need to type control-J and then stty "$good_stty" (or stty sane) and another control-J because the line ending settings have been modified and not restored correctly.
The problem may be the console encoding for your console. Also if you access from putty you must follow the following steps.
Verif console configuration is UTF
dpkg-reconfigure console-setup

what are the Number of states for the array with 0's and 1's

Given an array with just 0's and 1's.
If a 0 is left of a 1 then they swap theirs values
count the number of steps to take all 0's to the right of the array.
EXAMPLE1
if array=[0 1 0 0 1 0 1 0 1]
[1 0 0 1 0 1 0 1 0]
[1 0 1 0 1 0 1 0 0]
[1 1 0 1 0 1 0 0 0]
[1 1 1 0 1 0 0 0 0]
[1 1 1 1 0 0 0 0 0]
the Answer is ```5``` steps.
EXAMPLE2
if array=[0 1 0 1 0]
[1 0 1 0 0]
[1 1 0 0 0]
The Answer= 2
I wrote code to do what is asked. But its very slow for large size of the array :( pls help
I think this problem can be best solved using A* search with hamming like heuristic function (we need to prove the heuristic is admissible). Here are the steps:
For the given array, first compute the goal array that we want to reach (simply sort the array descending).
Represent the problem as a (graph) search problem, where the source node is the given array, adjacent nodes are defined to be the arrays that can be obtained by a single 0 1 swap.
Push the source node to a priority queue, where the priority is defined to be as the sum of the distance from the source node (f(.)) and heuristic function (h(.)=hamming distance) from the goal node. The node in the queue with minimum priority will be popped first, breaking ties arbitrarily.
Iteratively pop a node from the priority queue until the goal node is reached and push all the adjacent nodes not visited yet to the queue, updating the distance from the source for the popped node.
Stop once the goal node is popped.
Calculate the waiting time for each 1. now take the waiting time of the last 1 say t1 and add the number of zeros before it say totatlz.
your answer = t1 + totalz
#include <bits/stdc++.h>
using namespace std;
#define mp make_pair
#define pb push_back
#define vll vector<ll>
#define F first
#define S second
#define pll pair<ll,ll>
#define FOR1(i,a) for(i=0;i<=a;i++)
#define FOR2(i,a,b) for(i=a;i<=b;i++)
#define endl '\n'
#define clr(a) memset(a,0,sizeof(a))
#define all(x) x.begin(),x.end()
typedef long long ll;
int main()
{
ll t,i;
cin>>t;
while(t--)
{
ll n,totalz=0,interval=0,k=0,j;
cin>>n;
ll asyncTime[n]={0},a[n]={0};
bool flag=false;
FOR1(i,n-1)
cin>>a[i];
FOR1(i,n-1)
{
if(!a[i])
totalz+=1;
else
{
flag=true;
asyncTime[k]=0;
k+=1;
j=i+1;
break;
}
}
int l,lastpos;
FOR2(i,j,n-1)
{
if(!a[i])
totalz+=1,interval+=1;
else
{
if(asyncTime[k-1]>=interval)
asyncTime[k]=asyncTime[k-1]-interval+1;
else
asyncTime[k]=0;
interval=0;
k+=1;
lastpos=i;
}
}
// FOR1(i,n-1)
// cout<<a[i]<<" ";
// cout<<endl;
// FOR1(i,k-1)
// cout<<asyncTime[k]<<" ";
// cout<<endl;
if(flag)
{
if(asyncTime[k-1]==k-1&&lastpos==k-1)
cout<<"0"<<endl;
else
cout<<asyncTime[k-1]+totalz<<endl;
}
else
cout<<"0"<<endl;
}
return 0;
}

Need help understanding this C Code (Array)

I need help to understand this code clearly, please help. I can't figure out how this program keep track of how many number has given in responses array.
I don't understand what's going on the for loop and specially this line ++frequency[responses[answer]];
#include<stdio.h>
#define RESPONSE_SIZE 40
#define FREQUENCY_SIZE 11
int main(void)
{
int answer; /* counter to loop through 40 responses */
int rating; /* counter to loop through frequencies 1-10 */
/* initialize frequency counters to 0 */
int frequency[FREQUENCY_SIZE] = {0};
/* place the survey responses in the responses array */
int responses[RESPONSE_SIZE] = {1,2,6,4,8,5,9,7,8,10,1,6,3,8,6,10,3,8,2,7,6,5,7,6,8,6,7,5,6,6,5,6,7,5,6,4,8,6,8,10};
/* for each answer, select value of an element of array responses
and use that value as subscript in array frequency to determine element to increment */
for(answer = 0 ; answer < RESPONSE_SIZE; answer++){
++frequency[responses[answer]];
}
printf("%s%17s\n", "Rating", "Frequency");
/* output the frequencies in a tabular format */
for(rating = 1; rating < FREQUENCY_SIZE; rating++){
printf("%6d%17d\n", rating, frequency[rating]);
}
return 0;
}
++frequency[responses[answer]] is a dense way of writing
int r = response[answer];
frequency[r] = frequency[r] + 1;
with the caveat that frequency[r] is only evaluated once.
So, if answer equals 0, then responses[answer] equals 1, so we add 1 to frequency[1].
Edit
The following table shows what happens to frequency through the loop (old value => new value):
answer response[answer] frequency[response[answer]]
------ ---------------- ---------------------------
0 1 frequency[1]: 0 => 1
1 2 frequency[2]: 0 => 1
2 6 frequency[6]: 0 => 1
3 4 frequency[4]: 0 => 1
... ... ...
10 1 frequency[1]: 1 => 2
etc.
for(answer = 0 ; answer < RESPONSE_SIZE; answer++){
++frequency[responses[answer]]; // <---
}
This above loop just counts the number of times a number appear in array responses and that is stored at that number's index in array frequency. This line does that in first loop -
++frequency[responses[answer]];
So , it increments value at index responses[answer] of array frequency.
Lets say responses[answer] has value 1 , then value at index 1 of array frequency is incremented.
Second for loop is just for output as mentioned.

My OpenCL code changes the output based on a seemingly noop

I'm running the same OpenCL kernel code on an Intel CPU and on a NVIDIA GPU and the results are wrong on the first but right on the latter; the strange thing is that if I do some seemingly irrelevant changes the output works as expected in both cases.
The goal of the function is to calculate the matrix multiplication between A (triangular) and B (regular), where the position of A in the operation is determined by the value of the variable left. The bug only appears when left is true and when the for loop iterates at least twice.
Here is a fragment of the code omitting some bits that shouldn't affect for the sake of clarity.
__kernel void blas_strmm(int left, int upper, int nota, int unit, int row, int dim, int m, int n,
float alpha, __global const float *a, __global const float *b, __global float *c) {
/* [...] */
int ty = get_local_id(1);
int y = ty + BLOCK_SIZE * get_group_id(1);
int by = y;
__local float Bs[BLOCK_SIZE][BLOCK_SIZE];
/* [...] */
for(int i=start; i<end; i+=BLOCK_SIZE) {
if(left) {
ay = i+ty;
bx = i+tx;
}
else {
ax = i+tx;
by = i+ty;
}
barrier(CLK_LOCAL_MEM_FENCE);
/* [...] (Load As) */
if(bx >= m || by >= n)
Bs[tx][ty] = 0;
else
Bs[tx][ty] = b[bx*n+by];
barrier(CLK_LOCAL_MEM_FENCE);
/* [...] (Calculate Csub) */
}
if(y < n && x < (left ? row : m)) // In bounds
c[x*n+y] = alpha*Csub;
}
Now it gets weird.
As you can see, by always equals y if left is true. I checked (with some printfs, mind you) and left is always true, and the code on the else branch inside the loop is never executed. Nevertheless, if I remove or comment out the by = i+ty line there, the code works. Why? I don't know yet, but I though it might be something related to by not having the expected value assigned.
My train of thought took me to check if there was ever a discrepancy between by and y, as they should have the same value always; I added a line that checked if by != y but that comparison always returned false, as expected. So I went on and changed the appearance of by for y so the line
if(bx >= m || by >= n)
transformed into
if(bx >= m || y >= n)
and it worked again, even though I'm still using the variable by properly three lines below.
With an open mind I tried some other things and I got to the point that the code works if I add the following line inside the loop, as long as it is situated at any point after the initial if/else and before the if condition that I mentioned just before.
if(y >= n) left = 1;
The code inside (left = 1) can be substituted for anything (a printf, another useless assignation, etc.), but the condition is a bit more restrictive. Here are some examples that make the code output the correct values:
if(y >= n) left = 1;
if(y < n) left = 1;
if(y+1 < n+1) left = 1;
if(n > y) left = 1;
And some that don't work, note that m = n in the particular example that I'm testing:
if(y >= n+1) left = 1;
if(y > n) left = 1;
if(y >= m) left = 1;
/* etc. */
That's the point where I am now. I have added a line that shouldn't affect the program at all but it makes it work. This magic solution is not satisfactory to me and I would like to know what's happening inside my CPU and why.
Just to be sure I'm not forgetting anything, here is the full function code and a gist with example inputs and outputs.
Thank you very much.
Solution
Both users DarkZeros and sharpneli were right about their assumptions: the barriers inside the for loop weren't being hit the right amount of times. In particular, there was a bug involving the very first element of each local group that made it run one iteration less than the rest, provoking an undefined behaviour. It was painfully obvious to see in hindsight.
Thank you all for your answers and time.
Have you checked that the get_local_size always returns the correct value?
You said "In short, the full length of the matrix is divided in local blocks of BLOCK_SIZE and run in parallel; ". Remember that OpenCL allows any concurrency only within a workgroup. So if you call enqueueNDrange with global size of [32,32] and local size of [16,16] it is possible that the first thread block runs from start to finish, then the second one, then third etc. You cannot synchronize between workgroups.
What are your EnqueueNDRange call(s)? Example of the calls required to get your example output would be heavily appreciated (mostly interested in the global and local size arguments).
(I'd ask this in a comment but I am a new user).
E (Had an answer, upon verification did not have it, still need more info):
http://multicore.doc.ic.ac.uk/tools/GPUVerify/
By using that I got a complaint that a barrier could be reached by a nonuniform control flow.
It all depends on what values dim, nota and upper get. Could you provide some examples?
I did some testing. Assuming left = 1. nota != upper and dim = 32, row as 16 or 32 or whatnot, still worked and got the following result:
...
gid0: 2 gid1: 0 lid0: 14 lid1: 13 start: 0 end: 32
gid0: 2 gid1: 0 lid0: 14 lid1: 14 start: 0 end: 32
gid0: 2 gid1: 0 lid0: 14 lid1: 15 start: 0 end: 32
gid0: 2 gid1: 0 lid0: 15 lid1: 0 start: 0 end: 48
gid0: 2 gid1: 0 lid0: 15 lid1: 1 start: 0 end: 48
gid0: 2 gid1: 0 lid0: 15 lid1: 2 start: 0 end: 48
...
So if my assumptions about the variable values are even close to correct you have barrier divergence issue there. Some threads encounter a barrier which another threads never will. I'm surprised it did not deadlock.
The first thing I see it can terribly fail, is that you are using barriers inside a for loop.
If all the threads do not enter the same amount of times the for loop. Then the results are undefined completely. And you clearly state the problem only occurs if the for loop runs more than once.
Do you ensure this condition?

How to resolve "return 0 or return 1" within a for loop in open mp?

This is what I have done so far to resolve the return 1;, return 0;, it is actually a sudoku solver using backtracking algorithm, so I am trying to parallelize it, but I cant get the complete result. (correct me if my implementation is wrong)
what actually happen?
anybody can help?!
this is the site i refer to, i used to follow their way : http://www.thinkingparallel.com/2007/06/29/breaking-out-of-loops-in-openmp/#reply
int solver (int row, int col)
{
int i;
boolean flag = FALSE;
if (outBoard[row][col] == 0)
{
#pragma omp parallel num_threads(2)
#pragma omp parallel for //it works if i remove this line
for (i = 1; i < 10; i++)
{
if (checkExist(row, col, i)) //if not, assign number i to the empty cell
outBoard[row][col] = i;
#pragma omp flush (flag)
if (!flag)
{
if (row == 8 && col == 8)
{
//return 1;
flag = TRUE;
#pragma omp flush (flag)
}
else if (row == 8)
{
if (solver(0, col+1))
{
//return 1;
flag = TRUE;
#pragma omp flush (flag)
}
}
else if (solver(row+1, col))
{
//return 1;
flag = TRUE;
#pragma omp flush (flag)
}
}
}
if (flag) { return 1; }
if (i == 10)
{
if (outBoard[row][col] != inBoardA[row][col])
outBoard[row][col] = 0;
return 0;
}
}
else
{
if (row == 8 && col == 8)
{
return 1;
}
else if (row == 8)
{
if (solver(0,col+1)) return 1;
}
else
{
if (solver(row+1,col)) return 1;
}
return 0;
}
}
5 0 0 0 0 3 7 0 0
7 4 6 1 0 2 3 0 0
0 3 8 0 9 7 5 0 2
9 7 0 4 0 0 2 0 0
0 0 2 0 0 0 4 0 0
0 0 4 0 0 5 0 1 9
4 0 3 2 7 0 9 8 0
0 0 5 3 0 9 6 7 4
0 0 7 5 0 0 0 0 3
Sudoku solved :
5 2 9 8 0 3 7 4 1
7 4 6 1 5 2 3 9 0
1 3 8 0 9 7 5 6 2
9 7 0 4 1 0 2 3 6
0 1 2 9 6 0 4 5 8
3 6 4 7 8 5 0 1 9
4 0 3 2 7 6 9 8 5
2 8 5 3 0 9 6 7 4
6 9 7 5 4 8 1 2 3
The //return 1; is the original serial code, since return is not allowed in the parallel for, so I used #pragma opm flush to eliminate it, but the result is not complete, it still left few empty grids in the sudoku.
Thanks for answering :>
First, since solver is called recursively, it doubles the number of threads with each level of recursion. I don't think it's what you intended to do.
Edit: This is only true if nested parallelism is enabled with omp_set_nested(), and by default it is not. So only the first call of solver will fork.
#pragma omp parallel num_threads(2)
#pragma omp parallel for
in your code tries to create one parallel region within another, and it will cause the loop that follows to execute twice, because outer parallel already created two threads. This should be replaced with
#pragma omp parallel num_threads(2)
#pragma omp for
or equivalent #pragma omp parallel for num_threads(2).
Second, this code:
if (checkExist(row,col,i))//if not, assign number i to the empty cell
outBoard[row][col] = i;
creates a race condition, with both threads writing different values to the same cell in parallel. You might want to create a separate copy of the board for each thread to work with.
Another code part,
if (outBoard[row][col] != inBoardA[row][col])
outBoard[row][col] = 0;
seems to be outside the parallel region, but in nested calls to solver it's also executed in parallel in different threads created by the outer-most solver.
Final(e) (18.09) Anyway, even if you manage to debug/change your code to run correctly in parallel (as I myself did just for the heck of it - i'll try to provide the code if anyone is interested, which I doubt), the outcome will be that parallel execution of this code doesn't give you much advantage. The reason to my mind is as follows:
Imagine when solver iterates over 9 possible cell values it creates 9 branches of execution. If you create 2 threads using OpenMP, it will distribute top-level branches between threads in some way, say 5 branches executed by one thread and 4 by another, and in each thread branches will be executed consecutively, one by one. If initial sudoku state is valid, only one branch will lead to correct solution. Other branches will be cut short when they'll encounter discrepancy in the solution, so some will take longer and some will take shorter time to run, while the branch leading to correct solution will run the longest. You cannot predict what branches will take what time to execute, so there is no way to reasonably balance the workload among the threads. Even if you use OpenMP dynamic scheduling, chances are that while one thread executes the longest branch, other thread(s) will already finish all other branches and will wait for the last branch, because there are too little branches (so dynamic scheduling will be of little help).
Since creating threads and synchronizing data between them incurs some substantial overhead (compared to sequential solver running time of 0.01-10 ms), you'll see parallel execution time somewhat longer or shorter than sequential, depending on an input.
In any case, if the sequential solver running time is under 10 ms, why do you want to make it parallel?

Resources