My OpenCL code changes the output based on a seemingly noop - c

I'm running the same OpenCL kernel code on an Intel CPU and on a NVIDIA GPU and the results are wrong on the first but right on the latter; the strange thing is that if I do some seemingly irrelevant changes the output works as expected in both cases.
The goal of the function is to calculate the matrix multiplication between A (triangular) and B (regular), where the position of A in the operation is determined by the value of the variable left. The bug only appears when left is true and when the for loop iterates at least twice.
Here is a fragment of the code omitting some bits that shouldn't affect for the sake of clarity.
__kernel void blas_strmm(int left, int upper, int nota, int unit, int row, int dim, int m, int n,
float alpha, __global const float *a, __global const float *b, __global float *c) {
/* [...] */
int ty = get_local_id(1);
int y = ty + BLOCK_SIZE * get_group_id(1);
int by = y;
__local float Bs[BLOCK_SIZE][BLOCK_SIZE];
/* [...] */
for(int i=start; i<end; i+=BLOCK_SIZE) {
if(left) {
ay = i+ty;
bx = i+tx;
else {
ax = i+tx;
by = i+ty;
/* [...] (Load As) */
if(bx >= m || by >= n)
Bs[tx][ty] = 0;
Bs[tx][ty] = b[bx*n+by];
/* [...] (Calculate Csub) */
if(y < n && x < (left ? row : m)) // In bounds
c[x*n+y] = alpha*Csub;
Now it gets weird.
As you can see, by always equals y if left is true. I checked (with some printfs, mind you) and left is always true, and the code on the else branch inside the loop is never executed. Nevertheless, if I remove or comment out the by = i+ty line there, the code works. Why? I don't know yet, but I though it might be something related to by not having the expected value assigned.
My train of thought took me to check if there was ever a discrepancy between by and y, as they should have the same value always; I added a line that checked if by != y but that comparison always returned false, as expected. So I went on and changed the appearance of by for y so the line
if(bx >= m || by >= n)
transformed into
if(bx >= m || y >= n)
and it worked again, even though I'm still using the variable by properly three lines below.
With an open mind I tried some other things and I got to the point that the code works if I add the following line inside the loop, as long as it is situated at any point after the initial if/else and before the if condition that I mentioned just before.
if(y >= n) left = 1;
The code inside (left = 1) can be substituted for anything (a printf, another useless assignation, etc.), but the condition is a bit more restrictive. Here are some examples that make the code output the correct values:
if(y >= n) left = 1;
if(y < n) left = 1;
if(y+1 < n+1) left = 1;
if(n > y) left = 1;
And some that don't work, note that m = n in the particular example that I'm testing:
if(y >= n+1) left = 1;
if(y > n) left = 1;
if(y >= m) left = 1;
/* etc. */
That's the point where I am now. I have added a line that shouldn't affect the program at all but it makes it work. This magic solution is not satisfactory to me and I would like to know what's happening inside my CPU and why.
Just to be sure I'm not forgetting anything, here is the full function code and a gist with example inputs and outputs.
Thank you very much.
Both users DarkZeros and sharpneli were right about their assumptions: the barriers inside the for loop weren't being hit the right amount of times. In particular, there was a bug involving the very first element of each local group that made it run one iteration less than the rest, provoking an undefined behaviour. It was painfully obvious to see in hindsight.
Thank you all for your answers and time.

Have you checked that the get_local_size always returns the correct value?
You said "In short, the full length of the matrix is divided in local blocks of BLOCK_SIZE and run in parallel; ". Remember that OpenCL allows any concurrency only within a workgroup. So if you call enqueueNDrange with global size of [32,32] and local size of [16,16] it is possible that the first thread block runs from start to finish, then the second one, then third etc. You cannot synchronize between workgroups.
What are your EnqueueNDRange call(s)? Example of the calls required to get your example output would be heavily appreciated (mostly interested in the global and local size arguments).
(I'd ask this in a comment but I am a new user).
E (Had an answer, upon verification did not have it, still need more info):
By using that I got a complaint that a barrier could be reached by a nonuniform control flow.
It all depends on what values dim, nota and upper get. Could you provide some examples?
I did some testing. Assuming left = 1. nota != upper and dim = 32, row as 16 or 32 or whatnot, still worked and got the following result:
gid0: 2 gid1: 0 lid0: 14 lid1: 13 start: 0 end: 32
gid0: 2 gid1: 0 lid0: 14 lid1: 14 start: 0 end: 32
gid0: 2 gid1: 0 lid0: 14 lid1: 15 start: 0 end: 32
gid0: 2 gid1: 0 lid0: 15 lid1: 0 start: 0 end: 48
gid0: 2 gid1: 0 lid0: 15 lid1: 1 start: 0 end: 48
gid0: 2 gid1: 0 lid0: 15 lid1: 2 start: 0 end: 48
So if my assumptions about the variable values are even close to correct you have barrier divergence issue there. Some threads encounter a barrier which another threads never will. I'm surprised it did not deadlock.

The first thing I see it can terribly fail, is that you are using barriers inside a for loop.
If all the threads do not enter the same amount of times the for loop. Then the results are undefined completely. And you clearly state the problem only occurs if the for loop runs more than once.
Do you ensure this condition?


OpenCL CLK_LOCAL_MEM_FENCE causing abort trap 6

I'm doing some exercise about convolution over images (info here) using OpenCL. When I use images whose size is not a square (like r x c) CLK_LOCAL_MEM_FENCE makes the program stop with abort trap 6.
What I do is essentially filing up the local memory with proper values, waiting for this process of filling the local memory to finish, using barrier(CLK_LOCAL_MEM_FENCE) and then calculating the values.
It seems like when I use images like those I've told you about barrier(CLK_LOCAL_MEM_FENCE) gives issues, if I comment that command everything work fine (which is weird since there's no synchronization). What may cause this problem any idea?
EDIT: the problem comes when the hight or the width or both are not multiple of the the local items size (16 x 16). The global items size is aways a couple of values multiple of 16 like (512 x 512).
int c = get_global_id(0);
int r = get_global_id(1);
int lc = get_local_id(0);
int lr = get_local_id(1);
// this ignores indexes out of the input image.
if (c >= ImageWidth || r >= ImageHeight) return;
// fill a local array...
if (c < outputImageWidth && r < outputImageHeight)
OutputImage[r* outputImageWidth +c] = someValue;
OpenCL requires that each work-group barrier is executed by every work-item in that work-group.
In the code that you have posted, you have an early exit clause to prevent out-of-range accesses. This is a common trick for getting nice work-group sizes in OpenCL 1.X, but unfortunately this breaks the above condition, and this will lead to undefined behaviour (typically either a hang or a crash).
You will need to modify your kernel to avoid this, by either removing the early exit clause (and perhaps clamping out-of-range work-items instead, if applicable), or by restructuring the kernel so that out-of-range work-items continue at least as far as the barrier before exiting.
You can change the code order without affecting the behaviour to fix it:
int c = get_global_id(0);
int r = get_global_id(1);
int lc = get_local_id(0);
int lr = get_local_id(1);
// fill a local array... with all the threads
// ie: for(i=0;i<size;i+=get_local_size(0))
// ...
// this ignores indexes out of the input image.
if (c >= ImageWidth || r >= ImageHeight) return;
if (c < outputImageWidth && r < outputImageHeight)
OutputImage[r* outputImageWidth +c] = someValue;

How do I create a "twirly" in a C program task?

Hey guys I have created a program in C that tests all numbers between 1 and 10000 to check if they are perfect using a function that determines whether a number is perfect. Once it finds these it prints them to the user, they are 6, 28, 496 and 8128. After this the program then prints out all the factors of each perfect number to the user. This is all fine. Here is my problem.
The final part of my task asks me to:
"Use a "twirly" to indicate that your program is happily working away. A "twirly" is the following characters printed over the top of each other in the following order: '|' '/' '-' '\'. This has the effect of producing a spinning wheel - ie a "twirly". Hint: to do this you can use \r (instead of \n) in printf to give a carriage return only (instead of a carriage return linefeed). (Note: this may not work on some systems - you do not have to do it this way.)"
I have no idea what a twirly is or how to implement one. My tutor said it has something to do with the sleep and delay functions which I also don't know how to use. Can anyone help me with this last stage, it sucks that all my coding is complete but I can't get this "twirly" thing to work.
if you want to simultaneously perform the task of
Testing the numbers and
Display the twirly on screen
while the process goes on then you better look into using threads. using POSIX threads you can initiate the task on a thread and the other thread will display the twirly to the user on terminal.
int Test();
void Display();
int main(){
// create threads each for both tasks test and Display
//call threads
//wait for Test thread to finish
//terminate display thread after Test thread completes
//exit code
Refer chapter 12 for threads
beginning linux programming ebook
Given the program upon which the user is "waiting", I believe the problem as stated and the solutions using sleep() or threads are misguided.
To produce all the perfect numbers below 10,000 using C on a modern personal computer takes about 1/10 of a second. So any device to show the computer is "happily working away" would either never be seen or would significanly intefere with the time it takes to get the job done.
But let's make a working twirly for perfect number search anyway. I've left off printing the factors to keep this simple. Since 10,000 is too low to see the twirly in action, I've upped the limit to 100,000:
#include <stdio.h>
#include <string.h>
int main()
const char *twirly = "|/-\\";
for (unsigned x = 1; x <= 100000; x++)
unsigned sum = 0;
for (unsigned i = 1; i <= x / 2; i++)
if (x % i == 0)
sum += i;
if (sum == x)
printf("%d\n", x);
printf("%c\r", twirly[x / 2500 % strlen(twirly)]);
return 0;
No need for sleep() or threads, just key it into the complexity of the problem itself and have it update at reasonable intervals.
Now here's the catch, although the above works, the user will never see a fifth perfect number pop out with a 100,000 limit and even with a 100,000,000 limit, which should produce one more, they'll likely give up as this is a bad (slow) algorithm for finding them. But they'll have a twirly to watch.
i as integer
loop i: 1 to 10000
loop j: 1 to i/2
sum as integer
set sum = 0
if i%j == 0
return sum==i
if i%100 == 0
str as character pointer
set *str = "|/-\\"
set length = 4
print str[p] using "%c\r" as format specifier
Increment p and assign its modulo by len to p

Configuring and limiting output of PI controller

I have implemented simple PI controller, code is as follows:
PI_controller() {
// handling input value and errors
previous_error = current_error;
current_error = 0 - input_value;
// PI regulation
P = current_error //P is proportional value
I += previous_error; //I is integral value
output = Kp*P + Ki*I; //Kp and Ki are coeficients
Input value is always between -π and +π.
Output value must be between -4000 and +4000.
My question is - how to configure and (most importantly) limit the PI controller properly.
Too much to comment but not a definitive answer. What is "a simple PI controller"? And "how long is a piece of string"? I don't see why you (effectively) code
P = (current_error = 0 - input_value);
which simply negates the error of -π to π. You then aggregate the error with
I += previous_error;
but haven't stated the cumulative error bounds, and then calculate
output = Kp*P + Ki*I;
which must be -4000 <= output <= 4000. So you are looking for values of Kp and Ki that keep you within bounds, or perhaps don't keep you within bounds except in average conditions.
I suggest an empirical solution. Try a series of runs, filing the results, stepping the values of Kp and Ki by 5 steps each, first from extreme neg to pos values. Limit the output as you stated, counting the number of results that break the limit.
Next, halve the range of one of Kp and Ki and make a further informed choice as to which one to limit. And so on. "Divide and conquer".
As to your requirement "how to limit the PI controller properly", are you sure that 4000 is the limit and not 4096 or even 4095?
if (output < -4000) output = -4000;
if (output > 4000) output = 4000;
To configure your Kp and Ki you really should analyze the frequency response of your system and design your PI to give the desired response. To simply limit the output decide if you need to freeze the integrator, or just limit the immediate output. I'd recommend freezing the integrator.
I_tmp = previous_error + I;
output_tmp = Kp*P + Ki*I_tmp;
if( output_tmp < -4000 )
output = -4000;
else if( output_tmp > 4000 )
output = 4000;
I = I_tmp;
output = output_tmp;
That's not a super elegant, vetted algorithm, but it gives you an idea.
If I understand your question correctly you are asking about anti windup for your integrator.
There are more clever ways to to it, but a simple
if ( abs (I) < x)
I += previous_error;
will prevent windup of the integrator.
Then you need to figure out x, Kp and Ki so that abs(x*Ki) + abs(3.14*Kp) < 4000
[edit] Off cause as macduff states, you first need to analyse your system and choose the korrect Ki and Kp, x is the only really free variable in the above equation.

Traversing INT array in two ways

Traversing INT array in two ways is a robotic funny code (in C).
I have an array of positions like this: int pos[] = {0, 45, 90, 135, 180, 135, 90, 45};
These positions are used to move a servo motor.
45 90 135
\ | /
\ | /
\ | /
0 ----------- 180
In main loop() I check distance from an obstacle, and if it's < xx Cm my servo must rotate to next step (next array position) until it finds a free way ( > xx Cm ).
My main is easy:
int main (int argc, const char * argv[]) { for (;;) find(); }
and my core function (find) is this:
void find() {
for ( i=0; i<sizeof(pos); i++ ) // Traversing position array
distance = rand() % 7; // Simulate obstacle distance
move( pos[i] ); // Simulate movements
if (i==sizeof(pos)) { i=1; } // Try to reset the "i" counter. PROBLEM!
if ( distance<=5 ) continue; // Is there an obstacle?
sleep(2); // Debug sleep
find(); // Similar recursion
I don't know what is wrong in this code, but I need to move servo until is there not an obstacle.
At position 90 I find an obstacle. I want to loop array from left to right and viceversa controlling distance every step. If I don't find a freeway, print("ko") else print("ok").
How do I fix this code to work correctly?
You really want i < sizeof(pos) / sizeof(*pos) rather than i < sizeof(pos). The size of an array is not the number of its elements but rather the total byte count it occupies in memory.
sizeof(pos) yields 8 * sizeof(int). If an int is 4 bytes, you are looping 32 times instead of 8.
Also, i == sizeof(pos) will never be true in the body of the loop because the condition of the for statement limits i to sizeof(pos) - 1.
If I understand your question correctly, you want the servo to make a sweeping movement from left to right and then back from right to left. Measuring the distance to an object that can be in front of the robot at each angle. If there is a free way ahead of the robot, the find method returns.
int pos[] = {0, 45, 90, 135, 180, -1};
void find()
int i = 0;
int direction = 1;
do {
i += direction;
if (pos[i+direction] == -1) direction = -1;
if (i==0) direction = 1;
} while(measure_distance() <= 5);
Instead of recursion, there is a while loop that only exits when there is a distance greater then 5.
The 'pos' array has a sentinel at the end (-1). This is an invalid angle and can be used to find the end of the array. There is no need to calculate the number of elements.
The left-right, right-left movement comes from using the 'direction' variable. It is rather easy to detect either the beginning (i==0) or the end of the 'pos' array (pos[i+1] == -1), at which point we reverse the direction.
There is also no need to repeat the angles after 180 degrees. The sequence we get is:
0 45 90 135 180 135 90 45 0 45 90 ...
We can even reduce the code with one line...
if (pos[i+direction] == -1 || i == 0) direction *= -1;
Don't forget to initialize your rand function using
/* initialize random seed: */
srand ( time(NULL) );
distance = rand() % 7;
It's probably good practise to say:
#define POSLENGTH 8
and then iterate using i<POSLENGTH: as others have pointed out using sizeof(pos) is probably not going to work.
Also, arrays in C are 0 based: elements go 0,1,2,3,...n-1.
So, you need to say:
if (i==POSLENGTH-1) i=0;
Try using a while loop instead of a for loop. Increment the value when there is no obstacle and break when you find an obstruction:
i = rand()%7;
move( pos[i]);
if (i<5)
This will randomly choose the position until you get an obstacle and also there will be no need to reset it as the loop will automatically break on encountering an obstacle.

Intermittent bugs - sometimes this code works and sometimes it doesn't!

This code intermittently works. It's running on a small microcontroller. It will work fine even after restarting the processor, but if I change some part of the code, it breaks. This makes me think that it's some kind of pointer bug or memory corruption. What's happening is the coordinate, p_res.pos.x is sometimes read as 0 (the incorrect value) and 96 (the correct value) when it is passed to write_circle_outlined. y seems to be correct most of the time. If anyone can spot anything obviously wrong please point it out!
int demo_game()
long int d;
int x, y;
struct WorldCamera p_viewer;
struct Point3D_LLA p_subj;
struct Point2D_CalcRes p_res;
p_viewer.hfov = 27;
p_viewer.vfov = 32;
p_viewer.width = 192;
p_viewer.height = 128; = 51.26f;
p_viewer.p.lon = -1.0862f;
p_viewer.p.alt = 100.0f; = 51.20f;
p_subj.lon = -1.0862f;
p_subj.alt = 100.0f;
fill_buffer(draw_buffer_mask, 0x0000);
fill_buffer(draw_buffer_level, 0xffff);
compute_3d_transform(&p_viewer, &p_subj, &p_res, 10000.0f);
x = p_res.pos.x;
y = p_res.pos.y;
write_circle_outlined(x, y, 1.0f / p_res.est_dist, 0, 0, 0, 1); -= 0.0001f;
//p_viewer.p.alt -= 0.00001f;
d = 20000;
return 1;
The code for compute_3d_transform is:
void compute_3d_transform(struct WorldCamera *p_viewer, struct Point3D_LLA *p_subj, struct Point2D_CalcRes *res, float cliph)
// Estimate the distance to the waypoint. This isn't intended to replace
// proper lat/lon distance algorithms, but provides a general indication
// of how far away our subject is from the camera. It works accurately for
// short distances of less than 1km, but doesn't give distances in any
// meaningful unit (lat/lon distance?)
res->est_dist = hypot2(p_viewer-> - p_subj->lat, p_viewer->p.lon - p_subj->lon);
// Save precious cycles if outside of visible world.
if(res->est_dist > cliph)
goto quick_exit;
// Compute the horizontal angle to the point.
// atan2(y,x) so atan2(lon,lat) and not atan2(lat,lon)!
res->h_angle = RAD2DEG(angle_dist(atan2(p_viewer->p.lon - p_subj->lon, p_viewer-> - p_subj->lat), p_viewer->yaw));
res->small_dist = res->est_dist * 0.0025f; // by trial and error this works well.
// Using the estimated distance and altitude delta we can calculate
// the vertical angle.
res->v_angle = RAD2DEG(atan2(p_viewer->p.alt - p_subj->alt, res->est_dist));
// Normalize the results to fit in the field of view of the camera if
// the point is visible. If they are outside of (0,hfov] or (0,vfov]
// then the point is not visible.
res->h_angle += p_viewer->hfov / 2;
res->v_angle += p_viewer->vfov / 2;
// Set flags.
if(res->h_angle < 0 || res->h_angle > p_viewer->hfov)
res->flags |= X_OVER;
if(res->v_angle < 0 || res->v_angle > p_viewer->vfov)
res->flags |= Y_OVER;
res->pos.x = (res->h_angle / p_viewer->hfov) * p_viewer->width;
res->pos.y = (res->v_angle / p_viewer->vfov) * p_viewer->height;
res->flags |= X_OVER | Y_OVER;
Structure for the results:
typedef struct Point2D_Pixel { unsigned int x, y; };
// Structure for storing calculated results (from camera transforms.)
typedef struct Point2D_CalcRes
struct Point2D_Pixel pos;
float h_angle, v_angle, est_dist, small_dist;
int flags;
The code is part of an open source project of mine so it's okay to post a lot of code here.
I see some of your calculation depends on p_viewer->yaw, but I do not see any intialization for p_viewer->yaw. Is this your problem?
A couple of things that seem sketchy:
You can return from compute_3d_transform without setting many of the fields in p_res/res but the caller never checks for this situation.
You consistently read from res->flags without initializing it first.
Whenever the output differs, it possibly means some value is not initialized and the outcome depends on the garbage value present in a variable. Keeping that in mind, I looked for uninitialized variables. the structure p_res is not initialized.
if(res->est_dist > cliph)
goto quick_exit;
that means if condition may turn out to be true or false depending on what garbage value is stored in res->est_dist. When if condition turns out to true, it goes straight to quick_exit label and doesn't update p_res.pos.x. If condition turned out to be false then its updated.
When I used to program C, I would use a divide and conquer debugging technique for this kind of problem to try to isolate the offending operation (paying attention to whether the symptoms change as debugging code is added, which is indicative of dangling pointer type bugs).
Essentially, start with the first line where the value is known to be good (and prove that it is consistently good at that line). Then identify where is it known to be bad. Then approx. halfway between the two points insert a test to see if it's bad. If not, then insert a test halfway between the mid-point and the known bad location, if it is bad then insert a test halfway between the mid-point and the known good location, and so on.
If the line identified is itself a function call, this process can be repeated in that called function, and so on.
When using this kind of approach, it's important to minimize the amount of added code and the artificial "noise", which can create timing changes.
Use this if you don't have (or can't use) an interactive debugger, or if the problem does not manifest when using one.
