OpenCL atomics do nothing - c

I try to program an example histogram tool using OpenCL. To start, I was just interessted to atomicly increment each bin. I came up with the following kernel code:
__kernel void Histogram(
__global const int* input,
__global int* histogram,
int numElements) {
// get index into global data array
int iGID = get_global_id(0);
// bound check, equivalent to the limit on a 'for' loop
if (iGID >= numElements) {
if( iGID < 100 ) {
// initialize histogram
histogram[iGID] = 0;
int bin = input[iGID];
But the output histogram is zero in every bin. Why is that? Further more, the real strange things happen if a put a printf(" ") in the last line. Suddenly, it works. I am completely lost, has someone an idea why this happens?
I enabled all extensions

I solved to problem by my self.
After nothing fixed the problem, I tried to change the CLDevice to the CPU. Everything went as it was supposed to be (unfortunately very slow :D). But this gave me the idea that it might not be a code problem but a OpenCL infrastructure problem.
I updated the OpenCL platform of AMD and now everything works.
Thank you, in case you thought about my problem.


Program is taking way longer than expected, is it running properly?

not sure this is the right place...
I am running a brute-force code to solve an asymmetric traveler sales problem.
It has 17 cities, one is fixed, so it would have 16! (> 20 trillions) permutations to check.
unsigned long TotalCost(unsigned long *Matrix, short *Path, short
unsigned long result = 0;
unsigned long Cost;
int iD;
for (iD = 1; iD <= Dimention; iD++)
Cost = Matrix[Dimention*Path[iD - 1] + Path[iD]];
if (Cost > 0)
result = result + Cost;
return 4099999999;
return result;
void swapP(short *x, short *y)
short temp;
temp = *x;
*x = *y;
*y = temp;
void permute(unsigned long *Matrix, short Dimention, unsigned long *CurrentMin, short *PerPath, short **MinPath, short l, short r)
short i;
unsigned long CCost;
if (l == r)
CCost = TotalCost(Matrix, PerPath, Dimention);
if (CCost < (*CurrentMin))
for (i = 0; i <= Dimention; i++)
(*MinPath)[i] = PerPath[i];
(*CurrentMin) = CCost;
PrintResults(Matrix, PerPath, Dimention, 2);
for (i = l; i <= r; i++)
swapP((PerPath+l), (PerPath+i));
permute(Matrix, Dimention, CurrentMin, PerPath, MinPath, l+1, r);
swapP((PerPath+l), (PerPath+i)); //backtrack
int main (void)
// The ommited code here, allocs memory for the matrix, HcG and HrGR array
// it also initializes them
permute(Matrix, Dimention, &TotalMin, HcG, &HrGR, 1, Dimention - 1);
I tested the above code for an instance of five cities and it returned successfully as expected in a few milliseconds.
For the 17 cities, i initially thought it would take a few hours to solve, and then a couple days. It is running for 4 days now and i'm beginning to suspect the program, for some reason, is no longer running, like it's frozen.
I'm not getting any errors, but it's taking way longer than i expected, the program prints the total cost and the path every time it finds a path with lower cost, but it stopped printing half an hour since it started.
I am using ubuntu 18.04, the program is "running" on terminal, the system monitor tells Memory: N/A, does that mean it's not using memory?
It also tells CPU: 6%, can i increase it?
Is there a way to check if it is running properly? Or estimate how long it will take to finish?
I'm so unsure about it's integrity that i think i should stop the process, but at the same time i really wanted to see the results.
I only glanced through your code, but I have done things like this many times in the past. My general approach for this is as follows (although it adds a small cost) ...
add a print statement in a way (perhaps with a mod counter) that you would expect the print to come out approximately once every 2 to 3 minutes. Include some information in the print so that you can tell how far along your simulation is progressing. (note, among that information you probably want to be sure to print out variables that, if they get trashed, could cause infinite looping, for example "Dimention" (which you have misspelled btw)
I would personally not have jumped from 5 cities to 17. Rather 5 to 7, then maybe 9 or 10 ... just to confirm all is working and to get an idea how much time increase to expect with your particular CPU.
Finally, in the situation you are in now, is it possible to get another window and run "ps" to see if your job is getting any CPU time? If not, my approach would be to kill it and implement as I described above. HTH.
Note also, the code you have omitted (memory allocation, etc) is critical: the code as written has the potential to go out of bounds, and possibly not crash (if only slightly out of bounds) but rather end up trashing variables (depending on memory layout) that could (as mentioned above) create an infinite or near-infinite loop.

Problematic while loop in OpenCL kernel: Execution hangs

I wrote an OpenCL kernel that generates random numbers inside a while loop in the device. Once an acceptable random number is obtained, the kernel should exit the loop and give the result back to the host. Typically, the
number of iterations per workitem is ~100-1000.
The problem is that this code hangs when I enable the while loop and never returns a result. If I just disable the while loop–i.e. generating only one random number instead of 100s–the kernel works fine.
Anybody has any idea of what might be going on? The kernel code is below and also available at this github repo. One possibility is that the system (MacOS in my case) prevents the GPU from taking a long time executing a task as described here, but I am not sure.
#include <clRNG/mrg31k3p.clh> // for random number generation
#include "exposure.clh" // defines function exposure
__kernel void cr(__global clrngMrg31k3pHostStream* streams, __global float* xa, __global float* ya, const int n) {
int i = get_global_id(0);
float x,y,sampling;
if (i<n) {
// Loop that produces individual CRs
while (1) {
clrngMrg31k3pStream private_stream_d; // This is not a pointer!
clrngMrg31k3pCopyOverStreamsFromGlobal(1, &private_stream_d, &streams[i]);
// random number between 0 and 360
// random number between 0 and 1
// To avoid concentrations towards the poles, generates sin(delta)
// between -1 and +1, then converts to delta
y = asin((float)(2.*y-1.))*180./M_PI_F; // dec
// If sampling<exposure for a given CR, it is accepted
if (sampling <= exposure(y)) {
You are re-creating the random stream over and over again; perhaps it always creates the same output, which is why your while loop never terminates. Try creating the random stream above your loop that pulls from it.

Memory problems with C

Hello and have a good day, I've come here after days of trial and error so forgive me if I'm beign silly.
I have the following code. The idea of this code is first of all read all the files I have and store all the data into a matrix NsitesxNxxNy and then use the data for other unrelated things.
The amount of data is not very much, I mean i have 800 files of data which occupe no more than 80MB but anyway if I try to use a number for DataFiles higher than 134 I get a Segmentation Fault error.
I think it's weird because if it works with a number of DataFiles=100 why it should'nt work for higher?
I thought it was because for a reason my program does not get enough memory allocated for the process or because I'm having an issue when allocating the memory. But I always have the same amount of data and my data files have exactly 88*44 values and working only until 134 files it's... I don't have experience with "high amount" of data/memory usage but I think that 1000*88*44 which is about 10^6 double digits it's not too much.
I'm using GCC compiler and Ubuntu (14.02 I think), when I try to compile and execute this program in Windows using Codeblocks it just crashes (another mistery).
Oh I also had a terminal open with RAM memory usage and with 134 files it was nothing big to handle for the computer.
EDIT: I also tried making several [100][Nx][Ny] arrays and use them one by one but that also lead to the Segmentation Fault error.
EDIT2: minor erratas text and code
Also, I'm following this way because I need all that data simultaneously... I'm thinking of new ways of avoiding this but last couple days did'nt find any alternative.
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
const int Nx=88; //
const int Ny=44; //
const int DataFiles=100; // How many data files are we going to read
int main() {
int i, j, ki , kj ,index;
double fun[DataFiles][Nx][Ny], Res[DataFiles][Nx][Ny],mean[Nx][Ny];
FILE * conf;
char file[100];
for (index=0; index<DataFiles; index++){
fscanf(conf,"%i %i %lf", &i, &j, &fun[index][ki][kj]);
mean[ki][kj] = mean[ki][kj] + fun[index][ki][kj] ;
fclose (conf);
// do things with my loaded data
You ran out of stack. Generally speaking, don't allocate more than 8k at once on the stack. Oops.
double fun[DataFiles][Nx][Ny], Res[DataFiles][Nx][Ny],mean[Nx][Ny];
double (*fun)[Nx][Ny] = malloc(sizeof(fun[0]) * DataFiles), (*Res)[Nx][Ny] = malloc(sizeof(Res[0]) * DataFiles), mean[Nx][Ny];
if (!fun || !Res) {
/* handle OOM */

Getting the compiler to auto-vectorize code in a sensible manner

I'm trying to figure out how to structure the main loop code for a numerical simulation in such a way that the compiler generates nicely vectorized instructions in a compact way.
The problem is most easily explained by a C pseudocode, but I also have a Fortran version which is affected by the same kind of issue. Consider the following loop where lots_of_code_* are some complicated expressions which produces a fair number of machine instructions.
void process(const double *in_arr, double *out_arr, int len)
for (int i = 0; i < len; i++)
const double a = lots_of_code_a(i, in_arr);
const double b = lots_of_code_b(i, in_arr);
const double z = lots_of_code_z(i, in_arr);
out_arr[i] = final_expr(a, b, ..., z);
When compiled with an AVX target the Intel compiler generates code which goes like
The resulting binary is already quite sizable. My actual calculation loop, though, looks more like the following:
void process(const double *in_arr1, ... , const double *in_arr30,
double *out_arr1, ... double *out_arr30,
int len)
for (int i = 0; i < len; i++)
const double a1 = lots_of_code_a(i, in_arr1);
const double a30 = lots_of_code_a(i, in_arr30);
const double b1 = lots_of_code_b(i, in_arr1);
const double b30 = lots_of_code_b(i, in_arr30);
const double z1 = lots_of_code_z(i, in_arr1);
const double z30 = lots_of_code_z(i, in_arr30);
out_arr1[i] = final_expr1(a1, ..., z1);
out_arr30[i] = final_expr30(a30, ..., z30);
This results in a very large binary indeed (400KB for the Fortran version, 800KB for C99). If I now define lots_of_code_* as functions, then each function gets turned into non-vectorized code. Whenever the compiler decides to inline a function it does vectorize it, but seems to also duplicate the code each time as well.
In my mind, the ideal code should look like:
call AVX_lots_of_code_a
call AVX_lots_of_code_a
call SSE_lots_of_code_a
call SSE_lots_of_code_a
call scalar_lots_of_code_a
call scalar_lots_of_code_a
This clearly results in a much smaller code which is still just as well optimized as the fully-inlined version. With luck it might even fit in L1.
Obviously I can write the this myself using intrinsics or whatever, but is it possible to get the compiler to automatically vectorize in the way described above through "normal" source code?
I understand that the compiler will probably never generate separate symbols for each vectorized version of the functions, but I thought it could still just inline each function once inside process and use internal jumps to repeat the same code block, rather than duplicating code for each input array.
Formal answer to questions like yours:
Consider using OpenMP4.0 SIMD-enabled (I didn't say inlined) functions or equivalent proprietary mechanisms. Available in Intel Compiler or fresh GCC4.9.
See more details here:
//Invoke this function from vectorized loop
#pragma omp declare simd
int vfun(int x, int y)
return x*x+y*y;
It will give you capability to vectorize loop with function calls without inlining and as a result without huge code generation. (I didn't really explore your code snippet in details; instead I answered the question you asked in textual form)
The immediate problem that comes to mind is the lack of restrict on the input/output-pointers. The input is const though, so it's probably not too much of a problem, unless you have multiple output-pointers.
Other than that, I recommend -fassociative-math or whatever the ICC equivalent is. Structurally, you seem to iterate over the array, doing multiple independent operations on the array that are only munged together in the very end. Strict fp compliance might kill you on the array-operations.Finally, there's probably no way this will get vectorized if you need more intermediate results than vector_registers - input_arrays.Edit:
I think I see your problem now. You call the same function on different data, and want each result stored independently, right?The problem is that the same function always writes to the same output register, so subsequent, vectorized calls would clobber earlier results. The solution could be:A stack of results (either in memory or like the old x87 FPU-stack), that gets pushed every time. If in memory, it is slow, if x87, it's not vectorized. Bad idea.
Effectively multiple functions to write into different registers. Code duplication. Bad idea.Rotating registers, like on the Itanium. You don't have an Itanium? You're not alone.It's possible that this can't be easily vectorized on current architectures. Sorry.
Edit, you're apparently fine with going to memory:
void function1(double const *restrict inarr1, double const *restrict inarr2, \
double *restrict outarr, size_t n)
for (size_t i = 0; i<n; i++)
double intermediateres[NUMFUNCS];
double * rescursor = intermediateres;
*rescursor++ = mungefunc1(inarr1[i]);
*rescursor++ = mungefunc1(inarr2[i]);
*rescursor++ = mungefunc2(inarr1[i]);
*rescursor++ = mungefunc2(inarr2[i]);
outarr[i] = finalmunge(intermediateres[0],...,intermediateres[NUMFUNCS-1]);
This might be vectorizable. I don't think it'll be all that fast, going at memory speed, but you never know till you benchmark.
If you moved the lots_of_code blocks into separate compilation units without the for loop, they will probably not vecorize. Unless the compiler has a motive for vectorization, it will not vectorize the code because vectorization might lead for longer latencies in the pipelines. To get around that, split the loop into 30 loops, and put each one of them in a separate compilation unit like that:
for (int i = 0; i < len; i++)
lots_of_code_a(i, in_arr1);

Seg Fault when initializing array

I'm taking a class on C, and running into a segmentation fault. From what I understand, seg faults are supposed to occur when you're accessing memory that hasn't been allocated, or otherwise outside the bounds. 'Course all I'm trying to do is initialize an array (though rather large at that)
Am I simply misunderstanding how to parse a 2d array? Misplacing a bound is exactly what would cause a seg fault-- am I wrong in using a nested for-loop for this?
The professor provided the clock functions, so I'm hoping that's not the problem. I'm running this code in Cygwin, could that be the problem? Source code follows. Using c99 standard as well.
To be perfectly clear: I am looking for help understanding (and eventually fixing) the reason my code produces a seg fault.
#include <stdio.h>
#include <time.h>
int main(void){
//first define the array and two doubles to count elapsed seconds.
double rowMajor, colMajor;
rowMajor = colMajor = 0;
int majorArray [1000][1000] = {};
clock_t start, end;
//set it up to perform the test 100 times.
for(int k = 0; k<10; k++)
//first we do row major
for(int i = 0; i < 1000; i++)
for(int j = 0; j<1000; j++)
majorArray[i][j] = 314;
rowMajor+= (end-start)/(double)CLOCKS_PER_SEC;
//at this point, we've only done rowMajor, so elapsed = rowMajor
//now we do column major
for(int i = 0; i < 1000; i++)
for(int j = 0; j<1000; j++)
majorArray[j][i] = 314;
colMajor += (end-start)/(double)CLOCKS_PER_SEC;
//now that we've done the calculations 100 times, we can compare the values.
printf("Row major took %f seconds\n", rowMajor);
printf("Column major took %f seconds\n", colMajor);
printf("Row major is faster\n");
printf("Column major is faster\n");
return 0;
Your program works correctly on my computer (x86-64/Linux) so I suspect you're running into a system-specific limit on the size of the call stack. I don't know how much stack you get on Cygwin, but your array is 4,000,000 bytes (with 32-bit int) - that could easily be too big.
Try moving the declaration of majorArray out of main (put it right after the #includes) -- then it will be a global variable, which comes from a different allocation pool that can be much bigger.
By the way, this comparison is backwards:
printf("Row major is faster\n");
printf("Column major is faster\n");
Also, to do a test like this you really ought to repeat the process for many different array sizes and shapes.
You are trying to grab 1000 * 1000 * sizeof( int ) bytes on the stack. This is more then your OS allows for the stack growth. If on any Unix - check the ulimit -a for max stack size of the process.
As a rule of thumb - allocate big structures on the heap with malloc(3). Or use static arrays - outside of scope of any function.
In this case, you can replace the declaration of majorArray with:
int (*majorArray)[1000] = calloc(1000, sizeof majorArray);
I was unable to find any error in your code, so I compiled it and run it and worked as expected.
You have, however, a semantic error in your code:
//set it up to perform the test 100 times.
for(int k = 0; k<10; k++)
Should be:
//set it up to perform the test 100 times.
for(int k = 0; k<10; k++)
Also, the condition at the end should be changed to its inverse:
Finally, to avoid the problem of the os-specific stack size others mentioned, you should define your matrix outside main():
#include <stdio.h>
#include <time.h>
int majorArray [1000][1000];
int main(void){
//first define the array and two doubles to count elapsed seconds.
double rowMajor, colMajor;
rowMajor = colMajor = 0;
This code runs fine for me under Linux and I can't see anything obviously wrong about it. You can try to debug it via gdb. Compile it like this:
gcc -g -o testcode test.c
and then say
gdb ./testcode
and in gdb say run
If it crashes, say where and gdb tells you, where the crash occurred. Then you now in which line the error is.
The program is working perfectly when compiled by gcc, & run in Linux, Cygwin may very well be your problem here.
If it runs correctly elsewhere, you're most likely trying to grab more stack space than the OS allows. You're allocating 4MB on the stack (1 mill integers), which is way too much for allocating "safely" on the stack. malloc() and free() are your best bets here.
