OpenCL CLK_LOCAL_MEM_FENCE causing abort trap 6 - c

I'm doing some exercise about convolution over images (info here) using OpenCL. When I use images whose size is not a square (like r x c) CLK_LOCAL_MEM_FENCE makes the program stop with abort trap 6.
What I do is essentially filing up the local memory with proper values, waiting for this process of filling the local memory to finish, using barrier(CLK_LOCAL_MEM_FENCE) and then calculating the values.
It seems like when I use images like those I've told you about barrier(CLK_LOCAL_MEM_FENCE) gives issues, if I comment that command everything work fine (which is weird since there's no synchronization). What may cause this problem any idea?
EDIT: the problem comes when the hight or the width or both are not multiple of the the local items size (16 x 16). The global items size is aways a couple of values multiple of 16 like (512 x 512).
int c = get_global_id(0);
int r = get_global_id(1);
int lc = get_local_id(0);
int lr = get_local_id(1);
// this ignores indexes out of the input image.
if (c >= ImageWidth || r >= ImageHeight) return;
// fill a local array...
barrier(CLK_LOCAL_MEM_FENCE);
if (c < outputImageWidth && r < outputImageHeight)
{
// LOCAL DATA PROCESSED
OutputImage[r* outputImageWidth +c] = someValue;
}

OpenCL requires that each work-group barrier is executed by every work-item in that work-group.
In the code that you have posted, you have an early exit clause to prevent out-of-range accesses. This is a common trick for getting nice work-group sizes in OpenCL 1.X, but unfortunately this breaks the above condition, and this will lead to undefined behaviour (typically either a hang or a crash).
You will need to modify your kernel to avoid this, by either removing the early exit clause (and perhaps clamping out-of-range work-items instead, if applicable), or by restructuring the kernel so that out-of-range work-items continue at least as far as the barrier before exiting.

You can change the code order without affecting the behaviour to fix it:
int c = get_global_id(0);
int r = get_global_id(1);
int lc = get_local_id(0);
int lr = get_local_id(1);
// fill a local array... with all the threads
// ie: for(i=0;i<size;i+=get_local_size(0))
// ...
barrier(CLK_LOCAL_MEM_FENCE);
// this ignores indexes out of the input image.
if (c >= ImageWidth || r >= ImageHeight) return;
if (c < outputImageWidth && r < outputImageHeight)
{
// LOCAL DATA PROCESSED
OutputImage[r* outputImageWidth +c] = someValue;
}

Related

Why does thread not recognize change of a flag?

I have a strange situation under C/Visual Studio on a Windows 7 platform. There is a problem from time to time and I spent a lot of time to find it. The problem is within a third party library, for which I have the complete code. There a thread is created (the printLog statements are from myself):
...
plafParams->eventThreadFlag = 2;
printLog("before CreateThread");
if (plafParams->hReadThread_p = CreateThread(NULL, 0, ( LPTHREAD_START_ROUTINE ) plafPortReadThread, ( void * ) dlmsInstance, 0,
&plafParams->portReadThreadID) )
{
printLog("after CreateThread: OK");
plafParams->eventThreadFlag = 3;
}
else
{
unsigned int lasterr = GetLastError();
printLog("error CreateThread, last error:%x", lasterr);
/* Could not create the read thread. */
...
...
return FAILURE;
}
printLog("SUCCESS");
...
...
The thread function is:
void *plafPortReadThread(DLMS_GLOBALS *dlmsInstance)
{
PLAF_PARAMS *plafParams;
plafParams = (PLAF_PARAMS *)(dlmsInstance->plafParams);
printLog("start - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
while ((plafParams->eventThreadFlag != 1) && (plafParams->eventThreadFlag != 3))
{
if (plafParams->eventThreadFlag == 0)
{
printLog("exit 1 - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
CloseHandle(plafParams->hReadThread_p);
plafFree((void **)&plafParams);
ExitThread(0);
break;
}
}
printLog("start - plafPortReadThread, proceed=%d", proceed);
...
Now, when the flag is set before the while loop is started within the thread, everything works OK:
SUCCESS
start - plafPortReadThread, plafParams->eventThreadFlag=3
But sometimes the thread is quick enough so the while loop is started before the flag is actually set within the outer part.
The output is then:
start - plafPortReadThread, plafParams->eventThreadFlag=2
SUCCESS
Most surprisingly the while loop doesn't exit, even after the flag has been set to 3.
It seems, that the compiler "optimizes" the flag and assumes, that it cannot be changed from outside.
What could be the problem? I'm really surprised. Or is there something else I have overseen completely? I know, that the code is not very elegant and that such things should better be done with semaphores or signals. But it is not my code and I want to change as little as possible.
After removing the whole while condition it works as expected.
Should I change the struct or its fields to volatile ? Everybody says, that volatile is useless in our days and not needed anymore, except in the case, where a memory location is changed by peripherals...
Prior to C11 this is totally platform-dependent, because the effect you are observing is due to the memory model used by your platform. This is different from a compiler optimization as synchronization points between threads require the compiler to insert barrier instructions, instead of, e.g., making something a constant. For C11 for section 7.17.3 specifies the different models. So your value is not optimized out statically, thread A just never reads the value thread B wrote, but still has its local value.
In practice many projects don't use C11 yet, and thus you will likely have to check the documentation of your platform. Note that in many cases you don't have to modify the type of the variable for the flag (in case you can't). Most memory models specify synchronization points that also forbid reordering of certain instructions, i.e. in:
int x = 3;
_Atomic int a = 1;
x = 5;
a = 2;
the compiler will often have to ensure that x has the value 3 when a has the value 1, and that when a is assigned the value 2, x will have the value 5. volatile does not participate in this relationship (in the C/C++ 11 models - often confused because it does participate in Java's happened-before), and is mostly useless, unless your writes should never be optimized out because they have side-effects such as a LED blinking which the compiler can't understand:
volatile int x = 1; // some special location - blink then clear
x = 1; // blink then clear
x = 1; // blink then clear

tasks run in thread takes longer than in serial?

So im doing some computation on 4 million nodes.
the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation. this takes roughly 1.2 sec.
when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.
I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.
The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish.
I am expecting each of them to only take 1.9/4 second instead.
I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds. And then a for loop to joins them.
The computation read from a shared array and write to a different shared array.
I am sure they are not writing to the same byte.
Where could the overhead came from?
main: ncores: number of cores. node_size: size of graph (4 million node)
for(i = 0 ; i < ncores ; i++){
int *t = (int*)malloc(sizeof(int));
*t = i;
int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));
}
for (i = 0; i < ncores; i++)
{
pthread_join(thread[i], NULL);
}
calculate_rank_p: vector is the rank vector for page rank calculation
Void *calculate_rank_pthread(void *argument) {
int index = *(int*)argument;
for(i = index; i < node_size ; i+=ncores)
current_vector[i] = calc_r(i, vector);
return NULL;
}
calc_r: this is just a page rank calculation using compressed row format.
double calc_r(int i, double *vector){
double prank = 0;
int j;
for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
prank += vector[col_ind[j]] * val[j];
}
return prank;
}
everything that is not declared are global variable
The computation read from a shared array and write to a different shared array. I am sure they are not writing to the same byte.
It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ...
the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206
UPDATE
This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated.
for(i = index; i < node_size ; i+=ncores)
Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on.
For me the same surprise when build and run in Debug (other test code though).
In release all as expected ;)

segmentation fault when using omp parallel for, but not sequentially

I'm having trouble using the #pragma omp parallel for
Basically I have several hundred DNA sequences that I want to run against an algorithm called NNLS.
I figured that doing it in parallel would give me a pretty good speed up, so I applied the #pragma operators.
When I run it sequentially there is no issue, the results are fine, but when I run it with #pragma omp parallel for I get a segfault within the algorithm (sometimes at different points).
#pragma omp parallel for
for(int i = 0; i < dir_count; i++ ) {
int z = 0;
int w = 0;
struct dirent *directory_entry;
char filename[256];
directory_entry = readdir(input_directory_dh);
if(strcmp(directory_entry->d_name, "..") == 0 || strcmp(directory_entry->d_name, ".") == 0) {
continue;
}
sprintf(filename, "%s/%s", input_fasta_directory, directory_entry->d_name);
double *count_matrix = load_count_matrix(filename, width, kmer);
//normalize_matrix(count_matrix, 1, width)
for(z = 0; z < width; z++)
count_matrix[z] = count_matrix[z] * lambda;
// output our matricies if we are in debug mode
printf("running NNLS on %s, %d, %d\n", filename, i, z);
double *trained_matrix_copy = malloc(sizeof(double) * sequences * width);
for(w = 0; w < sequences; w++) {
for(z = 0; z < width; z++) {
trained_matrix_copy[w*width + z] = trained_matrix[w*width + z];
}
}
double *solution = nnls(trained_matrix_copy, count_matrix, sequences, width, i);
normalize_matrix(solution, 1, sequences);
for(z = 0; z < sequences; z++ ) {
solutions(i, z) = solution[z];
}
printf("finished NNLS on %s\n", filename);
free(solution);
free(trained_matrix_copy);
}
gdb always exits at a different pint in my thread, so I can't figure out what is going wrong.
What I have tried:
allocating a copy of each matrix, so that they would not be writing on top of eachother
using a mixture of private/shared operators for the #pragma piece
using different input sequences
writing out my trained_matrix and count_matrix prior to calling NNLS, ensuring that they look OK. (they do!)
I'm sort of out of ideas. Does anyone have some advice?
Solution: make sure not use static variables in your function when multithreading (damned f2c translator)
Defining "#pragma omp parallel for" is not going to give you what you want. Based on the algorithm you have, you must have a solid plan on which variables are going to shared and which ones going to private among the processors.
Looking at this link should give you a quick start on how to correctly share the work among the threads.
Based on your statement "I get a segfault within the algorithm (sometimes at different points)", I would think there is a race condition between the threads or improper initialization of variables.
Function readdir is not thread safe. To quote the Linux man page for readdir(3):
The data returned by readdir() may be overwritten by subsequent calls to readdir()
for the same directory stream.
Consider putting the calls to readdir inside a critical section. Before leaving the critical section, copy the filename returned from readdir() to a local temporary variable, since the next thread to enter the critical section may overwrite it.
Also consider protecting your output operations with a critical section too, otherwise the output from different threads might be jumbled together.
A very possible reason is the stack limit. As MutantTurkey mentioned, if you have a lot of static variables (like a huge array defined in subroutine), they may use up your stack.
To solve this, first run ulimit -s to check the stack limit for the process. You can use ulimit -s unlimited to set it as ulimited. Then if it still crashes, try to increase the stack for OPENMP by setting OMP_STACKSIZE environmental variable to a huge value, like 100MB.
Intel has a discussion at https://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors. It has more information of stack and heap memory.

Do we need a lock in a single writer multi-reader system?

In os books they said there must be a lock to protect data from accessed by reader and writer at the same time.
but when I test the simple example in x86 machine,it works well.
I want to know, is the lock here nessesary?
#define _GNU_SOURCE
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
struct doulnum
{
int i;
long int l;
char c;
unsigned int ui;
unsigned long int ul;
unsigned char uc;
};
long int global_array[100] = {0};
void* start_read(void *_notused)
{
int i;
struct doulnum d;
int di;
long int dl;
char dc;
unsigned char duc;
unsigned long dul;
unsigned int dui;
while(1)
{
for(i = 0;i < 100;i ++)
{
dl = global_array[i];
//di = d.i;
//dl = d.l;
//dc = d.c;
//dui = d.ui;
//duc = d.uc;
//dul = d.ul;
if(dl > 5 || dl < 0)
printf("error\n");
/*if(di > 5 || di < 0 || dl > 10 || dl < 5)
{
printf("i l value %d,%ld\n",di,dl);
exit(0);
}
if(dc > 15 || dc < 10 || dui > 20 || dui < 15)
{
printf("c ui value %d,%u\n",dc,dui);
exit(0);
}
if(dul > 25 || dul < 20 || duc > 30 || duc < 25)
{
printf("uc ul value %u,%lu\n",duc,dul);
exit(0);
}*/
}
}
}
int start_write(void)
{
int i;
//struct doulnum dl;
while(1)
{
for(i = 0;i < 100;i ++)
{
//dl.i = random() % 5;
//dl.l = random() % 5 + 5;
//dl.c = random() % 5 + 10;
//dl.ui = random() % 5 + 15;
//dl.ul = random() % 5 + 20;
//dl.uc = random() % 5 + 25;
global_array[i] = random() % 5;
}
}
return 0;
}
int main(int argc,char **argv)
{
int i;
cpu_set_t cpuinfo;
pthread_t pt[3];
//struct doulnum dl;
//dl.i = 2;
//dl.l = 7;
//dl.c = 12;
//dl.ui = 17;
//dl.ul = 22;
//dl.uc = 27;
for(i = 0;i < 100;i ++)
global_array[i] = 2;
for(i = 0;i < 3;i ++)
if(pthread_create(pt + i,NULL,start_read,NULL) < 0)
return -1;
/* for(i = 0;i < 3;i ++)
{
CPU_ZERO(&cpuinfo);
CPU_SET_S(i,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pt[i],sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity %d\n",i);
exit(0);
}
}
CPU_ZERO(&cpuinfo);
CPU_SET_S(3,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pthread_self(),sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity recver\n");
exit(0);
}*/
start_write();
return 0;
}
If you don't synchronise reads and writes, a reader could read while a writer is writing, and read the data in a half-written state if the write operation is not atomic. So yes, synchronisation would be necessary to keep that from happening.
You surely need synchronization here . The simple reason being that there is a distinct possibility that data be in a inconsistent state when start_write is updating the information in the global array and one of your 3 threads try to read the same data from the global array .
What you quote is also incorrect . " must be a lock to protect data from accessed by reader and writer at the same time" should be "must be a lock to protect data from modified by reader and writer at the same time"
if the shared data is being modified by one of the threads and another thread is reading from it you need to use lock to protect it .
if the shared data is being accessed by two or more threads then you dont need to protect it .
It will work fine if the threads are just reading from global_array. printf should be fine since this does a single IO operation in append mode.
However, since the main thread calls start_write to update the global_array at the same time the other threads are in start_read then they are going to be reading the values in a very unpredictable manner. It depends highly on how the threads are implemented in the OS, how many CPUs/cores you have, etc.. This might work well on your dual core development box but then fail spectactuarly when you move to a 16 core production server.
For example, if the threads were not synchronizing, they might never see any updates to global_array in the right circumstances. Or some threads would see changes faster than others. It's all about the timing of when memory pages are flushed to central memory and when the threads see the changes in their caches. To ensure consistent results you need synchronization (memory barriers) to force the caches to the updated.
The general answer is you need some way to ensure/enforce necessary atomicity, so the reader doesn't see an inconsistent state.
A lock (done correctly) is sufficient but not always necessary. But in order to prove that it's not necessary, you need to be able to say something about the atomicity of the operations involved.
This involves both the architecture of the target host and, to some extent, the compiler.
In your example, you're writing a long to an array. In this case, the question is is the storage of a long atomic? It probably is, but it depends on the host. It's possible that the CPU writes out a portion of the long (upper/lower words/bytes) separately and thus the reader could get a value never written. (This is, I believe, unlikely on most modern CPU archs, but you'd have to check to be sure.)
It's also possible for there to be write buffering in the CPU. It's been a long time since I looked at this, but I believe it's possible to get store reordering if you don't have the necessary write barrier instructions. It's unclear from your example if you would be relying on this.
Finally, you'd probably need to flag the array as volatile (again, I haven't done this in a while so I'm rusty on the specifics) in order to ensure that the compiler doesn't make assumptions about the data not changing underneath it.
It depends on how much you care about portability.
At least on an actual Intel x86 processor, when you're reading/writing dword (32-bit) data that's also dword aligned, the hardware gives you atomicity "for free" -- i.e., without your having to do any sort of lock to enforce it.
Changing much of anything (up to an including compiler flags that might affect the data'a alignment) can break that -- but in ways that might remain hidden for a long time (especially if you have low contention over a particular data item). It also leads to extremely fragile code -- for example, switching to a smaller data type can break the code, even if you're only using a subset of the values.
The current atomic "guarantee" is pretty much an accidental side-effect of the way the cache and bus happen to be designed. While I'm not sure I'd really expect a change that broke things, I wouldn't consider it particularly far-fetched either. The only place I've seen documentation of this atomic behavior was in the same processor manuals that cover things like model-specific registers that definitely have changed (and continue to change) from one model of processor to the next.
The bottom line is that you really should do the locking, but you probably won't see a manifestation of the problem with your current hardware, no matter how much you test (unless you change conditions like mis-aligning the data).

Intermittent bugs - sometimes this code works and sometimes it doesn't!

This code intermittently works. It's running on a small microcontroller. It will work fine even after restarting the processor, but if I change some part of the code, it breaks. This makes me think that it's some kind of pointer bug or memory corruption. What's happening is the coordinate, p_res.pos.x is sometimes read as 0 (the incorrect value) and 96 (the correct value) when it is passed to write_circle_outlined. y seems to be correct most of the time. If anyone can spot anything obviously wrong please point it out!
int demo_game()
{
long int d;
int x, y;
struct WorldCamera p_viewer;
struct Point3D_LLA p_subj;
struct Point2D_CalcRes p_res;
p_viewer.hfov = 27;
p_viewer.vfov = 32;
p_viewer.width = 192;
p_viewer.height = 128;
p_viewer.p.lat = 51.26f;
p_viewer.p.lon = -1.0862f;
p_viewer.p.alt = 100.0f;
p_subj.lat = 51.20f;
p_subj.lon = -1.0862f;
p_subj.alt = 100.0f;
while(1)
{
fill_buffer(draw_buffer_mask, 0x0000);
fill_buffer(draw_buffer_level, 0xffff);
compute_3d_transform(&p_viewer, &p_subj, &p_res, 10000.0f);
x = p_res.pos.x;
y = p_res.pos.y;
write_circle_outlined(x, y, 1.0f / p_res.est_dist, 0, 0, 0, 1);
p_viewer.p.lat -= 0.0001f;
//p_viewer.p.alt -= 0.00001f;
d = 20000;
while(d--);
}
return 1;
}
The code for compute_3d_transform is:
void compute_3d_transform(struct WorldCamera *p_viewer, struct Point3D_LLA *p_subj, struct Point2D_CalcRes *res, float cliph)
{
// Estimate the distance to the waypoint. This isn't intended to replace
// proper lat/lon distance algorithms, but provides a general indication
// of how far away our subject is from the camera. It works accurately for
// short distances of less than 1km, but doesn't give distances in any
// meaningful unit (lat/lon distance?)
res->est_dist = hypot2(p_viewer->p.lat - p_subj->lat, p_viewer->p.lon - p_subj->lon);
// Save precious cycles if outside of visible world.
if(res->est_dist > cliph)
goto quick_exit;
// Compute the horizontal angle to the point.
// atan2(y,x) so atan2(lon,lat) and not atan2(lat,lon)!
res->h_angle = RAD2DEG(angle_dist(atan2(p_viewer->p.lon - p_subj->lon, p_viewer->p.lat - p_subj->lat), p_viewer->yaw));
res->small_dist = res->est_dist * 0.0025f; // by trial and error this works well.
// Using the estimated distance and altitude delta we can calculate
// the vertical angle.
res->v_angle = RAD2DEG(atan2(p_viewer->p.alt - p_subj->alt, res->est_dist));
// Normalize the results to fit in the field of view of the camera if
// the point is visible. If they are outside of (0,hfov] or (0,vfov]
// then the point is not visible.
res->h_angle += p_viewer->hfov / 2;
res->v_angle += p_viewer->vfov / 2;
// Set flags.
if(res->h_angle < 0 || res->h_angle > p_viewer->hfov)
res->flags |= X_OVER;
if(res->v_angle < 0 || res->v_angle > p_viewer->vfov)
res->flags |= Y_OVER;
res->pos.x = (res->h_angle / p_viewer->hfov) * p_viewer->width;
res->pos.y = (res->v_angle / p_viewer->vfov) * p_viewer->height;
return;
quick_exit:
res->flags |= X_OVER | Y_OVER;
return;
}
Structure for the results:
typedef struct Point2D_Pixel { unsigned int x, y; };
// Structure for storing calculated results (from camera transforms.)
typedef struct Point2D_CalcRes
{
struct Point2D_Pixel pos;
float h_angle, v_angle, est_dist, small_dist;
int flags;
};
The code is part of an open source project of mine so it's okay to post a lot of code here.
I see some of your calculation depends on p_viewer->yaw, but I do not see any intialization for p_viewer->yaw. Is this your problem?
A couple of things that seem sketchy:
You can return from compute_3d_transform without setting many of the fields in p_res/res but the caller never checks for this situation.
You consistently read from res->flags without initializing it first.
Whenever the output differs, it possibly means some value is not initialized and the outcome depends on the garbage value present in a variable. Keeping that in mind, I looked for uninitialized variables. the structure p_res is not initialized.
if(res->est_dist > cliph)
goto quick_exit;
that means if condition may turn out to be true or false depending on what garbage value is stored in res->est_dist. When if condition turns out to true, it goes straight to quick_exit label and doesn't update p_res.pos.x. If condition turned out to be false then its updated.
When I used to program C, I would use a divide and conquer debugging technique for this kind of problem to try to isolate the offending operation (paying attention to whether the symptoms change as debugging code is added, which is indicative of dangling pointer type bugs).
Essentially, start with the first line where the value is known to be good (and prove that it is consistently good at that line). Then identify where is it known to be bad. Then approx. halfway between the two points insert a test to see if it's bad. If not, then insert a test halfway between the mid-point and the known bad location, if it is bad then insert a test halfway between the mid-point and the known good location, and so on.
If the line identified is itself a function call, this process can be repeated in that called function, and so on.
When using this kind of approach, it's important to minimize the amount of added code and the artificial "noise", which can create timing changes.
Use this if you don't have (or can't use) an interactive debugger, or if the problem does not manifest when using one.

Resources