segfault from bad OpenCL 2.0 kernel - c

I'm trying to learn the new features of OpenCL 2.0, and I've created a small kernel in an attempt to demonstrate device-side enqueue. The Kernel is below:
#pragma OPENCL EXTENSION cl_amd_printf : enable
__kernel void call_me(__global int *a);
__kernel void templateKernel(__global unsigned int * output,
__global unsigned int * input,
const unsigned int multiplier);
__kernel void call_me(__global int *a)
{
//do nothing
int id = get_global_id(0);
//a[id] = b[id];
}
__kernel void templateKernel(__global unsigned int * output,
__global unsigned int * input,
const unsigned int multiplier)
{
uint tid = get_global_id(0);
int lid = get_local_id(0);
int gid = get_group_id(0);
int broadcast = 1;
int global_size = get_global_size(0);
if(gid == 0) {
broadcast = work_group_broadcast(5, 0);
}
int collection = work_group_scan_exclusive_add(broadcast);
void (^kernel_block)(void) = ^{call_me(input);};
//output[tid] = input[tid] * multiplier + collection + broadcast;
output[tid] = collection;
//output[tid] = global_size;
size_t size = 100;
//printf("hey %d\n", broadcast);
ndrange_t ndrange = ndrange_1D(size);
queue_t default_queue = get_default_queue();
/*
if(tid == 0){
int status = enqueue_kernel(
default_queue,
CLK_ENQUEUE_FLAGS_WAIT_KERNEL,
ndrange,
kernel_block
);
}
*/
}
This kernel is supposed to do nothing, other than have it be a successful call in the kernel, that doesn't result in the program segfaulting. What's wrong with it? The segmentation fault is removed when the enqueue_kernel call is removed. My OpenCL C compiler is set to --cl-std=CL2.0 and is confirmed to be working, since the broadcast and collect functions work properly.
I'm using AMDAPPSDK 3.0 Beta. Any help is appreciated.

I have solved my own problem.
The issue was that in OpenCL 2.0, the API call to create command queues
clCreateCommandQueue() has been deprecated. Instead AMD suggests that one should use the new API call clCreateCommandQueueWithProperties() in order to enable device-side queues, for the device-side kernel calls.
In addition to using the new API call, one must also make at least 2 command queues. One for the host-side, and one for the device-side.
The device queue is made in the host, using the additional properties that come with the new API call.

Related

Confused about Passing user data to PortAudio Callbacks

This is my first post here and I'm fairly new to programming and especially with C. A couple weeks ago I started working through the Audio Programming Book(MIT press) and have been expand on some examples to try to understand things further.
I think my question lies with how I'm trying to pass data (retrieved from the user in an initialization function) to a PortAudio callback. I feel like what I've done isn't that different from the examples (both from the book and PortAudio's examples like paex_sine.c), but for some reason I can't my code to work and I've been banging my head against a wall trying to understand why. I've tried searching pretty extensively for solutions or example code to study, but I kind of don't know what I don't know, so that hasn't returned much.
How do I get user data into the callback?
Am I just not understanding how pointers and structs work and trying to force them to do things they don't want to?
Or, am I just overlooking something really obvious?
The following code either gives a really high pitched output, short high pitched blips, or no (audible) output:
#include <stdio.h>
#include <math.h>
#include "portaudio.h"
#define FRAME_BLOCK_LEN 64
#define SAMPLING_RATE 44100
#define TWO_PI (3.14159265f * 2.0f)
PaStream *audioStream;
double si = 0;
typedef struct
{
float frequency;
float phase;
}
paTestData;
int audio_callback (const void *inputBuffer, void *outputBuffer,
unsigned long framesPerBuffer,
const PaStreamCallbackTimeInfo* timeinfo,
PaStreamCallbackFlags statusFlags,
void *userData )
{
paTestData *data = (paTestData*)userData;
float *out = (float*)outputBuffer;
unsigned long i;
// data->frequency = 400;
for(i = 0; i < framesPerBuffer; i++){
si = TWO_PI * data->frequency / SAMPLING_RATE; // calculate sampling-incr
*out++ = sin(data->phase);
*out++ = sin(data->phase);
data->phase += si; // add sampling-incr to phase
}
return paContinue;
}
void init_stuff()
{
float frequency;
int i;
PaStreamParameters outputParameters;
paTestData data;
printf("type the modulator frequency in Hz: ");
scanf("%f", &data.frequency); // get modulator frequency
printf("you chose data.frequency %.2f\n",data.frequency);
data.phase = 0.0;
printf("initializing Portaudio. Please wait...\n");
Pa_Initialize(); // initialize Portaudio
outputParameters.device = Pa_GetDefaultOutputDevice(); /* default output device */
outputParameters.channelCount = 2; /* stereo output */
outputParameters.sampleFormat = paFloat32; /* 32 bit floating point output */
outputParameters.suggestedLatency = Pa_GetDeviceInfo( outputParameters.device )->defaultLowOutputLatency;
outputParameters.hostApiSpecificStreamInfo = NULL;
Pa_OpenStream( // open paStream object
&audioStream, // portaudio stream object
NULL, // input params
&outputParameters, // output params
SAMPLING_RATE, // SampleRate
FRAME_BLOCK_LEN, // frames per buffer
paNoFlag, // set no Flag
audio_callback, // callbak function address
&data ); // user data
Pa_StartStream(audioStream); // start the callback mechanism
printf("running... press space bar and enter to exit\n");
}
void terminate_stuff()
{
Pa_StopStream(audioStream); // stop callback mechanism
Pa_CloseStream(audioStream); // destroy audio stream object
Pa_Terminate(); // terminate portaudio
}
int main(void)
{
init_stuff();
while(getchar() != ' ') Pa_Sleep(100);
terminate_stuff();
return 0;
}
Uncommenting data->frequency = 400; at least plays a 400hz sine wave, but that ignores any user input done in init_stuff()
If I put a printf("%f\n",data->frequency); inside the callback, it prints 0.000000 or something like -146730090609497866240.000000.
It's pretty unpredictable, and this really makes me think it's pointer related.
My goal for this code is to eventually incorporate envelope generators to change the pitch and possibly incorporate wavetable oscillators so I'm not calculating sin(x) for every iteration.
I can get envelopes and wavetables to work while using a blocking API like portsf that's used in the book, but trying to adapt any of that code from earlier chapters to use PortAudio callbacks is turning my brain to mush.
Thanks so much!
The problem you're having with your callback data is that it goes out of scope and memory is deallocated as soon as init_stuff finishes execution.
You should allocate memory for your callback data using malloc or new and passing the pointer to it for the callback.
For example:
void init_stuff()
{
float frequency;
int i;
PaStreamParameters outputParameters;
paTestData *data = (paTestData *) malloc(sizeof(paTestData));
printf("type the modulator frequency in Hz: ");
scanf("%f", &(data->frequency)); // get modulator frequency
printf("you chose data.frequency %.2f\n",data->frequency);
data->phase = 0.0;
...
Pa_OpenStream( // open paStream object
&audioStream, // portaudio stream object
NULL, // input params
&outputParameters, // output params
SAMPLING_RATE, // SampleRate
FRAME_BLOCK_LEN, // frames per buffer
paNoFlag, // set no Flag
audio_callback, // callbak function address
data );
...
I wasn't able to get the original code working using malloc but based on both suggestions, I realized another workable solution. Because running init_stuff() caused my data to get deallocated, I'm for now just making all my assignments and calls to Pa_OpenStream() from main.
Works beautifully and I can now send whatever data I want to the callback. Thanks for the help!

How to assign threads to different cores in C?

I created a program that does the addition of 8 numbers using 4 threads, and then the product of the results. How to ensure that each thread is using a separate core for maximum performance gains. I am new to pthreads so I really don't have any idea on how to use it properly. Please provide answers as simple as possible.
My code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
int global[9];
void *sum_thread(void *arg)
{
int *args_array;
args_array = arg;
int n1,n2,sum;
n1=args_array[0];
n2=args_array[1];
sum = n1*n2;
printf("N1 * N2 = %d\n",sum);
return (void*) sum;
}
void *sum_thread1(void *arg)
{
int *args_array;
args_array = arg;
int n3,n4,sum2;
n3=args_array[2];
n4=args_array[3];
sum2=n3*n4;
printf("N3 * N4 = %d\n",sum2);
return (void*) sum2;
}
void *sum_thread2(void *arg)
{
int *args_array;
args_array = arg;
int n5,n6,sum3;
n5=args_array[4];
n6=args_array[5];
sum3=n5*n6;
printf("N5 * N6 = %d\n",sum3);
return (void*) sum3;
}
void *sum_thread3(void *arg)
{
int *args_array;
args_array = arg;
int n8,n7,sum4;
n7=args_array[6];
n8=args_array[7];
sum4=n7*n8;
printf("N7 * N8 = %d\n",sum4);
return (void*) sum4;
}
int main()
{
int sum3,sum2,sum,sum4;
int prod;
global[0]=9220; global[1]=1110; global[2]=1120; global[3]=2320; global[4]=5100; global[5]=6720; global[6]=7800; global[7]=9290;// the input
pthread_t tid_sum;
pthread_create(&tid_sum,NULL,sum_thread,global);
pthread_join(tid_sum,(void*)&sum);
pthread_t tid_sum1;
pthread_create(&tid_sum1,NULL,sum_thread1,global);
pthread_join(tid_sum1,(void*)&sum2);
pthread_t tid_sum2;
pthread_create(&tid_sum2,NULL,sum_thread2,global);
pthread_join(tid_sum2,(void*)&sum3);
pthread_t tid_sum3;
pthread_create(&tid_sum3,NULL,sum_thread3,global);
pthread_join(tid_sum3,(void*)&sum4);
prod=sum+sum2+sum3+sum4;
printf("The sum of the products is: %d", prod);
return 0;
}
You don't have, don't want and mustn't (I don't know if you somehow you can though) manage hardware resources at such low levels. That's a job for your OS and partially for standard libraries: they have been tested optimized and standardized properly.
I doubt you can do better. If you do what you are saying either you are an expert hardware/OS programmer or you are destroying decades of works :) .
Also consider this fact: your code will not be portable anymore if you could index the cores manually since it depends on the number of cores of your machine.
On the other way multithread programs should work (and even better sometimes) even when having one core. An example is the case where one of the threads doesn't do anything until an event happens: you can make one thread go to "sleep" so that only the other threads use the CPU; then when the event happens it will execute. In a non-multithread program generally polling is used which uses CPU resource to do nothing.
Also #yano said you are multithread program is not really parallel in this case since you are creating the thread and then waiting for it to finish with pthread_join before starting the other threads.

Audio samples producer multiple threads OSX

This question is a follow-up to a former question (Audio producer threads with OSX AudioComponent consumer thread and callback in C), including a test example, which works and behaves as expected but does not quite answer the question. I have substantially rephrased the question, and re-coded the example, so that it only contains plain-C code. (I've found out that few Objective-C portions of code in the former example only caused confusion and distracted the reader from what's essential in the question.)
In order to take advantage of multiple processor cores as well as to make the CoreAudio pull-model render thread as lightweight as possible, the LPCM samples' producer routine clearly has to "sit" on a different thread, outside the real-lime-priority render thread/callback. It must feed the samples to a circular buffer (TPCircularBuffer in this example), from which the system would schedule data pull-out in quants of inNumberFrames.
The Grand Central Dispatch API offers a simple solution, which I've deduced upon some individual research (including trial-and-error coding). This solution is elegant, since it doesn't block anything nor conflict between push and pull models. Yet the GCD, which is supposed to take care of "sub-threading" does not by far meet the specific parallelization requirements for the work threads of the producer code, so I had to explicitely spawn a number of POSIX threads, depending on the number of logical cores available. Although results are already remarkable in terms of speeding-up the computation I still feel a bit unconfortable mixing the POSIX and GCD. In particular it goes for the variable wait_interval, and computing it properly, not by predicting how many PCM samples may the render thread require for the next cycle.
Here's the shortened and simplified (pseudo)code for my test program, in plain-C.
Controller declaration:
#include "TPCircularBuffer.h"
#include <AudioToolbox/AudioToolbox.h>
#include <AudioUnit/AudioUnit.h>
#include <dispatch/dispatch.h>
#include <sys/sysctl.h>
#include <pthread.h>
typedef struct {
TPCircularBuffer buffer;
AudioComponentInstance toneUnit;
Float64 sampleRate;
AudioStreamBasicDescription streamFormat;
Float32* f; //array of updated frequencies
Float32* a; //array of updated amps
Float32* prevf; //array of prev. frequencies
Float32* preva; //array of prev. amps
Float32* val;
int* arg;
int* previous_arg;
UInt32 frames;
int state;
Boolean midif; //wip
} MyAudioController;
MyAudioController gen;
dispatch_semaphore_t mSemaphore;
Boolean multithreading, NF;
typedef struct data{
int tid;
int cpuCount;
}data;
Controller management:
void setup (void){
// Initialize circular buffer
TPCircularBufferInit(&(self->buffer), kBufferLength);
// Create the semaphore
mSemaphore = dispatch_semaphore_create(0);
// Setup audio
createToneUnit(&gen);
}
void dealloc (void) {
// Release buffer resources
TPCircularBufferCleanup(&buffer);
// Clean up semaphore
dispatch_release(mSemaphore);
// dispose of audio
if(gen.toneUnit){
AudioOutputUnitStop(gen.toneUnit);
AudioUnitUninitialize(gen.toneUnit);
AudioComponentInstanceDispose(gen.toneUnit);
}
}
Dispatcher call (launching producer queue from the main thread):
void dproducer (Boolean on, Boolean multithreading, Boolean NF)
{
if (on == true)
{
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
if((multithreading)||(NF))
producerSum(on);
else
producer(on);
});
}
return;
}
Threadable producer routine:
void producerSum(Boolean on)
{
int rc;
int num = getCPUnum();
pthread_t threads[num];
data thread_args[num];
void* resulT;
static Float32 frames [FR_MAX];
Float32 wait_interval;
int bytesToCopy;
Float32 floatmax;
while(on){
wait_interval = FACT*(gen.frames)/(gen.sampleRate);
Float32 damp = 1./(Float32)(gen.frames);
bytesToCopy = gen.frames*sizeof(Float32);
memset(frames, 0, FR_MAX*sizeof(Float32));
availableBytes = 0;
fbuffW = (Float32**)calloc(num + 1, sizeof(Float32*));
for (int i=0; i<num; ++i)
{
fbuffW[i] = (Float32*)calloc(gen.frames, sizeof(Float32));
thread_args[i].tid = i;
thread_args[i].cpuCount = num;
rc = pthread_create(&threads[i], NULL, producerTN, (void *) &thread_args[i]);
}
for (int i=0; i<num; ++i) rc = pthread_join(threads[i], &resulT);
for(UInt32 samp = 0; samp < gen.frames; samp++)
for(int i = 0; i < num; i++)
frames[samp] += fbuffW[i][samp];
//code for managing producer state and GUI updates
{ ... }
float *head = TPCircularBufferHead(&(gen.buffer), &availableBytes);
memcpy(head,(const void*)frames,MIN(bytesToCopy, availableBytes));//copies frames to head
TPCircularBufferProduce(&(gen.buffer),MIN(bytesToCopy,availableBytes));
dispatch_semaphore_wait(mSemaphore, dispatch_time(DISPATCH_TIME_NOW, wait_interval * NSEC_PER_SEC));
if(gen.state == stopped){gen.state = idle; on = false;}
for(int i = 0; i <= num; i++)
free(fbuffW[i]);
free(fbuffW);
}
return;
}
A single producer thread may look somewhat like this:
void *producerT (void *TN)
{
Float32 samples[FR_MAX];
data threadData;
threadData = *((data *)TN);
int tid = threadData.tid;
int step = threadData.cpuCount;
int *ret = calloc(1,sizeof(int));
do_something(tid, step, &samples);
{ … }
return (void*)ret;
}
Here is the render callback (CoreAudio real-time consumer thread):
static OSStatus audioRenderCallback(void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData) {
MyAudioController *THIS = (MyAudioController *)inRefCon;
// An event happens in the render thread- signal whoever is waiting
if (THIS->state == active) dispatch_semaphore_signal(mSemaphore);
// Mono audio rendering: we only need one target buffer
const int channel = 0;
Float32* targetBuffer = (Float32 *)ioData->mBuffers[channel].mData;
memset(targetBuffer,0,inNumberFrames*sizeof(Float32));
// Pull samples from circular buffer
int32_t availableBytes;
Float32 *buffer = TPCircularBufferTail(&THIS->buffer, &availableBytes);
//copy circularBuffer content to target buffer
int bytesToCopy = ioData->mBuffers[channel].mDataByteSize;
memcpy(targetBuffer, buffer, MIN(bytesToCopy, availableBytes));
{ … };
TPCircularBufferConsume(&THIS->buffer, availableBytes);
THIS->frames = inNumberFrames;
return noErr;
}
Grand Central Dispatch already takes care of dispatching operations to multiple processor cores and threads. In typical real-time audio rendering or processing, one never needs to wait on a signal or semaphore, as the circular buffer consumption rate is very predictable, and drifts extremely slowly over time. The AVAudioSession API (if available) and Audio Unit API and callback allow you to set and determine the callback buffer size, and thus the maximum rate at which the circular buffer can change. Thus you can dispatch all render operations on a timer, render the exact number needed per timer period, and let the buffer size and state compensate for any jitter in thread dispatch time.
In extremely long running audio renders, you might want to measure the drift between timer operations and real-time audio consumption (sample rate), and tweak the number of samples rendered or the timer offset.

User-threaded scheduling API on mac OSX using ucontext & signals

I'm designing a scheduling algorithm that has the following features:
Have 2 user-threads (contexts) in the one process (I'm supposed to do 3 threads but that didn't work on osx yet, so I decided to make 2 work for now)
preemptive using a SIGALRM signal that goes off every 1 sec and changes the control from one context to another, and save the current state (registers and current position) of the context that was running before doing the switch.
what I have noticed is the following:
ucontext.h library behaves strange on mac osx whereas when it is applied in Linux it behaves exactly the way it is supposed to (the example from this man link: http://man7.org/linux/man-pages/man3/makecontext.3.html works perfectly as it is supposed to on linux whereas on mac it fails with Segmentation fault before it does any swapping). I have to make it run on osx unfortunately and not linux.
I managed to work around the swapcontext error on osx by using getcontext() & then setcontext() to do the swapping of contexts.
In my signal handler function, I use the sa_sigaction( int sig, siginfo_t *s, void * cntxt ) since the 3rd variable once re-casted it as a ucontext_t pointer is the information about the context that was interrupted (which is true on Linux once I tested it) but on mac it doesn't point to the proper location as when I use it I get a segmentation fault yet again.
i have designed my test functions for each context to be looping inside a while loop as I want to interrupt them and make sure they go back to execute at the proper location within that function. i have defined a static global count variable that helps me see whether I was in the proper user-thread or not.
One last note is that I found out that calling getcontext() inside my while loop with in the test functions updates the position of my current context constantly since it is am empty while loop and therefore calling setcontext() when that context's time comes makes it execute from proper place. This solution is redundant since these functions will be provided from outside the API.
#include <stdio.h>
#include <sys/ucontext.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <errno.h>
/*****************************************************************************/
/* time-utility */
/*****************************************************************************/
#include <sys/time.h> // struct timeval
void timeval_add_s( struct timeval *tv, uint64_t s ) {
tv->tv_sec += s;
}
void timeval_diff( struct timeval *c, struct timeval *a, struct timeval *b ) {
// use signed variables
long aa;
long bb;
long cc;
aa = a->tv_sec;
bb = b->tv_sec;
cc = aa - bb;
cc = cc < 0 ? -cc : cc;
c->tv_sec = cc;
aa = a->tv_usec;
bb = b->tv_usec;
cc = aa - bb;
cc = cc < 0 ? -cc : cc;
c->tv_usec = cc;
out:
return;
}
/******************************************************************************/
/* Variables */
/*****************************************************************************/
static int count;
/* For now only the T1 & T2 are used */
static ucontext_t T1, T2, T3, Main, Main_2;
ucontext_t *ready_queue[ 4 ] = { &T1, &T2, &T3, &Main_2 };
static int thread_count;
static int current_thread;
/* timer struct */
static struct itimerval a;
static struct timeval now, then;
/* SIGALRM struct */
static struct sigaction sa;
#define USER_THREAD_SWICTH_TIME 1
static int check;
/******************************************************************************/
/* signals */
/*****************************************************************************/
void handle_schedule( int sig, siginfo_t *s, void * cntxt ) {
ucontext_t * temp_current = (ucontext_t *) cntxt;
if( check == 0 ) {
check = 1;
printf("We were in main context user-thread\n");
} else {
ready_queue[ current_thread - 1 ] = temp_current;
printf("We were in User-Thread # %d\n", count );
}
if( current_thread == thread_count ) {
current_thread = 0;
}
printf("---------------------------X---------------------------\n");
setcontext( ready_queue[ current_thread++ ] );
out:
return;
}
/* initializes the signal handler for SIGALARM, sets all the values for the alarm */
static void start_init( void ) {
int r;
sa.sa_sigaction = handle_schedule;
sigemptyset( &sa.sa_mask );
sa.sa_flags = SA_SIGINFO;
r = sigaction( SIGALRM, &sa, NULL );
if( r == -1 ) {
printf("Error: cannot handle SIGALARM\n");
goto out;
}
gettimeofday( &now, NULL );
timeval_diff( &( a.it_value ), &now, &then );
timeval_add_s( &( a.it_interval ), USER_THREAD_SWICTH_TIME );
setitimer( ITIMER_REAL, &a, NULL );
out:
return;
}
/******************************************************************************/
/* Thread Init */
/*****************************************************************************/
static void thread_create( void * task_func(void), int arg_num, int task_arg ) {
ucontext_t* thread_temp = ready_queue[ thread_count ];
getcontext( thread_temp );
thread_temp->uc_link = NULL;
thread_temp->uc_stack.ss_size = SIGSTKSZ;
thread_temp->uc_stack.ss_sp = malloc( SIGSTKSZ );
thread_temp->uc_stack.ss_flags = 0;
if( arg_num == 0 ) {
makecontext( thread_temp, task_func, arg_num );
} else {
makecontext( thread_temp, task_func, arg_num, task_arg );
}
thread_count++;
out:
return;
}
/******************************************************************************/
/* Testing Functions */
/*****************************************************************************/
void thread_funct( int i ) {
printf( "---------------------------------This is User-Thread #%d--------------------------------\n", i );
while(1) { count = i;} //getcontext( ready_queue[ 0 ] );}
out:
return;
}
void thread_funct_2( int i ) {
printf( "---------------------------------This is User-Thread #%d--------------------------------\n", i );
while(1) { count = i;} //getcontext( ready_queue[ 1 ] ); }
out:
return;
}
/******************************************************************************/
/* Main Functions */
/*****************************************************************************/
int main( void ) {
int r;
gettimeofday( &then, NULL );
thread_create( (void *)thread_funct, 1, 1);
thread_create( (void *)thread_funct_2, 1, 2);
start_init();
while(1);
printf( "completed\n" );
out:
return 0;
}
What am I doing wrong here? I have to change this around a bit to run it on Linux properly & running the version that works on Linux on OSX causes segmentation fault, but why would it work on that OS and not this?
Is this related by any chance to my stack size i allocate in each context?
Am I supposed to have a stack space allocated for my signal? (It says that if I don't then it uses a default stack, and if I do it doesn't really make a difference)?
If the use of ucontext will never give predictable behavior on mac osx, then what is the alternative to implement user-threading on osx? I tried using tmrjump & longjmp but I run into the same issue which is when a context is interrupted in the middle of executing certain function then how can I get the exact position of where that context got interrupted in order to continue where I left off next time?
So after days of testing and debugging I finally got this. I had to dig deep into the implementation of the ucontext.h and found differences between the 2 OS. Turns out that OSX implementation of ucontext.h is different from that of Linux. For instance the mcontext_t struct within ucontext_t struct which n=usually holds the values of the registers (PI, SP, BP, general registers...) of each context is declared as a pointer in OSX whereas on Linux it is not. A couple of other differences that needed top be set specially the context's stack pointer (rsp) register, the base pointer (rbp) register, the instruction pointer (rip) register, the destination index (rdi) register... All these had to be set correctly at the beginining/creation of each context as well as after it returns for the first time. I also had top create a mcontext struct to hold these registers and have my ucontext_t struct's uc_mcontext pointer point to it. After all that was done I was able to use the ucontext_t pointer that was passed as an argument in the sa_sigaction signal handler function (after I recast it to ucontext_t) in order to resume exactly where the context left off last time. Bottom line it was a messy affair. Anyone interested in more details can msg me. JJ out.

OpenCL kernel argument struct has zero values

I'm having several problems regarding OpenCL (total noob) but I think that if I manage to solve this one I will be able to solve some of the other. I have the following kernel that I want to store in a double array the a number calculated by the data of a struct. The argument that I pass to the kernel is a struct array and is initialised and the values are non zero (I tested it).
When executing the kernel though I get a "Floating point exception". If I got it right it means that the local_density variable is zero and the division causes an error. What I don't get is why it is zero since in the host the values of non-zero. Am I doing something wrong in the kernel?
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
typedef struct
{
double speeds[9];
} t_speed;
__kernel void prepare(__global const t_speed* cells,
__global const int* obstacles,
__global double* results,
const unsigned int count)
{
int pos = get_global_id(0);
if(pos >= count) return;
if(obstacles[pos] == 1) results[pos] = 0.00;
else
{
double local_density = 0.00;
for(int kk = 0; kk < 9; kk++)
local_density += cells[pos].speeds[kk];
results[pos] = (cells[pos].speeds[1] + cells[pos].speeds[5] +
cells[pos].speeds[8] - (cells[pos].speeds[3] +
cells[pos].speeds[6] + cells[pos].speeds[7])) /
local_density;
}
}
Here is also the initialization of the variable that I pass as an argument. params->ny/nx have correct values.
cells = (t_speed*) malloc(sizeof(t_speed) * (params->ny * params->nx));
Also I quote the argument setting for the kernel for the cells variable.
m_cells = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(t_speed) * count, NULL, NULL);
err = clEnqueueWriteBuffer(commands, m_cells, CL_TRUE, 0, sizeof(t_speed) * count, cells, 0, NULL, NULL);
err |= clSetKernelArg(av_velocity_prepare_kernel, 0, sizeof(cl_mem), &m_cells);
------------------------------------------ EDIT ------------------------------------------
OK, what is really weird is that I'm getting the same error (Floating point exception) even with the very simple following kernel. Anyone has got a clue?
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void test(__global float* result, const unsigned int n)
{
int i = get_global_id(0);
if(i >= n) return;
result[i] += 1.0f;
}
I noticed that you are declaring your buffer as CL_MEM_READ_ONLY, yet your are writing to it inside the kernel. According to the OpenCL spec, this is undefined. Try using CL_MEM_READ_WRITE instead.
OK, so it was a completely different thing than I thought it was. The problem was that when I was calling
clEnqueueNDRangeKernel (command_queue, kernel, work_dim, *global_work_offset,
*global_work_size, *local_work_size, num_events_in_wait_list,
*event_wait_list, *event)
the global_work_size was not divisible by local_work_size. That caused the Floating point exception.

Resources