Problematic while loop in OpenCL kernel: Execution hangs - c

I wrote an OpenCL kernel that generates random numbers inside a while loop in the device. Once an acceptable random number is obtained, the kernel should exit the loop and give the result back to the host. Typically, the
number of iterations per workitem is ~100-1000.
The problem is that this code hangs when I enable the while loop and never returns a result. If I just disable the while loop–i.e. generating only one random number instead of 100s–the kernel works fine.
Anybody has any idea of what might be going on? The kernel code is below and also available at this github repo. One possibility is that the system (MacOS in my case) prevents the GPU from taking a long time executing a task as described here, but I am not sure.
#include <clRNG/mrg31k3p.clh> // for random number generation
#include "exposure.clh" // defines function exposure
__kernel void cr(__global clrngMrg31k3pHostStream* streams, __global float* xa, __global float* ya, const int n) {
int i = get_global_id(0);
float x,y,sampling;
if (i<n) {
// Loop that produces individual CRs
while (1) {
clrngMrg31k3pStream private_stream_d; // This is not a pointer!
clrngMrg31k3pCopyOverStreamsFromGlobal(1, &private_stream_d, &streams[i]);
// random number between 0 and 360
x=360.*clrngMrg31k3pRandomU01(&private_stream_d);
// random number between 0 and 1
y=clrngMrg31k3pRandomU01(&private_stream_d);
// To avoid concentrations towards the poles, generates sin(delta)
// between -1 and +1, then converts to delta
y = asin((float)(2.*y-1.))*180./M_PI_F; // dec
// If sampling<exposure for a given CR, it is accepted
sampling=clrngMrg31k3pRandomU01(&private_stream_d);
if (sampling <= exposure(y)) {
xa[i]=x;
ya[i]=y;
break;
}
}
}
}

You are re-creating the random stream over and over again; perhaps it always creates the same output, which is why your while loop never terminates. Try creating the random stream above your loop that pulls from it.

Related

Why am I getting huge slowdown when parallelising with OpenMP and using static scheduling?

I'm working to parallelise a disease spread model in c using OpenMP but am only seeing massive (order of magnitude) slowdown. I'll point out at the outset that I am a complete novice with both OpenMP and c.
The code loops over every point in the simulation and checks its status (susceptible, infected, recovered) and for each status, follows an algorithm to determine its status at the next time step.
I'll give the loop for infected points for illustrative purposes. Lpoints is a list of indices for points in the simulation, Nneigh gives the number of neighbours each point has and Lneigh gives the indices of these neighbours.
for (ipoint=0;ipoint<Nland;ipoint++) { //loop over all points
if (Lpoints_old[ipoint]==I) { //act on infected points
/* Probability Prec of infected population recovering */
xi = genrand();
if (xi<Pvac) { /* This point recovers (I->R) */
Lpoints[ipoint] = R;
/* printf("Point %d gained immunity\n",ipoint); */
}
else {
/* Probability of being blockaded by neighbours */
nsn = 0;
for (in=0;in<Nneigh[ipoint];in++) { /*count susceptible neighbours (nsn)*/
//if (npoint<0) printf("Bad npoint 1: %d in=%d\n",ipoint,in);
//fflush(stdout);
npoint = Lneigh[ipoint][in];
if (Lpoints_old[npoint]==S) nsn++;
}
Prob = (double)nsn*Pblo;
xi = genrand();
if (xi<Prob) { /* The population point is blockaded (I->R)*/
Lpoints[ipoint] = R;
}
else { /* Still infected */
Lpoints[ipoint] = I;
}
} /*else*/
} /*infected*/
} /*for*/
I tried to parallelise by adding #pragma omp parallel for default(shared) private(ipoint,xi,in,npoint,nsn,Prob) before the for loop. (I tried using default(none) as is generally recommended but it wouldn't compile.) On the small grid I am using to test the original series code runs in about 5 seconds and the OpenMP version runs in around 50.
I have searched for ages online and every similar problem seems to be the result of false cache sharing and has been solved by using static scheduling with a chunk size divisible by 8. I tried varying the chunk size to no effect whatsoever, only getting the timings to the original order when the chunk size surpassed the size of the problem (i.e. back to linearly carrying out on one thread.)
Slowdown doesn't seem any better when the problem is more appropriately scaled as far as I can tell either. I have no idea why this isn't working and what's going wrong. Any help greatly appreciated.

OpenCL - Local Memory

I do understand whats the difference between global- and local-memory in general.
But I have problems to use local-memory.
1) What has to be considered by transforming a global-memory variables to local-memory variables?
2) How do I use the local-barriers?
Maybe someone can help me with a little example.
I tried to do a jacobi-computation by using local-memory, but I only get 0 as result. Maybe someone can give me an advice.
Working Solution:
#define IDX(_M,_i,_j) (_M)[(_i) * N + (_j)]
#define U(_i, _j) IDX(uL, _i, _j)
__kernel void jacobi(__global VALUE* u, __global VALUE* f, __global VALUE* tmp, VALUE factor) {
int i = get_global_id(0);
int j = get_global_id(1);
int iL = get_local_id(0);
int jL = get_local_id(1);
__local VALUE uL[(N+2)*(N+2)];
__local VALUE fL[(N+2)*(N+2)];
IDX(uL, iL, jL) = IDX(u, i, j);
IDX(fL, iL, jL) = IDX(f, i, j);
barrier(CLK_LOCAL_MEM_FENCE);
IDX(tmp, i, j) = (VALUE)0.25 * ( U(iL-1, jL) + U(iL, jL-1) + U(iL, jL+1) + U(iL+1, jL) - factor * IDX(fL, iL, jL));
}
Thanks.
1) Query for CL_DEVICE_LOCAL_MEM_SIZE value, it is 16kB minimum and increses for different hardwares. If your local variables can fit in this and if they are re-used many times, you should put them in local memory before usage. Even if you don't, automatic usage of L2 cache when accessing global memory of a gpu can be still effective for utiliation of cores.
If global-local copy is taking important slice of time, you can do async work group copy while cores calculating things.
Another important part is, more free local memory space means more concurrent threads per core. If gpu has 64 cores per compute unit, only 64 threads can run when all local memory is used. When it has more space, 128,192,...2560 threads can be run at the same time if there are no other limitations.
A profiler can show bottlenecks so you can consider it worth a try or not.
For example, a naive matrix-matrix multiplication using nested loop relies on cache l1 l2 but submatices can fit in local memory. Maybe 48x48 submatices of floats can fit in a mid-range graphics card compute unit and can be used for N times for whole calculation before replaced by next submatrix.
CL_DEVICE_LOCAL_MEM_TYPE querying can return LOCAL or GLOBAL which also says that not recommended to use local memory if it is GLOBAL.
Lastly, any memory space allocation(except __private) size must be known at compile time(for device, not host) because it must know how many wavefronts can be issued to achieve max performance(and/or maybe other compiler optimizations). That is why no recursive function allowed by opencl 1.2. But you can copy a function and rename for n times to have pseudo recursiveness.
2) Barriers are a meeting point for all workgroup threads in a workgroup. Similar to cyclic barriers, they all stop there, wait for all until continuing. If it is a local barrier, all workgroup threads finish any local memory operations before departing from that point. If you want to give some numbers 1,2,3,4.. to a local array, you can't be sure if all threads writing these numbers or already written, until a local barrier is passed, then it is certain that array will have final values already written.
All workgroup threads must hit same barrier. If one cannot reach it, kernel stucks or you get an error.
__local int localArray[64]; // not each thread. For all threads.
// per compute unit.
if(localThreadId!=0)
localArray[localThreadId]=localThreadId; // 64 values written in O(1)
// not sure if 2nd thread done writing, just like last thread
if(localThreadId==0) // 1st core of each compute unit loads from VRAM
localArray[localThreadId]=globalArray[globalThreadId];
barrier(CLK_LOCAL_MEM_FENCE); // probably all threads wait 1st thread
// (maybe even 1st SIMD or
// could be even whole 1st wavefront!)
// here all threads written their own id to local array. safe to read.
// except first element which is a variable from global memory
// lets add that value to all other values
if(localThreadId!=0)
localArrray[localThreadId]+=localArray[0];
Working example(local work group size=64):
inputs: 0,1,2,3,4,0,0,0,0,0,0,..
__kernel void vecAdd(__global float* x )
{
int id = get_global_id(0);
int idL = get_local_id(0);
__local float loc[64];
loc[idL]=x[id];
barrier (CLK_LOCAL_MEM_FENCE);
float distance_square_sum=0;
for(int i=0;i<64;i++)
{
float diff=loc[idL]-loc[i];
float diff_squared=diff*diff;
distance_square_sum+=diff_squared;
}
x[id]=distance_square_sum;
}
output: 30, 74, 246, 546, 974, 30, 30, 30...

How do I create a "twirly" in a C program task?

Hey guys I have created a program in C that tests all numbers between 1 and 10000 to check if they are perfect using a function that determines whether a number is perfect. Once it finds these it prints them to the user, they are 6, 28, 496 and 8128. After this the program then prints out all the factors of each perfect number to the user. This is all fine. Here is my problem.
The final part of my task asks me to:
"Use a "twirly" to indicate that your program is happily working away. A "twirly" is the following characters printed over the top of each other in the following order: '|' '/' '-' '\'. This has the effect of producing a spinning wheel - ie a "twirly". Hint: to do this you can use \r (instead of \n) in printf to give a carriage return only (instead of a carriage return linefeed). (Note: this may not work on some systems - you do not have to do it this way.)"
I have no idea what a twirly is or how to implement one. My tutor said it has something to do with the sleep and delay functions which I also don't know how to use. Can anyone help me with this last stage, it sucks that all my coding is complete but I can't get this "twirly" thing to work.
if you want to simultaneously perform the task of
Testing the numbers and
Display the twirly on screen
while the process goes on then you better look into using threads. using POSIX threads you can initiate the task on a thread and the other thread will display the twirly to the user on terminal.
#include<stdlib.h>
#include<pthread.h>
int Test();
void Display();
int main(){
// create threads each for both tasks test and Display
//call threads
//wait for Test thread to finish
//terminate display thread after Test thread completes
//exit code
}
Refer chapter 12 for threads
beginning linux programming ebook
Given the program upon which the user is "waiting", I believe the problem as stated and the solutions using sleep() or threads are misguided.
To produce all the perfect numbers below 10,000 using C on a modern personal computer takes about 1/10 of a second. So any device to show the computer is "happily working away" would either never be seen or would significanly intefere with the time it takes to get the job done.
But let's make a working twirly for perfect number search anyway. I've left off printing the factors to keep this simple. Since 10,000 is too low to see the twirly in action, I've upped the limit to 100,000:
#include <stdio.h>
#include <string.h>
int main()
{
const char *twirly = "|/-\\";
for (unsigned x = 1; x <= 100000; x++)
{
unsigned sum = 0;
for (unsigned i = 1; i <= x / 2; i++)
{
if (x % i == 0)
{
sum += i;
}
}
if (sum == x)
{
printf("%d\n", x);
}
printf("%c\r", twirly[x / 2500 % strlen(twirly)]);
}
return 0;
}
No need for sleep() or threads, just key it into the complexity of the problem itself and have it update at reasonable intervals.
Now here's the catch, although the above works, the user will never see a fifth perfect number pop out with a 100,000 limit and even with a 100,000,000 limit, which should produce one more, they'll likely give up as this is a bad (slow) algorithm for finding them. But they'll have a twirly to watch.
i as integer
loop i: 1 to 10000
loop j: 1 to i/2
sum as integer
set sum = 0
if i%j == 0
sum+=j
return sum==i
if i%100 == 0
str as character pointer
set *str = "|/-\\"
set length = 4
print str[p] using "%c\r" as format specifier
Increment p and assign its modulo by len to p

Create an array of values from different text files in C

I'm working in C on 64-bit Ubuntu 14.04.
I have a number of .txt files, each containing lines of floating point values (1 value per line). The lines represent parts of a complex sample, and they're stored as real(a1) \n imag(a1) \n real(a2) \n imag(a2), if that makes sense.
In a specific scenario there are 4 text files each containing 32768 samples (thus 65536 values), but I need to make the final version dynamic to accommodate up to 32 files (the maximum samples per file would not exceed 32768 though). I'll only be reading the first 19800 samples (depending on other things) though, since the entire signal is contained in those 39600 points (19800 samples).
A common abstraction is to represent the files / samples as a matrix, where columns represent return signals and rows represent the value of each signal at a sampling instant, up until the maximum duration.
What I'm trying to do is take the first sample from each return signal and move it into an array of double-precision floating point values to do some work on, move on to the second sample for each signal (which will overwrite the previous array) and do some work on them, and so forth, until the last row of samples have been processed.
Is there a way in which I can dynamically open files for each signal (depending on the number of pulses I'm using in that particular instance), read the first sample from each file into a buffer and ship that off to be processed. On the next iteration, the file pointers will all be aligned to the second sample, it would then move those into an array and ship it off again, until the desired amount of samples (19800 in our hypothetical case) has been reached.
I can read samples just fine from the files using fscanf:
rx_length = 19800;
int x;
float buf;
double *range_samples = calloc(num_pulses, 2 * sizeof(range_samples));
for (i=0; i < 2 * rx_length; i++){
x = fscanf(pulse_file, "%f", &buf);
*(range_samples) = buf;
}
All that needs to happen (in my mind) is that I need to cycle both sample# and pulse# (in that order), so when finished with one pulse it would move on to the next set of samples for the next pulse, and so forth. What I don't know how to do is to somehow declare file pointers for all return signal files, when the number of them can vary inbetween calls (e.g. do the whole thing for 4 pulses, and on the next call it can be 16 or 64).
If there are any ideas / comments / suggestions I would love to hear them.
Thanks.
I would make the code you posted a function that takes an array of file names as an argument:
void doPulse( const char **file_names, const int size )
{
FILE *file = 0;
// declare your other variables
for ( int i = 0; i < size; ++i )
{
file = fopen( file_names[i] );
// make sure file is open
// do the work on that file
fclose( file );
file = 0;
}
}
What you need is a generator. It would be reasonably easy in C++, but as you tagged C, I can imagine a function, taking a custom struct (the state of the object) as parameter. It could be something like (pseudo code) :
struct GtorState {
char *files[];
int filesIndex;
FILE *currentFile;
};
void gtorInit(GtorState *state, char **files) {
// loads the array of file into state, set index to 0, and open first file
}
int nextValue(GtorState *state, double *real, double *imag) {
// read 2 values from currentFile and affect them to real and imag
// if eof, close currentFile and open files[++currentIndex]
// if real and imag were found returns 0, else 1 if eof on last file, 2 if error
}
Then you main program could contain :
GtorState state;
// initialize the list of files to process
gtorInit(&state, files);
double real, imag);
int cr;
while (0 == (cr = nextValue(&state, &real, &imag)) {
// process (real, imag)
}
if (cr == 2) {
// process (at least display) error
}
Alternatively, your main program could iterate the values of the different files and call a function with state analog of the above generator that processes the values, and at the end uses the state of the processing function to get the results.
Tried a slightly different approach and it's working really well.
In stead of reading from the different files each time I want to do something, I read the entire contents of each file into a 2D array range_phase_data[sample_number][pulse_number], and then access different parts of the array depending on which range bin I'm currently working on.
Here's an excerpt:
#define REAL(z,i) ((z)[2*(i)])
#define IMAG(z,i) ((z)[2*(i)+1])
for (i=0; i<rx_length; i++){
printf("\t[%s] Range bin %i. Samples %i to %i.\n", __FUNCTION__, i, 2*i, 2*i+1);
for (j=0; j<num_pulses; j++){
REAL(fft_buf, j) = range_phase_data[2*i][j];
IMAG(fft_buf, j) = range_phase_data[2*i+1][j];
}
printf("\t[%s] Range bin %i done, ready to FFT.\n", __FUNCTION__, i);
// do stuff with the data
}
This alleviates the need to dynamically allocate file pointers and in stead just opens the files one at a time and writes the data to the corresponding column in the matrix.
Cheers.

Pipelining 1D Convolution algorithm using C on DSP development board

The DSP board I am currently using is DSK6416 from Spectrum Digital, and I am implementing a convolution algorithm in C to convolve input voice samples with a pre-recorded impulse response array. The objective is to speak into the microphone, and output the processed effect so we sound like we are speaking in that environment where the impulse response array is obtained.
The challenge I am facing now is doing the convolution live and keep up the pace of the input and output speed of the interrupt function at 8 kHz.
Here is my brain storming idea:
My current inefficient implementation that does not work is as follows:
The interrupt will stop the convolution process, output the index, and resume convolution at 8 kHz, or 1/8kHz seconds.
However, a complete iteration of convolution runs much slower than 1/8kHz seconds. So when the interrupt wants to output the data from the output array, the data is not ready yet.
My ideal implementation for fast pipelining convolution algorithm:
We would have many convolution processes running in the background while outputting the completed ones as time goes on. There will be many pipes running in parallel.
If I use the pipelining approach, we would need to have N = 10000 pipeline processes running in the background...
Now I have the idea down (at least I think I do, I might be wrong), I have no clue how to implement this on the DSK board using C programming language because C does not support object orientation.
The following is the pseudo-code for our C implementation:
#include <stdio.h>
#include "DSK6416_AIC23.h"
Uint32 fs=DSK6416_AIC23_FREQ_48KHZ; //set sampling rate
#define DSK6416_AIC23_INPUT_MIC 0x0015
#define DSK6416_AIC23_INPUT_LINE 0x0011
Uint16 inputsource=DSK6416_AIC23_INPUT_MIC; // select input
//input & output parameters declaration
#define MAX_SIZE 10000
Uint32 curr_input;
Int16 curr_input2;
short input[1];
short impulse[MAX_SIZE ];
short output[MAX_SIZE ];
Int16 curr_output;
//counters declaration
Uint32 a, b, c, d; //dip switch counters
int i, j, k; //convolution iterations
int x; //counter for initializing output;
interrupt void c_int11() //interrupt running at 8 kHz
{
//Reads Input
//Start new pipe
//Outputs output to speaker
}
void main()
{
//Read Impulse.txt into impulse array
comm_intr();
while(1)
{
if (DIP switch pressed)
{
//convolution here (our current inefficient convolution algorithm)
//Need to run multiple of the same process in the background in parallel.
for (int k = 0; k < MAX_SIZE; k++)
{
if (k==MAX_SIZE-1 && i == 0) // special condition overwriting element at i = MAX_SIZE -1
{
output[k] = (impulse[k]*input[0]);
}
else if (k+i < MAX_SIZE) // convolution from i to MAX_SIZE
{
output[k+i] += (impulse[k]*input[0]);
}
else if (k+i-MAX_SIZE != i-1) // convolution from 0 to i-2
{
output[k+i-MAX_SIZE] += (impulse[k]*input[0]);
}
else // overwrite element at i-1
{
output[i-1] = (impulse[k]*input[0]);
}
}
}
else //if DIP switch is not pressed
{
DSK6416_LED_off(0);
DSK6416_LED_off(1);
DSK6416_LED_off(2);
DSK6416_LED_off(3);
j = 0;
curr_output = input[1];
output_sample(curr_output); //outputs unprocessed dry voice
}
} //end of while
fclose(fp);
}
Is there a way to implement pipeline in C code to compile on the hardware DSP board so we can run multiple convolution iterations in the background all at the same time?
I drew some pictures, but I am new to this board so I can't post images.
Please let me know if you need my pictorial ideas to help you help me~
Any help on how to implement this code is very much appreciated !!
You probably need to process data in chunks of some N samples. While one chunk is being I/O'd in an DAC/ADC interrupt handler, another one is being processed somewhere in main(). The main thing here is to make sure your processing of a chunk of N samples takes less time than receiving/transmitting N samples.
Here's what it may look like in time (all things in every step (except step 1) happen "in parallel"):
buf1=buf3=zeroes, buf2=anything
ISR: DAC sends buf1, ADC receives buf2; main(): processes buf3
ISR: DAC sends buf3, ADC receives buf1; main(): processes buf2
ISR: DAC sends buf2, ADC receives buf3; main(): processes buf1
Repeat indefinitely from step 2.
Also, you may want to implement your convolution in assembly for extra speed. I'd look at some TI app notes or what not for an implementation. Perhaps it's available in some library too.
You may also consider doing convolution via Fast Fourier Transform.
Your DSP only has so many CPU cycles available per second. You need to analyze your algorithm to determine how many CPU cycles it takes to process each sample on average. That needs to be less that the number of CPU cycles between samples. No amount of pipelining or object orientation will help if you don't have an algorithm that completes in a small enough number of cycles per sample on average.

Resources