Accessing variable by pointer in OpenCL kernel - c

I am writing a raytracing program in OpenCL and I have a function in my Kernel, Quadratic, that takes in 3 float variables and two pointers to float values.
Function:
bool Quadratic(float A, float B, float C, float *t0, float *t1) {
float discrim = B * B - ( 4.0 * A * C );
if (discrim <= 0.0) return false;
float rootDiscrim = sqrtf(discrim);
float q;
if (B < 0) q = -0.5f * ( B - rootDiscrim);
else q = -0.5f * ( B + rootDiscrim);
*t0 = q / A;
*t1 = C / q;
float temp;
return true;
}
Calling the Function:
float t0;
float t1;
if (Quadratic(A, B, C, &t0, &t1)) c[(i*dimy)+j] = t0;
else c[(i*dimy)+j] = 0.0;
Produces the following error:
pyopencl.RuntimeError: clBuildProgram failed: build program failure -
Build on <pyopencl.Device 'ATI Radeon HD 6750M' on 'Apple' at 0x1021b00>:
Error returned by cvms_element_build_from_source
In trying to work out what the problem was I created the following test function which seems to work:
bool TestFunc(float Y, float *x) {
*x = Y;
return true;
}
float x;
if (TestFunc(50.0, &x)) c[(i*dimy)+j] = x;
As far as I can see both functions have the same types of inputs and outputs, any help would be greatly appreciated.

It turns out the problem was with using sqrtf. Once changed to sqrt it works perfectly.

Related

Add noise to the frame using Polar Method in C

i wrote a program who can read the Original YUV-file and add Gaussian noise with mean 0 to the Modified one.
the problem is that i don't know how to implement the Polar function on the main function, when i tried its always generate errors.
anyone have any ideas to solve my problem.
thanks
void polar(double *x1, double *x2)
{
double u, v, q, p;
do {
u = 2.0 * random() - 1;
v = 2.0 * random() - 1;
q = u * u + v * v;
} while (q >= 1.0 || q == 0.0);
p = sqrt(-2 * log(q) / q);
*x1 = u * p;
*x2 = v * p;
}
int main(void)
{
FILE *fp1, *fp2;
int a;
double a1,a2;
fp1= fopen("FOOTBALL_352x288_30_orig_01.yuv","rb");
fp2= fopen("FOOTBALL_352x288_30_orig_02.yuv","wb");
int tab[10]="";
while(!feof(fp1))
{
fread(tab,sizeof(int),1,fp1);
fwrite(tab,sizeof(int),1,fp2);
}
fclose(fp1);
fclose(fp2);
return 0;
}
The polar function expects the address of two doubles as inputs. If you declared a1 and a2 to be passed into polar, you can call polar(&a1, &a2). The polar function will have set the a1 and a2 by the time it returns. To check this, try printing these variables before and after your call to the polar function.

2D array print error

This is my code snippet:
float *fittedLine[NO_OF_INTERVALS];
int i;
for(k=0;k<NO_OF_INTERVALS;k++){
fittedLine[k] = plotLines(tStart, tEnd, yStart, yEnd);
for(i=0;i<14;i++)
printf("%f\n", fittedLine[k][i]);
}
My problem: The print statement is not giving proper outputs; Garbage values are printed.
I tried debugging by putting a breakpoint in line 5 and tried printing
fittedLine[0][0], fittedLine[0][1] and so on.
That gives proper output. But when coming to the print statement, things start to fall apart. What is happening here? plotLines returns a 1D array.
plotLines function:
float * plotLines(float t0, float tf, float y0, float yf){
float m = (yf - y0)/(tf - t0);
float t = t0;
int i = 0;
float y[200];
while(t<tf){
y[i] = m*t + yf - m*tf; ///y - y1 = m(x - x1)
i++;
t = t+SAMPLINGTIME;
}
return y;
}
It would be helpful to see the plotLines function, but my guess is that you are returning a local variable:
float *plotLines() {
float temp[14];
// blah...
return temp;
}
This is bad. temp goes out of scope when the function ends. You need to allocate the array using malloc and return that pointer:
float *plotLines() {
float *temp = malloc(sizeof(float) * 14);
// Make sure temp is not NULL
// blah...
return temp;
}
Don't forget to free all that memory later. If you only use the results temporarily, a static array could work:
float *plotLines() {
static float temp[14];
// blah...
return temp;
}

Returning an array of structs from a function - C programming

So I'm trying to write a function that will return an array of several values. At the moment, it is running correctly but only outputting the final calculated value. How would I make it so the output includes all calculated values?
My code looks like this:
//Practice to output an array of structs
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
struct boat_params {
double V, Uc, Vc;
};
struct boat_params submerged_volume(double L1, double L2, double Lavg, double H) {
struct boat_params volume;
double V_sub, Uc_sub, Vc_sub;
V_sub = 0;
//Boat description
double C, delta;
double theta, theta_rad, theta_min, theta_min2, theta_lim, theta_lim2, theta_lim_deg;
double Ug1, Ug2, Vg1, Vg2, V1, V2;
double pi;
pi = 4*atan(1);
C = sqrt(L1*L1 + L2*L2);
delta = acos(L1/C);
theta_lim = asin(H/L1);
theta_lim_deg = (theta_lim/pi) * 180.0;
theta_min = asin(H/C) - delta;
theta_min2 = 0;
//Calculating the submerged volume and centre of gravity for each different angle
for (theta = 0; theta <= 10; theta ++) {
//**Note: I've taken out the actual calculations of V_sub, Uc_sub, and Vc_sub for brevity**
volume.V = V_sub;
volume.Uc = Uc_sub;
volume.Vc = Vc_sub;
}
return volume;
}
int main () {
double L1, L2, Lavg, H;
struct boat_params volume;
L1 = 17.6;
L2 = 3;
Lavg = 4;
H = 4.5;
volume = submerged_volume(L1, L2, Lavg, H);
printf("V = %lf\nUc = %lf\nVc = %lf\n", volume.V, volume.Uc, volume.Vc);
return 0;
}
I can get it to correctly output the last calculated value (for theta = 10) but that's the only value I'm getting. How would I calculate V_sub, Uc_sub, and Vc_sub for each theta value? and output each value. I'm assuming this means turning the struct into an array and filling each element of the array with values of the struct for that theta but I don't know how to do this!
I really appreciate any help and thank you in advance.
Also: If possible I'd like to avoid pointers but understand this may not be possible! I'm still very new and not good at using them!
You are quite right, you will need to have an array for that. If the number of elements in the array is constant, you could also create a struct that contains exactly that number elements, but please don't do that.
To operate on arrays you will - unfortunately - need pointers. A very common way to do this in C is not to return a pointer, but pass a 'result' pointer in. This means that it will be up to the user of the function to allocate space and free it, he can also use the syntax for arrays. In your code it seems that the number of values is constant, this makes the aforementioned solution possible. Alternatively you could allocate space on the heap (using malloc) and return a pointer, but that means the user needs to free memory he never allocated, counter intuitive and might result in memory leaks if he forgets to do so. Consider the following solution:
void submerged_volume(double L1, double L2, double Lavg, double H, struct boat_params *result) {
// your calculations here
for (theta = 0; theta <= 10; theta ++) {
(result+theta)->V = V_sub;
(result+theta)->Uc = Uc_sub;
(result+theta)->Vc = Vc_sub;
}
}
// somewhere in your code where you want to use your function
struct boat_params values[11];
unsigned char i = 0;
submerged_values(/* parameters */, values);
for (; i <= 10; ++i) {
printf("V = %lf\nUc = %lf\nVc = %lf\n", values[i].V, values[i].Uc, values[i].Vc);
}
Try this, just add your logic to the loop and maths:
#include <stdio.h>
#include <stdlib.h>
#define ARRSIZE 100
typedef struct boat_params {
double V, Uc, Vc;
} Volume;
struct boat_params submerged_volume(double L1, double L2, double Lavg, double H, Volume *volumes[]) {
double theta;
int i = 0; /* only example, change as needed */
Volume *p;
for (theta = 0; theta <= 10; theta ++) {
p = malloc(sizeof(* p));
if (p == NULL) {
printf("malloc failed to allocate a new space");
exit(0);
}
p->V = 1; //V_sub;
p->Uc = 2; //Uc_sub;
p->Vc = 3; //Vc_sub;
volumes[i] = p;
i++;
}
}
int main () {
double L1, L2, Lavg, H;
L1 = 17.6;
L2 = 3;
Lavg = 4;
H = 4.5;
Volume *volumes[ARRSIZE];
submerged_volume(L1, L2, Lavg, H, volumes);
printf("V = %lf\nUc = %lf\nVc = %lf\n", volumes[0]->V, volumes[0]->Uc, volumes[0]->Vc); /* first element for example */
return 0;
}
If you don't know the size of the volumes array in advance, you should consider using linked list.

Model using Euler method and pointer arithmetic not functioning

I'm new to C, and quite unfamiliar with writing any program larger than a few lines.
I'm trying to write a model for an object in freefall acted upon by gravity and drag. It uses Eulers method to solve two first order differential equations, one for position and one for velocity.
So we have: F = m dv/dt = -mg - k|v|v and dy/dt = v
These are solved by: Vn+1 = Vn - (delta t*(g+(k/m)|Vn|Vn)) and Yn+1 = Yn + (delta t * Vn)
(In this Vn+1 is the n+1th term etc.)
In my program i've tried to have two functions, for position and velocity, which work by passing pointers with Y and V values between them and the main function, and it should then loop until Y=0 and print off the values at each step.
When I run it it comes up with something like this: http://imgur.com/DNHIhHI
Could anyone tell me either what is wrong with this, or if I need to use a different approach completely?
Many Thanks, Code below
#include <stdio.h>
void Velocity(double *ptr, double m, double k, double t);
void Position(double *pst, double *ptr, double t );
int main()
{
double k = 18833.5608;
double t = 0;
double m;
double speed = 0;
double *ptr = &speed;
double y = 1000;
double *pst = &y;
printf("Enter mass of object: \n");
scanf("%f" , &m);
do
{
Velocity( ptr, m, k, t );
printf("Velocity at time %f is: %f\n" , t, speed);
Position( pst, ptr, t);
printf("Position at time %f is: %f\n" , t , y);
t++;
}
while((y>0));
return 0;
}
void Velocity(double *velo, double m, double k, double t)
{
double g = 9.80665;
*velo = *velo - (t*(g+((k/m)*fabs(*velo)**(velo))));
}
void Position(double *Y , double *velo, double t )
{
*Y = *Y+(t*(*velo));
}
When writing programs that do calculations -- in any language, not just C -- try to make the code that does the computation take arguments and return results but not mutate variables. That is, do not write:
void do_calculation( double * result, double x, double y)
{
*result = x + y;
}
...
double r;
do_calculation(&r, 123, 456);
instead write
double do_calculation(double x, double y)
{
return x + y;
}
...
double r = do_calculation(123, 456);
Make sense?
If you want to modify an existing value, again, don't pass it in as a variable to be mutated. Instead of
void do_calculation(double * accumulator, double x, double y)
{
*accumulator = *accumulator + x + y;
}
...
double r = 10;
do_calculation(&r, 123, 456);
instead say
double do_calculation(double original, double x, double y)
{
return original + x + y;
}
...
double r = 10;
r = do_calculation(r, 123, 456);
Now, once you've got your program architected more sensibly, you need to learn how to debug small programs. Some good advice on that subject can be found here:
http://ericlippert.com/2014/03/05/how-to-debug-small-programs/
A misconcept. I believe you're trying to solve the equations by using small increments of time. Nothing wrong with that, just make the time increment as small as possible, and correct the formulas:
#include <stdio.h>
#include <math.h>
void Velocity(double *velocity, double m, double k, double t)
{
double g = 9.80665;
double velo = *(velocity);
velo = velo - (t*(g+((k/m)*abs(velo)*(velo))));
*(velocity)=velo;
}
void Position(double *position , double *velocity, double t )
{
double Y = *(position);
double velo = *(velocity);
Y = Y+(t*(velo));
*(position)=Y;
}
int main()
{
double k = 18833.5608;
double t = 0;
double dt = 0.001; //making a small increment of time
double m=100;
double speed = 0;
double y = 1000;
//printf("Enter mass of object: \n");
//scanf("%f" , &m);
do
{
Velocity( &speed, m, k, dt );
printf("Velocity at time %f is: %f\n" , t, speed);
Position( &y, &speed, dt);
printf("Position at time %f is: %f\n" , t , y);
t+=dt; //increment time by delta t
}
while((y>0));
return 0;
}

How do I parallelize this triple loop in an efficient way?

I'm trying to parallelize a function which takes as input three arrays (x, y, and prb) and one scalar, and outputs three arrays (P1, Pt1, and Px).
The original c code is here (the outlier and E are inconsequential):
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#define max(A, B) ((A) > (B) ? (A) : (B))
#define min(A, B) ((A) < (B) ? (A) : (B))
void cpd_comp(
double* x,
double* y,
double* prb,
double* sigma2,
double* outlier,
double* P1,
double* Pt1,
double* Px,
double* E,
int N,
int M,
int D
)
{
int n, m, d;
double ksig, diff, razn, outlier_tmp, sp;
double *P, *temp_x;
P = (double*) calloc(M, sizeof(double));
temp_x = (double*) calloc(D, sizeof(double));
ksig = -2.0 * *sigma2;
for (n=0; n < N; n++) {
sp=0;
for (m=0; m < M; m++) {
razn=0;
for (d=0; d < D; d++) {
diff=*(x+n+d*N)-*(y+m+d*M); diff=diff*diff;
razn+=diff;
}
*(P+m)=exp(razn/ksig) ;
sp+=*(P+m);
}
*(Pt1+n)=*(prb+n);
for (d=0; d < D; d++) {
*(temp_x+d)=*(x+n+d*N)/ sp;
}
for (m=0; m < M; m++) {
*(P1+m)+=((*(P+m)/ sp) **(prb+n));
for (d=0; d < D; d++) {
*(Px+m+d*M)+= (*(temp_x+d)**(P+m)**(prb+n));
}
}
*E += -log(sp);
}
*E +=D*N*log(*sigma2)/2;
free((void*)P);
free((void*)temp_x);
return;
}
Here is my attempt at parallelizing it:
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <thrust/device_ptr.h>
#include <thrust/reduce.h>
/*headers*/
void cpd_comp(
float * x, //Points to register [N*D]
float * y, //Points to be registered [M*D]
float * prb, //Vector of probabilities [N]
float * sigma2, //Square of sigma
float ** P1, //P1, output, [M]
float ** Pt1, //Pt1, output, [N]
float ** Px, //Px, output, [M*3]
int N, //Number of points, i.e. rows, in x
int M //Number of points, i.e. rows, in
);
__global__ void d_computeP(
float * P,
float * P1,
float * Px,
float * ProbabilityMatrix,
float * x,
float * y,
float * prb,
float ksig,
const int N,
const int M);
__global__ void d_sumP(
float * sp,
float * P1timessp,
float * Pxtimessp,
float * P1,
float * Px,
const int N,
const int M);
/*implementations*/
void cpd_comp(
float * x, //Points to register [N*D]
float * y, //Points to be registered [M*D]
float * prb, //Vector of probabilities [N]
float * sigma2, //Scalar
float ** P1, //P1, output, [M]
float ** Pt1, //Pt1, output, [N]
float ** Px, //Px, output, [M*3]
int N, //Number of points, i.e. rows, in x
int M //Number of points, i.e. rows, in y
){
//X is generatedPointPos
//Y is points
float
*P,
*P1timessp,
*Pxtimessp,
ksig = -2.0 * (*sigma2),
*h_sumofP = new float[N], //sum of P, on host
*d_sumofP; //sum of P, on device
cudaMalloc((void**)&P, sizeof(float)*M*N);
cudaMalloc((void**)&P1timessp,sizeof(float)*M*N);
cudaMalloc((void**)&Pxtimessp,sizeof(float)*M*N*3);
cudaMalloc((void**)&d_sumofP, sizeof(float)*N);
cudaMalloc((void**)P1, sizeof(float)*M);
cudaMalloc((void**)Px, sizeof(float)*M*3);
cudaMalloc((void**)Pt1, sizeof(float)*N);
d_computeP<<<dim3(N,M/1024+1),M>1024?1024:M>>>(P,P1timessp,Pxtimessp,NULL,x,y,prb,ksig,N,M);
for(int n=0; n<N; n++){
thrust::device_ptr<float>dev_ptr(P);
h_sumofP[n] = thrust::reduce(dev_ptr+M*n,dev_ptr+M*(n+1),0.0f,thrust::plus<float>());
}
cudaMemcpy(d_sumofP,h_sumofP,sizeof(float)*N,cudaMemcpyHostToDevice);
d_sumP<<<M/1024+1,M>1024?1024:M>>>(d_sumofP,P1timessp,Pxtimessp,*P1,*Px,N,M);
cudaMemcpy(*Pt1,prb,sizeof(float)*N,cudaMemcpyDeviceToDevice);
cudaFree(P);
cudaFree(P1timessp);
cudaFree(Pxtimessp);
cudaFree(d_sumofP);
delete[]h_sumofP;
}
/*kernels*/
__global__ void d_computeP(
float * P,
float * P1,
float * Px,
float * ProbabilityMatrix,
float * x,
float * y,
float * prb,
float ksig,
const int N,
const int M){
//thread configuration: <<<dim3(N,M/1024+1),1024>>>
int m = threadIdx.x+blockIdx.y*blockDim.x;
int n = blockIdx.x;
if(m>=M || n>=N) return;
float
x1 = x[3*n],
x2 = x[3*n+1],
x3 = x[3*n+2],
diff1 = x1 - y[3*m],
diff2 = x2 - y[3*m+1],
diff3 = x3 - y[3*m+2],
razn = diff1*diff1+diff2*diff2+diff3*diff3,
Pm = __expf(razn/ksig), //fast exponentiation
prbn = prb[n];
P[M*n+m] = Pm;
__syncthreads();
P1[N*m+n] = Pm*prbn;
Px[3*(N*m+n)+0] = x1*Pm*prbn;
Px[3*(N*m+n)+1] = x2*Pm*prbn;
Px[3*(N*m+n)+2] = x3*Pm*prbn;
}
__global__ void d_sumP(
float * sp,
float * P1timessp,
float * Pxtimessp,
float * P1,
float * Px,
const int N,
const int M){
//computes P1 and Px
//thread configuration: <<<M/1024+1,1024>>>
int m = threadIdx.x+blockIdx.x*blockDim.x;
if(m>=M) return;
float
P1m = 0,
Pxm1 = 0,
Pxm2 = 0,
Pxm3 = 0;
for(int n=0; n<N; n++){
float spn = 1/sp[n];
P1m += P1timessp[N*m+n]*spn;
Pxm1 += Pxtimessp[3*(N*m+n)+0]*spn;
Pxm2 += Pxtimessp[3*(N*m+n)+1]*spn;
Pxm3 += Pxtimessp[3*(N*m+n)+2]*spn;
}
P1[m] = P1m;
Px[3*m+0] = Pxm1;
Px[3*m+1] = Pxm2;
Px[3*m+2] = Pxm3;
}
However, to my horror, it runs much, much slower than the original version. How do I make it run faster? Please explain things thoroughly since I am very new to CUDA and parallel programming and have no experience in algorithms.
Do note that the c version has column-major ordering and the CUDA version has row-major. I have done several tests to make sure that the result is correct. It's just extremely slow and takes up a LOT of memory.
Any help is greatly appreciated!
EDIT: More information: N and M are on the order of a few thousand (say, 300-3000) and D is always 3. The CUDA version expects arrays to be device memory, except for variables prefixed with h_.
Before trying any CUDA-specific optimizations, profile your code to see where time is being spent.
Try and arrange your array reads/writes so that each CUDA thread uses a strided access pattern. For example, currently you have
int m = threadIdx.x+blockIdx.y*blockDim.x;
int n = blockIdx.x;
if(m>=M || n>=N) return;
diff1 = x1 - y[3*m],
diff2 = x2 - y[3*m+1],
diff3 = x3 - y[3*m+2],
So thread 1 will read from y[0],y[1],y[2] etc. Instead, rearrange your data so that thread 1 reads from y[0],y[M],y[2*M] and thread 2 reads from y[1],y[M+1],y[2*M+1] etc. You should follow this access pattern for other arrays.
Also, you may want to consider whether you can avoid the use of __syncthreads(). I don't quite follow why it's necessary in this algorithm, it might be worth removing it to see if it improves performance ( even if it produces incorrect results ).
The key to good CUDA performance is almost always to make as near to optimal memory access as possible. Your memory access pattern looks very similar to matrix multiplication. I would start with a good CUDA matrix multiplication implementation, being sure to understand why it's implemented the way it is, and then modify that to suit your needs.

Resources