Initializating struct error: 'theGrid' being used without being initialized - c

I am trying to create a 3d grid for my OpenCl/GL fluid. The problem Im having is that for some reason the my grid initialization function does not work properly. Here is my *.h, *.c setup and (at the end) call in main:
(grid.h):
#if RunGPU
#define make_float3(x,y,z) (float3)(x,y,z)
#define make_int3(i,j,k) (int3)(i,j,k)
#else
typedef struct i3{
int i,j,k;
} int3;
typedef struct f3{
float x,y,z;
} float3;
#define __global
#define make_float3(x,y,z) {x , y , z}
#define make_int3(x,y,z) {x , y ,z}
#endif
typedef struct grid3 * grid3_t; // u,v,w
typedef struct grid * grid_t; // p
struct grid3 {
__global float3* values_;
__global float * H_;
__global float * h_;
int dimx_;
int dimy_;
int dimz_;
} ;
struct grid {
__global float * values_;
int dimx_;
int dimy_;
int dimz_;
};
void grid3_init(grid3_t grid,__global float3* vel,__global float* H,__global float *h, int X, int Y, int Z);
(grid.c):
void grid3_init(grid3_t grid,__global float3* val,__global float* H,__global float *h, int X, int Y, int Z){
grid->values_ = val;
grid->H_ = H;
grid->h_ = h;
grid->dimx_ = X;
grid->dimy_ = Y;
grid->dimz_ = Z;
}
In main im initializing my grid like so:
int main(int argc, char** argv)
{
const int size3d = Bx*(By+2)*Bz;
const int size2d = Bx*Bz;
float3 * velocities = (float3*)malloc(size3d*sizeof(float3));
float * H = (float*)malloc(size2d*sizeof(float));
float * h = (float*)malloc(size2d*sizeof(float));
for(int i = 0; i < size3d; i++){
float3 tmp = make_float3(0.f,0.f,0.f);
velocities[i] = tmp;
if(i < size2d){
H[i] = 1;
h[i] = 2;
}
}
grid3_t theGrid;
grid3_init(theGrid, velocities, H, h, Bx, By, Bz); // <- ERROR OCCURS HERE
}
The error im getting is during runtime - "Run-Time Check Failure #3 - The variable 'theGrid' is being used without being initialized". But thats precisely the job of grid3_init?
As im trying to write code to work for both Host and GPU I have to sacrifice the use of classes and work strictly with structs - which I have less experience with.
At this point I dont really know what to google either, I appriciate any help i can get.

struct grid3 theGrid;
grid3_init(&theGrid, velocities, H, h, Bx, By, Bz);
You need to create grid3 instance and pass its pointer to grid3_init. Your existing code just uses uninitialized pointer.

Related

How to pass pointers to vector extensions in C

I'm trying to use GCC's vector extensions, the exact code that I tried is:
typedef float Vector4 __attribute__ ((vector_size (16)));
void defVector(Vector4* v, float x,float y,float z,float w){
v[0] = x;
v[1] = y;
v[2] = z;
v[3] = w;
}
int main(int argc, char* argv){
Vector4 a;
defVector(&a, 1, 2, 3, 4);
}
and keep getting errors:
incompatible types when assigning to type ‘Vector4 {aka __vector(4) float}’ from type ‘float’
v[0] = x;
Can't dereference it too or I get another error.
I would like to not copy the entire thing to the function stack every time I use it, and it's a necessity to make pointers to the return values like
int someFunc(Vector4 v, Vector4* r){
...
r[0] = return_value;
return 0;
}
I tried everything I know to access the values inside the funtion.
What I'm missing here?
Based on the OP's example in which a local function gets a "vector-extension" pointer:
#include <stdio.h>
#include <stdint.h>
typedef float Vector4 __attribute__ ((vector_size (16)));
void defVector(Vector4 *v,float a, float b);
void defVector(Vector4 *v,float a, float b){
v[0] = *(Vector4*)&a;
v[1] = *(Vector4*)&b;
a = *(float*)&v[0] + 1.2;
b = *(float*)&v[1] + 2.1;
printf("%f,%f,%u",a,b,\
(uint32_t)sizeof(Vector4)/(uint32_t)sizeof(float));
}
int main(void) {
static Vector4 vectorA;
static float x1 = 3.4;
static float x2 = 4.3;
defVector(&vectorA,x1,x2);
}
This code will print (demo): 4.6,6.4,4 , so the value of two float were assigned to two units (being the number of units sizeof(Vector4)/sizeof(float)) of a Vector4 type variable.

Using Custom Vectors in C

I'm fairly new to C/C++, but I'm trying to debug some code. It uses a vector that someone had called CART8 and is structured as such:
typedef struct crt8 {
double x;
double y;
double z; } CART8;
Now my question is this. How do I create and populate an instance of an vector of type CART8 called vector1? I've read through a lot of material, and even found a site that indicate how you would create the vector...as indicated above, but no information on HOW to actually use it.
typedef is used extensively in C to refer to struct variables without specifying the struct prefix, for example if I had:
struct vector {
double x;
double y;
double z;
};
than to initialize it I'd have to do:
struct vector vector1;
vector1.x = 1.11;
vector1.y = 1.22;
vector1.z = 1.33;
But if I used a typedef in the declaration:
typedef struct vector {
double x;
double y;
double z;
} vector_type;
than I could simplify this initialization like so (note the struct prefix is not needed now):
vector_type vector1;
vector1.x = 1.11;
vector1.y = 1.22;
vector1.z = 1.33;
Of course, I could still use the full struct vector initialization in this case as well
So in your case:
#include <stdio.h>
typedef struct crt8 {
double x;
double y;
double z;
} CART8;
int main(int argc, char** argv)
{
CART8 vector1;
vector1.x = 2.526;
vector1.y = 3.416;
vector1.z = 4.32;
printf("%f %f %f\n", vector1.x, vector1.y, vector1.z);
}
Alternatively, you can always resort to the original struct definition:
#include <stdio.h>
typedef struct crt8 {
double x;
double y;
double z;
} CART8;
int main(int argc, char** argv)
{
struct crt8 x;
x.x = 2.341;
x.y = 3.43;
x.z = 4.521;
printf("%f %f %f\n", x.x, x.y, x.z);
}
You wrote:
typedef struct crt8 {
double x;
double y;
double z;
} CART8;
This defined a new 'type'. The 'typename' is struct crt8 or the alias you defined CART8. This is how you instantiate an object from that type in C:
struct crt8 myVector;
or you can use the alias 'CART8' that you've defined:
CART8 myVector;
Either way, this is how you populate the 'members' of your object:
CART8 x; // Creation of object
x.x = 100;
x.y = 101;
x.z = 102;
Here is a demonstrative program that shows various ways how objects of the structure can be created, initialized, and used.
#include <stdio.h>
#include <math.h>
typedef struct crt8 {
double x;
double y;
double z; } CART8;
int main( void )
{
CART8 vector1 = { 1.1, 2.2, 3.3 };
CART8 vector2 = { .x = 1.1, .y = 2.2, .z = 3.3 };
CART8 vector3;
vector3.x = 1.1;
vector3.y = 2.2;
vector3.z = 3.3;
CART8 vector4 = vector1;
CART8 vector5 = { vector1.x + vector2.z, vector1.y + vector2.y, vector1.z + vector2.x };
printf( "vector5 = { %lf, %lf, %lf }\n", vector5.x, vector5.y, vector5.z );
printf( "Magnitude = %lf", sqrt( pow( vector1.x, 2 ) + pow( vector1.y, 2 ) + pow( vector1.z, 2 ) ) );
return 0;
}
The output is
vector5 = { 4.400000, 4.400000, 4.400000 }
Magnitude = 4.115823

Copy Array of pointers inside a struct using CUDA

I wish to copy an array of pointers from one struct to another. The Struct looks like this:
typedef struct COORD3D
{
int x,y,z;
}
COORD3D;
typedef struct structName
{
double *volume;
COORD3D size;
// .. some other vars
}
structName;
I wish to do this inside a function where I pass in the address of an empty instance of the struct and the address of the struct with the data I wish to copy. Currently I do this serially via:
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
int i;
for(i=0;i<size;i++)
dest->volume[i] = source->volume[i];
}
I want to do this in CUDA to speed up the process (as the array is very large [~12 million elements].
I have tried the following however, although the code compiles and runs, I get incorrect results stored in the array (seems to be very large random numbers)
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
// Device Pointers
double *DEVICE_SOURCE, *DEVICE_DEST;
// Declare memory on GPU
cudaMalloc(&DEVICE_DEST,size);
cudaMalloc(&DEVICE_SOURCE,size);
// Copy Source to GPU
cudaMemcpy(DEVICE_SOURCE,source->volume,size,
cudaMemcpyHostToDevice);
// Setup Blocks/Grids
dim3 dimGrid(ceil(source->size.x/10.0),
ceil(source->size.y/10.0),
ceil(source->size.z/10.0));
dim3 dimBlock(10,10,10);
// Run CUDA Kernel
copyVol<<<dimGrid,dimBlock>>> (DEVICE_SOURCE,
DEVICE_DEST,
source->size.x,
source->size.y,
source->size.z);
// Copy Constructed Array back to Host
cudaMemcpy(dest->volume,DEVICE_DEST,size,
cudaMemcpyDeviceToHost);
}
The Kernel looks like this:
__global__ void copyVol(double *source, double *dest,
int x, int y, int z)
{
int posX = blockIdx.x * blockDim.x + threadIdx.x;
int posY = blockIdx.y * blockDim.y + threadIdx.y;
int posZ = blockIdx.z * blockDim.z + threadIdx.z;
if (posX < x && posY < y && posZ < z)
{
dest[posX+(posY*x)+(posZ*y*x)] =
source[posX+(posY*x)+(posZ*y*x)];
}
}
Can anyone tell me where I am going wrong?
I am risking a wrong answer, but have you left out the size of the data type?
cudaMalloc(&DEVICE_DEST,size);
should be
cudaMalloc(&DEVICE_DEST,size*sizeof(double));
Also
cudaMemcpy(DEVICE_SOURCE,source->volume,size, cudaMemcpyHostToDevice);
should be
cudaMemcpy(DEVICE_SOURCE,source->volume,size*sizeof(double), cudaMemcpyHostToDevice);
and so on.

C Typedef Struct / Union auto-cast

Im having a small math library for 3d vector and Im trying to "unify" it.
Instead of having multiple typedef struct for vector3f, vector3i, color3, angles etc... Im trying to put everything inside the same struct like this:
typedef struct
{
union
{
float x;
float r;
float ax;
int x_int;
};
union
{
float y;
float g;
float ay;
int y_int;
};
union
{
float z;
float b;
float az;
int z_int;
};
} vec3;
Everything works peachy as long as the type is float, however when it falls to int Im having some strange values (which is understandable). My question is: Is there a way to cast directly/automatically inside the structure definition or I have to create extra functions to typecast between float and int?
Due to the answers below, maybe I should modify my original question to the following:
What is the best way to "unify" (and by unify I mean have like 1 struct) to be able to handle at the same time the following:
vector3f (float x,y,z)
vector3i (int x,y,z)
RGB (float r,g,b)
RGB (unsigned char r,g,b)
euler angle (ax, ay, az)
Thanks in advance!
If you mean that you want to put '360.0f' into float z of a union and have int z_int == 3, or vice versa, you can't. That is not the purpose of a union, and the binary representation of 3 (an integer) and 3.0 (a floating point value) are dissimiliar.
However, you could just remove the int and cast one of the floats to an int.
#include <stdio.h>
#include <stdlib.h>
typedef struct genericStruct
{
void *valueOne;
void *valueTwo;
}GS;
int main()
{
GS *gs = malloc(sizeof(*gs));
int valueInt = 10;
float valueFloat = 3.141592653589;
int *inputIntPtr = (int*)malloc(sizeof(int));
float *inputFloatPtr = (float*)malloc(sizeof(float));
void *voidPtr = NULL;
*inputIntPtr = valueInt;
*inputFloatPtr = valueFloat;
voidPtr = inputIntPtr;
gs->valueOne = voidPtr;
int *outputIntPtr = (int*)malloc(sizeof(int));
outputIntPtr = gs->valueOne;
printf("Input ptr = %d\n", *inputIntPtr);
printf("Output ptr = %d\n", *outputIntPtr);
voidPtr = inputFloatPtr;
gs->valueTwo = voidPtr;
float *outputFloatPtr = (float*)malloc(sizeof(float));
outputFloatPtr = gs->valueTwo;
printf("Input ptr = %f\n", *inputFloatPtr);
printf("output ptr = %f\n", *outputFloatPtr);
free(gs);
free(inputIntPtr);
free(inputFloatPtr);
free(outputIntPtr);
free(outputFloatPtr);
return 0;
}
And this what I meant by using void types.
This is a small piece of code that i wrote for you.It should do the job.I hope i was able to do what you asked for...
typedef struct{
void *ptr1;
void *ptr2;
void *ptr3;
}VEC;
main(){
VEC v ;
VEC *ptr;
int a = 5;
double b = 6;
float c = 7;
v.ptr1 = NULL;
v.ptr2 = NULL;
v.ptr3 = NULL;
ptr = &v;
v.ptr1 = (int *)&a;
ptr->ptr1 = (int *)&a;
v.ptr2 = (double *)&b;
ptr->ptr2 = (double *)&b;
v.ptr3 = (float *)&c;
ptr->ptr3 = (float *)&c;
printf("%d\n",*(int *)v.ptr1);
printf("%d\n",*(int *)(ptr->ptr1));
printf("%lf\n",*(double *)v.ptr2);
printf("%lf\n",*(double *)(ptr->ptr2));
printf("%f\n",*(float *)v.ptr3);
printf("%f\n",*(float *)(ptr->ptr3));
}
Or change all variables to void pointer type and then cast them to float or integer. Is it OK?

How do I parallelize this triple loop in an efficient way?

I'm trying to parallelize a function which takes as input three arrays (x, y, and prb) and one scalar, and outputs three arrays (P1, Pt1, and Px).
The original c code is here (the outlier and E are inconsequential):
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#define max(A, B) ((A) > (B) ? (A) : (B))
#define min(A, B) ((A) < (B) ? (A) : (B))
void cpd_comp(
double* x,
double* y,
double* prb,
double* sigma2,
double* outlier,
double* P1,
double* Pt1,
double* Px,
double* E,
int N,
int M,
int D
)
{
int n, m, d;
double ksig, diff, razn, outlier_tmp, sp;
double *P, *temp_x;
P = (double*) calloc(M, sizeof(double));
temp_x = (double*) calloc(D, sizeof(double));
ksig = -2.0 * *sigma2;
for (n=0; n < N; n++) {
sp=0;
for (m=0; m < M; m++) {
razn=0;
for (d=0; d < D; d++) {
diff=*(x+n+d*N)-*(y+m+d*M); diff=diff*diff;
razn+=diff;
}
*(P+m)=exp(razn/ksig) ;
sp+=*(P+m);
}
*(Pt1+n)=*(prb+n);
for (d=0; d < D; d++) {
*(temp_x+d)=*(x+n+d*N)/ sp;
}
for (m=0; m < M; m++) {
*(P1+m)+=((*(P+m)/ sp) **(prb+n));
for (d=0; d < D; d++) {
*(Px+m+d*M)+= (*(temp_x+d)**(P+m)**(prb+n));
}
}
*E += -log(sp);
}
*E +=D*N*log(*sigma2)/2;
free((void*)P);
free((void*)temp_x);
return;
}
Here is my attempt at parallelizing it:
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <thrust/device_ptr.h>
#include <thrust/reduce.h>
/*headers*/
void cpd_comp(
float * x, //Points to register [N*D]
float * y, //Points to be registered [M*D]
float * prb, //Vector of probabilities [N]
float * sigma2, //Square of sigma
float ** P1, //P1, output, [M]
float ** Pt1, //Pt1, output, [N]
float ** Px, //Px, output, [M*3]
int N, //Number of points, i.e. rows, in x
int M //Number of points, i.e. rows, in
);
__global__ void d_computeP(
float * P,
float * P1,
float * Px,
float * ProbabilityMatrix,
float * x,
float * y,
float * prb,
float ksig,
const int N,
const int M);
__global__ void d_sumP(
float * sp,
float * P1timessp,
float * Pxtimessp,
float * P1,
float * Px,
const int N,
const int M);
/*implementations*/
void cpd_comp(
float * x, //Points to register [N*D]
float * y, //Points to be registered [M*D]
float * prb, //Vector of probabilities [N]
float * sigma2, //Scalar
float ** P1, //P1, output, [M]
float ** Pt1, //Pt1, output, [N]
float ** Px, //Px, output, [M*3]
int N, //Number of points, i.e. rows, in x
int M //Number of points, i.e. rows, in y
){
//X is generatedPointPos
//Y is points
float
*P,
*P1timessp,
*Pxtimessp,
ksig = -2.0 * (*sigma2),
*h_sumofP = new float[N], //sum of P, on host
*d_sumofP; //sum of P, on device
cudaMalloc((void**)&P, sizeof(float)*M*N);
cudaMalloc((void**)&P1timessp,sizeof(float)*M*N);
cudaMalloc((void**)&Pxtimessp,sizeof(float)*M*N*3);
cudaMalloc((void**)&d_sumofP, sizeof(float)*N);
cudaMalloc((void**)P1, sizeof(float)*M);
cudaMalloc((void**)Px, sizeof(float)*M*3);
cudaMalloc((void**)Pt1, sizeof(float)*N);
d_computeP<<<dim3(N,M/1024+1),M>1024?1024:M>>>(P,P1timessp,Pxtimessp,NULL,x,y,prb,ksig,N,M);
for(int n=0; n<N; n++){
thrust::device_ptr<float>dev_ptr(P);
h_sumofP[n] = thrust::reduce(dev_ptr+M*n,dev_ptr+M*(n+1),0.0f,thrust::plus<float>());
}
cudaMemcpy(d_sumofP,h_sumofP,sizeof(float)*N,cudaMemcpyHostToDevice);
d_sumP<<<M/1024+1,M>1024?1024:M>>>(d_sumofP,P1timessp,Pxtimessp,*P1,*Px,N,M);
cudaMemcpy(*Pt1,prb,sizeof(float)*N,cudaMemcpyDeviceToDevice);
cudaFree(P);
cudaFree(P1timessp);
cudaFree(Pxtimessp);
cudaFree(d_sumofP);
delete[]h_sumofP;
}
/*kernels*/
__global__ void d_computeP(
float * P,
float * P1,
float * Px,
float * ProbabilityMatrix,
float * x,
float * y,
float * prb,
float ksig,
const int N,
const int M){
//thread configuration: <<<dim3(N,M/1024+1),1024>>>
int m = threadIdx.x+blockIdx.y*blockDim.x;
int n = blockIdx.x;
if(m>=M || n>=N) return;
float
x1 = x[3*n],
x2 = x[3*n+1],
x3 = x[3*n+2],
diff1 = x1 - y[3*m],
diff2 = x2 - y[3*m+1],
diff3 = x3 - y[3*m+2],
razn = diff1*diff1+diff2*diff2+diff3*diff3,
Pm = __expf(razn/ksig), //fast exponentiation
prbn = prb[n];
P[M*n+m] = Pm;
__syncthreads();
P1[N*m+n] = Pm*prbn;
Px[3*(N*m+n)+0] = x1*Pm*prbn;
Px[3*(N*m+n)+1] = x2*Pm*prbn;
Px[3*(N*m+n)+2] = x3*Pm*prbn;
}
__global__ void d_sumP(
float * sp,
float * P1timessp,
float * Pxtimessp,
float * P1,
float * Px,
const int N,
const int M){
//computes P1 and Px
//thread configuration: <<<M/1024+1,1024>>>
int m = threadIdx.x+blockIdx.x*blockDim.x;
if(m>=M) return;
float
P1m = 0,
Pxm1 = 0,
Pxm2 = 0,
Pxm3 = 0;
for(int n=0; n<N; n++){
float spn = 1/sp[n];
P1m += P1timessp[N*m+n]*spn;
Pxm1 += Pxtimessp[3*(N*m+n)+0]*spn;
Pxm2 += Pxtimessp[3*(N*m+n)+1]*spn;
Pxm3 += Pxtimessp[3*(N*m+n)+2]*spn;
}
P1[m] = P1m;
Px[3*m+0] = Pxm1;
Px[3*m+1] = Pxm2;
Px[3*m+2] = Pxm3;
}
However, to my horror, it runs much, much slower than the original version. How do I make it run faster? Please explain things thoroughly since I am very new to CUDA and parallel programming and have no experience in algorithms.
Do note that the c version has column-major ordering and the CUDA version has row-major. I have done several tests to make sure that the result is correct. It's just extremely slow and takes up a LOT of memory.
Any help is greatly appreciated!
EDIT: More information: N and M are on the order of a few thousand (say, 300-3000) and D is always 3. The CUDA version expects arrays to be device memory, except for variables prefixed with h_.
Before trying any CUDA-specific optimizations, profile your code to see where time is being spent.
Try and arrange your array reads/writes so that each CUDA thread uses a strided access pattern. For example, currently you have
int m = threadIdx.x+blockIdx.y*blockDim.x;
int n = blockIdx.x;
if(m>=M || n>=N) return;
diff1 = x1 - y[3*m],
diff2 = x2 - y[3*m+1],
diff3 = x3 - y[3*m+2],
So thread 1 will read from y[0],y[1],y[2] etc. Instead, rearrange your data so that thread 1 reads from y[0],y[M],y[2*M] and thread 2 reads from y[1],y[M+1],y[2*M+1] etc. You should follow this access pattern for other arrays.
Also, you may want to consider whether you can avoid the use of __syncthreads(). I don't quite follow why it's necessary in this algorithm, it might be worth removing it to see if it improves performance ( even if it produces incorrect results ).
The key to good CUDA performance is almost always to make as near to optimal memory access as possible. Your memory access pattern looks very similar to matrix multiplication. I would start with a good CUDA matrix multiplication implementation, being sure to understand why it's implemented the way it is, and then modify that to suit your needs.

Resources