Below is my kernel, but when i wanna multiply or do other operations with two values which are defined by #define keyword and assign it to an argument of the kernel i get an error with error status -48.(Invalid kernel.)Is it not possible to multiply these or am i doing something else wrong?
#define cl_sizeX 1024;
#define pi 3.1415926535897;
#define N 1024;
#define M 1024;
#define lambda 632e-9;//632e-9;
#define X 12.1e-6;
__kernel void helloworld(__global char* in, __global char* out)
int num = get_global_id(0);
out[num] = in[num] + 1;
__kernel void multiply_arrays(__global int* first, __global int* second, __global int* out_array)
int num = get_global_id(0);
out_array[num] = first[num] * second[num];
__kernel void create_library(__global float* z0){
//Variable definitions
int a = get_global_id(0);
int i1 = get_global_id(1);
int i2 = get_global_id(2);
//z0[a] = ((N*pow(X, 2)) / lambda) + (a - 1)*((N*pow(X, 2)) / (100 * lambda));
z0[a] = N*X; // This is where i get error
When i assign z0[a] = N; i don't get an error and couldn't figure it out.
I use Windows 8.1 and Visual Studio 13 for coding.
If you remove the ; after all the #define statements the kernel will compile.
You are assigning a double to a float which could be raising an error in the compiler.
Use clGetProgramBuildInfo with CL_PROGRAM_BUILD_LOG to get the actual clBuildProgram output from the compiler which will give you a better idea of the problem.
For example, I want to use Eigen to do matrix multiply. But the type of input matrix is int16_t, and the type of output is int32_t. So it causes a compiler error.
Showing Recent Issues
/Eigen/src/Core/AssignEvaluator.h:834:3: Static_assert failed due to requirement 'Eigen::internal::has_ReturnType::Scalar, assign_op > >::value' "YOU_MIXED_DIFFERENT_NUMERIC_TYPES__YOU_NEED_TO_USE_THE_CAST_METHOD_OF_MATRIXBASE_TO_CAST_NUMERIC_TYPES_EXPLICITLY"
Below is the test code:
#include <iostream>
typedef Eigen::Matrix<int16_t, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatX16;
typedef Eigen::Matrix<int32_t, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> MatX32;
#define MAP_MATRIX(name, ptr, M, N) Eigen::Map<MatX32> name(ptr, M, N)
#define MAP_CONST_MATRIX(name, ptr, M, N) Eigen::Map<const MatX16> name(ptr, M, N)
int main(int argc, const char * argv[]) {
int M, N, K;
M = 10; N = 10; K = 10;
// eigen int16xint16 = int32
int16_t lhs[100] = {1};
int16_t rhs[100] = {2};
int32_t res[100] = {0};
MAP_MATRIX(eC, res, M, N);
eC = eA * eB;
return 0;
The product of two int16 matrices will be an int16 again. You can cast the result to int32:
eC = (eA * eB).cast<int32_t>();
However, what you probably actually want, is to cast the original factors to int32. Additionally, you can tell Eigen that eC will not alias with either eA or eB:
eC.noalias() = eA.cast<int32_t>() * eB.cast<int32_t>();
Note that casting is not vectorized (yet), so you probably get sub-optimal code with this. Your compiler might be smart enough to partially auto-vectorize the product, though.
I wish to copy an array of pointers from one struct to another. The Struct looks like this:
typedef struct COORD3D
int x,y,z;
typedef struct structName
double *volume;
COORD3D size;
// .. some other vars
I wish to do this inside a function where I pass in the address of an empty instance of the struct and the address of the struct with the data I wish to copy. Currently I do this serially via:
void foo(structName *dest, structName *source)
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
int i;
dest->volume[i] = source->volume[i];
I want to do this in CUDA to speed up the process (as the array is very large [~12 million elements].
I have tried the following however, although the code compiles and runs, I get incorrect results stored in the array (seems to be very large random numbers)
void foo(structName *dest, structName *source)
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
// Device Pointers
// Declare memory on GPU
// Copy Source to GPU
// Setup Blocks/Grids
dim3 dimGrid(ceil(source->size.x/10.0),
dim3 dimBlock(10,10,10);
// Run CUDA Kernel
copyVol<<<dimGrid,dimBlock>>> (DEVICE_SOURCE,
// Copy Constructed Array back to Host
The Kernel looks like this:
__global__ void copyVol(double *source, double *dest,
int x, int y, int z)
int posX = blockIdx.x * blockDim.x + threadIdx.x;
int posY = blockIdx.y * blockDim.y + threadIdx.y;
int posZ = blockIdx.z * blockDim.z + threadIdx.z;
if (posX < x && posY < y && posZ < z)
dest[posX+(posY*x)+(posZ*y*x)] =
Can anyone tell me where I am going wrong?
I am risking a wrong answer, but have you left out the size of the data type?
should be
cudaMemcpy(DEVICE_SOURCE,source->volume,size, cudaMemcpyHostToDevice);
should be
cudaMemcpy(DEVICE_SOURCE,source->volume,size*sizeof(double), cudaMemcpyHostToDevice);
and so on.
I am trying to create a 3d grid for my OpenCl/GL fluid. The problem Im having is that for some reason the my grid initialization function does not work properly. Here is my *.h, *.c setup and (at the end) call in main:
#if RunGPU
#define make_float3(x,y,z) (float3)(x,y,z)
#define make_int3(i,j,k) (int3)(i,j,k)
typedef struct i3{
int i,j,k;
} int3;
typedef struct f3{
float x,y,z;
} float3;
#define __global
#define make_float3(x,y,z) {x , y , z}
#define make_int3(x,y,z) {x , y ,z}
typedef struct grid3 * grid3_t; // u,v,w
typedef struct grid * grid_t; // p
struct grid3 {
__global float3* values_;
__global float * H_;
__global float * h_;
int dimx_;
int dimy_;
int dimz_;
} ;
struct grid {
__global float * values_;
int dimx_;
int dimy_;
int dimz_;
void grid3_init(grid3_t grid,__global float3* vel,__global float* H,__global float *h, int X, int Y, int Z);
void grid3_init(grid3_t grid,__global float3* val,__global float* H,__global float *h, int X, int Y, int Z){
grid->values_ = val;
grid->H_ = H;
grid->h_ = h;
grid->dimx_ = X;
grid->dimy_ = Y;
grid->dimz_ = Z;
In main im initializing my grid like so:
int main(int argc, char** argv)
const int size3d = Bx*(By+2)*Bz;
const int size2d = Bx*Bz;
float3 * velocities = (float3*)malloc(size3d*sizeof(float3));
float * H = (float*)malloc(size2d*sizeof(float));
float * h = (float*)malloc(size2d*sizeof(float));
for(int i = 0; i < size3d; i++){
float3 tmp = make_float3(0.f,0.f,0.f);
velocities[i] = tmp;
if(i < size2d){
H[i] = 1;
h[i] = 2;
grid3_t theGrid;
grid3_init(theGrid, velocities, H, h, Bx, By, Bz); // <- ERROR OCCURS HERE
The error im getting is during runtime - "Run-Time Check Failure #3 - The variable 'theGrid' is being used without being initialized". But thats precisely the job of grid3_init?
As im trying to write code to work for both Host and GPU I have to sacrifice the use of classes and work strictly with structs - which I have less experience with.
At this point I dont really know what to google either, I appriciate any help i can get.
struct grid3 theGrid;
grid3_init(&theGrid, velocities, H, h, Bx, By, Bz);
You need to create grid3 instance and pass its pointer to grid3_init. Your existing code just uses uninitialized pointer.
I have created a program in C and trying to change it to CUDA.
the program output files with numbers for graph.
with CUDA I can get the program to output the files but the calculations have not been done
here the code with the algorithms
__device__ void nextState(int i, darray oldv, darray newv, darray w, int t){
double dv;
dv = -8*oldv[i]*(oldv[i]-0.1)*(oldv[i]-1) - oldv[i]*w[i];
/* Stimulate in leftmost region */
if ((t >=10) && (t<=15) && (i < 4))
dv += 2;
/* diffusion */
newv[i] = oldv[i] + 0.1 *dv +
0.1 *1.0*(oldv[i-1]-2*oldv[i]+oldv[i+1])/(1.0*1.0);
w[i] = w[i] + 0.1 *eps(oldv[i],w[i])
__device__ double eps(double u, double v)
return (0.002 + (0.2*v)/(u+0.3));
__global__ void run_state(darray* oldv, darray* newv, darray* w, int* t)
int i = threadIdx.x;
nextState(i, *oldv, *newv, *w, *t);
also #define N 256; with run_state<<< 1, N>>>(d_oldv, d_newv, d_w, d_t);
so it should output 256 values. it does that but all are at 0.000...
so I am wandering if I have made a mistake in any of these functions.
thanks in advance
If you are trying to check if your function is working properly try using __host__ directive so the same function can also be used on host then you can test and debug it locally. If it works on the host and doesn't work on the device, you are probably doing something wrong when copying the information from host to device and back.
Declare your function like this:
__host__ __device__ void nextState(int i, darray oldv, darray newv, darray w, int t)
Little bit of a 2 parter. First of all im trying to do this in all c. First of all I'll go ahead and post my program
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <string.h>
double f(double x);
void Trap(double a, double b, int n, double* integral_p);
int main(int argc, char* argv[]) {
double integral=0.0; //Integral Result
double a=6, b=10; //Left and Right Points
int n; //Number of Trapezoids (Higher=more accurate)
int degree;
if (argc != 3) {
printf("Error: Invalid Command Line arguements, format:./trapezoid N filename");
n = atoi(argv[2]);
FILE *fp = fopen( argv[1], "r" );
# pragma omp parallel
Trap(a, b, n, &integral);
printf("With n = %d trapezoids....\n", n);
printf("of the integral from %f to %f = %.15e\n",a, b, integral);
return 0;
double f(double x) {
double return_val;
return_val = pow(3.0*x,5)+pow(2.5*x,4)+pow(-1.5*x,3)+pow(0*x,2)+pow(1.7*x,1)+4;
return return_val;
void Trap(double a, double b, int n, double* integral_p) {
double h, x, my_integral;
double local_a, local_b;
int i, local_n;
int my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads();
h = (b-a)/n;
local_n = n/thread_count;
local_a = a + my_rank*local_n*h;
local_b = local_a + local_n*h;
my_integral = (f(local_a) + f(local_b))/2.0;
for (i = 1; i <= local_n-1; i++) {
x = local_a + i*h;
my_integral += f(x);
my_integral = my_integral*h;
# pragma omp critical
*integral_p += my_integral;
As you can see, it calculates trapezoidal rule given an interval.
First of all it DOES work, if you hardcode the values and the function. But I need to read from a file in the format of
3.0 2.5 -1.5 0.0 1.7 4.0
6 10
Which means:
It is of degree 5 (no more than 50 ever)
3.0x^5 +2.5x^4 −1.5x^3 +1.7x+4 is the polynomial (we skip ^2 since it's 0)
and the Interval is from 6 to 10
My main concern is the f(x) function which I have hardcoded. I have NO IDEA how to make it take up to 50 besides literally typing out 50 POWS and reading in the values to see what they could be.......Anyone else have any ideas perhaps?
Also what would be the best way to read in the file? fgetc? Im not really sure when it comes to reading in C input (especially since everything i read in is an INT, is there some way to convert them?)
For a large degree polynomial, would something like this work?
double f(double x, double coeff[], int nCoeff)
double return_val = 0.0;
int exponent = nCoeff-1;
int i;
for(i=0; i<nCoeff-1; ++i, --exponent)
return_val = pow(coeff[i]*x, exponent) + return_val;
/* add on the final constant, 4, in our example */
return return_val + coeff[nCoeff-1];
In your example, you would call it like:
double coefficients[] = {3.0, 2.5, -1.5, 0, 1.7, 4};
/* This expresses 3x^5 + 2.5x^4 + (-1.5x)^3 + 0x^2 + 1.7x + 4 */
my_integral = f(x, coefficients, 6);
By passing an array of coefficients (the exponents are assumed), you don't have to deal with variadic arguments. The hardest part is constructing the array, and that is pretty simple.
It should go without saying, if you put the coefficients array and number-of-coefficients into global variables, then the signature of f(x) doesn't need to change:
double f(double x)
// access glbl_coeff and glbl_NumOfCoeffs, instead of parameters
For you f() function consider making it variadic (varargs is another name)
This way you could pass the function 1 arg telling it how many "pows" you want, with each susequent argument being a double value. Is this what you are asking for with the f() function part of your question?