algorithms in Cuda

algorithms in Cuda - c

I have created a program in C and trying to change it to CUDA.
the program output files with numbers for graph.
with CUDA I can get the program to output the files but the calculations have not been done
here the code with the algorithms
__device__ void nextState(int i, darray oldv, darray newv, darray w, int t){
double dv;
dv = -8*oldv[i]*(oldv[i]-0.1)*(oldv[i]-1) - oldv[i]*w[i];
/* Stimulate in leftmost region */
if ((t >=10) && (t<=15) && (i < 4))
dv += 2;
/* diffusion */
newv[i] = oldv[i] + 0.1 *dv +
0.1 *1.0*(oldv[i-1]-2*oldv[i]+oldv[i+1])/(1.0*1.0);
w[i] = w[i] + 0.1 *eps(oldv[i],w[i])
*(-w[i]-8*oldv[i]*(oldv[i]-0.1-1));
}
__device__ double eps(double u, double v)
{
return (0.002 + (0.2*v)/(u+0.3));
}
__global__ void run_state(darray* oldv, darray* newv, darray* w, int* t)
{
int i = threadIdx.x;
nextState(i, *oldv, *newv, *w, *t);
}
also #define N 256; with run_state<<< 1, N>>>(d_oldv, d_newv, d_w, d_t);
so it should output 256 values. it does that but all are at 0.000...
so I am wandering if I have made a mistake in any of these functions.
thanks in advance

If you are trying to check if your function is working properly try using __host__ directive so the same function can also be used on host then you can test and debug it locally. If it works on the host and doesn't work on the device, you are probably doing something wrong when copying the information from host to device and back.
Declare your function like this:
__host__ __device__ void nextState(int i, darray oldv, darray newv, darray w, int t)

Related

Segmentation Fault 11 in C caused by larger operation numbers

I have known that when encountered with segmentation fault 11, it means the program has attempted to access an area of memory that it is not allowed to access.
Here I am trying to calculate a Fourier transform, using the following code.
It works well when nPoints = 2^15 (or of course with less points) , however it corrupts when I further increase the points to 2^16. I am wondering, is that caused by occupying too much memory? But I did not notice too much memory occupation during the operation. And although it use recursion, it transforms in-place. I thought it would occupy not so much memory. Then, where's the problem?
Thanks in advance
PS: one thing I forgot to say is, the result above was on Max OS (8G memory).
When I running the code on Windows (16G memory), it corrupts when nPoints = 2^14. So it makes me confused whether it's caused by the memory allocation, as the Windows PC has a larger memory (but it's really hard to say, because the two operation systems utilize different memory strategy).
#include <stdio.h>
#include <tgmath.h>
#include <string.h>
// in place FFT with O(n) memory usage
long double PI;
typedef long double complex cplx;
void _fft(cplx buf[], cplx out[], int n, int step)
{
if (step < n) {
_fft(out, buf, n, step * 2);
_fft(out + step, buf + step, n, step * 2);
for (int i = 0; i < n; i += 2 * step) {
cplx t = exp(-I * PI * i / n) * out[i + step];
buf[i / 2] = out[i] + t;
buf[(i + n)/2] = out[i] - t;
}
}
}
void fft(cplx buf[], int n)
{
cplx out[n];
for (int i = 0; i < n; i++) out[i] = buf[i];
_fft(buf, out, n, 1);
}
int main()
{
const int nPoints = pow(2, 15);
PI = atan2(1.0l, 1) * 4;
double tau = 0.1;
double tSpan = 12.5;
long double dt = tSpan / (nPoints-1);
long double T[nPoints];
cplx At[nPoints];
for (int i = 0; i < nPoints; ++i)
{
T[i] = dt * (i - nPoints / 2);
At[i] = exp( - T[i]*T[i] / (2*tau*tau));
}
fft(At, nPoints);
return 0;
}

You cannot allocate very large arrays in the stack. The default stack size on macOS is 8 MiB. The size of your cplx type is 32 bytes, so an array of 216 cplx elements is 2 MiB, and you have two of them (one in main and one in fft), so that is 4 MiB. That fits on the stack, but, at that size, the program runs to completion when I try it. At 217, it fails, which makes sense because then the program has two arrays taking 8 MiB on stack. The proper way to allocate such large arrays is to include <stdlib.h> and use cmplx *At = malloc(nPoints * sizeof *At); followed by if (!At) { /* Print some error message about being unable to allocate memory and terminate the program. */ }. You should do that for At, T, and out. Also, when you are done with each array, you should free it, as with free(At);.
To calculate an integer power of two, use the integer operation 1 << power, not the floating-point operation pow(2, 16). We have designed pow well on macOS, but, on other systems, it may return approximations even when exact results are possible. An approximate result may be slightly less than the exact integer value, so converting it to an integer truncates to the wrong result. If it may be a power of two larger than suitable for an int, then use (type) 1 << power, where type is a suitably large integer type.

the following, instrumented, code clearly shows that the OPs code repeatedly updates the same locations in the out[] array and actually does not update most of the locations in that array.
#include <stdio.h>
#include <tgmath.h>
#include <assert.h>
// in place FFT with O(n) memory usage
#define N_POINTS (1<<15)
double T[N_POINTS];
double At[N_POINTS];
double PI;
// prototypes
void _fft(double buf[], double out[], int step);
void fft( void );
int main( void )
{
PI = 3.14159;
double tau = 0.1;
double tSpan = 12.5;
double dt = tSpan / (N_POINTS-1);
for (int i = 0; i < N_POINTS; ++i)
{
T[i] = dt * (i - (N_POINTS / 2));
At[i] = exp( - T[i]*T[i] / (2*tau*tau));
}
fft();
return 0;
}
void fft()
{
double out[ N_POINTS ];
for (int i = 0; i < N_POINTS; i++)
out[i] = At[i];
_fft(At, out, 1);
}
void _fft(double buf[], double out[], int step)
{
printf( "step: %d\n", step );
if (step < N_POINTS)
{
_fft(out, buf, step * 2);
_fft(out + step, buf + step, step * 2);
for (int i = 0; i < N_POINTS; i += 2 * step)
{
double t = exp(-I * PI * i / N_POINTS) * out[i + step];
buf[i / 2] = out[i] + t;
buf[(i + N_POINTS)/2] = out[i] - t;
printf( "index: %d buf update: %d, %d\n", i, i/2, (i+N_POINTS)/2 );
}
}
}
Suggest running via (where untitled1 is the name of the executable and on linux)
./untitled1 > out.txt
less out.txt
the out.txt file is 8630880 bytes
An examination of that file shows the lack of coverage and shows that any one entry is NOT the sum of the prior two entries, so I suspect this is not a valid Fourier transform,

Copy Array of pointers inside a struct using CUDA

I wish to copy an array of pointers from one struct to another. The Struct looks like this:
typedef struct COORD3D
{
int x,y,z;
}
COORD3D;
typedef struct structName
{
double *volume;
COORD3D size;
// .. some other vars
}
structName;
I wish to do this inside a function where I pass in the address of an empty instance of the struct and the address of the struct with the data I wish to copy. Currently I do this serially via:
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
int i;
for(i=0;i<size;i++)
dest->volume[i] = source->volume[i];
}
I want to do this in CUDA to speed up the process (as the array is very large [~12 million elements].
I have tried the following however, although the code compiles and runs, I get incorrect results stored in the array (seems to be very large random numbers)
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
// Device Pointers
double *DEVICE_SOURCE, *DEVICE_DEST;
// Declare memory on GPU
cudaMalloc(&DEVICE_DEST,size);
cudaMalloc(&DEVICE_SOURCE,size);
// Copy Source to GPU
cudaMemcpy(DEVICE_SOURCE,source->volume,size,
cudaMemcpyHostToDevice);
// Setup Blocks/Grids
dim3 dimGrid(ceil(source->size.x/10.0),
ceil(source->size.y/10.0),
ceil(source->size.z/10.0));
dim3 dimBlock(10,10,10);
// Run CUDA Kernel
copyVol<<<dimGrid,dimBlock>>> (DEVICE_SOURCE,
DEVICE_DEST,
source->size.x,
source->size.y,
source->size.z);
// Copy Constructed Array back to Host
cudaMemcpy(dest->volume,DEVICE_DEST,size,
cudaMemcpyDeviceToHost);
}
The Kernel looks like this:
__global__ void copyVol(double *source, double *dest,
int x, int y, int z)
{
int posX = blockIdx.x * blockDim.x + threadIdx.x;
int posY = blockIdx.y * blockDim.y + threadIdx.y;
int posZ = blockIdx.z * blockDim.z + threadIdx.z;
if (posX < x && posY < y && posZ < z)
{
dest[posX+(posY*x)+(posZ*y*x)] =
source[posX+(posY*x)+(posZ*y*x)];
}
}
Can anyone tell me where I am going wrong?

I am risking a wrong answer, but have you left out the size of the data type?
cudaMalloc(&DEVICE_DEST,size);
should be
cudaMalloc(&DEVICE_DEST,size*sizeof(double));
Also
cudaMemcpy(DEVICE_SOURCE,source->volume,size, cudaMemcpyHostToDevice);
should be
cudaMemcpy(DEVICE_SOURCE,source->volume,size*sizeof(double), cudaMemcpyHostToDevice);
and so on.

Multiply two pre-defined values in the kernel

Below is my kernel, but when i wanna multiply or do other operations with two values which are defined by #define keyword and assign it to an argument of the kernel i get an error with error status -48.(Invalid kernel.)Is it not possible to multiply these or am i doing something else wrong?
#define cl_sizeX 1024;
#define pi 3.1415926535897;
#define N 1024;
#define M 1024;
#define lambda 632e-9;//632e-9;
#define X 12.1e-6;
__kernel void helloworld(__global char* in, __global char* out)
{
int num = get_global_id(0);
out[num] = in[num] + 1;
}
__kernel void multiply_arrays(__global int* first, __global int* second, __global int* out_array)
{
int num = get_global_id(0);
out_array[num] = first[num] * second[num];
}
__kernel void create_library(__global float* z0){
//Variable definitions
int a = get_global_id(0);
int i1 = get_global_id(1);
int i2 = get_global_id(2);
//z0[a] = ((N*pow(X, 2)) / lambda) + (a - 1)*((N*pow(X, 2)) / (100 * lambda));
z0[a] = N*X; // This is where i get error
When i assign z0[a] = N; i don't get an error and couldn't figure it out.
I use Windows 8.1 and Visual Studio 13 for coding.

If you remove the ; after all the #define statements the kernel will compile.

You are assigning a double to a float which could be raising an error in the compiler.
Use clGetProgramBuildInfo with CL_PROGRAM_BUILD_LOG to get the actual clBuildProgram output from the compiler which will give you a better idea of the problem.

In C, initializing an array using a variable led to stack overflow error or caused R to crash when code is called in R

Okay. My original question turned out to be caused by not initializing some arrays. The original issue had to do with code crashing R. When I was trying to debug it by commenting things out, I by mistake commented out the lines that initialized the arrays. So I thought my problem had to do with passing pointers.
The actual problem is this. As I said before, I want to use outer_pos to calculate outer differences and pass both the pointers of the results and the total number of positive differences back to a function that calls outer_pos
#include <R.h>
#include <Rmath.h>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
void outer_pos(double *x, double *y, int *n, double *output){
int i, j, l=0;
for(i=0; i<*n; i++){
for(j=0; j<*n; j++){
if((x[j]-x[i])>0){
output[l+1]=(y[j]-y[i])/(x[j]-x[i]);
output[0]=(double)(++l);
}
}
}
Rprintf("%d\n", (int)output[0]);
}
void foo1(double *x, double *y, int *nsamp){
int i, j, k, oper=2, l;
double* v1v2=malloc(sizeof(double)*((*nsamp)*(*nsamp-1)/2 + 1));
outer_pos(x, y, nsamp, &v1v2[0]);
double v1v2b[1999000]; // <--------------HERE
for(i=1; i<= (int)v1v2[0]; i++){
v1v2b[i-1]=1;
}
}
Suppose foo1 is the function that calls outer_pos. Here I specified the size of the array v1v2b using an actual number 1999000. This value corresponds to the number of positive differences. Calling foo1 from R causes no problem. It's all fine.
In the scenario above, I know the number of positive differences, so I can use the actual value to set the array size. But I would like to accommodate situations where I don't necessarily know the value. foo2 below is intended to do that. As you can see, v1v2b is initialized using the first value of the array v1v2. Recall that the first slot of the output of outer_pos stores the number of positive differences. So basically I use this value to set v1v2's size. However, calling this function in R causes R to either show a stack overflow error or causes it to crash (see screen shot below)
void foo2(double *x, double *y, int *nsamp){
int i, j, k, oper=2, l;
double* v1v2=malloc(sizeof(double)*((*nsamp)*(*nsamp-1)/2 + 1));
outer_pos(x, y, nsamp, &v1v2[0]);
double v1v2b[(int)v1v2[0]]; //<--------HERE
for(i=1; i<= (int)v1v2[0]; i++){
v1v2b[i-1]=1;
}
}
So I thought, maybe it has to do with indexation. Maybe the actual size of v1v2b is too small, or something, so the loop iterates outside the bound. So I created foo2b in which I commented out the loop, and use Rprintf to print the first slot of v1v2 to see if the value stored in it is correct. But it seems that the value v1v2[0] is correct, namely 1999000. So I don't know what is happening here.
Sorry for the confusion with my previous question!!
void foo2b(double *x, double *y, int *nsamp){
int i, j, k, oper=2, l;
double* v1v2=malloc(sizeof(double)*((*nsamp)*(*nsamp-1)/2 + 1));
outer_pos(x, y, nsamp, &v1v2[0]);
double v1v2b[(int)v1v2[0]]; //<----Array size declared by a variable
Rprintf("%d", (int)v1v2[0]);
//for(i=1; i<= (int)v1v2[0]; i++){
//v1v2b[i-1]=v1v2[i];
//}
}
R code to run the code above:
x=rnorm(2000)
y=rnorm(2000)
.C("foo1", x=as.double(x), y=as.double(y), nsamp=as.integer(2000))
.C("foo2", x=as.double(x), y=as.double(y), nsamp=as.integer(2000))
.C("foo2b", x=as.double(x), y=as.double(y), nsamp=as.integer(2000))
** FOLLOW UP **
I modified my code based on Martin's suggestion to check if the stack overflow issue can be resolved:
void foo2b(double *x, double *y, int *nsamp) {
int n = *nsamp, i;
double *v1v2, *v1v2b;
v1v2 = (double *) R_alloc(n * (n - 1) / 2 + 1, sizeof(double));
/* outer_pos(x, y, nsamp, v1v2); */
v1v2b = (double *) R_alloc((size_t) v1v2[0], sizeof(int));
for(i=0; i< (int)v1v2[0]; i++){
v1v2b[i]=1;
}
//qsort(v1v2b, (size_t) v1v2[0], sizeof(double), mycompare);
/* ... */
}
After compiling it, I ran the code:
x=rnorm(1000)
y=rnorm(1000)
.C("foo2b", x=as.double(x), y=as.double(y), nsamp=as.integer(length(x)))
And got an error message:
Error: cannot allocate memory block of size 34359738368.0 Gb
** FOLLOW UP 2 **
It seems that the error message shows up every other run of the function. At least it did not crash R...So basically function alternates between running with no problem and showing an error message.
(I included both headers in my script file).

As before, you're allocating on the stack, but should be allocating from the heap. Correct this using malloc / free as you did in your previous question (actually, I think the recommended approach is Calloc / Free or if your code returns to R simply R_alloc; R_alloc automatically recovers the memory when returning to R, even in the case of an error that R catches).
qsort is mentioned in a comment. It takes as its final argument a user-supplied function that defines how its first argument is to be sorted. The signature of qsort (from man qsort) is
void qsort(void *base, size_t nmemb, size_t size,
int(*compar)(const void *, const void *));
with the final argument being 'a pointer to a function that takes two constant void pointers and returns an int'. A function satisfying this signature and sorting pointers to two doubles according to the specification on the man page is
int mycompare(const void *p1, const void *p2)
{
const double d1 = *(const double *) p1,
d2 = *(const double *) p2;
return d1 < d2 ? -1 : (d2 > d1 ? 1 : 0);
}
So
#include <Rdefines.h>
#include <stdlib.h>
int mycompare(const void *p1, const void *p2)
{
const double d1 = *(const double *) p1,
d2 = *(const double *) p2;
return d1 < d2 ? -1 : (d2 > d1 ? 1 : 0);
}
void outer_pos(double *x, double *y, int *n, double *output){
int i, j, l = 0;
for (i = 0; i < *n; i++) {
for (j = 0; j < *n; j++) {
if ((x[j] - x[i]) > 0) {
output[l + 1] = (y[j] - y[i]) / (x[j] - x[i]);
output[0] = (double)(++l);
}
}
}
}
void foo2b(double *x, double *y, int *nsamp) {
int n = *nsamp;
double *v1v2, *v1v2b;
v1v2 = (double *) R_alloc(n * (n - 1) / 2 + 1, sizeof(double));
outer_pos(x, y, nsamp, v1v2);
v1v2b = (double *) R_alloc((size_t) v1v2[0], sizeof(double));
qsort(v1v2b, (size_t) v1v2[0], sizeof(double), mycompare);
/* ... */
}

When foo2b calls outer_pos, it is passing two allocated but uninitialized arrays as x and y. You can't depend on their contents, thus you have different results from different invocations.
Edit
You're dangerously close to your stack size with 1999000 doubles, which take just over 15.25MB, and that's because you're on Mac OS. On most other platforms, threads don't get anywhere near 16M of stack.
You don't start out with a clean (empty) stack when you call this function -- you're deep into R functions, each creating frames that take space on the stack.
Edit 2
Below, you are using an uninitialized value v1v2[0] as an argument to R-alloc. That you get an error sometimes (and not always) is not a surprise.
v1v2 = (double *) R_alloc(n * (n - 1) / 2 + 1, sizeof(double));
/* outer_pos(x, y, nsamp, v1v2); */
v1v2b = (double *) R_alloc((size_t) v1v2[0], sizeof(int));

C File Input/Trapezoid Rule Program

Little bit of a 2 parter. First of all im trying to do this in all c. First of all I'll go ahead and post my program
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <string.h>
double f(double x);
void Trap(double a, double b, int n, double* integral_p);
int main(int argc, char* argv[]) {
double integral=0.0; //Integral Result
double a=6, b=10; //Left and Right Points
int n; //Number of Trapezoids (Higher=more accurate)
int degree;
if (argc != 3) {
printf("Error: Invalid Command Line arguements, format:./trapezoid N filename");
exit(0);
}
n = atoi(argv[2]);
FILE *fp = fopen( argv[1], "r" );
# pragma omp parallel
Trap(a, b, n, &integral);
printf("With n = %d trapezoids....\n", n);
printf("of the integral from %f to %f = %.15e\n",a, b, integral);
return 0;
}
double f(double x) {
double return_val;
return_val = pow(3.0*x,5)+pow(2.5*x,4)+pow(-1.5*x,3)+pow(0*x,2)+pow(1.7*x,1)+4;
return return_val;
}
void Trap(double a, double b, int n, double* integral_p) {
double h, x, my_integral;
double local_a, local_b;
int i, local_n;
int my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads();
h = (b-a)/n;
local_n = n/thread_count;
local_a = a + my_rank*local_n*h;
local_b = local_a + local_n*h;
my_integral = (f(local_a) + f(local_b))/2.0;
for (i = 1; i <= local_n-1; i++) {
x = local_a + i*h;
my_integral += f(x);
}
my_integral = my_integral*h;
# pragma omp critical
*integral_p += my_integral;
}
As you can see, it calculates trapezoidal rule given an interval.
First of all it DOES work, if you hardcode the values and the function. But I need to read from a file in the format of
5
3.0 2.5 -1.5 0.0 1.7 4.0
6 10
Which means:
It is of degree 5 (no more than 50 ever)
3.0x^5 +2.5x^4 −1.5x^3 +1.7x+4 is the polynomial (we skip ^2 since it's 0)
and the Interval is from 6 to 10
My main concern is the f(x) function which I have hardcoded. I have NO IDEA how to make it take up to 50 besides literally typing out 50 POWS and reading in the values to see what they could be.......Anyone else have any ideas perhaps?
Also what would be the best way to read in the file? fgetc? Im not really sure when it comes to reading in C input (especially since everything i read in is an INT, is there some way to convert them?)

For a large degree polynomial, would something like this work?
double f(double x, double coeff[], int nCoeff)
{
double return_val = 0.0;
int exponent = nCoeff-1;
int i;
for(i=0; i<nCoeff-1; ++i, --exponent)
{
return_val = pow(coeff[i]*x, exponent) + return_val;
}
/* add on the final constant, 4, in our example */
return return_val + coeff[nCoeff-1];
}
In your example, you would call it like:
sampleCall()
{
double coefficients[] = {3.0, 2.5, -1.5, 0, 1.7, 4};
/* This expresses 3x^5 + 2.5x^4 + (-1.5x)^3 + 0x^2 + 1.7x + 4 */
my_integral = f(x, coefficients, 6);
}
By passing an array of coefficients (the exponents are assumed), you don't have to deal with variadic arguments. The hardest part is constructing the array, and that is pretty simple.
It should go without saying, if you put the coefficients array and number-of-coefficients into global variables, then the signature of f(x) doesn't need to change:
double f(double x)
{
// access glbl_coeff and glbl_NumOfCoeffs, instead of parameters
}

For you f() function consider making it variadic (varargs is another name)
http://www.gnu.org/s/libc/manual/html_node/Variadic-Functions.html
This way you could pass the function 1 arg telling it how many "pows" you want, with each susequent argument being a double value. Is this what you are asking for with the f() function part of your question?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

algorithms in Cuda - c

Related

Segmentation Fault 11 in C caused by larger operation numbers

Copy Array of pointers inside a struct using CUDA

Multiply two pre-defined values in the kernel

In C, initializing an array using a variable led to stack overflow error or caused R to crash when code is called in R

C File Input/Trapezoid Rule Program

Categories

Resources