Data Gathering Portion of CUDA Code is Unexpectedly Outputting "0"s - c

EDIT
In the initial posting's code snippet (see below) I was not properly sending the struct to the device, this has been fixed, but the results are still the same. In my full code this mistake was not present. (There were two mistakes in that command in my initial posting -- one, the structure was being copied from HostToDevice, but was actually reversed, and the size of the copy was also wrong. Apologies; both errors were fixed, but the recompiled code still displays the zeros phenomena described below, as does my full code.)
EDIT 2
In the haste of my de-proprietarization rewrite of the code I made a couple errors which dalekchef kindly pointed out to me (the copy of the struct to the device was performed BEFORE the allocation on the device, in my rewritten code and the device cudaMalloc calls were not multiplied with the sizeof(...) the type of the array elements. I added these fixes, recompiled and retested, but it did not fix the problem. Also double checked my original code -- it did not have those mistakes. Apologies again, for the confusion.
I'm trying to dump statistics from a large simulations program. A similar pared down code is displayed below. Both codes exhibit the same problem -- they output zeroes, when they should be outputting averaged values.
#include "stdio.h"
struct __align__(8) DynamicVals
{
double a;
double b;
int n1;
int n2;
int perDump;
};
__device__ int *dev_arrN1, *dev_arrN2;
__device__ double *dev_arrA, *dev_arrB;
__device__ DynamicVals *dev_myVals;
__device__ int stepsA, stepsB;
__device__ double sumA, sumB;
__device__ int stepsN1, stepsN2;
__device__ int sumN1, sumN2;
__global__ void TEST
(int step, double dev_arrA[], double dev_arrB[],
int dev_arrN1[], int dev_arrN2[],DynamicVals *dev_myVals)
{
if (step % dev_myVals->perDump)
{
dev_arrN1[step/dev_myVals->perDump] = 0;
dev_arrN2[step/dev_myVals->perDump] = 0;
dev_arrA[step/dev_myVals->perDump] = 0.0;
dev_arrB[step/dev_myVals->perDump] = 0.0;
stepsA = 0;
stepsB = 0;
stepsN1 = 0;
stepsN2 = 0;
sumA = 0.0;
sumB = 0.0;
sumN1 = 0;
sumN2 = 0;
}
sumA += dev_myVals->a;
sumB += dev_myVals->b;
sumN1 += dev_myVals->n1;
sumN2 += dev_myVals->n2;
stepsA++;
stepsB++;
stepsN1++;
stepsN2++;
if ( sumA > 100000000 )
{
dev_arrA[step/dev_myVals->perDump] +=
sumA / stepsA;
sumA = 0.0;
stepsA = 0;
}
if ( sumB > 100000000 )
{
dev_arrB[step/dev_myVals->perDump] +=
sumB / stepsB;
sumB = 0.0;
stepsB = 0;
}
if ( sumN1 > 1000000 )
{
dev_arrN1[step/dev_myVals->perDump] +=
sumN1 / stepsN1;
sumN1 = 0;
stepsN1 = 0;
}
if ( sumN2 > 1000000 )
{
dev_arrN2[step/dev_myVals->perDump] +=
sumN2 / stepsN2;
sumN2 = 0;
stepsN2 = 0;
}
if ((step+1) % dev_myVals->perDump)
{
dev_arrA[step/dev_myVals->perDump] +=
sumA / stepsA;
dev_arrB[step/dev_myVals->perDump] +=
sumB / stepsB;
dev_arrN1[step/dev_myVals->perDump] +=
sumN1 / stepsN1;
dev_arrN2[step/dev_myVals->perDump] +=
sumN2 / stepsN2;
}
}
int main()
{
const int TOTAL_STEPS = 10000000;
DynamicVals vals;
int *arrN1, *arrN2;
double *arrA, *arrB;
int statCnt;
vals.perDump = TOTAL_STEPS/10;
statCnt = TOTAL_STEPS/vals.perDump+1;
vals.a = 30000.0;
vals.b = 60000.0;
vals.n1 = 10000;
vals.n2 = 20000;
cudaMalloc( (void**)&dev_arrA, statCnt*sizeof(double) );
cudaMalloc( (void**)&dev_arrB, statCnt*sizeof(double) );
cudaMalloc( (void**)&dev_arrN1, statCnt*sizeof(int) );
cudaMalloc( (void**)&dev_arrN2, statCnt*sizeof(int) );
cudaMalloc( (void**)&dev_myVals, sizeof(DynamicVals));
cudaMemcpy(dev_myVals, &vals, sizeof(DynamicVals),
cudaMemcpyHostToDevice);
arrA = (double *)malloc(statCnt * sizeof(double));
arrB = (double *)malloc(statCnt * sizeof(double));
arrN1 = (int *)malloc(statCnt * sizeof(int));
arrN2 = (int *)malloc(statCnt * sizeof(int));
for (int i=0; i< TOTAL_STEPS; i++)
TEST<<<1,1>>>(i, dev_arrA,dev_arrB,dev_arrN1,dev_arrN2,dev_myVals);
cudaMemcpy(arrA,dev_arrA,statCnt * sizeof(double),cudaMemcpyDeviceToHost);
cudaMemcpy(arrB,dev_arrB,statCnt * sizeof(double),cudaMemcpyDeviceToHost);
cudaMemcpy(arrN1,dev_arrN1,statCnt * sizeof(int),cudaMemcpyDeviceToHost);
cudaMemcpy(arrN2,dev_arrN2,statCnt * sizeof(int),cudaMemcpyDeviceToHost);
for (int i=0; i< statCnt; i++)
{
printf("Step: %d ; A=%g B=%g N1=%d N2=%d\n",
i*vals.perDump,
arrA[i], arrB[i], arrN1[i], arrN2[i]);
}
}
Output:
Step: 0 ; A=0 B=0 N1=0 N2=0
Step: 1000000 ; A=0 B=0 N1=0 N2=0
Step: 2000000 ; A=0 B=0 N1=0 N2=0
Step: 3000000 ; A=0 B=0 N1=0 N2=0
Step: 4000000 ; A=0 B=0 N1=0 N2=0
Step: 5000000 ; A=0 B=0 N1=0 N2=0
Step: 6000000 ; A=0 B=0 N1=0 N2=0
Step: 7000000 ; A=0 B=0 N1=0 N2=0
Step: 8000000 ; A=0 B=0 N1=0 N2=0
Step: 9000000 ; A=0 B=0 N1=0 N2=0
Step: 10000000 ; A=0 B=0 N1=0 N2=0
Now, if I were to use a small period for my dumps or if my #s were smaller, I could get away with just a direct
add
divide by period and the end of period
...algorithm, but I use temporary sums as otherwise my int would overflow (the double wouldn't overflow, but I was concerned about it losing precision).
If I use the above direct algorithm for smaller values I get correct non-zero values, but the second I use the intermediates (e.g. stepsA, sumA, etc.) the values go to zero.
I know I'm doing something silly here... what am I missing?
Notes:
A.) Yes, I know this code in its above form is not parallel and by itself does not warrant parallelization. It is part of a small statistics collecting portion of a much longer code. In that code it is encased in a thread index specific conditional logic to prevent clashing (making it parallel) and serves as data gathering to a simulations program (which warrants parallelization). Hopefully you can understand where the above code originates and avoid snide comments about its lack of thread-safety. (This disclaimer is added out of past experience receiving unproductive comments from people who didn't understand I was posting an excerpt, not a full code, despite me writing in less explicit terms as such.)
B.) Yes, I know the names of the variables are ambiguous. That is the point. The code I'm working on is proprietary, though it will eventually be open sourced. I only write this as I have posted similarly anonymized codes in the past and received rude commentary about my naming convention.
C.) Yes, I have read the CUDA manual several times, though I do make errors and I admit there's some features I don't understand. I'm not using shared memory here, but I am using shared memory (OF COURSE) in my full code.
D.) Yes, the above code does represent the exact same features as the data dumping portion of my non-working code, with the logic not related to this particular problem removed, and with it the thread safety conditional. The variable names have been changed, but algorithmically it should be unaltered and this is verified by the exact same non-working output (zeroes).
E.) I do realize the "dynamic" struct in the above snippet has non-dynamic values. I named the structure that because in the full code, this struct contains simulations data, and is dynamic. The static nature in the pared-down code should not make the statistics collecting code fail, it will simply mean that the average for each dump should be constant (and non-zero).

A couple of things:
It seems like you are calling cudaMemcpy for dev_MyVals before you are calling cudaMalloc for it. This is not how it should be.
ALSO: You do not multiply by sizeof int when you do your cudaMalloc calls.
You should really check all of your CUDA calls cudaMalloc/cudaMemcpy for an error code. They should all return an error or CUDA_SUCCESS. I believe the CUDA examples all show how to do this.
Also, for future reference NEVER use the modulo operator in CUDA it is incredibly slow. Just Google for "Modulo CUDA" for some alternatives.
Let me know how it goes, this will probably take a couple of iterations to fix.

The biggest problem I see here is one of scope. The way this code is written leads me to conclude that you might not understand how variable scoping in C++ works in general, and how device and host code scope works in CUDA in particular. A couple of observations:
When you do this type of thing in code:
__device__ double *dev_arrA, *dev_arrB;
__global__ void TEST(int step, double dev_arrA[], double dev_arrB[], ....)
you have a variable scope problem. dev_arrA is declared at both compilation unit scope and function scope. The two declarations do not refer to the same variable -- the function unit scope declaration (in the kernel) takes precedence over the compilation unit scope declaration inside the kernel. you modify that variable, you are modifying the kernel scope declaration, not the __device__variable. This can lead to all sorts of subtle and unexpactd behaviour. It is much better to avoid ever having the same variable declared at multiple scopes.
When you declare a variable using the __device__ specifier, it is intended to be exclusively a device context symbol, and should only be used directly in device code. So something like this:
__device__ double *dev_arrA;
int main()
{
....
cudaMalloc( (void**)&dev_arrA, statCnt*sizeof(double) );
....
}
is illegal. You cannot call an API function like cudaMalloc directly on a __device__ variable. Even though it will compile (because of the hackery involved in the CUDA compilation tradjectories for host and device code), it is incorrect to do so. In the above example dev_arrA is a device symbol. You can interact with it via the API symbol manipulation calls, but that is all it is technically legal to do. In you code, variables intended to hold device pointers and be passed as kernel arguments (like dev_arrA) should be declared at main() scope, and passed by value to the kernel.
It is a combination of the above two things which is probably causing your problems.
But the difficulty is that you have chosen to post roughy 150 lines of code (a lot of which is redundant) as a repro case. I doubt anyone cares enough about your problems to go through that much code with a fine tooth comb and pinpoint where the precise problem is. Further, you habit of doing these nasty "top edits" in your questions quickly turn what might have been reasonably written starting points into unintelligible psuedo changelogs which are incredibly hard to follow and are unlikely to be of help to anyone. Also, the mildly passive-aggressive notes section serves no real purpose - it adds nothing of value to the question.
So I will leave you with a greatly simplified version of the code you posted which I think has all the basic things which you are trying to do working. I leave it as an "exercise for the reader" to turn it back into whatever it is that you are trying to do.
#include "stdio.h"
typedef float Real;
struct __align__(8) DynamicVals
{
Real a;
int n1;
int perDump;
};
__device__ int stepsA;
__device__ Real sumA;
__device__ int stepsN1;
__device__ int sumN1;
__global__ void TEST
(int step, Real dev_arrA[], int dev_arrN1[], DynamicVals *dev_myVals)
{
if (step % dev_myVals->perDump)
{
dev_arrN1[step/dev_myVals->perDump] = 0;
dev_arrA[step/dev_myVals->perDump] = 0.0;
stepsA = 0;
stepsN1 = 0;
sumA = 0.0;
sumN1 = 0;
}
sumA += dev_myVals->a;
sumN1 += dev_myVals->n1;
stepsA++;
stepsN1++;
dev_arrA[step/dev_myVals->perDump] += sumA / stepsA;
dev_arrN1[step/dev_myVals->perDump] += sumN1 / stepsN1;
}
inline void gpuAssert(cudaError_t code, char *file, int line,
bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code),
file, line);
if (abort) exit(code);
}
}
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
int main()
{
const int TOTAL_STEPS = 1000;
DynamicVals vals;
int *arrN1;
Real *arrA;
int statCnt;
vals.perDump = TOTAL_STEPS/10;
statCnt = TOTAL_STEPS/vals.perDump;
vals.a = 30000.0;
vals.n1 = 10000;
Real *dev_arrA;
int *dev_arrN1;
DynamicVals *dev_myVals;
gpuErrchk( cudaMalloc( (void**)&dev_arrA, statCnt*sizeof(Real)) );
gpuErrchk( cudaMalloc( (void**)&dev_arrN1, statCnt*sizeof(int)) );
gpuErrchk( cudaMalloc( (void**)&dev_myVals, sizeof(DynamicVals)) );
gpuErrchk( cudaMemcpy(dev_myVals, &vals, sizeof(DynamicVals),
cudaMemcpyHostToDevice) );
arrA = (Real *)malloc(statCnt * sizeof(Real));
arrN1 = (int *)malloc(statCnt * sizeof(int));
for (int i=0; i< TOTAL_STEPS; i++) {
TEST<<<1,1>>>(i, dev_arrA,dev_arrN1,dev_myVals);
gpuErrchk( cudaPeekAtLastError() );
}
gpuErrchk( cudaMemcpy(arrA,dev_arrA,statCnt * sizeof(Real),
cudaMemcpyDeviceToHost) );
gpuErrchk( cudaMemcpy(arrN1,dev_arrN1,statCnt * sizeof(int),
cudaMemcpyDeviceToHost) );
for (int i=0; i< statCnt; i++)
{
printf("Step: %d ; A=%g N1=%d\n",
i*vals.perDump, arrA[i], arrN1[i] );
}
}

Related

Why does the following code run out of memory?

The following code describes 2 harmonic oscillators. They are initially uncoupled and independent and am only looking at one of them, which is a mechanical oscillator. The other oscillator's variables have been declared and forced to 0. I first do 300,000 time iterations and do this for 80 different frequencies of wdm (mechanical drive w).
//find frequency response
int index_A;
double wdm_1;
wdm_1=wm-2*3.142*2e5;
double wdm_2;
wdm_2=wm+2*3.142*2e5;
double wdm_prec=2*3.142*5e3;
index_A=(wdm_2-wdm_1)/wdm_prec;
printf("%d \n", index_A);
However, my code runs till 58 frequencies and gives this error:
Strangely, if I run the code in 2 parts of 40 frequencies each and append the files together, it works fine.
Resonance peak in frequency domain
Also, when I reduce the size of v0 to 3, the code works properly. However, I will need the other variables later.
int j=0;
for (j=0; j<= index_A ; j++){
wdm=wdm_1+j*wdm_prec;
printf("%d \n",j);
v0[0] = 0;
v0[1] = 0;
v0[2] = 0;
v0[3] = 0;
v0[4] = 0;
v0[5] = 0;
v0[6]= wdm;
for (i=0; i< n ; i++){
if (cabs(xa)>=1){
printf("Breaking Loop \n");
break;
}
v1 = rk4vec_ameya_complex_1 ( tau, 7, v0, dtau, rk4vec_f_ameya_complex_1 );
memcpy(v0, v1, 4 * sizeof ( double complex ) );
tau=tau+dtau;
}
fprintf(f1, "%g, %g \n", wdm/(2*3.142), cabs(v1[2]) );
}
printf("Completed");
fclose(f1);
At the end of time iterations, I save the value of the frequency wdm and last value of displacement x=v1[2] in a file and move on to do time iterations with another frequency. Hence, my file contains the frequency response.
fprintf(f1, "%g, %g \n", wdm/(2*3.142), cabs(v1[2]) );
I have used runge kutta (rk4.c) from people.sc.fsu.edu/~jburkardt/c_src/rk4/rk4.html and modified it for complex datatype using
#include <complex.h>
Following is the function runge-kutta-4 has to solve:
/******************************************************************************/
double complex *rk4vec_f_ameya_complex_1 ( double t, int n, double complex u[] )
/******************************************************************************/
{
double complex drive_m;
double complex drive_c;
double x;
double xrf0_1;
double complex *uprime;
uprime = ( double * ) malloc ( 7 * sizeof ( double complex ) );
//Check if memory unavailable
if(uprime==NULL){
printf("No memory available \n");
return 0;
}
///////////////////Second Order////////////////////////
xrf0_1=xrf0*(1-exp(-0.2*t));
drive_m=(xrf0*cexp(I*((u[6]-wm)/gammac)*t)/(2*wm*gammac));
uprime[2]=u[3];
uprime[3]=(wm/gammac)*(drive_m-u[3]*(2*I+2*gammam/wm)-u[2]*(2*I*gammam/gammac));
return uprime;
free(uprime);
}
Kindly suggest any solutions if my usage of malloc() is responsible for running out of memory.
One obvious problem - you free uprime after you have returned from rk4vec_f_ameya_complex_1, I'd expect you to be getting an unreachable code warning. Also as uprime is always the same size why do you use malloc at all? If you just make it an array of size 7 then you won't have issues with malloc and your code will probably run faster.

A lot of 0's received when using cudaMemcpy()

I've just started to learn CUDA and i wanted to fill an array (a 2D array represented as a 1D array) with random numbers. I followed another posts in order to generate random numbers, but i don't know if there is a problem with the generation of numbers or with the memory recovering from the device or anything else. The problem is that, though i have tried to fill any cell of the array with the id of the thread that is atending it in order to see the results after copying into the host memory, i receive an array that is filled with 0 in any position after recovering the data with cudaMemcpy().
I'm programming on Visual Studio 2013, with cuda 7.5, on a i5 2500k as my processor and a 960 GTX graphic card.
Here is the main and the method where i try to fill it. I'll update the cuRand Initialization too. If you need to see something else, just tell me.
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
__global__ void poblar(int * adn, curandState * state){
curandState localState = state[threadIdx.x];
int random = curand(&localState);
adn[threadIdx.x] = random;
// It doesn't mind if i use the following instruction, the result is a lot of 0's
//adn[threadIdx.x] = threadIdx.x;
}
int main()
{
const int adnLength = NUMCROMOSOMAS * SIZECROMOSOMAS; // 256 * 128 (32.768)
const size_t adnSize = adnLength * sizeof(int);
int adnCPU[adnLength];
int * adnDevice;
cudaError_t error = cudaSetDevice(0);
if (error != cudaSuccess)
exit(-EXIT_FAILURE);
curandState * randState;
error = cudaMalloc(&randState, adnLength * sizeof(curandState));
if (error != cudaSuccess){
cudaFree(randState);
exit(-EXIT_FAILURE);
}
//Here is initialized cuRand
setup_cuRand <<<1, adnLength >> > (randState, unsigned(time(NULL)));
error = cudaMalloc((void **)&adnDevice, adnSize);
if (error == cudaErrorMemoryAllocation){// cudaSuccess){
cudaFree(adnDevice);
cudaFree(randState);
printf("\n error");
exit(-EXIT_FAILURE);
}
poblar <<<1, adnLength >>> (adnDevice, randState);
error = cudaMemcpy(adnCPU, adnDevice, adnSize, cudaMemcpyDeviceToHost);
//After here, for any i, adnCPU[i] is 0 and i cannot figure what is wrong
if (error == cudaSuccess){
for (int i = 0; i < NUMCROMOSOMAS; i++){
for (int j = 0; j < SIZECROMOSOMAS; j++){
printf("%i,", adnCPU[(i*SIZECROMOSOMAS) + j]);
}
printf("\n");
}
}
return 0;
}
EDIT after answer solved: There was a particularity over the answer given, and is that you need a lower number of threads (half of that quantity worked for me) in order to seed correctly the random numbers with cuRand. For some reason, i could create the threads perfectly but i couldn't seed the pseudo-random algorithm generator.
The maximum number of threads per block is 1024 on your hardware, hence, you may not schedule a call with adnLength if it is larger than 1024.
The error you are having is most probably a call configuration error, and it is returned by cudaPeekAtLastError, as it occurs before any GPU work, right after the triple angled-bracket call. Indeed cudaMemcpy may not return it, even though it returns error from previous asynchronous calls.
The error that may occur is cudaErrorLaunchOutOfResources.

C: fastest way to evaluate a function on a finite set of small integer values by using a lookup table?

I am currently working on a project where I would like to optimize some numerical computation in Python by calling C.
In short, I need to compute the value of y[i] = f(x[i]) for each element in an huge array x (typically has 10^9 entries or more). Here, x[i] is an integer between -10 and 10 and f is function that takes x[i] and returns a double. My issue is that f but it takes a very long time to evaluate in a way that is numerically stable.
To speed things up, I would like to just hard code all 2*10 + 1 possible values of f(x[i]) into constant array such as:
double table_of_values[] = {f(-10), ...., f(10)};
And then just evaluate f using a "lookup table" approach as follows:
for (i = 0; i < N; i++) {
y[i] = table_of_values[x[i] + 11]; //instead of y[i] = f(x[i])
}
Since I am not really well-versed at writing optimized code in C, I am wondering:
Specifically - since x is really large - I'm wondering if it's worth doing second-degree optimization when evaluating the loop (e.g. by sorting x beforehand, or by finding a smart way to deal with the negative indices (aside from just doing [x[i] + 10 + 1])?
Say x[i] were not between -10 and 10, but between -20 and 20. In this case, I could still use the same approach, but would need to hard code the lookup table manually. Is there a way to generate the look-up table dynamically in the code so that I make use of the same approach and allow for x[i] to belong to a variable range?
It's fairly easy to generate such a table with dynamic range values.
Here's a simple, single table method:
#include <malloc.h>
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
double *table_of_values;
int table_bias;
// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif
#define XLEN (1 << 9)
xval_t *x;
// fslow -- your original function
double
fslow(int i)
{
return 1; // whatever
}
// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
int len;
table_bias = -lo;
len = hi - lo;
len += 1;
// NOTE: you can do free(table_of_values) when no longer needed
table_of_values = malloc(sizeof(double) * len);
for (int i = lo; i <= hi; ++i)
table_of_values[i + table_bias] = f(i);
}
// fcached -- retrieve cached table data
double
fcached(int i)
{
return table_of_values[i + table_bias];
}
// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
double *tptr;
int bias;
double val;
// ensure these go into registers to prevent needless extra memory fetches
tptr = table_of_values;
bias = table_bias;
for (int i = 0; i < XLEN; ++i) {
val = tptr[x[i] + bias];
// do stuff with val
VARIABLE_USED(val);
}
}
int
main(void)
{
ftablegen(fslow,-10,10);
x = malloc(sizeof(xval_t) * XLEN);
fripper(x);
return 0;
}
Here's a slightly more complex way that allows many similar tables to be generated:
#include <malloc.h>
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 1
typedef short xval_t;
#endif
#if 0
typedef char xval_t;
#endif
#define XLEN (1 << 9)
xval_t *x;
struct table {
int tbl_lo; // lowest index
int tbl_hi; // highest index
int tbl_bias; // bias for index
double *tbl_data; // cached data
};
struct table ftable1;
struct table ftable2;
double
fslow(int i)
{
return 1; // whatever
}
double
f2(int i)
{
return 2; // whatever
}
// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi,struct table *tbl)
{
int len;
tbl->tbl_bias = -lo;
len = hi - lo;
len += 1;
// NOTE: you can do free tbl_data when no longer needed
tbl->tbl_data = malloc(sizeof(double) * len);
for (int i = lo; i <= hi; ++i)
tbl->tbl_data[i + tbl->tbl_bias] = fslow(i);
}
// fcached -- retrieve cached table data
double
fcached(struct table *tbl,int i)
{
return tbl->tbl_data[i + tbl->tbl_bias];
}
// fripper -- access x and table arrays
void
fripper(xval_t *x,struct table *tbl)
{
double *tptr;
int bias;
double val;
// ensure these go into registers to prevent needless extra memory fetches
tptr = tbl->tbl_data;
bias = tbl->tbl_bias;
for (int i = 0; i < XLEN; ++i) {
val = tptr[x[i] + bias];
// do stuff with val
VARIABLE_USED(val);
}
}
int
main(void)
{
x = malloc(sizeof(xval_t) * XLEN);
// NOTE: we could use 'char' for xval_t ...
ftablegen(fslow,-37,62,&ftable1);
fripper(x,&ftable1);
// ... but, this forces us to use a 'short' for xval_t
ftablegen(f2,-99,307,&ftable2);
return 0;
}
Notes:
fcached could/should be an inline function for speed. Notice that once the table is calculated once, fcached(x[i]) is quite fast. The index offset issue you mentioned [solved by the "bias"] is trivially small in calculation time.
While x may be a large array, the cached array for f() values is fairly small (e.g. -10 to 10). Even if it were (e.g.) -100 to 100, this is still about 200 elements. This small cached array will [probably] stay in the hardware memory cache, so access will remain quite fast.
Thus, sorting x to optimize H/W cache performance of the lookup table will have little to no [measurable] effect.
The access pattern to x is independent. You'll get best performance if you access x in a linear manner (e.g. for (i = 0; i < 999999999; ++i) x[i]). If you access it in a semi-random fashion, it will put a strain on the H/W cache logic and its ability to keep the needed/wanted x values "cache hot"
Even with linear access, because x is so large, by the time you get to the end, the first elements will have been evicted from the H/W cache (e.g. most CPU caches are on the order of a few megabytes)
However, if x only has values in a limited range, changing the type from int x[...] to short x[...] or even char x[...] cuts the size by a factor of 2x [or 4x]. And, that can have a measurable improvement on the performance.
Update: I've added an fripper function to show the fastest way [that I know of] to access the table and x arrays in a loop. I've also added a typedef named xval_t to allow the x array to consume less space (i.e. will have better H/W cache performance).
UPDATE #2:
Per your comments ...
fcached was coded [mostly] to illustrate simple/single access. But, it was not used in the final example.
The exact requirements for inline has varied over the years (e.g. was extern inline). Best use now: static inline. However, if using c++, it may be, yet again different. There are entire pages devoted to this. The reason is because of compilation in different .c files, what happens when optimization is on or off. Also, consider using a gcc extension. So, to force inline all the time:
__attribute__((__always_inline__)) static inline
fripper is the fastest because it avoids refetching globals table_of_values and table_bias on each loop iteration. In fripper, compiler optimizer will ensure they remain in registers. See my answer: Is accessing statically or dynamically allocated memory faster? as to why.
However, I coded an fripper variant that uses fcached and the disassembled code was the same [and optimal]. So, we can disregard that ... Or, can we? Sometimes, disassembling the code is a good cross check and the only way to know for sure. Just an extra item when creating fully optimized C code. There are many options one can give to the compiler regarding code generation, so sometimes it's just trial and error.
Because benchmarking is important, I threw in my routines for timestamping (FYI, [AFAIK] the underlying clock_gettime call is the basis for python's time.clock()).
So, here's the updated version:
#include <malloc.h>
#include <time.h>
typedef long long s64;
#define SUPER_INLINE \
__attribute__((__always_inline__)) static inline
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
#define TVSEC 1000000000LL // nanoseconds in a second
#define TVSECF 1e9 // nanoseconds in a second
// tvget -- get high resolution time of day
// RETURNS: absolute nanoseconds
s64
tvget(void)
{
struct timespec ts;
s64 nsec;
clock_gettime(CLOCK_REALTIME,&ts);
nsec = ts.tv_sec;
nsec *= TVSEC;
nsec += ts.tv_nsec;
return nsec;
)
// tvgetf -- get high resolution time of day
// RETURNS: fractional seconds
double
tvgetf(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= TVSECF;
sec += ts.tv_sec;
return sec;
)
double *table_of_values;
int table_bias;
double *dummyptr;
// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif
#define XLEN (1 << 9)
xval_t *x;
// fslow -- your original function
double
fslow(int i)
{
return 1; // whatever
}
// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
int len;
table_bias = -lo;
len = hi - lo;
len += 1;
// NOTE: you can do free(table_of_values) when no longer needed
table_of_values = malloc(sizeof(double) * len);
for (int i = lo; i <= hi; ++i)
table_of_values[i + table_bias] = f(i);
}
// fcached -- retrieve cached table data
SUPER_INLINE double
fcached(int i)
{
return table_of_values[i + table_bias];
}
// fripper_fcached -- access x and table arrays
void
fripper_fcached(xval_t *x)
{
double val;
double *dptr;
dptr = dummyptr;
for (int i = 0; i < XLEN; ++i) {
val = fcached(x[i]);
// do stuff with val
dptr[i] = val;
}
}
// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
double *tptr;
int bias;
double val;
double *dptr;
// ensure these go into registers to prevent needless extra memory fetches
tptr = table_of_values;
bias = table_bias;
dptr = dummyptr;
for (int i = 0; i < XLEN; ++i) {
val = tptr[x[i] + bias];
// do stuff with val
dptr[i] = val;
}
}
int
main(void)
{
ftablegen(fslow,-10,10);
x = malloc(sizeof(xval_t) * XLEN);
dummyptr = malloc(sizeof(double) * XLEN);
fripper(x);
fripper_fcached(x);
return 0;
}
You can have negative indices in your arrays. (I am not sure if this is in the specifications.) If you have the following code:
int arr[] = {1, 2 ,3, 4, 5};
int* lookupTable = arr + 3;
printf("%i", lookupTable[-2]);
it will print out 2.
This works because arrays in c are defined as pointers. And if the pointer does not point to the begin of the array, you can access the item before the pointer.
Keep in mind though that if you have to malloc() the memory for arr you probably cannot use free(lookupTable) to free it.
I really think Craig Estey is on the right track for building your table in an automatic way. I just want to add a note for looking up the table.
If you know that you will run the code on a Haswell machine (with AVX2) you should make sure your code utilise VGATHERDPD which you can utilize with the _mm256_i32gather_pd intrinsic. If you do that, your table lookups will fly! (You can even detect avx2 on the fly with cpuid(), but that's another story)
EDIT:
Let me elaborate with some code:
#include <stdint.h>
#include <stdio.h>
#include <immintrin.h>
/* I'm not sure if you need the alignment */
double table[8] __attribute__((aligned(16)))= { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };
int main()
{
int32_t i[4] = { 0,2,4,6 };
__m128i index = _mm_load_si128( (__m128i*) i );
__m256d result = _mm256_i32gather_pd( table, index, 8 );
double* f = (double*)&result;
printf("%f %f %f %f\n", f[0], f[1], f[2], f[3]);
return 0;
}
Compile and run:
$ gcc --std=gnu99 -mavx2 gathertest.c -o gathertest && ./gathertest
0.100000 0.300000 0.500000 0.700000
This is fast!

A simple reduction program in CUDA

In the below code, I am trying to implement a simple parallel reduction with blocksize and number of threads per block being 1024. However, after implementing partial reduction, I wish to see whether my implementation is going right or not and in that process I make the program print the first element of the host memory (after data has been copied from device memory to host memory).
My host memory is initialize with '1' and is copied to device memory for reduction. And the printf statement after the reduction process still gives me '1' at the first element of the array.
Is there a problem in what I am getting to print or is it something logical in the implementation of reduction?
In addition printf statements in the kernel do not print anything. Is there something wrong in my syntax or the call to the printf statement?
My code is as below:
ifndef CUDACC
define CUDACC
endif
include "cuda_runtime.h"
include "device_launch_parameters.h"
include
include
ifndef THREADSPERBLOCK
define THREADSPERBLOCK 1024
endif
ifndef NUMBLOCKS
define NUMBLOCKS 1024
endif
global void reduceKernel(int *c)
{
extern shared int sh_arr[];
int index = blockDim.x*blockIdx.x + threadIdx.x;
int sh_index = threadIdx.x;
// Storing data from Global memory to shared Memory
sh_arr[sh_index] = c[index];
__syncthreads();
for(unsigned int i = blockDim.x/2; i>0 ; i>>=1)
{
if(sh_index < i){
sh_arr[sh_index] += sh_arr[i+sh_index];
}
__syncthreads();
}
if(sh_index ==0)
c[blockIdx.x]=sh_arr[sh_index];
printf("value stored at %d is %d \n", blockIdx.x, c[blockIdx.x]);
return;
}
int main()
{
int *h_a;
int *d_a;
int share_memSize, h_memSize;
size_t d_memSize;
share_memSize = THREADSPERBLOCK*sizeof(int);
h_memSize = THREADSPERBLOCK*NUMBLOCKS;
h_a = (int*)malloc(sizeof(int)*h_memSize);
d_memSize=THREADSPERBLOCK*NUMBLOCKS;
cudaMalloc( (void**)&d_a, h_memSize*sizeof(int));
for(int i=0; i<h_memSize; i++)
{
h_a[i]=1;
};
//printf("last element of array %d \n", h_a[h_memSize-1]);
cudaMemcpy((void**)&d_a, (void**)&h_a, h_memSize, cudaMemcpyHostToDevice);
reduceKernel<<<NUMBLOCKS, THREADSPERBLOCK, share_memSize>>>(d_a);
cudaMemcpy((void**)&h_a, (void**)&d_a, d_memSize, cudaMemcpyDeviceToHost);
printf("sizeof host memory %d \n", d_memSize); //sizeof(h_a));
printf("sum after reduction %d \n", h_a[0]);
}
There are a number of problems with this code.
much of what you've posted is not valid code. As just a few examples, your global and shared keywords are supposed to have double-underscores before and after, like this: __global__ and __shared__. I assume this is some sort of copy-paste error or formatting error. There are problems with your define statements as well. You should endeavor to post code that doesn't have these sorts of problems.
Any time you are having trouble with a CUDA code, you should use proper cuda error checking and run your code with cuda-memcheck before asking for help. If you had done this , it would have focused your attention on item 3 below.
Your cudaMemcpy operations are broken in a couple of ways:
cudaMemcpy((void**)&d_a, (void**)&h_a, h_memSize, cudaMemcpyHostToDevice);
First, unlike cudaMalloc, but like memcpy, cudaMemcpy just takes ordinary pointer arguments. Second, the size of the transfer (like memcpy) is in bytes, so your sizes need to be scaled up by sizeof(int):
cudaMemcpy(d_a, h_a, h_memSize*sizeof(int), cudaMemcpyHostToDevice);
and similarly for the one after the kernel.
printf from every thread in a large kernel (like this one which has 1048576 threads) is probably not a good idea. You won't actually get all the output you expect, and on windows (appears you are running on windows) you may run into a WDDM watchdog timeout due to kernel execution taking too long. If you need to printf from a large kernel, be selective and condition your printf on threadIdx.x and blockIdx.x
The above things are probably enough to get some sensible printout, and as you point out you're not finished yet anyway: "I wish to see whether my implementation is going right or not ". However, this kernel, as crafted, overwrites its input data with output data:
__global__ void reduceKernel(int *c)
...
c[blockIdx.x]=sh_arr[sh_index];
This will lead to a race condition. Rather than trying to sort this out for you, I'd suggest separating your output data from your input data. Even better, you should study the cuda reduction sample code which also has an associated presentation.
Here is a modified version of your code which has most of the above issues fixed. It's still not correct. It still has defect 5 above in it. Rather than completely rewrite your code to fix defect 5, I would direct you to the cuda sample code mentioned above.
$ cat t820.cu
#include <stdio.h>
#ifndef THREADSPERBLOCK
#define THREADSPERBLOCK 1024
#endif
#ifndef NUMBLOCKS
#define NUMBLOCKS 1024
#endif
__global__ void reduceKernel(int *c)
{
extern __shared__ int sh_arr[];
int index = blockDim.x*blockIdx.x + threadIdx.x;
int sh_index = threadIdx.x;
// Storing data from Global memory to shared Memory
sh_arr[sh_index] = c[index];
__syncthreads();
for(unsigned int i = blockDim.x/2; i>0 ; i>>=1)
{
if(sh_index < i){
sh_arr[sh_index] += sh_arr[i+sh_index];
}
__syncthreads();
}
if(sh_index ==0)
c[blockIdx.x]=sh_arr[sh_index];
// printf("value stored at %d is %d \n", blockIdx.x, c[blockIdx.x]);
return;
}
int main()
{
int *h_a;
int *d_a;
int share_memSize, h_memSize;
size_t d_memSize;
share_memSize = THREADSPERBLOCK*sizeof(int);
h_memSize = THREADSPERBLOCK*NUMBLOCKS;
h_a = (int*)malloc(sizeof(int)*h_memSize);
d_memSize=THREADSPERBLOCK*NUMBLOCKS;
cudaMalloc( (void**)&d_a, h_memSize*sizeof(int));
for(int i=0; i<h_memSize; i++)
{
h_a[i]=1;
};
//printf("last element of array %d \n", h_a[h_memSize-1]);
cudaMemcpy(d_a, h_a, h_memSize*sizeof(int), cudaMemcpyHostToDevice);
reduceKernel<<<NUMBLOCKS, THREADSPERBLOCK, share_memSize>>>(d_a);
cudaMemcpy(h_a, d_a, d_memSize*sizeof(int), cudaMemcpyDeviceToHost);
printf("sizeof host memory %d \n", d_memSize); //sizeof(h_a));
printf("first block sum after reduction %d \n", h_a[0]);
}
$ nvcc -o t820 t820.cu
$ cuda-memcheck ./t820
========= CUDA-MEMCHECK
sizeof host memory 1048576
first block sum after reduction 1024
========= ERROR SUMMARY: 0 errors
$

Seg Fault when initializing array

I'm taking a class on C, and running into a segmentation fault. From what I understand, seg faults are supposed to occur when you're accessing memory that hasn't been allocated, or otherwise outside the bounds. 'Course all I'm trying to do is initialize an array (though rather large at that)
Am I simply misunderstanding how to parse a 2d array? Misplacing a bound is exactly what would cause a seg fault-- am I wrong in using a nested for-loop for this?
The professor provided the clock functions, so I'm hoping that's not the problem. I'm running this code in Cygwin, could that be the problem? Source code follows. Using c99 standard as well.
To be perfectly clear: I am looking for help understanding (and eventually fixing) the reason my code produces a seg fault.
#include <stdio.h>
#include <time.h>
int main(void){
//first define the array and two doubles to count elapsed seconds.
double rowMajor, colMajor;
rowMajor = colMajor = 0;
int majorArray [1000][1000] = {};
clock_t start, end;
//set it up to perform the test 100 times.
for(int k = 0; k<10; k++)
{
start=clock();
//first we do row major
for(int i = 0; i < 1000; i++)
{
for(int j = 0; j<1000; j++)
{
majorArray[i][j] = 314;
}
}
end=clock();
rowMajor+= (end-start)/(double)CLOCKS_PER_SEC;
//at this point, we've only done rowMajor, so elapsed = rowMajor
start=clock();
//now we do column major
for(int i = 0; i < 1000; i++)
{
for(int j = 0; j<1000; j++)
{
majorArray[j][i] = 314;
}
}
end=clock();
colMajor += (end-start)/(double)CLOCKS_PER_SEC;
}
//now that we've done the calculations 100 times, we can compare the values.
printf("Row major took %f seconds\n", rowMajor);
printf("Column major took %f seconds\n", colMajor);
if(rowMajor<colMajor)
{
printf("Row major is faster\n");
}
else
{
printf("Column major is faster\n");
}
return 0;
}
Your program works correctly on my computer (x86-64/Linux) so I suspect you're running into a system-specific limit on the size of the call stack. I don't know how much stack you get on Cygwin, but your array is 4,000,000 bytes (with 32-bit int) - that could easily be too big.
Try moving the declaration of majorArray out of main (put it right after the #includes) -- then it will be a global variable, which comes from a different allocation pool that can be much bigger.
By the way, this comparison is backwards:
if(rowMajor>colMajor)
{
printf("Row major is faster\n");
}
else
{
printf("Column major is faster\n");
}
Also, to do a test like this you really ought to repeat the process for many different array sizes and shapes.
You are trying to grab 1000 * 1000 * sizeof( int ) bytes on the stack. This is more then your OS allows for the stack growth. If on any Unix - check the ulimit -a for max stack size of the process.
As a rule of thumb - allocate big structures on the heap with malloc(3). Or use static arrays - outside of scope of any function.
In this case, you can replace the declaration of majorArray with:
int (*majorArray)[1000] = calloc(1000, sizeof majorArray);
I was unable to find any error in your code, so I compiled it and run it and worked as expected.
You have, however, a semantic error in your code:
start=clock();
//set it up to perform the test 100 times.
for(int k = 0; k<10; k++)
{
Should be:
//set it up to perform the test 100 times.
for(int k = 0; k<10; k++)
{
start=clock();
Also, the condition at the end should be changed to its inverse:
if(rowMajor<colMajor)
Finally, to avoid the problem of the os-specific stack size others mentioned, you should define your matrix outside main():
#include <stdio.h>
#include <time.h>
int majorArray [1000][1000];
int main(void){
//first define the array and two doubles to count elapsed seconds.
double rowMajor, colMajor;
rowMajor = colMajor = 0;
This code runs fine for me under Linux and I can't see anything obviously wrong about it. You can try to debug it via gdb. Compile it like this:
gcc -g -o testcode test.c
and then say
gdb ./testcode
and in gdb say run
If it crashes, say where and gdb tells you, where the crash occurred. Then you now in which line the error is.
The program is working perfectly when compiled by gcc, & run in Linux, Cygwin may very well be your problem here.
If it runs correctly elsewhere, you're most likely trying to grab more stack space than the OS allows. You're allocating 4MB on the stack (1 mill integers), which is way too much for allocating "safely" on the stack. malloc() and free() are your best bets here.

Resources