Weird bug in c implementing schoolbook multiplication using gmp - c

I have a weird problem.
I am trying to implement the schoolbook multiplication. I am aware that the function mpz_mul does that for me but it is my task to implement it myself as a homework.
So here is my code:
void mpz_school_mul(mpz_t c, mpz_t a, mpz_t b)
{
size_t i;
mp_limb_t b_i;
mpz_t c_part;
mpz_init(c_part);
/* Backup a for the special case a := a * b. */
mpz_t a_backup;
mpz_init(a_backup);
mpz_set(a_backup, a);
/* Clear the result */
mpz_set_ui(c,0);
gmp_printf("i = %zx, size(b) = %zx, a = %Zx, b = %Zx\n", i, mpz_size(b), a, b);
for(i = 0; i < mpz_size(b); i++)
{
printf("test\n");
b_i = mpz_getlimbn(b,i);
/* c = a*b_i*B^i + ... + a*b_0*B^0 */
/* Calculate a*b_i for every round. */
mpz_mul_limb(c_part,a_backup,b_i);
/* Shift it to the right position (*B^i). */
mpz_mul_base(c_part,c_part,i);
/* Sum all a*b_i*B^i */
mpz_school_add(c, c, c_part);
}
mpz_clear(a_backup);
mpz_clear(c_part);
}
This code works well for me and i can test it with several parameters. The result is correct so I don't think I need to change to much in the calculation part. ;)
As example: This parameters work as intended.
mpz_set_str(a, "ffffffff00000000abcdabcd", 16);
mpz_set_str(b, "cceaffcc00000000abcdabcd", 16);
mpz_school_mul(c,a,b);
Now to the bug:
When i run the program with a parameter b with a zero limb (I'm using a 32 bit VM) at the end the program crashes:
mpz_set_str(a, "ffffffff00000000abcdabcd", 16);
mpz_set_str(b, "cceaffcc00000000", 16);
mpz_school_mul(c,a,b);
The output with this parameter b_0 = 0 is:
i = 0, size(b) = 2, a = ffffffff00000000abcdabcd, b = cceaffcc00000000
I think the for-loop stucks because the printf("test\n"); does not show up in this run.
Thanks for your help ;)

The bug in the problem is fixed now.
Here is the solution:
I tested out to use fprintf(stderr, "test\n"); rather than printf("test\n"); and tested the code. It magically showed up the "test" in my console.
It may have to do with the wrong inclusion order of the and file.
As I had this problem with the print I didn't check my other functions.
Since I figured out, that the for-loop wasn't the problem I tested several prints after each command. I was able to detect the error in the void mpz_mul_base(mpz_t c, mpz_t a, mp_size_t i) function where I didn't check the case with c_part = 0. With this parameter the following code of the mpz_mul_base(c_part,c_part,i); function ran into an endless loop:
if(mpz_size(c) >= n)
for(i = mpz_size(c); i >= n; i--)
{
mpz_setlimbn(c, clear, i);
}
I replaced the >= with > and everything works fine now.

Related

How to call C function from R?

How can you use some function written in C from R level using R data.
eg. to use function like:
double* addOneToVector(int n, const double* vector) {
double* ans = malloc(sizeof(double)*n);
for (int i = 0; i < n; ++i)
ans[i] = vector[i] + 1
return ans;
}
in the context:
x = 1:3
x = addOneToVector(x)
x # 2, 3, 4
I've searched stackoverflow first but I noticed there is no answer for that in here.
The general idea is (commands for linux, but same idea under other OS):
Create function that will only take pointers to basic types and do everything by side-effects (returns void). eg in a file called foo.c:
void addOneToVector(int* n, double* vector) {
for (int i = 0; i < *n; ++i)
vector[i] += 1.0;
}
Compile file C source as dynamic library, you can use R shortcut to do this:
$ R CMD SHLIB foo.c
This will then create a file called foo.so on Mac or foo.dll on Windows.
Load dynamic library from R
on Mac:
dyn.load("foo.so")
or on Windows:
dyn.load("foo.dll")
Call C functions using .C R function, IE:
x = 1:3
ret_val = .C("addOneToVector", n=length(x), vector=as.double(x))
It returns list from which you can get value of inputs after calling functions eg.
ret_val$x # 2, 3, 4
You can now wrap it to be able to use it from R easier.
There is a nice page describing whole process with more details here (also covering Fortran):
http://users.stat.umn.edu/~geyer/rc/
I just did the same thing in a very simple way using the Rcpp package. It allows you to write C++ functions directly in R.
library("Rcpp")
cppFunction("
NumericVector addOneToVector(NumericVector vector) {
int n = vector.size();
for (int i = 0; i < n; ++i)
vector[i] = vector[i] + 1.0;
return vector;
}")
Find more details here http://adv-r.had.co.nz/Rcpp.html. C++ functions can be done very fast with these instructions.
First off, I wanted to thank both #m0nhawk and #Jan for their immensely useful contributions to this problem.
I tried both methods on my MacBook: first the one showed m0nhawk which requires creating a function in C (without the main method) and then compiling using R CMD SHLIB <prog.c> and then invoking the function from R using the .C command
Here's a small C code I wrote (not a pro in C - just learning in bits and pieces)
Step 1: Write the C Program
#include <stdio.h>
int func_test() {
for(int i = 0; i < 5; i++) {
printf("The value of i is: %d\n", i);
}
return 0;
}
Step 2: Compile the program using
R CMD SHLIB func_test.c
This will produce a func_test.so file
Step 3: Now write the R Code that invokes this C function from within R Studio
dyn.load("/users/my_home_dir/xxx/ccode/ac.so")
.C("func_test")
Step 4: Output:
.C("func_test") The value of i is: 0 The value of i is: 1 The value of i is: 2 The value of i is: 3 The value of i is: 4 list()
Then I tried the direct method suggested by Jan - using the RCpp package
library("Rcpp")
cppFunction("
NumericVector addOneToVector(NumericVector vector) {
int n = vector.size();
for (int i = 0; i < n; ++i)
vector[i] = vector[i] + 1.0;
return vector;
}")
# Test code to test the function
addOneToVector(c(1,2,3))
Both methods worked superbly. I can now start writing functions in C or C++ and use them in R
Thank you once again!

c - Avoid if in loop

Context
Debian 64.
Core 2 duo.
Fiddling with a loop. I came with different variations of the same loop but I would like to avoid conditional branching if possible.
But, even if I think it will be difficult to beat.
I thought about SSE or bit shifting but still, it would require a jump (look at the computed goto below). Spoiler : a computed jump doesn't seems to be the way to go.
The code is compiled without PGO. Because on this piece of code, it makes the code slower..
flags :
gcc -march=native -O3 -std=c11 test_comp.c
Unrolling the loop didn't help here..
63 in ascii is '?'.
The printf is here to force the code to execute. Nothing more.
My need :
A logic to avoid the condition. I assume this as a challenge to make my holydays :)
The code :
Test with the sentence. The character '?' is guaranteed to be there but at a random position.
hjkjhqsjhdjshnbcvvyzayuazeioufdhkjbvcxmlkdqijebdvyxjgqddsyduge?iorfe
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv){
/* This is quite slow. Average actually.
Executes in 369,041 cycles here (cachegrind) */
for (int x = 0; x < 100; ++x){
if (argv[1][x] == 63){
printf("%d\n",x);
break;
}
}
/* This is the slowest.
Executes in 370,385 cycles here (cachegrind) */
register unsigned int i = 0;
static void * restrict table[] = {&&keep,&&end};
keep:
++i;
goto *table[(argv[1][i-1] == 63)];
end:
printf("i = %d",i-1);
/* This is slower. Because of the calculation..
Executes in 369,109 cycles here (cachegrind) */
for (int x = 100; ; --x){
if (argv[1][100 - x ] == 63){printf("%d\n",100-x);break;}
}
return 0;
}
Question
Is there a way to make it faster, avoiding the branch maybe ?
The branch miss is huge with 11.3% (cachegrind with --branch-sim=yes).
I cannot think it is the best one can achieve.
If some of you manage assembly with enough talent, please come in.
Assuming you have a buffer of well know size being able to hold the maximum amount of chars to test against, like
char buffer[100];
make it one byte larger
char buffer[100 + 1];
then fill it with the sequence to test against
read(fileno(stdin), buffer, 100);
and put your test-char '?' at the very end
buffer[100] = '?';
This allows you for a loop with only one test condition:
size_t i = 0;
while ('?' != buffer[i])
{
++i;
}
if (100 == i)
{
/* test failed */
}
else
{
/* test passed for i */
}
All other optimisation leave to the compiler.
However I couldn't resist, so here's a possible approach to do micro optimisation
char buffer[100 + 1];
read(fileno(stdin), buffer, 100);
buffer[100] = '?';
char * p = buffer;
while ('?' != *p)
{
++p;
}
if ((p - buffer) == 100)
{
/* test failed */
}
else
{
/* test passed for (p - buffer) */
}

Visual Studios 2010: 'File location'.exe is not recognized as internal or external command... (C program)

My Thermo professor assigned our class a computational project in which we have to calculate some thermodynamic functions. He provided us with some code to work off of which is a program that essentially finds the area under a curve between two points for the function x^2. The code is said to be correct and it looks correct to me. However, I've been having FREQUENT problems with all of my programs giving me the error "'File location'.exe is not recognized as internal or external command, operable programs or batch files." upon initial running of a project or [mostly] reopening projects.
I've been researching the problem for many hours. I tried adjusting the environmental variables like so many other sites suggested, but I'm either not doing it right or it's not working. All I keep reading about is people explaining the purpose of an .exe file and that I have to locate that file and open that. The problem is that I cannot find ANY .exe file. There is the project I created with the source.c file I created and wrote the program in. Everything else has lengthy extensions that I've never seen before.
I'm growing increasingly impatient with Visual Studios' inconsistent behavior lately. I've just made the switch from MATLAB, which although is an inferior programming language, is far more user friendly and easier to program with. For those of you interested in the code I'm running, it is below:
#include <stdio.h>
#include <iostream>
#include <math.h>
using namespace std;
double integration();
double integration()
{
int num_of_intervals = 4, i;
double final_sum = 0, lower_limit = 2, upper_limit = 3, var, y = 1, x;
x = (upper_limit - lower_limit) / num_of_intervals; // Calculating delta x value
if(num_of_intervals % 2 != 0) //Simpson's rule can be performed only on even number of intervals
{
printf("Cannot perform integration. Number of intervals should be even");
return 0;
}
for(i = 0 ; i < num_of_intervals ; i++)
{
if(i != 0) //Coefficients for even and odd places. Even places, it is 2 and for odd it is 4.
{
if(i % 2 == 0)
y = 2;
else
y = 4;
}
var = lower_limit + (i * x);// Calculating the function variable value
final_sum = final_sum + (pow(var, 2) * y); //Calculating the sum
}
final_sum = (final_sum + pow(upper_limit , 2)) * x / 3; //Final sum
return final_sum;
}
int main()
{
printf("The integral value of x2 between limits 2 and 3 is %lf \n" , integration());
system("PAUSE");
return 0;
}
Thanks in advance,
Dom

Data Gathering Portion of CUDA Code is Unexpectedly Outputting "0"s

EDIT
In the initial posting's code snippet (see below) I was not properly sending the struct to the device, this has been fixed, but the results are still the same. In my full code this mistake was not present. (There were two mistakes in that command in my initial posting -- one, the structure was being copied from HostToDevice, but was actually reversed, and the size of the copy was also wrong. Apologies; both errors were fixed, but the recompiled code still displays the zeros phenomena described below, as does my full code.)
EDIT 2
In the haste of my de-proprietarization rewrite of the code I made a couple errors which dalekchef kindly pointed out to me (the copy of the struct to the device was performed BEFORE the allocation on the device, in my rewritten code and the device cudaMalloc calls were not multiplied with the sizeof(...) the type of the array elements. I added these fixes, recompiled and retested, but it did not fix the problem. Also double checked my original code -- it did not have those mistakes. Apologies again, for the confusion.
I'm trying to dump statistics from a large simulations program. A similar pared down code is displayed below. Both codes exhibit the same problem -- they output zeroes, when they should be outputting averaged values.
#include "stdio.h"
struct __align__(8) DynamicVals
{
double a;
double b;
int n1;
int n2;
int perDump;
};
__device__ int *dev_arrN1, *dev_arrN2;
__device__ double *dev_arrA, *dev_arrB;
__device__ DynamicVals *dev_myVals;
__device__ int stepsA, stepsB;
__device__ double sumA, sumB;
__device__ int stepsN1, stepsN2;
__device__ int sumN1, sumN2;
__global__ void TEST
(int step, double dev_arrA[], double dev_arrB[],
int dev_arrN1[], int dev_arrN2[],DynamicVals *dev_myVals)
{
if (step % dev_myVals->perDump)
{
dev_arrN1[step/dev_myVals->perDump] = 0;
dev_arrN2[step/dev_myVals->perDump] = 0;
dev_arrA[step/dev_myVals->perDump] = 0.0;
dev_arrB[step/dev_myVals->perDump] = 0.0;
stepsA = 0;
stepsB = 0;
stepsN1 = 0;
stepsN2 = 0;
sumA = 0.0;
sumB = 0.0;
sumN1 = 0;
sumN2 = 0;
}
sumA += dev_myVals->a;
sumB += dev_myVals->b;
sumN1 += dev_myVals->n1;
sumN2 += dev_myVals->n2;
stepsA++;
stepsB++;
stepsN1++;
stepsN2++;
if ( sumA > 100000000 )
{
dev_arrA[step/dev_myVals->perDump] +=
sumA / stepsA;
sumA = 0.0;
stepsA = 0;
}
if ( sumB > 100000000 )
{
dev_arrB[step/dev_myVals->perDump] +=
sumB / stepsB;
sumB = 0.0;
stepsB = 0;
}
if ( sumN1 > 1000000 )
{
dev_arrN1[step/dev_myVals->perDump] +=
sumN1 / stepsN1;
sumN1 = 0;
stepsN1 = 0;
}
if ( sumN2 > 1000000 )
{
dev_arrN2[step/dev_myVals->perDump] +=
sumN2 / stepsN2;
sumN2 = 0;
stepsN2 = 0;
}
if ((step+1) % dev_myVals->perDump)
{
dev_arrA[step/dev_myVals->perDump] +=
sumA / stepsA;
dev_arrB[step/dev_myVals->perDump] +=
sumB / stepsB;
dev_arrN1[step/dev_myVals->perDump] +=
sumN1 / stepsN1;
dev_arrN2[step/dev_myVals->perDump] +=
sumN2 / stepsN2;
}
}
int main()
{
const int TOTAL_STEPS = 10000000;
DynamicVals vals;
int *arrN1, *arrN2;
double *arrA, *arrB;
int statCnt;
vals.perDump = TOTAL_STEPS/10;
statCnt = TOTAL_STEPS/vals.perDump+1;
vals.a = 30000.0;
vals.b = 60000.0;
vals.n1 = 10000;
vals.n2 = 20000;
cudaMalloc( (void**)&dev_arrA, statCnt*sizeof(double) );
cudaMalloc( (void**)&dev_arrB, statCnt*sizeof(double) );
cudaMalloc( (void**)&dev_arrN1, statCnt*sizeof(int) );
cudaMalloc( (void**)&dev_arrN2, statCnt*sizeof(int) );
cudaMalloc( (void**)&dev_myVals, sizeof(DynamicVals));
cudaMemcpy(dev_myVals, &vals, sizeof(DynamicVals),
cudaMemcpyHostToDevice);
arrA = (double *)malloc(statCnt * sizeof(double));
arrB = (double *)malloc(statCnt * sizeof(double));
arrN1 = (int *)malloc(statCnt * sizeof(int));
arrN2 = (int *)malloc(statCnt * sizeof(int));
for (int i=0; i< TOTAL_STEPS; i++)
TEST<<<1,1>>>(i, dev_arrA,dev_arrB,dev_arrN1,dev_arrN2,dev_myVals);
cudaMemcpy(arrA,dev_arrA,statCnt * sizeof(double),cudaMemcpyDeviceToHost);
cudaMemcpy(arrB,dev_arrB,statCnt * sizeof(double),cudaMemcpyDeviceToHost);
cudaMemcpy(arrN1,dev_arrN1,statCnt * sizeof(int),cudaMemcpyDeviceToHost);
cudaMemcpy(arrN2,dev_arrN2,statCnt * sizeof(int),cudaMemcpyDeviceToHost);
for (int i=0; i< statCnt; i++)
{
printf("Step: %d ; A=%g B=%g N1=%d N2=%d\n",
i*vals.perDump,
arrA[i], arrB[i], arrN1[i], arrN2[i]);
}
}
Output:
Step: 0 ; A=0 B=0 N1=0 N2=0
Step: 1000000 ; A=0 B=0 N1=0 N2=0
Step: 2000000 ; A=0 B=0 N1=0 N2=0
Step: 3000000 ; A=0 B=0 N1=0 N2=0
Step: 4000000 ; A=0 B=0 N1=0 N2=0
Step: 5000000 ; A=0 B=0 N1=0 N2=0
Step: 6000000 ; A=0 B=0 N1=0 N2=0
Step: 7000000 ; A=0 B=0 N1=0 N2=0
Step: 8000000 ; A=0 B=0 N1=0 N2=0
Step: 9000000 ; A=0 B=0 N1=0 N2=0
Step: 10000000 ; A=0 B=0 N1=0 N2=0
Now, if I were to use a small period for my dumps or if my #s were smaller, I could get away with just a direct
add
divide by period and the end of period
...algorithm, but I use temporary sums as otherwise my int would overflow (the double wouldn't overflow, but I was concerned about it losing precision).
If I use the above direct algorithm for smaller values I get correct non-zero values, but the second I use the intermediates (e.g. stepsA, sumA, etc.) the values go to zero.
I know I'm doing something silly here... what am I missing?
Notes:
A.) Yes, I know this code in its above form is not parallel and by itself does not warrant parallelization. It is part of a small statistics collecting portion of a much longer code. In that code it is encased in a thread index specific conditional logic to prevent clashing (making it parallel) and serves as data gathering to a simulations program (which warrants parallelization). Hopefully you can understand where the above code originates and avoid snide comments about its lack of thread-safety. (This disclaimer is added out of past experience receiving unproductive comments from people who didn't understand I was posting an excerpt, not a full code, despite me writing in less explicit terms as such.)
B.) Yes, I know the names of the variables are ambiguous. That is the point. The code I'm working on is proprietary, though it will eventually be open sourced. I only write this as I have posted similarly anonymized codes in the past and received rude commentary about my naming convention.
C.) Yes, I have read the CUDA manual several times, though I do make errors and I admit there's some features I don't understand. I'm not using shared memory here, but I am using shared memory (OF COURSE) in my full code.
D.) Yes, the above code does represent the exact same features as the data dumping portion of my non-working code, with the logic not related to this particular problem removed, and with it the thread safety conditional. The variable names have been changed, but algorithmically it should be unaltered and this is verified by the exact same non-working output (zeroes).
E.) I do realize the "dynamic" struct in the above snippet has non-dynamic values. I named the structure that because in the full code, this struct contains simulations data, and is dynamic. The static nature in the pared-down code should not make the statistics collecting code fail, it will simply mean that the average for each dump should be constant (and non-zero).
A couple of things:
It seems like you are calling cudaMemcpy for dev_MyVals before you are calling cudaMalloc for it. This is not how it should be.
ALSO: You do not multiply by sizeof int when you do your cudaMalloc calls.
You should really check all of your CUDA calls cudaMalloc/cudaMemcpy for an error code. They should all return an error or CUDA_SUCCESS. I believe the CUDA examples all show how to do this.
Also, for future reference NEVER use the modulo operator in CUDA it is incredibly slow. Just Google for "Modulo CUDA" for some alternatives.
Let me know how it goes, this will probably take a couple of iterations to fix.
The biggest problem I see here is one of scope. The way this code is written leads me to conclude that you might not understand how variable scoping in C++ works in general, and how device and host code scope works in CUDA in particular. A couple of observations:
When you do this type of thing in code:
__device__ double *dev_arrA, *dev_arrB;
__global__ void TEST(int step, double dev_arrA[], double dev_arrB[], ....)
you have a variable scope problem. dev_arrA is declared at both compilation unit scope and function scope. The two declarations do not refer to the same variable -- the function unit scope declaration (in the kernel) takes precedence over the compilation unit scope declaration inside the kernel. you modify that variable, you are modifying the kernel scope declaration, not the __device__variable. This can lead to all sorts of subtle and unexpactd behaviour. It is much better to avoid ever having the same variable declared at multiple scopes.
When you declare a variable using the __device__ specifier, it is intended to be exclusively a device context symbol, and should only be used directly in device code. So something like this:
__device__ double *dev_arrA;
int main()
{
....
cudaMalloc( (void**)&dev_arrA, statCnt*sizeof(double) );
....
}
is illegal. You cannot call an API function like cudaMalloc directly on a __device__ variable. Even though it will compile (because of the hackery involved in the CUDA compilation tradjectories for host and device code), it is incorrect to do so. In the above example dev_arrA is a device symbol. You can interact with it via the API symbol manipulation calls, but that is all it is technically legal to do. In you code, variables intended to hold device pointers and be passed as kernel arguments (like dev_arrA) should be declared at main() scope, and passed by value to the kernel.
It is a combination of the above two things which is probably causing your problems.
But the difficulty is that you have chosen to post roughy 150 lines of code (a lot of which is redundant) as a repro case. I doubt anyone cares enough about your problems to go through that much code with a fine tooth comb and pinpoint where the precise problem is. Further, you habit of doing these nasty "top edits" in your questions quickly turn what might have been reasonably written starting points into unintelligible psuedo changelogs which are incredibly hard to follow and are unlikely to be of help to anyone. Also, the mildly passive-aggressive notes section serves no real purpose - it adds nothing of value to the question.
So I will leave you with a greatly simplified version of the code you posted which I think has all the basic things which you are trying to do working. I leave it as an "exercise for the reader" to turn it back into whatever it is that you are trying to do.
#include "stdio.h"
typedef float Real;
struct __align__(8) DynamicVals
{
Real a;
int n1;
int perDump;
};
__device__ int stepsA;
__device__ Real sumA;
__device__ int stepsN1;
__device__ int sumN1;
__global__ void TEST
(int step, Real dev_arrA[], int dev_arrN1[], DynamicVals *dev_myVals)
{
if (step % dev_myVals->perDump)
{
dev_arrN1[step/dev_myVals->perDump] = 0;
dev_arrA[step/dev_myVals->perDump] = 0.0;
stepsA = 0;
stepsN1 = 0;
sumA = 0.0;
sumN1 = 0;
}
sumA += dev_myVals->a;
sumN1 += dev_myVals->n1;
stepsA++;
stepsN1++;
dev_arrA[step/dev_myVals->perDump] += sumA / stepsA;
dev_arrN1[step/dev_myVals->perDump] += sumN1 / stepsN1;
}
inline void gpuAssert(cudaError_t code, char *file, int line,
bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code),
file, line);
if (abort) exit(code);
}
}
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
int main()
{
const int TOTAL_STEPS = 1000;
DynamicVals vals;
int *arrN1;
Real *arrA;
int statCnt;
vals.perDump = TOTAL_STEPS/10;
statCnt = TOTAL_STEPS/vals.perDump;
vals.a = 30000.0;
vals.n1 = 10000;
Real *dev_arrA;
int *dev_arrN1;
DynamicVals *dev_myVals;
gpuErrchk( cudaMalloc( (void**)&dev_arrA, statCnt*sizeof(Real)) );
gpuErrchk( cudaMalloc( (void**)&dev_arrN1, statCnt*sizeof(int)) );
gpuErrchk( cudaMalloc( (void**)&dev_myVals, sizeof(DynamicVals)) );
gpuErrchk( cudaMemcpy(dev_myVals, &vals, sizeof(DynamicVals),
cudaMemcpyHostToDevice) );
arrA = (Real *)malloc(statCnt * sizeof(Real));
arrN1 = (int *)malloc(statCnt * sizeof(int));
for (int i=0; i< TOTAL_STEPS; i++) {
TEST<<<1,1>>>(i, dev_arrA,dev_arrN1,dev_myVals);
gpuErrchk( cudaPeekAtLastError() );
}
gpuErrchk( cudaMemcpy(arrA,dev_arrA,statCnt * sizeof(Real),
cudaMemcpyDeviceToHost) );
gpuErrchk( cudaMemcpy(arrN1,dev_arrN1,statCnt * sizeof(int),
cudaMemcpyDeviceToHost) );
for (int i=0; i< statCnt; i++)
{
printf("Step: %d ; A=%g N1=%d\n",
i*vals.perDump, arrA[i], arrN1[i] );
}
}

OpenCL kernel argument struct has zero values

I'm having several problems regarding OpenCL (total noob) but I think that if I manage to solve this one I will be able to solve some of the other. I have the following kernel that I want to store in a double array the a number calculated by the data of a struct. The argument that I pass to the kernel is a struct array and is initialised and the values are non zero (I tested it).
When executing the kernel though I get a "Floating point exception". If I got it right it means that the local_density variable is zero and the division causes an error. What I don't get is why it is zero since in the host the values of non-zero. Am I doing something wrong in the kernel?
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
typedef struct
{
double speeds[9];
} t_speed;
__kernel void prepare(__global const t_speed* cells,
__global const int* obstacles,
__global double* results,
const unsigned int count)
{
int pos = get_global_id(0);
if(pos >= count) return;
if(obstacles[pos] == 1) results[pos] = 0.00;
else
{
double local_density = 0.00;
for(int kk = 0; kk < 9; kk++)
local_density += cells[pos].speeds[kk];
results[pos] = (cells[pos].speeds[1] + cells[pos].speeds[5] +
cells[pos].speeds[8] - (cells[pos].speeds[3] +
cells[pos].speeds[6] + cells[pos].speeds[7])) /
local_density;
}
}
Here is also the initialization of the variable that I pass as an argument. params->ny/nx have correct values.
cells = (t_speed*) malloc(sizeof(t_speed) * (params->ny * params->nx));
Also I quote the argument setting for the kernel for the cells variable.
m_cells = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(t_speed) * count, NULL, NULL);
err = clEnqueueWriteBuffer(commands, m_cells, CL_TRUE, 0, sizeof(t_speed) * count, cells, 0, NULL, NULL);
err |= clSetKernelArg(av_velocity_prepare_kernel, 0, sizeof(cl_mem), &m_cells);
------------------------------------------ EDIT ------------------------------------------
OK, what is really weird is that I'm getting the same error (Floating point exception) even with the very simple following kernel. Anyone has got a clue?
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void test(__global float* result, const unsigned int n)
{
int i = get_global_id(0);
if(i >= n) return;
result[i] += 1.0f;
}
I noticed that you are declaring your buffer as CL_MEM_READ_ONLY, yet your are writing to it inside the kernel. According to the OpenCL spec, this is undefined. Try using CL_MEM_READ_WRITE instead.
OK, so it was a completely different thing than I thought it was. The problem was that when I was calling
clEnqueueNDRangeKernel (command_queue, kernel, work_dim, *global_work_offset,
*global_work_size, *local_work_size, num_events_in_wait_list,
*event_wait_list, *event)
the global_work_size was not divisible by local_work_size. That caused the Floating point exception.

Resources