Cuda returns wrong value from __device__ function when cos is used - c

I'm trying to run this code on nvidia GPU and it returns strange values. It consist of two modules main.cu and exmodul.cu (listed bellow). For building I'm using:
nvcc -dc -arch sm_35 main.cu
nvcc -dc -arch sm_35 exmodul.cu
nvcc -arch sm_35 -lcudart -o main main.o exmodul.o
If I run that I obtained strange last line!!! gd must be 1.
result=0
result=0
result=0
result=0
gd=-0.5
When I change 1.0 in exmodul.cu to number greater than
1.000000953 or bellow 0.999999999999999945, it return proper
result.
When I change 1.1 in exmodul.cu it also fails except value 1.0.
Behavior doesn't depend on constant 2.0 in the same module.
When I use another function instead of cos like sin or exp it works properly.
Use of double q = cos(1.1); has no effect.
When I copy function extFunc() to module main.cu it works properly.
If I uncomment *gd=1.0; in main.cu it returns correct 1.0.
Tested on Nvidia GT750M and GeForce GTX TITAN Black. (on GT750M returns different value gd=6.1232329394368592e-17 but still wrong). OS: Debian Jessie.
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Thu_Mar_13_11:58:58_PDT_2014
Cuda compilation tools, release 6.0, V6.0.1
Have you got any idea what is wrong?
Thanks, Lukas
main.cu
#include <stdio.h> // printf
#include "exmodul.h" // extFunc
__global__ void mykernel(double*gd);
void deviceCheck();
int main(int argc, char *argv[])
{
double gd, *d_gd;
cudaMalloc(&d_gd, sizeof(double)); deviceCheck();
mykernel<<<1,1>>>(d_gd); deviceCheck();
cudaMemcpy(&gd, d_gd, sizeof(double), cudaMemcpyDeviceToHost);
deviceCheck();
cudaFree(d_gd); deviceCheck();
fprintf(stderr,"gd=%.17g\n",gd);
return 0;
}
void deviceCheck()
{
cudaError_t result = cudaSuccess;
cudaDeviceSynchronize();
result = cudaGetLastError();
fprintf(stderr,"result=%d\n",result); fflush(stderr);
}
__global__ void mykernel(double *gd)
{
*gd = extFunc();
//*gd=1.0;
__syncthreads();
return;
}
exmodul.cu
#include "exmodul.h"
__device__ double extFunc()
{
double q = 1.1;
q = cos(q);
if(q<2.0) { q = 1.0; }
return q;
}
exmodul.h
__device__ double extFunc();

I was able to reproduce problematic behavior on a supported config (CUDA 6.5, CentOS 6.2, K40).
When I switched from CUDA 6.5 to CUDA 7 RC, the problem went away.
The problem also did not appear to be reproducible on an older config (CUDA 5.5, CentOS 6.2, M2070)
I suggest switching to CUDA 7 RC to address this issue. I suspect an underlying bug in the compilation process that has been fixed already.

Related

How do I link the Accelerate Framework to a c program in MacOs?

I just started with c development and I need to compile and link a program which uses the Accelerate Framework from Apple:
Simple example accelerate.c:
#include <stdio.h>
#include <Accelerate/Accelerate.h>
double vectorvector_product(double * a, double * b, int dim){
// This function returns in res the elementwiseproduct between a and b,
// a and b must have the same dimension dim.
return cblas_ddot(dim,a,1,b,1);
}
int main(){
double a[4] = {1.0,2.0,3.0,4.0};
double b[4] = {1.0,2.0,3.0,4.0};
double res = vectorvector_product(a,b,4);
printf("Res: %f",res);
}
I compiled it with clang:
>>> cc -Wall -g -c accelerate.c
And obtained a new file accelerate.o
What would I do now in order to properly link it?
All I know is that this Accelerate framework is located at /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework
>>> ls /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework
Accelerate.tbd Frameworks Headers Modules Versions
p.s.: If I Run this program with Xcode it magically works, but I need to do it from the command line and I would like to know what I'm doing.
Apparently the correct way to link Accelerate.h is by passing -framework Accelerate as argument e.g.
>>> cc -framework Accelerate accelerate.c
will compile and link accelerate.c by generating an executable a.out.

Why does this SIMD example code in C compile with minGW but the executable doesn't run on my windows machine?

I'm learning the basics of SIMD so I was given a simple code snippet to see the principle at work with SSE and SSE2.
I recently installed minGW to compile C code in windows with gcc instead of using the visual studio compiler.
The objective of the example is to add two floats and then multiply by a third one.
The headers included are the following (which I guess are used to be able to use the SSE intrinsics):
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
#include <time.h>
#include <sys/time.h> // for timing
Then I have a function to check what time it is, to compare time between calculations:
double now(){
struct timeval t; double f_t;
gettimeofday(&t, NULL);
f_t = t.tv_usec; f_t = f_t/1000000.0; f_t +=t.tv_sec;
return f_t;
}
The function to do the calculation in the "scalar" sense is the following:
void run_scalar(){
unsigned int i;
for( i = 0; i < N; i++ ){
rs[i] = (a[i]+b[i])*c[i];
}
}
Here is the code for the sse2 function:
void run_sse2(){
unsigned int i;
__m128 *mm_a = (__m128 *)a;
__m128 *mm_b = (__m128 *)b;
__m128 *mm_c = (__m128 *)c;
__m128 *mm_r = (__m128 *)rv;
for( i = 0; i <N/4; i++)
mm_r[i] = _mm_mul_ps(_mm_add_ps(mm_a[i],mm_b[i]),mm_c[i]);
}
The vectors are defined the following way (N is the size of the vectors and it is defined elsewhere) and a function init() is called to initialize them:
float a[N] __attribute__((aligned(16)));
float b[N] __attribute__((aligned(16)));
float c[N] __attribute__((aligned(16)));
float rs[N] __attribute__((aligned(16)));
float rv[N] __attribute__((aligned(16)));
void init(){
unsigned int i;
for( i = 0; i < N; i++ ){
a[i] = (float)rand () / RAND_MAX / N;
b[i] = (float)rand () / RAND_MAX / N;
c[i] = (float)rand () / RAND_MAX / N;
}
}
Finally here is the main that calls the functions and prints the results and computing time.
int main(){
double t;
init();
t = now();
run_scalar();
t = now()-t;
printf("S = %10.9f Temps du code scalaire : %f seconde(s)\n",1e5*sum(rs),t);
t = now();
run_sse2();
t = now()-t;
printf("S = %10.9f Temps du code vectoriel 2: %f seconde(s)\n",1e5*sum(rv),t);
}
For sum reason if I compile this code with a command line of "gcc -o vec vectorial.c -msse -msse2 -msse3" or "mingw32-gcc -o vec vectorial.c -msse -msse2 -msse3"" it compiles without any problems, but for some reason I can't run it in my windows machine, in the command prompt I get an "access denied" and a big message appears on the screen saying "This app can't run on your PC, to find a version for your PC, check with the software publisher".
I don't really understand what is going on, neither do I have much experience with MinGW or C (just an introductory course to C++ done on Linux machines). I've tried playing around with different headers because I thought maybe I was targeting a different processor than the one on my PC but couldn't solve the issue. Most of the info I found was confusing.
Can someone help me understand what is going on? Is it a problem in the minGW configuration that is compiling in targeting a Linux platform? Is it something in the code that doesn't have the equivalent in windows?
I'm trying to run it on a 64 bit Windows 8.1 pc
Edit: Tried the configuration suggested in the site linked below. The output remains the same.
If I try to run through MSYS I get a "Bad File number"
If I try to run throught the command prompt I get Access is Denied.
I'm guessing there's some sort of bug arising from permissions. Tried turning off the antivirus and User Account control but still no luck.
Any ideas?
There is nothing wrong with your code, besides, you did not provide the definition of sum() or N which is, however, not a problem. The switches -msse -msse2 appear to be not required.
I was able to compile and run your code on Linux (Ubuntu x86_64, compiled with gcc 4.8.2 and 4.6.3, on Atom D2700 and AMD Athlon LE-1640) and Windows7/64 (compiled with gcc 4.5.3 (32bit) and 4.8.2 (64bit), on Core i3-4330 and Core i7-4960X). It was running without problem.
Are you sure your CPU supports the required instructions? What exactly was the error code you got? Which MinGW configuration did you use? Out of curiosity, I used the one available at http://win-builds.org/download.html which was very straight-forward.
However, using the optimization flag -O3 created the best result -- with the scalar loop! Also useful are -m64 -mtune=native -s.

Integrating fftw C function calls inside system verilog code

I have installed fftw C library succefully on my linux system. Here is more info about fftw c => http://www.fftw.org/ I have a sample C code which can call fftw C functions successfully. Below is a C ccode and command to run the C code:
Code:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <fftw3.h>
int main(void)
{
double FFT_in[] = {0.1, 0.6, 0.1, 0.4, 0.5, 0, 0.8, 0.7, 0.8, 0.6, 0.1,0};
double *IFFT_out;
int i,size = 12;
fftw_complex *middle;
fftw_plan fft;
fftw_plan ifft;
middle = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*size);
IFFT_out = (double *) malloc(size*sizeof(double));
fft = fftw_plan_dft_r2c_1d(size, FFT_in, middle, FFTW_ESTIMATE); //Setup fftw plan for fft (real 1D data)
ifft = fftw_plan_dft_c2r_1d(size, middle, IFFT_out, FFTW_ESTIMATE); //Setup fftw plan for ifft
fftw_execute(fft);
fftw_execute(ifft);
printf("Input: \tFFT_coefficient[i][0] \tFFT_coefficient[i][1] \tRecovered Output:\n");
for(i=0;i<size;i++)
printf("%f\t%f\t\t\t%f\t\t\t%f\n",(FFT_in[i]),middle[i][0],middle[i][1],IFFT_out[i]/size);
fftw_destroy_plan(fft);
fftw_destroy_plan(ifft);
fftw_free(middle);
free(IFFT_out);
return 0;
}
I can run this code succesfully with below gcc command:
gcc -g -Wall -I/home/usr/fftw/local/include -L/home/usr/fftw/local/lib fftw_test.c -lfftw3 -lm -o fftw_test
Now I want to call this function inside system verilog using DPI method. Before I show my issue, below is a sample DPI-C/systemverilog testcase which I could run successfully:
C:code
#include <stdio.h>
#include <stdlib.h>
#include "svdpi.h"
int add(x,y)
{
int z;
z=x+y;
printf("This is from C:%d+%d=%d\n",x,y,z);
return z;
}
Calling above C function from system verilog using DPI:
module top;
import "DPI-C" function int add(int a, int b);
int a,b,j;
initial begin
$display("Entering in SystemVerilog Initial Block\n");
#20
a=20;b=10;
j = add(a,b);
$display("This is from System Verilog:Value of J=%d",j);
$display("Exiting from SystemVerilog Initial Block");
#5 $finish;
end
endmodule
and finally I could run this successfully with irun command
irun -f run.f where run.f contains following commands:
# Compile the SystemVerilog files
basicadd.sv
-access +rwc
# Generate a header file called _sv_export.h
-dpiheader _sv_export.h
# Delay compilation of testexport.c until after elaboration
-cpost add.c -end
# Redirect output of ncsc_run to a log file called ncsc_run.log
-log_ncsc_run ncsc_run.log
Now my real issue is
When I try to link fftw C with system verilog, I am not able to run it. Below is my C code, which is pretty similar to very first C code I posted:
C code:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <fftw3.h>
#include "svdpi.h"
void fftw_test_DPI(double FFT_in[],int size)
{
double *IFFT_out;
int i;
fftw_complex *middle;
fftw_plan fft;
fftw_plan ifft;
middle = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*size);
IFFT_out = (double *) malloc(size*sizeof(double));
fft = fftw_plan_dft_r2c_1d(size, FFT_in, middle, FFTW_ESTIMATE); //Setup fftw plan for fft (real 1D data)
ifft = fftw_plan_dft_c2r_1d(size, middle, IFFT_out, FFTW_ESTIMATE); //Setup fftw plan for ifft
fftw_execute(fft);
fftw_execute(ifft);
printf("Input: \tFFT_coefficient[i][0] \tFFT_coefficient[i][1] \tRecovered Output:\n");
for(i=0;i<size;i++)
printf("%f\t%f\t\t\t%f\t\t\t%f\n",FFT_in[i],middle[i][0],middle[i][1],IFFT_out[i]/size);
fftw_destroy_plan(fft);
fftw_destroy_plan(ifft);
fftw_free(middle);
free(IFFT_out);
//return IFFT_out;
}
Here is my system verilog, where I call above C function:
module top;
import "DPI-C" function void fftw_test_DPI(real FFT_in[0:11], int size);
real j [0:11];
integer i,size;
real FFT_in [0:11];
initial begin
size = 12;
FFT_in[0] = 0.1;
FFT_in[1] = 0.6;
FFT_in[2] = 0.1;
FFT_in[3] = 0.4;
FFT_in[4] = 0.5;
FFT_in[5] = 0.0;
FFT_in[6] = 0.8;
FFT_in[7] = 0.7;
FFT_in[8] = 0.8;
FFT_in[9] = 0.6;
FFT_in[10] = 0.1;
FFT_in[11] = 0.0;
$display("Entering in SystemVerilog Initial Block\n");
#20
fftw_test_DPI(FFT_in,size);
$display("Printing recovered output from system verilog\n");
//for(i=0;i<size;i++)
//$display("%f\t\n",(j[i])/size);
$display("Exiting from SystemVerilog Initial Block");
#5 $finish;
end
endmodule
And finally I tried it running several ways, I will mentioned couple of ways I tried:
1st method:
irun -f run_fftw.f where run_fftw.f contains:
# Compile the SystemVerilog files
fftw_test.sv
-access +rwc
# Generate a header file called _sv_export.h
-dpiheader _sv_export.h
# Delay compilation of fftw_test.c until after elaboration
-cpost fftw_test_DPI.c -end
-I/home/usr/fftw/local/include -L/home/usr/fftw/local/lib fftw_test_DPI.c -lfftw3 -lm
# Redirect output of ncsc_run to a log file called ncsc_run.log
-log_ncsc_run ncsc_run.log
but this results in below error:
building library run.so
ld: /home/usr/fftw/local/lib/libfftw3.a(mapflags.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC
/home/usr/fftw/local/lib/libfftw3.a: could not read symbols: Bad value
collect2: ld returned 1 exit status
make: * [/home/usr/DPI/./INCA_libs/irun.lnx8664.12.20.nc/librun.so] Error 1
ncsc_run: *E,TBBLDF: Failed to build test library
/home/usr/DPI/./INCA_libs/irun.lnx8664.12.20.nc/librun.so
2nd method:
1.gcc -fPIC -O2 -c -g -I/home/ss69/fftw/local/include -I. -L/home/usr/fftw/local/lib -L. -o fftw_test_DPI.o fftw_test_DPI.c -lfftw3 -lm
2.gcc -shared -o libdpi.so fftw_test_DPI.o
3.irun -64bit -sv_lib libdpi.so fftw_test.sv
This results in below error:
Loading snapshot worklib.top:sv .................... Done
ncsim> run
Entering in SystemVerilog Initial Block
ncsim: symbol lookup error: ./libdpi.so: undefined symbol: fftw_malloc
I know my post is kind of difficult to follow, but I would highly appreciate any help.
This is a good description and summary of the problem that we discussed earlier. I do not usually work on a *nix system, so I cannot suggest any specific details. What I will point out is that approaching a problem like this in steps is usually a good method.
Your first step of have the fft code called directly from main() is good. However, I did not see a step where (NOT USING SystemVerilog or ncsim) you compiled the fft code into a library and then called that code from main().
It would seem that getting the code to be stored in a separate lib and called from main() is a necessary step. After you have that working, then you should be able to include the fft lib in the SystemVerilog stuff without having to compile the fft routine at the same time?
You might need to export or setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/home/usr/fftw/local/lib:.. Unless your libdpi.so is linked with static FFTW library, otherwise, your code will need to load the dynamic version of the FFTW library (libfftw.so?) because you are using fftw_* APIs.

R.h and Rmath.h in native C program

"R.h" and "Rmath.h" are header files for an interface between R.app and C. But, they seems to be readable only through a R command 'R CMD SHLIB something.c'
I wish to compile my native C program to include them using gcc. I'm using Snow Leopard where I'm not able to locate those header files!
Any help?
Please see the 'Writing R Extensions' manual about details, you can easily compile and link against Rmath.h and the standalone R Math library -- but not R.h. (Which you can use via Rcpp / RInside but that is a different story.)
There are a number of examples floating around for use of libRmath, one is in the manual itself. Here is one I ship in the Debian package r-mathlib containing this standalone math library:
/* copyright header omitted here for brevity */
#define MATHLIB_STANDALONE 1
#include <Rmath.h>
#include <stdio.h>
typedef enum {
BUGGY_KINDERMAN_RAMAGE,
AHRENS_DIETER,
BOX_MULLER,
USER_NORM,
INVERSION,
KINDERMAN_RAMAGE
} N01type;
int
main(int argc, char** argv)
{
/* something to force the library to be included */
qnorm(0.7, 0.0, 1.0, 0, 0);
printf("*** loaded '%s'\n", argv[0]);
set_seed(123, 456);
N01_kind = AHRENS_DIETER;
printf("one normal %f\n", norm_rand());
set_seed(123, 456);
N01_kind = BOX_MULLER;
printf("normal via BM %f\n", norm_rand());
return 0;
}
and on Linux you simply build like this (as I place the library and header in standard locations in the package; add -I and -L as needed on OS X)
/tmp $ cp -vax /usr/share/doc/r-mathlib/examples/test.c mathlibtest.c
`/usr/share/doc/r-mathlib/examples/test.c' -> `mathlibtest.c'
/tmp $ gcc -o mathlibtest mathlibtest.c -lRmath -lm
/tmp $ ./mathlibtest
*** loaded '/tmp/mathlibtest'
one normal 1.119638
normal via BM -1.734578
/tmp $

Some issue with Atomic add in CUDA kernel operation

I'm having a issue with my kernel.cu class
Calling nvcc -v kernel.cu -o kernel.o I'm getting this error:
kernel.cu(17): error: identifier "atomicAdd" is undefined
My code:
#include "dot.h"
#include <cuda.h>
#include "device_functions.h" //might call atomicAdd
__global__ void dot (int *a, int *b, int *c){
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ){
int sum = 0;
for( int i = 0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
}
}
Some suggest?
You need to specify an architecture to nvcc which supports atomic memory operations (the default architecture is 1.0 which does not support atomics). Try:
nvcc -arch=sm_11 -v kernel.cu -o kernel.o
and see what happens.
EDIT in 2015 to note that the default architecture in CUDA 7.0 is now 2.0, which supports atomic memory operations, so this should not be a problem in newer toolkit versions.
Today with the latest cuda SDK and toolkit this solution will not work.
People also say that adding:
compute_11,sm_11; OR compute_12,sm_12; OR compute_13,sm_13;
compute_20,sm_20;
compute_30,sm_30;
to CUDA in the Project Properties in Visual Studio 2010 will work. It doesn't.
You have to specify this for the .cu file itself in its own properties (Under the C++/CUDA->Device->Code Generation) tab such as:
compute_13,sm_13;
compute_20,sm_20;
compute_30,sm_30;

Resources