Some issue with Atomic add in CUDA kernel operation

Some issue with Atomic add in CUDA kernel operation - c

I'm having a issue with my kernel.cu class
Calling nvcc -v kernel.cu -o kernel.o I'm getting this error:
kernel.cu(17): error: identifier "atomicAdd" is undefined
My code:
#include "dot.h"
#include <cuda.h>
#include "device_functions.h" //might call atomicAdd
__global__ void dot (int *a, int *b, int *c){
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ){
int sum = 0;
for( int i = 0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
}
}
Some suggest?

You need to specify an architecture to nvcc which supports atomic memory operations (the default architecture is 1.0 which does not support atomics). Try:
nvcc -arch=sm_11 -v kernel.cu -o kernel.o
and see what happens.
EDIT in 2015 to note that the default architecture in CUDA 7.0 is now 2.0, which supports atomic memory operations, so this should not be a problem in newer toolkit versions.

Today with the latest cuda SDK and toolkit this solution will not work.
People also say that adding:
compute_11,sm_11; OR compute_12,sm_12; OR compute_13,sm_13;
compute_20,sm_20;
compute_30,sm_30;
to CUDA in the Project Properties in Visual Studio 2010 will work. It doesn't.
You have to specify this for the .cu file itself in its own properties (Under the C++/CUDA->Device->Code Generation) tab such as:
compute_13,sm_13;
compute_20,sm_20;
compute_30,sm_30;

Related

How to translate neon intrinsics to llvm-IR using llvm-clang on x86

Using clang we can generate IR with compile C program:
clang -S -emit-llvm hello.c -o hello.ll
I would like to translate neon intrinsic to llvm-IR, code like this:
/* neon_example.c - Neon intrinsics example program */
#include <stdint.h>
#include <stdio.h>
#include <assert.h>
#include <arm_neon.h>
/* fill array with increasing integers beginning with 0 */
void fill_array(int16_t *array, int size)
{ int i;
for (i = 0; i < size; i++)
{
array[i] = i;
}
}
/* return the sum of all elements in an array. This works by calculating 4 totals (one for each lane) and adding those at the end to get the final total */
int sum_array(int16_t *array, int size)
{
/* initialize the accumulator vector to zero */
int16x4_t acc = vdup_n_s16(0);
int32x2_t acc1;
int64x1_t acc2;
/* this implementation assumes the size of the array is a multiple of 4 */
assert((size % 4) == 0);
/* counting backwards gives better code */
for (; size != 0; size -= 4)
{
int16x4_t vec;
/* load 4 values in parallel from the array */
vec = vld1_s16(array);
/* increment the array pointer to the next element */
array += 4;
/* add the vector to the accumulator vector */
acc = vadd_s16(acc, vec);
}
/* calculate the total */
acc1 = vpaddl_s16(acc);
acc2 = vpaddl_s32(acc1);
/* return the total as an integer */
return (int)vget_lane_s64(acc2, 0);
}
/* main function */
int main()
{
int16_t my_array[100];
fill_array(my_array, 100);
printf("Sum was %d\n", sum_array(my_array, 100));
return 0;
}
But It doesn't support neon intrinsic, and print error messages like this:
/home/user/llvm-proj/build/bin/../lib/clang/4.0.0/include/arm_neon.h:65:24: error:
'neon_vector_type' attribute is not supported for this target
typedef __attribute__((neon_vector_type(8))) float16_t float16x8_t;
^
I think the reason is my host is on x86, but target is on ARM.
And I have no idea how to Cross-compilation using Clang to translate to llvm-IR(clang version is 4.0 on ubuntu 14.04).
Is there any target option commands or other tools helpful?
And any difference between SSE and neon llvm-IR?

Using ELLCC, a pre-packaged clang based tool chain (http://ellcc.org), I was able to compile and run your program by adding -mfpu=neon:
rich#dev:~$ ~/ellcc/bin/ecc -target arm32v7-linux -mfpu=neon neon.c
rich#dev:~$ ./a.
a.exe a.out
rich#dev:~$ file a.out
a.out: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, BuildID[sha1]=613c22f6bbc277a8d577dab7bb27cd64443eb390, not stripped
rich#dev:~$ ./a.out
Sum was 4950
rich#dev:~$
It was compiled on an x86 and I ran it using QEMU.
Using normal clang, you'll also need the appropriate -target option for ARM. ELLCC uses slightly different -target options.

Cuda returns wrong value from device function when cos is used

I'm trying to run this code on nvidia GPU and it returns strange values. It consist of two modules main.cu and exmodul.cu (listed bellow). For building I'm using:
nvcc -dc -arch sm_35 main.cu
nvcc -dc -arch sm_35 exmodul.cu
nvcc -arch sm_35 -lcudart -o main main.o exmodul.o
If I run that I obtained strange last line!!! gd must be 1.
result=0
result=0
result=0
result=0
gd=-0.5
When I change 1.0 in exmodul.cu to number greater than
1.000000953 or bellow 0.999999999999999945, it return proper
result.
When I change 1.1 in exmodul.cu it also fails except value 1.0.
Behavior doesn't depend on constant 2.0 in the same module.
When I use another function instead of cos like sin or exp it works properly.
Use of double q = cos(1.1); has no effect.
When I copy function extFunc() to module main.cu it works properly.
If I uncomment *gd=1.0; in main.cu it returns correct 1.0.
Tested on Nvidia GT750M and GeForce GTX TITAN Black. (on GT750M returns different value gd=6.1232329394368592e-17 but still wrong). OS: Debian Jessie.
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Thu_Mar_13_11:58:58_PDT_2014
Cuda compilation tools, release 6.0, V6.0.1
Have you got any idea what is wrong?
Thanks, Lukas
main.cu
#include <stdio.h> // printf
#include "exmodul.h" // extFunc
__global__ void mykernel(double*gd);
void deviceCheck();
int main(int argc, char *argv[])
{
double gd, *d_gd;
cudaMalloc(&d_gd, sizeof(double)); deviceCheck();
mykernel<<<1,1>>>(d_gd); deviceCheck();
cudaMemcpy(&gd, d_gd, sizeof(double), cudaMemcpyDeviceToHost);
deviceCheck();
cudaFree(d_gd); deviceCheck();
fprintf(stderr,"gd=%.17g\n",gd);
return 0;
}
void deviceCheck()
{
cudaError_t result = cudaSuccess;
cudaDeviceSynchronize();
result = cudaGetLastError();
fprintf(stderr,"result=%d\n",result); fflush(stderr);
}
__global__ void mykernel(double *gd)
{
*gd = extFunc();
//*gd=1.0;
__syncthreads();
return;
}
exmodul.cu
#include "exmodul.h"
__device__ double extFunc()
{
double q = 1.1;
q = cos(q);
if(q<2.0) { q = 1.0; }
return q;
}
exmodul.h
__device__ double extFunc();

I was able to reproduce problematic behavior on a supported config (CUDA 6.5, CentOS 6.2, K40).
When I switched from CUDA 6.5 to CUDA 7 RC, the problem went away.
The problem also did not appear to be reproducible on an older config (CUDA 5.5, CentOS 6.2, M2070)
I suggest switching to CUDA 7 RC to address this issue. I suspect an underlying bug in the compilation process that has been fixed already.

Why does this SIMD example code in C compile with minGW but the executable doesn't run on my windows machine?

I'm learning the basics of SIMD so I was given a simple code snippet to see the principle at work with SSE and SSE2.
I recently installed minGW to compile C code in windows with gcc instead of using the visual studio compiler.
The objective of the example is to add two floats and then multiply by a third one.
The headers included are the following (which I guess are used to be able to use the SSE intrinsics):
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
#include <time.h>
#include <sys/time.h> // for timing
Then I have a function to check what time it is, to compare time between calculations:
double now(){
struct timeval t; double f_t;
gettimeofday(&t, NULL);
f_t = t.tv_usec; f_t = f_t/1000000.0; f_t +=t.tv_sec;
return f_t;
}
The function to do the calculation in the "scalar" sense is the following:
void run_scalar(){
unsigned int i;
for( i = 0; i < N; i++ ){
rs[i] = (a[i]+b[i])*c[i];
}
}
Here is the code for the sse2 function:
void run_sse2(){
unsigned int i;
__m128 *mm_a = (__m128 *)a;
__m128 *mm_b = (__m128 *)b;
__m128 *mm_c = (__m128 *)c;
__m128 *mm_r = (__m128 *)rv;
for( i = 0; i <N/4; i++)
mm_r[i] = _mm_mul_ps(_mm_add_ps(mm_a[i],mm_b[i]),mm_c[i]);
}
The vectors are defined the following way (N is the size of the vectors and it is defined elsewhere) and a function init() is called to initialize them:
float a[N] __attribute__((aligned(16)));
float b[N] __attribute__((aligned(16)));
float c[N] __attribute__((aligned(16)));
float rs[N] __attribute__((aligned(16)));
float rv[N] __attribute__((aligned(16)));
void init(){
unsigned int i;
for( i = 0; i < N; i++ ){
a[i] = (float)rand () / RAND_MAX / N;
b[i] = (float)rand () / RAND_MAX / N;
c[i] = (float)rand () / RAND_MAX / N;
}
}
Finally here is the main that calls the functions and prints the results and computing time.
int main(){
double t;
init();
t = now();
run_scalar();
t = now()-t;
printf("S = %10.9f Temps du code scalaire : %f seconde(s)\n",1e5*sum(rs),t);
t = now();
run_sse2();
t = now()-t;
printf("S = %10.9f Temps du code vectoriel 2: %f seconde(s)\n",1e5*sum(rv),t);
}
For sum reason if I compile this code with a command line of "gcc -o vec vectorial.c -msse -msse2 -msse3" or "mingw32-gcc -o vec vectorial.c -msse -msse2 -msse3"" it compiles without any problems, but for some reason I can't run it in my windows machine, in the command prompt I get an "access denied" and a big message appears on the screen saying "This app can't run on your PC, to find a version for your PC, check with the software publisher".
I don't really understand what is going on, neither do I have much experience with MinGW or C (just an introductory course to C++ done on Linux machines). I've tried playing around with different headers because I thought maybe I was targeting a different processor than the one on my PC but couldn't solve the issue. Most of the info I found was confusing.
Can someone help me understand what is going on? Is it a problem in the minGW configuration that is compiling in targeting a Linux platform? Is it something in the code that doesn't have the equivalent in windows?
I'm trying to run it on a 64 bit Windows 8.1 pc
Edit: Tried the configuration suggested in the site linked below. The output remains the same.
If I try to run through MSYS I get a "Bad File number"
If I try to run throught the command prompt I get Access is Denied.
I'm guessing there's some sort of bug arising from permissions. Tried turning off the antivirus and User Account control but still no luck.
Any ideas?

There is nothing wrong with your code, besides, you did not provide the definition of sum() or N which is, however, not a problem. The switches -msse -msse2 appear to be not required.
I was able to compile and run your code on Linux (Ubuntu x86_64, compiled with gcc 4.8.2 and 4.6.3, on Atom D2700 and AMD Athlon LE-1640) and Windows7/64 (compiled with gcc 4.5.3 (32bit) and 4.8.2 (64bit), on Core i3-4330 and Core i7-4960X). It was running without problem.
Are you sure your CPU supports the required instructions? What exactly was the error code you got? Which MinGW configuration did you use? Out of curiosity, I used the one available at http://win-builds.org/download.html which was very straight-forward.
However, using the optimization flag -O3 created the best result -- with the scalar loop! Also useful are -m64 -mtune=native -s.

OpenMP in C using GCC in Windows to create DLL

I've recently started to play around with OpenMP and like it very much.
I am a just-for-fun Classic-VB programmer and like coding functions for my VB programs in C. As such, I use Windows 7 x64 and GCC 4.7.2.
I usually set up all my C functions in one large C file and then compile a DLL out of it. Now I would like to use OpenMP in my DLL.
First of all, I set up a simple example and compiled an exe file from it:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
int n = 520000;
int i;
int a[n];
int NumThreads;
omp_set_num_threads(4);
#pragma omp parallel for
for (i = 0; i < n; i++)
{
a[i] = 2 * i;
NumThreads = omp_get_num_threads();
}
printf("Value = %d.\n", a[77]);
printf("Number of threads = %d.", NumThreads);
return(0);
}
I compile that using gcc -fopenmp !MyC.c -o !MyC.exe and it works like a charm.
However, when I try to use OpenMP in my DLL, it fails. For example, I set up this function:
__declspec(dllexport) int __stdcall TestAdd3i(struct SAFEARRAY **InArr1, struct SAFEARRAY **InArr2, struct SAFEARRAY **OutArr) //OpenMP Test
{
int LengthArr;
int i;
int *InArrElements1;
int *InArrElements2;
int *OutArrElements;
LengthArr = (*InArr1)->rgsabound[0].cElements;
InArrElements1 = (int*) (**InArr1).pvData;
InArrElements2 = (int*) (**InArr2).pvData;
OutArrElements = (int*) (**OutArr).pvData;
omp_set_num_threads(4);
#pragma omp parallel for private(i)
for (i = 0; i < LengthArr; i++)
{
OutArrElements[i] = InArrElements1[i] + InArrElements2[i];
}
return(omp_get_num_threads());
}
The structs are defined, of course. I compile that using
gcc -fopenmp -c -DBUILD_DLL dll.c -o dll.o
gcc -fopenmp -shared -o mydll.dll dll.o -lgomp -Wl,--add-stdcall-alias
The compiler and linker do not complain (not even warnings come up) and the dll file is actually being built. But as I try to call the function from within VB, the VB compiler claims the the DLL file could not be found (run-time error 53). The strange thing about that is that as soon as one single OpenMP "command" is present inside the .c file, the VB compiler claims a missing DLL even if I call a function that does not even contain a single line of OpenMP code. When I comment all OpenMP stuff out, the function works as expected, but doesn't use OpenMP for parallelization, of course.
What is wrong here? Any help appreciated, thanks in advance! :-)

The problem most probably in this case is LD_LIBRARY_PATH is not set . You must use set LD_LIBRARY_PATH to the path that contains the dll or the system will not be able to find it and hence complains about the same

Using sqrtf() in C: "undefined reference to `sqrtf'"

I am using Linux, Ubuntu 12.04 (Precise Pangolin), and Geany for coding. The code I am writing in C worked completely fine until I used the sqrtf command to find the square root of a float.
Error: HAC3.c:(.text+0xfd7): undefined reference to `sqrtf' .
The part of code I am using sqrtf() in:
float syn(float *a, float *b, int dimensions)
{
float similarity=0;
float sumup=0;
float sumdown=0;
float as=0;
float bs=0;
int i;
for(i=0; i<dimensions; i++)
{
sumup = sumup + a[i] * b[i];
as = as + a[i] * a[i];
bs = bs + b[i] * b[i];
}
sumdown = sqrtf(as) * sqrtf(bs);
similarity = sumup / sumdown;
return similarity;
}
I included math.h, but this doesn't seem to be the problem.
Is there a way to fix Geany so this won't come up again?

Go to Build -> Set Build Commands then under C commands click on the empty label and it will let you specify a new label (name it Link). Type in it gcc -Wall -o "%e" "%f" -lm - where -lm will tell it to link the math library to your app. Click OK.
Then click on Build and select your newly created label - Link. This should do it for you.

You need to link with -lm to provide the math functions.

In addition to the many fine answers here, the portable form of the command that supports C99 version of <math.h> is specified by POSIX as c99 -l m. That having been said, every important Linux compiler supports -lm.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Some issue with Atomic add in CUDA kernel operation - c

Related

How to translate neon intrinsics to llvm-IR using llvm-clang on x86

Cuda returns wrong value from device function when cos is used

Why does this SIMD example code in C compile with minGW but the executable doesn't run on my windows machine?

OpenMP in C using GCC in Windows to create DLL

Using sqrtf() in C: "undefined reference to `sqrtf'"

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Some issue with Atomic add in CUDA kernel operation - c

Related

How to translate neon intrinsics to llvm-IR using llvm-clang on x86

Cuda returns wrong value from __device__ function when cos is used

Why does this SIMD example code in C compile with minGW but the executable doesn't run on my windows machine?

OpenMP in C using GCC in Windows to create DLL

Using sqrtf() in C: "undefined reference to `sqrtf'"

Categories

Resources

Cuda returns wrong value from device function when cos is used