CUDA: LNK2005 error on __device__ function used in header file

CUDA: LNK2005 error on __device__ function used in header file - linker

I have a device function that is defined in a header file. The reason it is in a header file is because it is used by a global kernel, which needs to be in a header file since it is a template kernel.
When this header file is included across 2 or more .cu files, I get a LNK2005 error during linking:
FooDevice.cu.obj : error LNK2005: "int
__cdecl getCurThreadIdx(void)" (?getCurThreadIdx##YAHXZ) already defined
in Main.cu.obj
Why is this error caused? How to fix it?
Here is sample code to produces the above error:
FooDevice.h:
#ifndef FOO_DEVICE_H
#define FOO_DEVICE_H
__device__ int getCurThreadIdx()
{
return ( ( blockIdx.x * blockDim.x ) + threadIdx.x );
}
template< typename T >
__global__ void fooKernel( const T* inArr, int num, T* outArr )
{
const int threadNum = ( gridDim.x * blockDim.x );
for ( int idx = getCurThreadIdx(); idx < num; idx += threadNum )
outArr[ idx ] = inArr[ idx ];
return;
}
__global__ void fooKernel2( const int* inArr, int num, int* outArr );
#endif // FOO_DEVICE_H
FooDevice.cu:
#include "FooDevice.h"
// One other kernel that uses getCurThreadIdx()
__global__ void fooKernel2( const int* inArr, int num, int* outArr )
{
const int threadNum = ( gridDim.x * blockDim.x );
for ( int idx = getCurThreadIdx(); idx < num; idx += threadNum )
outArr[ idx ] = inArr[ idx ];
return;
}
Main.cu:
#include "FooDevice.h"
int main()
{
int num = 10;
int* dInArr = NULL;
int* dOutArr = NULL;
const int arrSize = num * sizeof( *dInArr );
cudaMalloc( &dInArr, arrSize );
cudaMalloc( &dOutArr, arrSize );
// Using template kernel
fooKernel<<< 10, 10 >>>( dInArr, num, dOutArr );
return 0;
}

Why is this error caused?
Because you have included your header in FooDevice.cu and Main.cu where it gets defined so you now have two copies of the same function and the linker detects this.
How to fix it?
If you have the following defined in foo.h
template<typename T> __device__ T foo(T x)
{
return x;
}
And two .cu files that both include foo.h and also contain a call to it, e.g.
int x = foo<int>(1);
Then you can force foo() inline:
template<typename T>
inline __device__ T foo(T x)
{
return x;
}
and call:
int x = foo<int>(1);
This will stop it from being declared multiple times.
Function templates are an exempt of
One Defintion Rule and may be more
than one definition of them in
different translation unit's. Full
function template specialization is
not a template, rather an ordinary
function, so you need to use inline
keyword not to violate ODR if you want
to put them in a header file included
into several translation unit's.
Taken from http://www.velocityreviews.com/forums/t447911-why-does-explicit-specialization-of-function-templates-cause-generation-of-code.html
See also: http://en.wikipedia.org/wiki/One_Definition_Rule
I changed your code like this:
inline __device__ int getCurThreadIdx()
{
return ( ( blockIdx.x * blockDim.x ) + threadIdx.x );
}
template< typename T >
__global__ void fooKernel( const T* inArr, int num, T* outArr )
{
const int threadNum = ( gridDim.x * blockDim.x );
for ( int idx = getCurThreadIdx(); idx < num; idx += threadNum )
outArr[ idx ] = inArr[ idx ];
return;
}
And it now compiles. Your declaration without the inline of getCurThreadIdx() was violating the one definition rule.

It should be inlined. You could try adding the inline keyword.
Maybe you could remove the unnecessary code and create a simple text example for us to see? Usually the problem lies in the details...

Related

why pass as argument of a function a function?

i have a little question.
i'm studying C with devc++ (as start) and i have seen as argument function you can pass a function, this is ok but why?
for example u can write as argument:
void myfunc(void(*func)(int)){}
but if u simple call function with his name and argument it is not better?
like example:
void myfunction (){name of func to call(myargs); }
there's a difference?
it seems the same thing but with more simple and short code
edit:
i want only know
void map (int (*fun) (int),int x[ ], int l) {
for(i = 0; i < l; i++)
x[i] = fun(x[i]);
}
why u use this instead of:
void map (int x[ ], int l) {
for(i = 0; i < l; i++)
x[i] = nameoffunction(yourargument);
}

You can use a function pointer as a parameter if you want your function to do different things depending on what the user wants.
Here's a simple example:
#include <stdio.h>
int add(int x, int y)
{
return x + y;
}
int subtract(int x, int y)
{
return x - y;
}
int multiply(int x, int y)
{
return x * y;
}
int divide(int x, int y)
{
return x / y;
}
int operation(int x, int y, int (*func)(int, int))
{
printf(" x=%d, y=%d\n", x, y);
return func(x,y);
}
int main()
{
int x = 8, y = 4;
printf("x+y=%d\n", operation(x,y,add));
printf("x-y=%d\n", operation(x,y,subtract));
printf("x*y=%d\n", operation(x,y,multiply));
printf("x/y=%d\n", operation(x,y,divide));
return 0;
}

A very good example is the classic sorting function qsort. It's a library function, which means that you only have access to it's prototype. In order to make qsort general, you have to write your own compare function. A typical implementation looks like this for regular integers:
int cmpfunc (const void * a, const void * b)
{
return ( *(int*)a - *(int*)b );
}
And then, if you have an array arr of integers you can sort it with qsort(arr, sizeof(arr), cmpfunc)
You might ask why this is not built in the qsort function? After all, it would be easy to make it work for both floats and integers. Yes, but imagine if you have an array of structs that look like this:
struct {
char *firstname;
char *lastname;
int age;
} persons[10];
How would you sort this? Well, that's not obvious. You might want all three. In that case, write three different compare functions.

i want only know
void map (int (*fun) (int),int x[ ], int l) {
for(i = 0; i < l; i++)
x[i] = fun(x[i]);
}
why u use this instead of:
void map (int x[ ], int l) {
for(i = 0; i < l; i++)
x[i] = nameoffunction(yourargument);
}
Let's answer the question with a question - what if you want to perform more than one type of mapping? What if you want to map both x2 and √x?
You could certainly do something like
void map( int x[], int l, int type )
{
for ( int i = 0; i < l; i++ )
{
if ( type == MAP_SQUARED )
x[i] = int_square( x );
else if ( type == MAP_ROOT )
x[i] = int_root( x );
...
}
}
which works, but is hard to read and cumbersome to maintain - every time you want to perform a new mapping, you have to add a new case to the map function.
Compare that to
void map( int x[], int l, int (*fun)(int) )
{
for ( int i = 0; i < l; i++ )
x[i] = fun( x[i] );
}
...
map( x, l, int_square );
map( y, l, int_root );
You don't have to hack the map function to get different mappings - you only have to pass the function that operates on the individual elements. If you want to perform a new mapping, all you have to do is write a new function - you don't have to edit the map function at all.
The C standard library uses this form of delegation in several places, including the qsort function (allowing you to sort arrays of any type in any order) and the signal function (allowing you to change how a program reacts to interrupts dynamically).

return an array from function when dynamic argument is passed to the function in c

I am trying to return an array from the function, it is working fine as long as I am using hard coded value for size of an array. However, when I change it to dynamic (getting calculated from nproc = sysconf(_SC_NPROCESSORS_ONLN);) then I am getting following error:
-->gcc test.c
test.c: In function ‘getRandom’:
test.c:14:16: error: storage size of ‘r’ isn’t constant
static int r[nproc];
^
test.c:18:21: warning: implicit declaration of function ‘time’; did you mean ‘nice’? [-Wimplicit-function-declaration]
srand( (unsigned)time( NULL ) );
^~~~
nice
when I change static int r[10]; to static int r[nproc]; its failing. I need to keep the size dynamic as the its going to be runtime calculated. Can someone please help me to get through this problem ?
Code:
#define _GNU_SOURCE
#include <assert.h>
#include <sched.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
/* function to generate and return random numbers */
int * getRandom(int nproc ) {
printf("nproc is %d\n",nproc);
//static int r[10];
static int r[nproc];
int i;
/* set the seed */
srand( (unsigned)time( NULL ) );
for ( i = 0; i < 10; ++i) {
r[i] = rand();
printf( "r[%d] = %d\n", i, r[i]);
}
return r;
}
/* main function to call above defined function */
int main () {
/* a pointer to an int */
int *p;
int i;
int nproc;
nproc = sysconf(_SC_NPROCESSORS_ONLN);
p = getRandom(nproc);
for ( i = 0; i < 10; i++ ) {
printf( "*(p + %d) : %d\n", i, *(p + i));
}
return 0;
}
Need to know how to achieve this in C PROGRAMMING

You cannot define a static array with a non fixed size because it is like defining a global array with dynamic size. It is illegal in C because the global variable are part of the binary and must have a known size at the compilation.
If you really want to keep it static, you need to define the array of maximum possible size, and then pass the nproc value as the upper limit each time you call the getRandom() function.
Example:
/* function to generate and return random numbers */
int * getRandom(int nproc ) {
printf("nproc is %d\n",nproc);
static int r[MAX_POSSIBLE_LENGTH];
int i;
/* set the seed */
srand( (unsigned)time( NULL ) );
for ( i = 0; i < nproc; ++i) {
r[i] = rand();
printf( "r[%d] = %d\n", i, r[i]);
}
return r;
}
There is also a possibility to allocate/reallocate each time the required array size (by malloc/realloc) in the caller of getRandom() and the pass the pointer to it and the size to getRandom():
void getRandom(int *pArr, unsigned int size);
In this case you won't need to hold any static arrays.

I think what you want is:
int *r = (int *) malloc(sizeof(int) * nproc);// allocating for nproc integers

Multiply two pre-defined values in the kernel

Below is my kernel, but when i wanna multiply or do other operations with two values which are defined by #define keyword and assign it to an argument of the kernel i get an error with error status -48.(Invalid kernel.)Is it not possible to multiply these or am i doing something else wrong?
#define cl_sizeX 1024;
#define pi 3.1415926535897;
#define N 1024;
#define M 1024;
#define lambda 632e-9;//632e-9;
#define X 12.1e-6;
__kernel void helloworld(__global char* in, __global char* out)
{
int num = get_global_id(0);
out[num] = in[num] + 1;
}
__kernel void multiply_arrays(__global int* first, __global int* second, __global int* out_array)
{
int num = get_global_id(0);
out_array[num] = first[num] * second[num];
}
__kernel void create_library(__global float* z0){
//Variable definitions
int a = get_global_id(0);
int i1 = get_global_id(1);
int i2 = get_global_id(2);
//z0[a] = ((N*pow(X, 2)) / lambda) + (a - 1)*((N*pow(X, 2)) / (100 * lambda));
z0[a] = N*X; // This is where i get error
When i assign z0[a] = N; i don't get an error and couldn't figure it out.
I use Windows 8.1 and Visual Studio 13 for coding.

If you remove the ; after all the #define statements the kernel will compile.

You are assigning a double to a float which could be raising an error in the compiler.
Use clGetProgramBuildInfo with CL_PROGRAM_BUILD_LOG to get the actual clBuildProgram output from the compiler which will give you a better idea of the problem.

How add element (cv::Point) into shared array - CUDA

I'm new in Cuda technology. I need help a CUDA find in binary (monochromatic) image only pixels, that have value white (255). Pixels are then required to sort the output array. My solution is based on critical section. However, it gives incorrect results.
//----- call kernel: -----
{
const dim3 block(16,16);
const dim3 grid(divUp(_binImg.cols, block.x), divUp(_binImg.rows, block.y));
// others allocations, declarations ...
cudaCalcWhitePixels<<<grid, block>>>(_binImg, _index, _pointsX, _pointsY);
}
__device__ int lock = 0;
__global__ void cudaCalcWhitePixels(cv::gpu::PtrStepSzb _binImg, int *_index, int *_pointsX, int *_pointsY)
{
extern int lock;
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
__syncthreads();
if(x < _binImg.cols && y < _binImg.rows)
{
if(_binImg.ptr(y)[x] == 255)
{
do{} while(atomicCAS(&lock, 0, 1) != 0)
//----- critical section ------
_pointsX[*_index] = x;
_pointsY[*_index] = y;
(*_index)++;
lock = 0;
//----- end CS ------
}
}
}
It seems to me that the critical section is not working properly. White pixels in the image will be represented approximately 1%.
Could you please advise me? Thank you and have a nice day :)
EDIT:
solution:
__global__ void cudaCalcWhitePixels(cv::gpu::PtrStepSzb _binImg, int *_index, int *_pointsX, int *_pointsY)
{
int myIndex = 0;
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
__syncthreads();
if(x < _binImg.cols && y < _binImg.rows)
{
if(_binImg.ptr(y)[x] == 255)
{
//----- critical section ------
myIndex = atomicAdd(_index, 1);
_pointsX[myIndex] = x;
_pointsY[myIndex] = y;
//----- end CS ------
}
}
}

This code from the following URL may assist you understand how to use atomicCAS() to create a critical section.
https://github.com/ArchaeaSoftware/cudahandbook/blob/master/memory/spinlockReduction.cu
class cudaSpinlock {
public:
cudaSpinlock( int *p );
void acquire();
void release();
private:
int *m_p;
};
inline __device__
cudaSpinlock::cudaSpinlock( int *p )
{
m_p = p;
}
inline __device__ void
cudaSpinlock::acquire( )
{
while ( atomicCAS( m_p, 0, 1 ) );
}
inline __device__ void
cudaSpinlock::release( )
{
atomicExch( m_p, 0 );
}
Since (*_index)++; is the only atomic operation you do in CS, you could consider use atomicAdd() instead.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd
On the other hand, you could try to use thrust::copy_if() to simplify the coding.

qsort does not work for double array

I try to sort an array of double value using qsort, but it doesn't seems to work. Wonder what has gone wrong here??
#include <stdio.h>
#include <stdlib.h>
static double compare (const void * a, const void * b)
{
if (*(double*)a > *(double*)b) return 1;
else if (*(double*)a < *(double*)b) return -1;
else return 0;
}
int main() {
int idx;
double* sum_least_square_err;
sum_least_square_err = (double*) malloc (2500*2500*sizeof(double));
sum_least_square_err[0] = 0.642;
sum_least_square_err[1] = 0.236;
sum_least_square_err[2] = 0.946;
idx = 3;
qsort(sum_least_square_err, idx, sizeof(sum_least_square_err), compare);
int i;
for (i=0; i<idx; i++){
fprintf(stderr,"sum_least_square_err[%d] = %.3f\n", i, sum_least_square_err[i]);
}
fprintf(stderr,"MAEE = %.3f\n", sum_least_square_err[idx/2]);
free(sum_least_square_err);
}
Result:
sum_least_square_err[0] = 0.642
sum_least_square_err[1] = 0.236
sum_least_square_err[2] = 0.946
MAEE = 0.236

Change:
static double compare (const void * a, const void * b)
to:
static int compare (const void * a, const void * b)
and change:
qsort(sum_least_square_err, idx, sizeof(sum_least_square_err), compare);
to:
qsort(sum_least_square_err, idx, sizeof(sum_least_square_err[0]), compare);
Note: you should have got an appropriate compiler warning about the first bug - are you compiling with gcc -Wall or equivalent, and if so are you taking notice of compiler warnings ? (If not then please take the hint and let the compiler catch problems such as this for you in future.)

I believe your error is at the line:
qsort(sum_least_square_err, idx, sizeof(sum_least_square_err), compare);
The problem is the 3rd parameter should be sizeof(double), that is, the size of an element of the array. You were instead passing the size of a pointer, which can be (and usually is) different from the size of the element.
For more information, see: http://www.cplusplus.com/reference/clibrary/cstdlib/qsort/
Edit: And Paul R is right in his answer: The prototype of your comparison function is wrong. The prototype should be:
int ( * comparator ) ( const void *, const void * )
Last but not least, in your code:
if (*(double*)a > *(double*)b) return 1;
else if (*(double*)a < *(double*)b) return -1;
You are casting away the const. This has not consequence here, but still, this is bad form.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

CUDA: LNK2005 error on device function used in header file - linker

It should be inlined. You could try adding the inline keyword. Maybe you could remove the unnecessary code and create a simple text example for us to see? Usually the problem lies in the details...

Related

why pass as argument of a function a function?

return an array from function when dynamic argument is passed to the function in c

Multiply two pre-defined values in the kernel

How add element (cv::Point) into shared array - CUDA

qsort does not work for double array

Categories

Resources