CUDA local array initalization modifies program output

CUDA local array initalization modifies program output - arrays

I have a program which (for now) calculates values of two functions in random points on GPU , sends these values back to host, and then visualizes them. This is what I get, some nice semi-random points:
Now, if I modify my kernel code, and add the local array initalization code at the very end,
__global__ void optymalize(curandState * state, float* testPoints)
{
int ind=blockDim.x*blockIdx.x+threadIdx.x;
int step=blockDim.x*gridDim.x;
for(int i=ind*2;i<NOF*TEST_POINTS;i+=step*2)
{
float* x=generateX(state);
testPoints[i]=ZDT_f1(x);
testPoints[i+1]=ZDT_f2(x);
}
//works fine with 'new'
//float* test_array=new float[2];
float test_array[2]={1.0f,2.0f};
}
I get something like this everytime:
Does anyone know the cause of this behavior? All the drawn points are computed BEFORE test_array is initialized, yet they are affected by it. It doesn't happen when I initialize test_array before the 'for' loop.
Host/device code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "curand_kernel.h"
#include "device_functions.h"
#include <random>
#include <iostream>
#include <time.h>
#include <fstream>
using namespace std;
#define XSIZE 5
#define TEST_POINTS 100
#define NOF 2
#define BLOCK_COUNT 64
#define THR_COUNT 128
#define POINTS_PER_THREAD (NOF*TEST_POINTS+THR_COUNT*BLOCK_COUNT-1)/(THR_COUNT*BLOCK_COUNT)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=false)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__device__ float g(float* x)
{
float tmp=1;
for(int i=1;i<XSIZE;i++)
tmp*=x[i];
return 1+9*(tmp/(XSIZE-1));
}
__device__ float ZDT_f1(float* x)
{
return x[0];
}
__device__ float ZDT_f2(float* x)
{
float gp=g(x);
return gp*(1-sqrtf(x[0]/gp));
}
__device__ bool oneDominatesTwo(float* x1, float* x2)
{
for(int i=0;i<XSIZE;i++)
if(x1[i]>=x2[i])
return false;
return true;
}
__device__ float* generateX(curandState* globalState)
{
int ind = threadIdx.x;
float x[XSIZE];
for(int i=0;i<XSIZE;i++)
x[i]=curand_uniform(&globalState[ind]);
return x;
}
__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
int id = blockDim.x*blockIdx.x+threadIdx.x;
curand_init ( seed, id, 0, &state[id] );
}
__global__ void optymalize(curandState * state, float* testPoints)
{
int ind=blockDim.x*blockIdx.x+threadIdx.x;
int step=blockDim.x*gridDim.x;
for(int i=ind*2;i<NOF*TEST_POINTS;i+=step*2)
{
float* x=generateX(state);
testPoints[i]=ZDT_f1(x);
testPoints[i+1]=ZDT_f2(x);
}
__syncthreads();
//float* test_array=new float[2];
//test_array[0]=1.0f;
//test_array[1]=1.0f;
float test_array[2]={1.0f,1.0f};
}
void saveResultToFile(float* result)
{
ofstream resultFile;
resultFile.open ("result.txt");
for(unsigned int i=0;i<NOF*TEST_POINTS;i+=NOF)
{
resultFile << result[i] << " "<<result[i+1]<<"\n";
}
resultFile.close();
}
int main()
{
float* dev_fPoints;
float* fPoints=new float[NOF*TEST_POINTS];
gpuErrchk(cudaMalloc((void**)&dev_fPoints, NOF * TEST_POINTS * sizeof(float)));
curandState* devStates;
gpuErrchk(cudaMalloc(&devStates,THR_COUNT*sizeof(curandState)));
cudaEvent_t start;
gpuErrchk(cudaEventCreate(&start));
cudaEvent_t stop;
gpuErrchk(cudaEventCreate(&stop));
gpuErrchk(cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024));
gpuErrchk(cudaEventRecord(start, NULL));
setup_kernel<<<BLOCK_COUNT, THR_COUNT>>>(devStates,unsigned(time(NULL)));
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaGetLastError());
optymalize<<<BLOCK_COUNT,THR_COUNT>>>(devStates, dev_fPoints);
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaGetLastError());
gpuErrchk(cudaMemcpy(fPoints, dev_fPoints, NOF * TEST_POINTS * sizeof(float), cudaMemcpyDeviceToHost));
gpuErrchk(cudaEventRecord(stop, NULL));
gpuErrchk(cudaEventSynchronize(stop));
float msecTotal = 0.0f;
cudaEventElapsedTime(&msecTotal, start, stop);
cout<<"Kernel execution time: "<<msecTotal<< "ms"<<endl;
saveResultToFile(fPoints);
system("start pythonw plot_data.py result.txt");
cudaFree(dev_fPoints);
cudaFree(devStates);
system("pause");
return 0;
}
Plot script code:
import matplotlib.pyplot as plt;
import sys;
if len(sys.argv)<2:
print("Usage: python PlotScript <filename>");
sys.exit(0);
path=sys.argv[1];
x=[]
y=[]
with open(path,"r") as f:
for line in f:
vals=line.strip().split(" ");
x.append(vals[0]);
y.append(vals[1]);
plt.plot(x,y,'ro')
plt.show();

The basic problem was in code you originally didn't show in your question, specifically this:
__device__ float* generateX(curandState* globalState)
{
int ind = threadIdx.x;
float x[XSIZE];
for(int i=0;i<XSIZE;i++)
x[i]=curand_uniform(&globalState[ind]);
return x;
}
Returning an address or reference to a local scope variable from a function results in undefined behaviour. It is only valid to use x by reference or value within generateX while it is in scope. There should be no surprise that adding or moving other local scope variables around within the kernel changes the kernel behaviour.
Fix this function so it populates an array passed by reference, rather than returning the address of a local scope array. And pay attention to compiler warnings - there will have been one for this which should have immediately set off alarm bells that there was something wrong.

Related

How to achieve something like V-sync for every line in terminal emulators using C?

I'm trying to write some code which would allow to render 3D graphics in console using characters and escape sequences (for color). I need it for one specific program I want to write, but, if possible, I would like to make it more universal. I'm experiencing something like screen tearing and I want to get rid of it (that the whole screen would be printed "at once"). The test is simply displaying screen filled with spaces with wite and black background (one full white frame then one full black one) in one second interval.
I have tried:
At the begging I thought about line buffering on stdout. Tried both disabling it and creating full buffor with size sufficient enough to hold every char on the screen. Second option provides better results, and by that I mean that less frames are teared, but they still are.
I thought it might be a problem with my terminal emulator (this question gave me the idea) so I started to mess around with other ones. I've got best result with Kitty but it's not there yet.
The next thing was to mess with Kitty configuration. I've noticed that if I would increase the input_delay setting to about 20 ms the problem would be almost gone. Just few of, and not every frame would be teared.
So, I came into the conclusion that in fact terminal emulators (or at least kitty) are being too fast and there might be some sort of race condition here, where buffer is not flushed yet fully and TE display both what was partially flushed and is part of old frame. Am I wrong? If not is there any way I can enforce terminals to wait for input to finnish before displaying it, or at least enforce input delay in C?
here is the relevant part of the code:
main.c
#include "TermCTRL/termCTRL.h"
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
int main()
{
termcell_t cell;
int k;
uint16_t x,y;
termCTRL_get_term_size(&x, &y);
sleep(1);
termCTRL_init();
uint8_t a = 0;
for(k=0; k<200; k++)
{
a^=255;
cell.bg.B = a;
cell.bg.G = a;
cell.bg.R = a;
cell.fg.B = a;
cell.fg.G = a;
cell.fg.R = a;
cell.symbol[0] = ' '; //symbol is in fact a string, because I want to use UTF chars too
cell.symbol[1] = '\0';
for(int xd=0; xd<x; xd++)
for(int yd=0; yd<y; yd++)
{
termCTRL_load_termcell(xd, yd, &cell);
}
termCTRL_update_screen();
sleep(1);
}
termCTRL_close();
return 0;
}
termCTRL.h
#pragma once
#include <stdint.h>
#define INPLACE_TERMCELL(FG_R, FG_G, FG_B, BG_R, BG_G, BG_B, SYMBOL) \
(termcell_t) { {FG_R, FG_G, FG_B}, {BG_R, BG_G, BG_B}, SYMBOL }
#define termCTRL_black_fill_screen() \
termCTRL_fill_screen(&INPLACE_TERMCELL(0, 0, 0, 0, 0, 0, " "))
typedef struct termcell_color_t
{
uint16_t R;
uint16_t G;
uint16_t B;
} termcell_color_t;
typedef struct termcell_t
{
termcell_color_t fg;
termcell_color_t bg;
char symbol[4];
} termcell_t;
typedef enum termCTRL_ERRNO
{
termCTRL_OUT_OF_BORDER = -2,
termCTRL_INVALID_TERMCELL = -1,
termCTRL_INTERNAL_ERROR = 0,
termCTRL_OK = 1,
} termCTRL_ERRNO;
void termCTRL_init();
void termCTRL_close();
void termCTRL_get_term_size(uint16_t *col, uint16_t *row);
termCTRL_ERRNO termCTRL_load_termcell(uint16_t x, uint16_t y, termcell_t *in);
void termCTRL_update_screen();
termCTRL_ERRNO termCTRL_fill_screen(termcell_t *cell);
termCTRL.c
#include "termCTRL.h"
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/ioctl.h>
#define CONVERTED_TERMCELL_SIZE 44
#define CAST_SCREEN_TO_BUFFER \
char (*screen_buffer)[term_xsize][term_ysize][CONVERTED_TERMCELL_SIZE]; \
screen_buffer = _screen_buffer
static void *_screen_buffer = NULL;
static uint16_t term_xsize, term_ysize;
static char *IO_buff = NULL;
void termCTRL_get_term_size(uint16_t *col, uint16_t *row)
{
struct winsize w;
ioctl(STDOUT_FILENO, TIOCGWINSZ, &w);
*col = w.ws_col;
*row = w.ws_row;
}
void int_decompose(uint8_t in, char *out)
{
uint8_t x = in/100;
out[0] = x + '0';
in -= x*100;
x = in/10;
out[1] = x + '0';
in -= x*10;
out[2] = in + '0';
}
termCTRL_ERRNO termCTRL_move_cursor(uint16_t x, uint16_t y)
{
char mov_str[] = "\x1b[000;000H";
if(x<term_xsize && y<term_ysize)
{
int_decompose(y, &mov_str[2]);
int_decompose(x, &mov_str[6]);
if(fputs(mov_str, stdout) == EOF) return termCTRL_INTERNAL_ERROR;
else return termCTRL_OK;
}
else
{
return termCTRL_OUT_OF_BORDER;
}
}
termCTRL_ERRNO termCTRL_load_termcell(uint16_t x, uint16_t y, termcell_t *in)
{
CAST_SCREEN_TO_BUFFER;
if(in == NULL) return termCTRL_INVALID_TERMCELL;
if(x >= term_xsize || y >= term_ysize) return termCTRL_OUT_OF_BORDER;
//because screen buffer was initialized, it is only needed to replace RGB values and symbol.
//whole escape sequence is already there
int_decompose(in->fg.R, &(*screen_buffer)[x][y][7]);
int_decompose(in->fg.G, &(*screen_buffer)[x][y][11]);
int_decompose(in->fg.B, &(*screen_buffer)[x][y][15]);
int_decompose(in->bg.R, &(*screen_buffer)[x][y][26]);
int_decompose(in->bg.G, &(*screen_buffer)[x][y][30]);
int_decompose(in->bg.B, &(*screen_buffer)[x][y][34]);
strcpy(&(*screen_buffer)[x][y][38], in->symbol); //copy symbol, note that it could be UTF char
return termCTRL_OK;
}
termCTRL_ERRNO termCTRL_fill_screen(termcell_t *cell)
{
uint16_t x, y;
termCTRL_ERRNO ret;
for(y=0; y<term_ysize; y++)
for(x=0; x<term_xsize; x++)
{
ret = termCTRL_load_termcell(x, y, cell);
if(ret != termCTRL_OK)
return ret;
}
return ret;
}
void termCTRL_update_screen()
{
uint16_t x, y;
CAST_SCREEN_TO_BUFFER;
termCTRL_move_cursor(0, 0);
for(y=0; y<term_ysize-1; y++)
{
for(x=0; x<term_xsize; x++)
fputs((*screen_buffer)[x][y], stdout);
fputs("\n", stdout);
}
//last line got special treatment because it can't have \n
for(x=0; x<term_xsize; x++)
fputs((*screen_buffer)[x][y], stdout);
fflush(stdout);
}
void termCTRL_init()
{
uint16_t x, y;
termCTRL_get_term_size(&term_xsize, &term_ysize);
IO_buff = calloc(term_xsize*term_ysize, CONVERTED_TERMCELL_SIZE);
setvbuf(stdout, IO_buff, _IOFBF, term_xsize*term_ysize*CONVERTED_TERMCELL_SIZE);
_screen_buffer = calloc(term_xsize*term_ysize, CONVERTED_TERMCELL_SIZE);
fputs("\e[?25l", stdout); //hide cursor
fputs("\x1b[2J", stdout); //clear screen
CAST_SCREEN_TO_BUFFER;
for(y=0; y<term_ysize; y++)
for (x=0; x<term_xsize; x++)
sprintf( (*screen_buffer)[x][y], "\x1b[38;2;200;200;000m\x1b[48;2;000;000;000m ");
termCTRL_update_screen();
}
void termCTRL_close()
{
free(_screen_buffer);
setvbuf(stdout, NULL, _IONBF, 0);
free(IO_buff);
printf("\e[?25h"); //show cursor
printf("\x1b[m"); //reset colors
printf("\x1b[2J"); //clear screen
}

Cuda Matrix multiplication program can't pass, very strange error code

I am doing a matrix multiplication using cuda. I think I am about to success, but some very strange error stops me, I can't find out where the code goes wrong. Below is the code sample:
#include <stdio.h>
#include <cuda.h>
#define BLOCK_SIZE 16;
__global__ void matmulKernel(float* mat_in1,float* mat_in2, float* mat_out,int mat_dim);
int main() {
float *h_M, *h_N, *h_P, *d_M, *d_N, *d_P;
int i,width=10;
int size=width*width*sizeof(float);
dim3 block_dim(BLOCK_SIZE,BLOCK_SIZE,1);
int grid_size=width/BLOCK_SIZE;
if(width%BLOCK_SIZE) grid_size++;
dim3 grid_dim (grid_size,grid_size,1);
h_M=(float*)malloc(size);
h_N=(float*)malloc(size);
h_P=(float*)malloc(size);
cudaMalloc((void**)&d_M,size);
cudaMalloc((void**)&d_N,size);
cudaMalloc((void**)&d_P,size);
if(h_M==0||h_N==0||h_P==0||d_M==0||d_N==0||d_P==0) {
printf("memory locate fail!\n");
}
for(i=0;i<width*width;i++) {
h_M[i]=1.2*i;
h_N[i]=1.4*i;
}
cudaMemcpy(d_M,h_M,size,cudaMemcpyHostToDevice);
cudaMemcpy(d_N,h_N,size,cudaMemcpyHostToDevice);
matmulKernel<<<grid_dim,block_dim>>>(d_M,d_N,d_P,width);
cudaMemcpy(h_P,d_P,size,cudaMemcpyDeviceToHost);
printf("firt row of the results matrix P:\n");
for(i=0;i<width;i++) {
printf("%f, %f",h_P[i]);
}
printf("\n");
return 0;
}
__global__ void matmulKernel(float* mat1,float* mat2, float* matP,int dim) {
int thread_x,thread_y,i;
thread_x=blockIdx.x*blockDim.x+threadIdx.x;
thread_y=blockIdx.y*blockDim.y+threadIdx.y;
if(thread_x<dim&&thread_y<dim) {
float P_value=0.;
for(i=0;i<dim;i++) {
P_value+=mat1[thread_y*dim+i]*mat2[i*dim+thread_x];
}
matP[thread_y*dim+thread_x]=P_value;
}
}
Using nvcc compile, the error is :
matmul.cu(11): error: expected a ")"
matmul.cu(11): error: expected an expression
matmul.cu(11): error: expected an expression
matmul.cu(13): error: expected a ")"
I cannot see why the compiler report this error, anyone please tell me where I am wrong.

You have a stray semicolon.
Change:
#define BLOCK_SIZE 16;
to:
#define BLOCK_SIZE 16

how to create a union

I made a mistake while using union,and I don't know why.
the problem is in the function goto_xy();
I read it from the book, but it cannot be compiled.
In this function I am trying to locate the cursor, but REGS variable is not declared. I want to know what is its function.
#include<stdio.h>
#include<windows.h>
#include<dos.h>
#include<conio.h>
void goto_xy(int x,int y); //goto is a key word;define the subfunction to creat the original cursor int the coordinate system
void rectangle_clear(int x1,int x2,int y1,int y2); //define the rectangle_clear opening subfunction
void center_clear(int x1,int x2,int y1,int y2); //define the center_clear opening subfunction
void creat(); //define the subfunction of creating the star
int main() //the main function
{
creat();
getch();
center_clear(0,25,0,79);
getch();
}
void center_clear(int x1,int x2,int y1,int y2) //the subfunction which creats the stars while opening the project
{
int x00,y00,x0,y0,i,d;
if((y2-y1)>(x2-x1))
{
d=(x2-x1)/2;
x0=(x1+x2)/2;
y0=y1+d;
y00=y2-d;
for(i=0;i<(d+1);i++)
{
rectangle_clear((x0-i),(x00+i),(y0-i),(y00+i));
}
delay(10); //to delay the dismis of the star
}
else
{
d=(y2-y1)/2;
y0=(y1+y2)/2;
x0=x1+d;
x00=x2-d;
for(i=0;i<d+1;i++)
{
rectangle_clear((x0-i),(x00+i),(y0-i),(y00+i));
}
delay(10);
}
}
void rectangle_clear(int x1,int x2,int y1,int y2) //to creat the star int the shape of a rectangle
{
int i,j;
for(i=y1;i<y2;i++)
{
goto_xy(x1,i);
putchar(' ');
goto_xy(x2,i);
putchar(' ');
delay(10);
}
for(j=x1;j<x2;j++)
{
goto_xy(i,y1);
putchar(' ');
goto_xy(i,y2);
putchar(' ');
delay(10);
}
}
void goto_xy(int x,int y)
{
union REGS r;
r.h.ah=2;
r.h.dl=y;
r.h.dh=x;
r.h.bh=0;
int86(0x10,&r,&r);
}
void creat()
{
int i,j;
for(i=0;i<24;i++)
{
for(j=0;j<79;j++)
{
goto_xy(i,j);
printf("a");
}
}
}

It mostly appears to me that the union REGS must be already present in one of the header files and you are including the same.
As can be seen from your code below, even the members of union like h and the members of h are also present, which means the union is there in some header file and you are including it.
void goto_xy(int x,int y)
{
union REGS r;
r.h.ah=2; //Here you are accessing the member of REGS and even the sub-members of h
r.h.dl=y;
r.h.dh=x;
r.h.bh=0;
int86(0x10,&r,&r);
}
EDIT:
A Google search tells me that UNION REGS will be defined in dos.h and it is some like
union REGS {
struct WORDREGS x;
struct BYTEREGS h;
};
So, you need to include dos.h to solve your problem. But, it appears inspite of you including that, this problem is present. You can as well open dos.h and check if union REGS is present or not.
See here for more details.

To define a union, you need to do the following:
union REGS { some_type h; other_type f; };
Now you can create a variable of type REGS and use the union.

CUDA extern texture declaration

I want to declare my texture once and use it in all my kernels and files. Therefore, I declare it as extern in a header and include the header on all other files (following the SO How do I use extern to share variables between source files?)
I have a header cudaHeader.cuh file containing my texture:
extern texture<uchar4, 2, cudaReadModeElementType> texImage;
In my file1.cu, I allocate my CUDA array and bind it to the texture:
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc< uchar4 >( );
cudaStatus=cudaMallocArray( &cu_array_image, &channelDesc, width, height );
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMallocArray failed! cu_array_image couldn't be created.\n");
return cudaStatus;
}
cudaStatus=cudaMemcpyToArray( cu_array_image, 0, 0, image, size_image, cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpyToArray failed! Copy from the host memory to the device texture memory failed.\n");
return cudaStatus;
}
// set texture parameters
texImage.addressMode[0] = cudaAddressModeWrap;
texImage.addressMode[1] = cudaAddressModeWrap;
texImage.filterMode = cudaFilterModePoint;
texImage.normalized = false; // access with normalized texture coordinates
// Bind the array to the texture
cudaStatus=cudaBindTextureToArray( texImage, cu_array_image, channelDesc);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaBindTextureToArray failed! cu_array couldn't be bind to texImage.\n");
return cudaStatus;
}
In file2.cu, I use the texture in the kernel function as follows:
__global__ void kernel(int width, int height, unsigned char *dev_image) {
int x = blockIdx.x*blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;
if(y< height) {
uchar4 tempcolor=tex2D(texImage, x, y);
//if(tempcolor.x==0)
// printf("tempcolor.x %d \n", tempcolor.x);
dev_image[y*width*3+x*3]= tempcolor.x;
dev_image[y*width*3+x*3+1]= tempcolor.y;
dev_image[y*width*3+x*3+2]= tempcolor.z;
}
}
The problem is that my texture contains nothing or corrupt values when I use it in my file2.cu. Even if I use the function kernel directly in file1.cu, the data are not correct.
If I add: texture<uchar4, 2, cudaReadModeElementType> texImage; in file1.cu and file2.cu, the compiler says that there is a redefinition.
EDIT:
I tried the same thing with CUDA version 5.0 but the same problem appears. If I print the address of texImage in file1.cu and file2.cu, I don't have the same address. There must have a problem with the declaration of the variable texImage.

This is a very old question and answers were provided in the comments by talonmies and Tom. In the pre-CUDA 5.0 scenario, extern textures were not feasible due to the lack of a true linker leading to extern linkage possibilities. As a consequence, and as mentioned by Tom,
you can have different compilation units, but they cannot reference each other
In the post-CUDA 5.0 scenario, extern textures are possible and I want to provide a simple example below, showing this in the hope that it could be useful to other users.
kernel.cu compilation unit
#include <stdio.h>
texture<int, 1, cudaReadModeElementType> texture_test;
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
/*************************/
/* LOCAL KERNEL FUNCTION */
/*************************/
__global__ void kernel1() {
printf("ThreadID = %i; Texture value = %i\n", threadIdx.x, tex1Dfetch(texture_test, threadIdx.x));
}
__global__ void kernel2();
/********/
/* MAIN */
/********/
int main() {
const int N = 16;
// --- Host data allocation and initialization
int *h_data = (int*)malloc(N * sizeof(int));
for (int i=0; i<N; i++) h_data[i] = i;
// --- Device data allocation and host->device memory transfer
int *d_data; gpuErrchk(cudaMalloc((void**)&d_data, N * sizeof(int)));
gpuErrchk(cudaMemcpy(d_data, h_data, N * sizeof(int), cudaMemcpyHostToDevice));
gpuErrchk(cudaBindTexture(NULL, texture_test, d_data, N * sizeof(int)));
kernel1<<<1, 16>>>();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
kernel2<<<1, 16>>>();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaUnbindTexture(texture_test));
}
kernel2.cu compilation unit
#include <stdio.h>
extern texture<int, 1, cudaReadModeElementType> texture_test;
/**********************************************/
/* DIFFERENT COMPILATION UNIT KERNEL FUNCTION */
/**********************************************/
__global__ void kernel2() {
printf("Texture value = %i\n", tex1Dfetch(texture_test, threadIdx.x));
}
Remember to compile generating relocatable device code, namely, -rdc = true, to enable external linkage

Error in cudaMemcpyToSymbol using CUDA 5

The Problem
I have prepared one sample CUDA code using the constant memory. I can run this in cuda 4.2 successfully but I get "invalid device symbol" when I compile using the CUDA 5.
I have attached the sample code here.
The Code
#include <iostream>
#include <stdio.h>
#include <cuda_runtime.h>
#include <cuda.h>
struct CParameter
{
int A;
float B;
float C;
float D;
};
__constant__ CParameter * CONSTANT_PARAMETER;
#define PARAMETER "CONSTANT_PARAMETER"
bool ERROR_CHECK(cudaError_t Status)
{
if(Status != cudaSuccess)
{
printf(cudaGetErrorString(Status));
return false;
}
return true;
}
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N)
{
a[idx] = CONSTANT_PARAMETER->A * a[idx];
}
}
////Main Function/////
int main(void)
{
/////Variable Definition
const int N = 10;
size_t size = N * sizeof(float);
cudaError_t Status = cudaSuccess;
CParameter * m_dParameter;
CParameter * m_hParameter;
float * m_D;
float * m_H;
//Memory Allocation Host
m_hParameter = new CParameter;
m_H = new float[N];
//Memory Allocation Device
cudaMalloc((void **) &m_D, size);
cudaMalloc((void**)&m_dParameter,sizeof(CParameter));
////Data Initialization
for (int i=0; i<N; i++)
m_H[i] = (float)i;
m_hParameter->A = 5;
m_hParameter->B = 3;
m_hParameter->C = 98;
m_hParameter->D = 100;
//Memory Copy from Host To Device
Status = cudaMemcpy(m_D, m_H, size, cudaMemcpyHostToDevice);
ERROR_CHECK(Status);
Status = cudaMemcpy(m_dParameter,m_hParameter,sizeof(CParameter),cudaMemcpyHostToDevice);
ERROR_CHECK(Status);
Status = cudaMemcpyToSymbol(PARAMETER, &m_dParameter, sizeof(m_dParameter));
ERROR_CHECK(Status);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<<n_blocks, block_size>>>(m_D,N);
// Retrieve result from device and store it in host array
cudaMemcpy(m_H, m_D, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++)
printf("%d %f\n", i, m_H[i]);
// Cleanup
free(m_H);
free(m_hParameter);
cudaFree(m_dParameter);
cudaFree(m_D);
return 0;
}
I have tried WINDOWS: CUDA 5.0 Production Release and the Graphics card is GTX 590.
Any help will be appreciated.

In an effort to avoid being "Stringly Typed", the use of character strings to refer to device symbols was deprecated in CUDA runtime API functions in CUDA 4.1, and removed in CUDA 5.0.
The CUDA 5 release notes read:
** The use of a character string to indicate a device symbol, which was possible
with certain API functions, is no longer supported. Instead, the symbol should be
used directly.
If you change your code to the following, it should work.
Status = cudaMemcpyToSymbol(CONSTANT_PARAMETER, &m_dParameter, sizeof(m_dParameter));
ERROR_CHECK(Status);

From the CUDA 5.0 Release Notes:
** The use of a character string to indicate a device symbol, which was possible with certain API functions, is no longer supported. Instead, the symbol should be used directly. "
These API functions still exist, but they accept the target symbol argument only as a bare identifier now, not as either a bare identifier or a string literal naming an ident. E.g.
__ device__ __ constant__ type ident;
main() { cudaMemcpyToSymbol("ident", ...); } // no longer valid, returns cudaErrorInvalidSymbol
main() { cudaMemcpyToSymbol(ident, ...); } // valid
So get rid of this:
#define PARAMETER "CONSTANT_PARAMETER"
And change this:
Status = cudaMemcpyToSymbol(PARAMETER, &m_dParameter, sizeof(m_dParameter));
To this:
Status = cudaMemcpyToSymbol(CONSTANT_PARAMETER, &m_dParameter, sizeof(m_dParameter));
And I think it will work.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

CUDA local array initalization modifies program output - arrays

Related

How to achieve something like V-sync for every line in terminal emulators using C?

Cuda Matrix multiplication program can't pass, very strange error code

how to create a union

CUDA extern texture declaration

Error in cudaMemcpyToSymbol using CUDA 5

Categories

Resources