OpenCL transpose kernel how is get_local_id being used

OpenCL transpose kernel how is get_local_id being used - c

Code taken from a sample. I created a project with it and it works, but I don't understand some parts.
For the sake of the example, say I have a 32x32 matrix, there are 36 work items and so get_global_id(0) goes from 0 -> 35 I presume, and size = MATRIX_DIM/4 = 8.
__kernel void transpose(__global float4 *g_mat,
__local float4 *l_mat, uint size) {
__global float4 *src, *dst;
/* Determine row and column location */
int col = get_global_id(0);
int row = 0;
while(col >= size) {
col -= size--;
row++;
}
col += row;
size += row;
/* Read source block into local memory */
src = g_mat + row * size * 4 + col;
l_mat += get_local_id(0)*8;
In the clEnqueueNDRangeKernel call, the arg local_work_size was set to NULL which according to the manual means let the compiler or something figure it out:
local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.
But I don't understand the multiply by 8, which gives an address offset into local memory for the work group I suppose. Can someone please explain this?
l_mat[0] = src[0];
l_mat[1] = src[size];
l_mat[2] = src[2*size];
l_mat[3] = src[3*size];
/* Process block on diagonal */
if(row == col) {
src[0] =
(float4)(l_mat[0].x, l_mat[1].x, l_mat[2].x, l_mat[3].x);
src[size] =
(float4)(l_mat[0].y, l_mat[1].y, l_mat[2].y, l_mat[3].y);
src[2*size] =
(float4)(l_mat[0].z, l_mat[1].z, l_mat[2].z, l_mat[3].z);
src[3*size] =
(float4)(l_mat[0].w, l_mat[1].w, l_mat[2].w, l_mat[3].w);
}
/* Process block off diagonal */
else {
/* Read destination block into local memory */
dst = g_mat + col * size * 4 + row;
l_mat[4] = dst[0];
l_mat[5] = dst[size];
l_mat[6] = dst[2*size];
l_mat[7] = dst[3*size];
/* Set elements of destination block */
dst[0] =
(float4)(l_mat[0].x, l_mat[1].x, l_mat[2].x, l_mat[3].x);
dst[size] =
(float4)(l_mat[0].y, l_mat[1].y, l_mat[2].y, l_mat[3].y);
dst[2*size] =
(float4)(l_mat[0].z, l_mat[1].z, l_mat[2].z, l_mat[3].z);
dst[3*size] =
(float4)(l_mat[0].w, l_mat[1].w, l_mat[2].w, l_mat[3].w);
/* Set elements of source block */
src[0] =
(float4)(l_mat[4].x, l_mat[5].x, l_mat[6].x, l_mat[7].x);
src[size] =
(float4)(l_mat[4].y, l_mat[5].y, l_mat[6].y, l_mat[7].y);
src[2*size] =
(float4)(l_mat[4].z, l_mat[5].z, l_mat[6].z, l_mat[7].z);
src[3*size] =
(float4)(l_mat[4].w, l_mat[5].w, l_mat[6].w, l_mat[7].w);
}
}

l_mat is being used a a local store for threads in a work-group. Specifically, it is being used because accesses to local memory are orders of magnitude faster than to global memory.
Each thread needs 8 float4s. Doing the following pointer arithmetic
l_mat += get_local_id(0)*8;
moves the l_mat pointer for each thread so that it doesn't overlap with other threads' data.
This could cause an error since the local_size wasn't specified and we are unable to ensure that the size of l_mat is sufficient to store the values for each thread.

l_mat is used as a temporary buffer for storing the two matrix components to invert for all the work-items.
So for each work-item it needs to store 2 * 4 float4s, hence : offset = get_local_id(0)*2*4 = get_local_id(0)*8.

Related

Strategy for doing final reduction

I am trying to implement an OpenCL version for doing reduction of a array of float.
To achieve it, I took the following code snippet found on the web :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
This kernel code works well but I would like to compute the final sum by adding all the partial sums of each work group.
Currently, I do this step of final sum by CPU with a simple loop and iterations nWorkGroups.
I saw also another solution with atomic functions but it seems to be implemented for int, not for floats. I think that only CUDA provides atomic functions for float.
I saw also that I could another kernel code which performs this operation of sum but I would like to avoid this solution in order to keep a simple readable source. Maybe I cannot do without this solution...
I must tell you that I use OpenCL 1.2 (returned by clinfo) on a Radeon HD 7970 Tahiti 3GB (I think that OpenCL 2.0 is not supported with my card).
More generally, I would like to get advice about the simplest method to perform this last final summation with my graphics card model and OpenCL 1.2.

If that float's order of magnitude is smaller than exa scale, then:
Instead of
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
You could use
if (local_id == 0)
{
if(strategy==ATOMIC)
{
long integer_part=getIntegerPart(localSums[0]);
atom_add (&totalSumIntegerPart[0] ,integer_part);
long float_part=1000000*getFloatPart(localSums[0]);
// 1000000 for saving meaningful 7 digits as integer
atom_add (&totalSumFloatPart[0] ,float_part);
}
}
this will overflow float part so when you divide it by 1000000 in another kernel, it may have more than 1000000 value so you get its integer part and add it to the real integer part:
float value=0;
if(strategy==ATOMIC)
{
float float_part=getFloatPart_(totalSumFloatPart[0]);
float integer_part=getIntegerPart_(totalSumFloatPart[0])
+ totalSumIntegerPart[0];
value=integer_part+float_part;
}
just a few atomic operations shouldn't be effective on whole kernel time.
Some of these get___part can be written easily already using floor and similar functions. Some need a divide by 1M.

Sorry for previous code.
also It has problem.
CLK_GLOBAL_MEM_FENCE effects only current workgroup.
I confused. =[
If you want to reduction sum by GPU, you should enqueue reduction kernel by NDRangeKernel function after clFinish(commandQueue).
Plaese just take concept.
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
barrier(CLK_GLOBAL_MEM_FENCE);
if(get_group_id(0)==0){
if(local_id < get_num_groups(0)){ // 16384
for(int n=0 ; n<get_num_groups(0) ; n+= group_size )
localSums[local_id] += partialSums[local_id+n];
barrier(CLK_LOCAL_MEM_FENCE);
for(int s=group_size/2;s>0;s/=2){
if(local_id < s)
localSums[local_id] += localSums[local_id+s];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(local_id == 0)
partialSums[0] = localSums[0];
}
}
}

sending c struct via MPI fails partially

I am sending a (particle) struct using the MPI_Type_create_struct() as done e.g. here, or explained in detail here.
I'm collecting all particles which are going to a specific proc, memcpy() them into the send buffer and MPI_Isend() them.
So far, so good. MPI_Iprob()'ing for the message gives me the right count of particles sent.
So I MPI_Recv() the buffer and extract the data (now even by copying the struct one by one). No matter how many particles I send, only the first particles' data are correct.
There are three possible mistakes:
The MPI_Type_create_struct() doesn't create a proper map of my struct, due to my usage of offset of() like in the first link. Maybe my struct contains a non visible padding as explained in the second link.
I'm doing some simple mistakes while copying particles into the send buffer and from the receive buffer back (I do print the send buffer - and it works - but maybe I'm overlooking something)
Something totally different.
(sorry for the really ugly presentation of the code, I could not manage to present it in a descent way. You'll find the code here - the line is already marked - on Github, too!)
Here are the construction of the mpi datatype,
typedef struct {
int ID;
double x[DIM];
} pchase_particle_t;
const int items = 2;
int block_lengths[2] = {1, DIM};
MPI_Datatype mpi_types[2] = {MPI_INT, MPI_DOUBLE};
MPI_Aint offsets[2];
offsets[0] = offsetof(pchase_particle_t, ID);
offsets[1] = offsetof(pchase_particle_t, x);
MPI_Type_create_struct(items, block_lengths, offsets, mpi_types, &W->MPI_Particle);
MPI_Type_commit(&W->MPI_Particle);
the sending
/* handle all mpi send/recv status data */
MPI_Request *send_request = P4EST_ALLOC(MPI_Request, W->p4est->mpisize);
MPI_Status *recv_status = P4EST_ALLOC(MPI_Status, W->p4est->mpisize);
/* setup send/recv buffers */
pchase_particle_t **recv_buf = P4EST_ALLOC(pchase_particle_t *, num_senders);
pchase_particle_t **send_buf = P4EST_ALLOC(pchase_particle_t *, num_receivers);
int recv_count = 0, recv_length, flag, j;
/* send all particles to their belonging procs */
for (i = 0; i < num_receivers; i++) {
/* resolve particle list for proc i */
sc_list_t *tmpList = *((sc_list_t **) sc_array_index(W->particles_to, receivers[i]));
pchase_particle_t * tmpParticle;
int send_count = 0;
/* get space for the particles to be sent */
send_buf[i] = P4EST_ALLOC(pchase_particle_t, tmpList->elem_count);
/* copy all particles into the send buffer and remove them from this proc */
while(tmpList->first != NULL){
tmpParticle = sc_list_pop(tmpList);
memcpy(send_buf[i] + send_count * sizeof(pchase_particle_t), tmpParticle, sizeof(pchase_particle_t));
/* free particle */
P4EST_FREE(tmpParticle);
/* update particle counter */
send_count++;
}
/* print send buffer */
for (j = 0; j < send_count; j++) {
pchase_particle_t *tmpParticle = send_buf[i] + j * sizeof(pchase_particle_t);
printf("[pchase %i sending] particle[%i](%lf,%lf)\n", W->p4est->mpirank, tmpParticle->ID, tmpParticle->x[0], tmpParticle->x[1]);
}
printf("[pchase %i sending] particle count: %i\n", W->p4est->mpirank, send_count);
/* send particles to right owner */
mpiret = MPI_Isend(send_buf[i], send_count, W->MPI_Particle, receivers[i], 13, W->p4est->mpicomm, &send_request[i]);
SC_CHECK_MPI(mpiret);
}
and the receiving.
recv_count = 0;
/* check for messages until all arrived */
while (recv_count < num_senders) {
/* probe if any of the sender has already sent his message */
for (i = 0; i < num_senders; i++) {
MPI_Iprobe(senders[i], MPI_ANY_TAG, W->p4est->mpicomm,
&flag, &recv_status[i]);
if (flag) {
/* resolve number of particles receiving */
MPI_Get_count(&recv_status[i], W->MPI_Particle, &recv_length);
printf("[pchase %i receiving message] %i particles arrived from sender %i with tag %i\n",
W->p4est->mpirank, recv_length, recv_status[i].MPI_SOURCE, recv_status[i].MPI_TAG);
/* get space for the particles to be sent */
recv_buf[recv_count] = P4EST_ALLOC(pchase_particle_t, recv_length);
/* receive a list with recv_length particles */
mpiret = MPI_Recv(recv_buf[recv_count], recv_length, W->MPI_Particle, recv_status[i].MPI_SOURCE,
recv_status[i].MPI_TAG, W->p4est->mpicomm, &recv_status[i]);
SC_CHECK_MPI(mpiret);
/*
* insert all received particles into the
* push list
*/
pchase_particle_t *tmpParticle;
for (j = 0; j < recv_length; j++) {
/*
* retrieve all particle details from
* recv_buf
*/
tmpParticle = recv_buf[recv_count] + j * sizeof(pchase_particle_t);
pchase_particle_t *addParticle = P4EST_ALLOC(pchase_particle_t,1);
addParticle->ID=tmpParticle->ID;
addParticle->x[0] = tmpParticle->x[0];
addParticle->x[1] = tmpParticle->x[1];
printf("[pchase %i receiving] particle[%i](%lf,%lf)\n",
W->p4est->mpirank, addParticle->ID, addParticle->x[0], addParticle->x[1]);
/* push received particle to push list and update world counter */
sc_list_append(W->particle_push_list, addParticle);
W->n_particles++;
}
/* we received another particle list */
recv_count++;
}
}
}
edit: reindented..
edit: Only the first particles' data is correct, means that all it's properties (ID and coordinates) are identical to that of the sent particle. The others however are initialized with zeros i.e. ID=0, x[0]=0.0, x[1]=0.0. Maybe that's a hint for the solution.

There is an error in your pointer arithmetic. send_buf[i] is already of type pchase_particle_t * and therefore send_buf[i] + j * sizeof(pchase_particle_t) does not point to the j-th element of the i-th buffer but rather to the j * sizeof(pchase_particle_t)-th element. Thus your particles are not stored contiguously in memory but rather separated by sizeof(pchase_particle_t) - 1 empty array elements. These get sent instead of the correct particles because the MPI_Send call accesses buffer memory contiguously. The same applies to the code of the receiver.
You do not see the error in the sender code because your debug print uses the same incorrect pointer arithmetic and hence accesses memory using the same stride. I guess your send counts are small and you get memory allocated on the data segment heap, otherwise you should have received SIGSEGV for out-of-bound array access very early in the data packing process (e.g. in the memcpy part).
Resolution: do not multiply the array index by sizeof(pchase_particle_t).

Race conditions despite atomicAdd functions (CUDA)?

I have a problem that is parallel on two levels: I have a ton of sets of (x0, x1, y0, y1) coordinate pairs, which are turned into variables vdx, vdy, vyy and for each of these sets I'm trying to calculate the values of all "monomials" composed of them up to degree n (i.e. all possible combinations of different powers of them, like vdx^3*vdy*vyy^2 or vdx*1*vyy^4). These values are then added up over all the sets.
My strategy (and for now I'd just like to get it to work, it doesn't have to be optimized with multiple kernels or complex reductions, unless it really has to) is to have each thread deal with one set of coordinate pairs and calculate the values of all their corresponding monomials. Each block's shared memory holds all the monomial sums, and when the block is done, the first thread in the block adds the result to the global sum. Since each block's shared memory is accessed by all threads in all places, I'm using atomicAdd; same with the blocks and the global memory.
Unfortunately there still seems to be a race condition somewhere, since I different results every time I run the kernel.
If it helps, I'm currently using degree = 3 and omitting one of the variables, which means that in the code below, the innermost for loop (over evbl) doesn't do anything and just repeats 4 times. Indeed, the output of the kernel looks like this: 51502,55043.1,55043.1,51502,47868.5,47868.5,48440.5,48440.6,46284.7,46284.7,46284.7,46284.7,46034.3,46034.3,46034.3,46034.3,44972.8,44972.8,44972.8,44972.8,43607.6,43607.6,43607.6,43607.6,43011,43011,43011,43011,42747.8,42747.8,42747.8,42747.8,45937.8,45937.8,46509.9,46509.9,... and it's noticable that there is a (rough) pattern of 4-tuples. But everytime I run it the values are all very different.
Everything is in floats, but I'm on a 2.1 GPU and so that shouldn't be a problem. cuda-memcheck also reports no errors.
Can somebody with more CUDA experience give me some pointers how to track down the race condition here?
__global__ void kernel(...) {
extern __shared__ float s_data[];
// just use global memory for now
// get threadID:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= nPairs) return;
// ... do some calculations to get x/y...
// calculate vdx, vdy and vyy
float vdx = (x1 - x0)/(float)xheight;
float vdy = (y1 - y0)/(float)xheight;
float vyy = 0.5*(y0 + y1)/(float)xheight;
const int offs1 = degree + 1;
const int offs2 = offs1 * offs1;
const int offs3 = offs2 * offs1;
float sol = 1.0;
// now calculate monomial results and store in shared memory
for(int evdx = 0; evdx <= degree; evdx++) {
for(int evdy = 0; evdy <= degree; evdy++) {
for(int evyy = 0; evyy <= degree; evyy++) {
for(int evbl = 0; evbl <= degree; evbl++) {
s = powf(vdx, evdx) + powf(vdy, evdy) + powf(vyy, evyy);
atomicAdd(&(s_data[evbl + offs1*evyy + offs2*evdy +
offs3*evdx]), sol/1000.0 );
}
}
}
}
// now copy shared memory to global
__syncthreads();
if(threadIdx.x == 0) {
for(int i = 0; i < nMonomials; i++) {
atomicAdd(&outmD[i], s_data[i]);
}
}
}

You are using shared memory but you are never initializing it.

wrong partial zero result from copying shared memory to global memory

I wrote a simple CUDA kernel as follows:
__global__ void cudaDoSomethingInSharedMemory(float* globalArray, pitch){
__shared__ float sharedInputArray[1088];
__shared__ float sharedOutputArray[1088];
int tid = threadIdx.x //Use 1D block
int rowIdx = blockIdx.x //Use 1D grid
int rowOffset = pitch/sizeof(float);//Offset in elements (not in bytes)
//Copy data from global memory to shared memory (checked)
while(tid < 1088){
sharedInputArray[tid] = *(((float*) globalArray) + rowIdx*rowOffset + tid);
tid += blockDim.x;
__syncthreads();
}
__syncthreads();
//Do something (already simplified and the problem still exists)
tid = threadIdx.x;
while(tid < 1088){
if(tid%2==1){
if(tid == 1087){
sharedOutputArray[tid/2 + 544] = 321;
}
else{
sharedOutputArray[tid/2 + 544] = 321;
}
}
tid += blockDim.x;
__syncthreads();
}
tid = threadIdx.x;
while(tid < 1088){
if(tid%2==0){
if(tid==0){
sharedOutputArray[tid/2] = 123;
}
else{
sharedOutputArray[tid/2] = 123;
}
}
tid += blockDim.x;
__syncthreads();
}
__syncthreads();
//Copy data from shared memory back to global memory (and add read-back for test)
float temp = -456;
tid = threadIdx.x;
while(tid < 1088){
*(((float*) globalArray) + rowIdx*rowOffset + tid) = sharedOutputArray[tid];
temp = *(((float*) globalArray) + rowIdx*rowOffset + tid);//(1*) Errors are found.
__syncthreads();
tid += blockDim.x;
}
__syncthreads();
}
The code is to change "sharedOutputArray" from "interlaced" to "clustered" : "123 321 123 321 ... 123 321" is changed to "123 123 123.. 123 321 321 321...321" and output the clustered result to the global memory array "globalArray". "globalArray" is allocated by "cudaMallocPitch()"
This kernel is used to process a 2D array. The idea is simple: one block for one row (so 1D grid and the number of blocks equals the number of rows) and N threads for each row. The row number is 1920 and column number is 1088. So there are 1920 blocks.
The problem is: when N (the number of threads in one block) is 64, 128 or 256, everything works (at least looks like working) fine. However, when N was 512 (I am using GTX570 with CUDA computation capability 2.0 and the maximum size for each dimension of one block is 1024), the errors happened.
The errors are: The elements (each one is a 4-byte floating number) in a row in the global memory from position 256 to 287 (index starts at 0, error strip length is 32 elements, 128 bits) is 0 rather than 123. It looks like "123 123 123 ... 0 0 0 0 0... 0 123 123 ...". I checked the line above (1*) and those elements were 123 in "sharedOutputArray" and when the element (for example tid==270) was read in (1*), "temp" showed 0. I tried to see "tid==255" and "tid==288" and the element was 123 (corrent). This type of error happened in almost all 1920 rows.
I tried to "synchronize" (maybe already over-synchronized) the threads but it did not work. What makes me confused is why 64,128 or 256 threads worked fine but 512 did not work. I know using 512 threads may not be optimized for the performance and I just would like to know where I made the mistake.
Thank you in advance.

You are using __syncthreads() inside conditional code where the condition does not evaluate uniformly between the threads of a block. Don't do that.
In your case you can simply remove the __syncthreads() inside the while loops, as it serves no purpose.

How to read back a CUDA Texture for testing?

Ok, so far, I can create an array on the host computer (of type float), and copy it to the gpu, then bring it back to the host as another array (to test if the copy was successful by comparing to the original).
I then create a CUDA array from the array on the GPU. Then I bind that array to a CUDA texture.
I now want to read that texture back and compare with the original array (again to test that it copied correctly). I saw some sample code that uses the readTexel() function shown below. It doesn't seem to work for me... (basically everything works except for the section in the bindToTexture(float* deviceArray) function starting at the readTexels(SIZE, testArrayDevice) line).
Any suggestions of a different way to do this? Or are there some obvious problems I missed in my code?
Thanks for the help guys!
#include <stdio.h>
#include <assert.h>
#include <cuda.h>
#define SIZE 20;
//Create a channel description to use.
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
//Create a texture to use.
texture<float, 2, cudaReadModeElementType> cudaTexture;
//cudaTexture.filterMode = cudaFilterModeLinear;
//cudaTexture.normalized = false;
__global__ void readTexels(int amount, float *Array)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < amount)
{
float x = tex1D(cudaTexture, float(index));
Array[index] = x;
}
}
float* copyToGPU(float* hostArray, int size)
{
//Create pointers, one for the array to be on the device, and one for bringing it back to the host for testing.
float* deviceArray;
float* testArray;
//Allocate some memory for the two arrays so they don't get overwritten.
testArray = (float *)malloc(sizeof(float)*size);
//Allocate some memory for the array to be put onto the GPU device.
cudaMalloc((void **)&deviceArray, sizeof(float)*size);
//Actually copy the array from hostArray to deviceArray.
cudaMemcpy(deviceArray, hostArray, sizeof(float)*size, cudaMemcpyHostToDevice);
//Copy the deviceArray back to testArray in host memory for testing.
cudaMemcpy(testArray, deviceArray, sizeof(float)*size, cudaMemcpyDeviceToHost);
//Make sure contents of testArray match the original contents in hostArray.
for (int i = 0; i < size; i++)
{
if (hostArray[i] != testArray[i])
{
printf("Location [%d] does not match in hostArray and testArray.\n", i);
}
}
//Don't forget free these arrays after you're done!
free(testArray);
return deviceArray; //TODO: FREE THE DEVICE ARRAY VIA cudaFree(deviceArray);
}
cudaArray* bindToTexture(float* deviceArray)
{
//Create a CUDA array to translate deviceArray into.
cudaArray* cuArray;
//Allocate memory for the CUDA array.
cudaMallocArray(&cuArray, &cudaTexture.channelDesc, SIZE, 1);
//Copy the deviceArray into the CUDA array.
cudaMemcpyToArray(cuArray, 0, 0, deviceArray, sizeof(float)*SIZE, cudaMemcpyHostToDevice);
//Release the deviceArray
cudaFree(deviceArray);
//Bind the CUDA array to the texture.
cudaBindTextureToArray(cudaTexture, cuArray);
//Make a test array on the device and on the host to verify that the texture has been saved correctly.
float* testArrayDevice;
float* testArrayHost;
//Allocate memory for the two test arrays.
cudaMalloc((void **)&testArray, sizeof(float)*SIZE);
testArrayHost = (float *)malloc(sizeof(float)*SIZE);
//Read the texels of the texture to the test array in the device.
readTexels(SIZE, testArrayDevice);
//Copy the device test array to the host test array.
cudaMemcpy(testArrayHost, testArrayDevice, sizeof(float)*SIZE, cudaMemcpyDeviceToHost);
//Print contents of the array out.
for (int i = 0; i < SIZE; i++)
{
printf("%f\n", testArrayHost[i]);
}
//Free the memory for the test arrays.
free(testArrayHost);
cudaFree(testArrayDevice);
return cuArray; //TODO: UNBIND THE CUDA TEXTURE VIA cudaUnbindTexture(cudaTexture);
//TODO: FREE THE CUDA ARRAY VIA cudaFree(cuArray);
}
int main(void)
{
float* hostArray;
hostArray = (float *)malloc(sizeof(float)*SIZE);
for (int i = 0; i < SIZE; i++)
{
hostArray[i] = 10.f + i;
}
float* deviceAddy = copyToGPU(hostArray, SIZE);
free(hostArray);
return 0;
}

Briefly:
------------- in your main.cu ---------------------------------------------------------------------------------------
-1. Define the texture as a globlal variable
texture refTexture; // global variable !
// meaning: address the texture with (x,y) (2D) and get an unsinged int
In the main function:
-2. Use arrays combined with texture
cudaArray* myArray; // declar.
// ask for memory
cudaMallocArray ( &myArray,
&refTex.channelDesc, /* with this you don't need to fill a channel descriptor */
width,
height);
-3. copy data from CPU to GPU (to the array)
cudaMemcpyToArray ( arrayCudaEntrada, // destination: the array
0, 0, // offsets
sourceData, // pointer uint*
widthheightsizeof(uint), // total amount of bytes to be copied
cudaMemcpyHostToDevice);
-4. bind texture and array
cudaBindTextureToArray( refTex,arrayCudaEntrada)
-5. change some parameters in the texture
refTextura_In.normalized = false; // don't automatically convert fetched data to [0,1[
refTextura_In.addressMode[0] = cudaAddressModeClamp; // if my indexing is out of bounds: automatically use a valid indexing (0 if negative index, last if too great index)
refTextura_In.addressMode[1] = cudaAddressModeClamp;
---------- in the kernel --------------------------------------------------------
// find out indexes (f,c) to process by this thread
uint f = (blockIdx.x * blockDim.x) + threadIdx.x;
uint c = (blockIdx.y * blockDim.y) + threadIdx.y;
// this is curious and necessary: indexes for reading from a texture
// are floats !. Even if you are certain to access (4,5) you have
// match the "center" this is (4.5, 5.5)
uint read = tex2D( refTex, c+0.5f, f+0.5f); // texRef is a global variable
Now You process read and write the results to other zone of the device global
memory, not to the texture itself !

readTexels() is a kernel (__global__) function, i.e. it runs on the GPU. Therefore you need to use the correct syntax to launch a kernel.
Take a look through the CUDA Programming Guide and some of the SDK samples, both available via the NVIDIA CUDA site to see how to launch a kernel.
Hint: It'll end up something like readTexels<<<grid,block>>>(...)