Aligning 12-byte struct within a warp - cuda - arrays

Lets say we have a struct of 3 integers which is not aligned:
struct data {
int x;
int y;
int z;
};
I pass an array of this struct to kernel. I'm aware that I should pass struct of array instead of array of struct but this is not important for this question.
32 threads inside a warp, access memory in coalesced manner (i to i + 31) which equals total memory of 384 bytes. 384 bytes is multiple of L1 cache line (128 bytes) which means three memory transaction of 128-byte each.
Now if we have an aligned struct:
struct __align__(16) aligned_data {
int x;
int y;
int z;
};
if access patterns remains the same as previous example, then it would fetch 512 bytes of memory which is 4 memory transaction each requesting 128-byte.
So this means is first example more efficient or second one is still more efficient although it fetches more memory.

The only real way to answer a question is by benchmarking. And if you do, you may not get the same answer depending on your hardware. When I run this:
#define NITER (128)
struct data {
int x;
int y;
int z;
};
struct __align__(16) aligned_data {
int x;
int y;
int z;
};
template<typename T, int niter>
__global__
void kernel(T *in, int *out, int dowrite=0)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int nthreads = blockDim.x * gridDim.x;
int oval = 0;
#pragma unroll
for(int i=0; i<niter; ++i,tid+=nthreads) {
T val = in[tid];
oval += val.x + val.y + val.z;
}
if (dowrite) {
out[tid] = oval;
}
}
template __global__ void kernel<data,NITER>(data *, int*, int);
template __global__ void kernel<aligned_data,NITER>(aligned_data *, int*, int);
int main()
{
const int bs = 512;
const int nb = 32;
const int nvals = bs * nb * NITER;
data *d_; cudaMalloc((void **)&d_, sizeof(data) * size_t(nvals));
aligned_data *ad_; cudaMalloc((void **)&ad_, sizeof(aligned_data) * size_t(nvals));
for(int i=0; i<10; ++i) {
kernel<data,NITER><<<nb, bs>>>(d_, (int *)0, 0);
kernel<aligned_data,NITER><<<nb, bs>>>(ad_, (int *)0, 0);
cudaDeviceSynchronize();
}
cudaDeviceReset();
return 0;
}
I see that the aligned structure version gives overall higher performance on a compute 5.2 capability device:
Time(%) Time Calls Avg Min Max Name
52.71% 2.3995ms 10 239.95us 238.10us 241.79us void kernel<data, int=128>(data*, int*, int)
47.29% 2.1529ms 10 215.29us 214.91us 215.51us void kernel<aligned_data, int=128>(aligned_data*, int*, int)
In this case I would assumed that the roughly 10% improvement is down to the lower number of load instructions which are issued. In the unaligned case the compiler issues three 32 bit loads to fetch the structure, whereas in the aligned case the compiler issues a single 128 bit load to fetch the structure. The reduction in instructions seems to offset the 25% wasted net memory bandwidth. On other hardware with different memory instruction throughput to bandwdith ratios, the result might well be different.

Related

Multiple Flexible Array Member or VLA with Shared Memory and Semaphores

I need to define a struct with two semaphores and three(at the least) or maybe more arrays as members of the struct whose size are variables. Indicative example ( not the right syntax but to give a contextual meaning ; lr is typedef for double) :
int nx = 400,ny = 400,nz = 400;
struct ShMem {
sem_t prod;
sem_t cons;
lr u_x[nx+1][ny+2][nz+2];
lr u_y[nx+2][ny+1][nz+2];
lr u_z[nx+2][ny+2][nz+2];
};
What I need to do is to make the struct ShMem as a shared memory block between two codes aka producer and consumer which compute and read this memory block with the help of the semaphores present in the struct.
Since the array size are variables and will be defined in runtime how do i get a 3 Dimensional variable length array ?
Comment :
If lets say I have nx, ny and nz #defined to 400 I follow the following step ( already tested )
#define nx (400)
#define ny (400)
#define nz (400)
struct ShMem {
sem_t prod;
sem_t cons;
lr u_x[nx+1][ny+2][nz+2];
lr u_y[nx+2][ny+1][nz+2];
lr u_z[nx+2][ny+2][nz+2];
};
...
// shared memory allocation
ShmID = shmget(ShmKEY, sizeof(struct Shmem), IPC_CREAT|0666);
...
Additional requirement is that for the application I do need those arrays as 3D arrays such that I can index them as u_x[i][j][k], whre i, j, k are indices in the x, y, and z-direction respectively.
Edit after Lundin and Felix solution.
CONSTRAINT - u_x, u_y and u_z needs to be a 3D array/*** pointer which is accessed by u_x[i][j][k] - This can't be changed since this is a legacy code. The arrays need to be set such that the sanctity of the access is maintained. Everywhere in the code it is accessed like that.
As already discussed in the comments, C doesn't support something like that. So, you will have to build it yourself. A simple example using a macro to make the "3D access" inside the structure readable could look like this:
#include <stdlib.h>
typedef int lr;
struct foo {
size_t ny;
size_t nz;
lr *u_y;
lr *u_z;
lr u_x[];
};
#define val(o, a, x, y, z) ((o).a[(o).ny * (o).nz * x + (o).nz * y + z])
struct foo *foo_create(size_t nx, size_t ny, size_t nz)
{
size_t arrsize = nx * ny * nz;
struct foo *obj = malloc(sizeof *obj + 3 * arrsize * sizeof *(obj->u_x));
if (!obj) return 0;
obj->ny = ny;
obj->nz = nz;
obj->u_y = obj->u_x + arrsize;
obj->u_z = obj->u_y + arrsize;
return obj;
}
int main(void)
{
struct foo *myFoo = foo_create(10, 10, 10);
// set u_y[9][5][2] in *myFoo to 42:
val(*myFoo, u_y, 9, 5, 2) = 42;
free(myFoo);
}
This uses the single FAM at the end of the struct supported by C, so you can allocate such a struct in a single block. To place it in shared memory, just replace the malloc() and use the same calculations for the size.
You have to build something like this
struct ShMem {
int some_stuff_here;
size_t x[3];
size_t y[3];
size_t z[3];
int array[];
};
And then ignore that the flexible array member type is a plain int array. Instead do something like
size_t size = sizeof( int[x1][y1][z1] ) +
sizeof( int[x2][y2][z2] ) +
sizeof( int[x3][y3][z3] );
ShMem* shmem = malloc(sizeof *shmem + size);
And then when accessing, you use an array pointer type instead of the int[]. The code turns a bit nasty to read:
for(size_t i=0; i<3; i++)
{
typedef int(*arr_t)[shmem->y[i]][shmem->z[i]]; // desired array pointer type
int some_offset = ... // calculate based on previously used x y z
arr_t arr = (arr_t)shmem->(array + some_offset); // cast to the array pointer type
for(size_t x=0; x<shmem->x[i]; x++)
{
for(size_t y=0; y<shmem->y[i]; y++)
{
for(size_t z=0; z<shmem->z[i]; z++)
{
arr[x][y][z] = something;
}
}
}
}
This is actually well-defined behavior since the data allocated with malloc doesn't have an effective type until you access it.
"some_offset" in the above example could be a counter variable or something stored inside the struct itself.

OpenAcc error with copyin and copyout

General Information
NOTE: I am also decently new to C, OpenAcc.
Hi I am trying to develop an image blurring program, but first I wanted to see if I could parallelize the for loops and copyin/copyout my values.
The problem I am facing currently is when I try to copyin and copyout my data and output variables. The error looks to be a buffer overflow (I have also googled it and that is what people have said), but i am not sure how I should go about fixing this. I think I am doing something wrong with the pointers, but I am not sure.
Thanks so much in advance, if you think that I missed some information please let me know and I can provide it.
Question
I would like to confirm what the error actually is?
How should I go about fixing the issue?
Anything I should look into more so I can fix this kind of issue myself in the future.
Error
FATAL ERROR: variable in data clause is partially present on the device: name=output
file:/nfs/u50/singhn8/4F03/A3/main.c ProcessImageACC line:48
output lives at 0x7ffca75f6288 size 16 not present
Present table dump for device[1]: NVIDIA Tesla GPU 1, compute capability 3.5
host:0x7fe98eaf9010 device:0xb05dc0000 size:2073600 presentcount:1 line:47 name:(null)
host:0x7fe98f0e8010 device:0xb05bc0000 size:2073600 presentcount:1 line:47 name:(null)
host:0x7ffca75f6158 device:0xb05ac0400 size:4 presentcount:1 line:47 name:filterRad
host:0x7ffca75f615c device:0xb05ac0000 size:4 presentcount:1 line:47 name:row
host:0x7ffca75f6208 device:0xb05ac0200 size:4 presentcount:1 line:47 name:col
host:0x7ffca75f6280 device:0xb05ac0600 size:16 presentcount:1 line:48 name:data
Program Definition
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
// ================================================
// ppmFile.h
// ================================================
#include <sys/types.h>
typedef struct Image
{
int width;
int height;
unsigned char *data;
} Image;
Image* ImageCreate(int width,
int height);
Image* ImageRead(char *filename);
void ImageWrite(Image *image,
char *filename);
int ImageWidth(Image *image);
int ImageHeight(Image *image);
void ImageClear(Image *image,
unsigned char red,
unsigned char green,
unsigned char blue);
void ImageSetPixel(Image *image,
int x,
int y,
int chan,
unsigned char val);
unsigned char ImageGetPixel(Image *image,
int x,
int y,
int chan);
Blur Filter Function
// ================================================
// The Blur Filter
// ================================================
void ProcessImageACC(Image **data, int filterRad, Image **output) {
int row = (*data)->height;
int col = (*data)->width;
#pragma acc data copyin(row, col, filterRad, (*data)->data[0:row * col]) copyout((*output)->data[0:row * col])
#pragma acc kernels
{
#pragma acc loop independent
for (int j = 0; j < row; j++) {
#pragma acc loop independent
for (int i = 0; i < col; i++) {
(*output)->data[j * row + i] = (*data)->data[j * row + i];
}
}
}
}
Main Function
// ================================================
// Main Program
// ================================================
int main(int argc, char *argv[]) {
// vars used for processing:
Image *data, *result;
int dataSize;
int filterRadius = atoi(argv[1]);
// ===read the data===
data = ImageRead(argv[2]);
// ===send data to nodes===
// send data size in bytes
dataSize = sizeof(unsigned char) * data->width * data->height * 3;
// ===process the image===
// allocate space to store result
result = (Image *)malloc(sizeof(Image));
result->data = (unsigned char *)malloc(dataSize);
result->width = data->width;
result->height = data->height;
// initialize all to 0
for (int i = 0; i < (result->width * result->height * 3); i++) {
result->data[i] = 0;
}
// apply the filter
ProcessImageACC(&data, filterRadius, &result);
// ===save the data back===
ImageWrite(result, argv[3]);
return 0;
}
The problem here is that in addition to the data arrays, the output and data pointers need to be copied over as well. From the compiler feed back messages, you can see the compiler implicitly copying them over.
% pgcc -c image.c -ta=tesla:cc70 -Minfo=accel
ProcessImageACC:
46, Generating copyout(output->->data[:col*row])
Generating copyin(data->->data[:col*row],col,filterRad,row)
47, Generating implicit copyout(output[:1])
Generating implicit copyin(data[:1])
50, Loop is parallelizable
52, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
50, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
52, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
Now you might be able to get this to work by using unstructured data regions to create both the data and pointers, and then "attach" the pointers to the arrays (i.e. fill in the value of the device pointers to the address of the device data array).
Though an easier option is to create temp arrays to point to the data, and then copy the data to the device. This will also increase the performance of your code (both on the GPU and CPU) since it eliminates the extra levels of indirection.
void ProcessImageACC(Image **data, int filterRad, Image **output) {
int row = (*data)->height;
int col = (*data)->width;
unsigned char * ddata, * odata;
odata = (*output)->data;
ddata = (*data)->data;
#pragma acc data copyin(ddata[0:row * col]) copyout(odata[0:row * col])
#pragma acc kernels
{
#pragma acc loop independent
for (int j = 0; j < row; j++) {
#pragma acc loop independent
for (int i = 0; i < col; i++) {
odata[j * row + i] = ddata[j * row + i];
}
}
}
}
Note that scalars are firstprivate by default so there's no need to add the row, col, and filterRad variables in the data clause.

OpenCL: type casting on GPU

I store my data in a char array, and I need to read float and int variables from there.
This code works fine on CPU:
global float *p;
p = (global float*)get_pointer_to_the_field(char_array, index);
*p += 10;
But on GPU I get the error -5: CL_OUT_OF_RESOURCES. The reading itself works, but doing something with the value (adding 10 in this case) causes the error. How could I fix it?
Update:
This works on GPU:
float f = *p;
f += 10;
However, I still can't write this value back to the array.
Here is the kernel:
global void write_value(global char *data, int tuple_pos, global char *field_value,
int which_field, global int offsets[], global int *num_of_attributes) {
int tuple_size = offsets[*num_of_attributes];
global char *offset = data + tuple_pos * tuple_size;
offset += offsets[which_field];
memcpy(offset, field_value, (offsets[which_field+1] - offsets[which_field]));
}
global char *read_value(global char *data, int tuple_pos,
int which_field, global int offsets[], global int *num_of_attributes) {
int tuple_size = offsets[*num_of_attributes];
global char *offset = data + tuple_pos * tuple_size;
offset += offsets[which_field];
return offset;
}
kernel void update_single_value(global char* input_data, global int* pos, global int offsets[],
global int *num_of_attributes, global char* types) {
int g_id = get_global_id(1);
int attr_id = get_global_id(0);
int index = pos[g_id];
if (types[attr_id] == 'f') { // if float
global float *p;
p = (global float*)read_value(input_data, index, attr_id, offsets, num_of_attributes);
float f = *p;
f += 10;
//*p += 10; // not working on GPU
}
else if (types[attr_id] == 'i') { // if int
global int *p;
p = (global int*)read_value(input_data, index, attr_id, offsets, num_of_attributes);
int i = *p;
i += 10;
//*p += 10;
}
else { // if char
write_value(input_data, index, read_value(input_data, index, attr_id, offsets, num_of_attributes), attr_id, offsets, num_of_attributes);
}
}
It updates values of a table's tuples, int and float are increased by 10, char fields are just replaced with the same content.
Are you enabling the byte_addressable_store extension? As far as I'm aware, bytewise writes to global memory aren't well-defined in OpenCL unless you enable this. (You'll need to check if the extension is supported by your implementation.)
You might also want to consider using the "correct" type in the kernel argument - this might help the compiler produce more efficient code. If the type can vary dynamically, you could perhaps try using a union type (or union fields in a struct type), although I haven't tested this with OpenCL myself.
It turned out that the problem occurs because the int and float values in the char array aren't 4 bytes aligned. When I'm doing writes to addresses like
offset = data + tuple_pos*4; // or 8, 16 etc
everything works fine. However, the following causes the error:
offset = data + tuple_pos*3; // or any other number not divisible by 4
This means that either I should change the whole design and store the values somehow else, or add "empty" bytes to the char array to make int and float values 4 bytes aligned (which isn't a really good solution).

Copy Array of pointers inside a struct using CUDA

I wish to copy an array of pointers from one struct to another. The Struct looks like this:
typedef struct COORD3D
{
int x,y,z;
}
COORD3D;
typedef struct structName
{
double *volume;
COORD3D size;
// .. some other vars
}
structName;
I wish to do this inside a function where I pass in the address of an empty instance of the struct and the address of the struct with the data I wish to copy. Currently I do this serially via:
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
int i;
for(i=0;i<size;i++)
dest->volume[i] = source->volume[i];
}
I want to do this in CUDA to speed up the process (as the array is very large [~12 million elements].
I have tried the following however, although the code compiles and runs, I get incorrect results stored in the array (seems to be very large random numbers)
void foo(structName *dest, structName *source)
{
// .. some other work
int size = source->size.x * source->size.y * source->size.z;
dest->volume = (double*)malloc(size*sizeof(double));
// Device Pointers
double *DEVICE_SOURCE, *DEVICE_DEST;
// Declare memory on GPU
cudaMalloc(&DEVICE_DEST,size);
cudaMalloc(&DEVICE_SOURCE,size);
// Copy Source to GPU
cudaMemcpy(DEVICE_SOURCE,source->volume,size,
cudaMemcpyHostToDevice);
// Setup Blocks/Grids
dim3 dimGrid(ceil(source->size.x/10.0),
ceil(source->size.y/10.0),
ceil(source->size.z/10.0));
dim3 dimBlock(10,10,10);
// Run CUDA Kernel
copyVol<<<dimGrid,dimBlock>>> (DEVICE_SOURCE,
DEVICE_DEST,
source->size.x,
source->size.y,
source->size.z);
// Copy Constructed Array back to Host
cudaMemcpy(dest->volume,DEVICE_DEST,size,
cudaMemcpyDeviceToHost);
}
The Kernel looks like this:
__global__ void copyVol(double *source, double *dest,
int x, int y, int z)
{
int posX = blockIdx.x * blockDim.x + threadIdx.x;
int posY = blockIdx.y * blockDim.y + threadIdx.y;
int posZ = blockIdx.z * blockDim.z + threadIdx.z;
if (posX < x && posY < y && posZ < z)
{
dest[posX+(posY*x)+(posZ*y*x)] =
source[posX+(posY*x)+(posZ*y*x)];
}
}
Can anyone tell me where I am going wrong?
I am risking a wrong answer, but have you left out the size of the data type?
cudaMalloc(&DEVICE_DEST,size);
should be
cudaMalloc(&DEVICE_DEST,size*sizeof(double));
Also
cudaMemcpy(DEVICE_SOURCE,source->volume,size, cudaMemcpyHostToDevice);
should be
cudaMemcpy(DEVICE_SOURCE,source->volume,size*sizeof(double), cudaMemcpyHostToDevice);
and so on.

CUDA speedup for simple calculations

I have the following code in cuda_computation.cu
#include <iostream>
#include <stdio.h>
#include <cuda.h>
#include <assert.h>
void checkCUDAError(const char *msg);
__global__ void euclid_kernel(float *x, float* y, float* f)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int i = blockIdx.x;
int j = threadIdx.x;
f[idx] = sqrt((x[i]-x[j])*(x[i]-x[j]) + (y[i]-y[j])*(y[i]-y[j]));
}
int main()
{
float *xh;
float *yh;
float *fh;
float *xd;
float *yd;
float *fd;
size_t n = 256;
size_t numBlocks = n;
size_t numThreadsPerBlock = n;
size_t memSize = numBlocks * numThreadsPerBlock * sizeof(float);
xh = (float *) malloc(n * sizeof(float));
yh = (float *) malloc(n * sizeof(float));
fh = (float *) malloc(memSize);
for(int ii(0); ii!=n; ++ii)
{
xh[ii] = ii;
yh[ii] = ii;
}
cudaMalloc( (void **) &xd, n * sizeof(float) );
cudaMalloc( (void **) &yd, n * sizeof(float) );
cudaMalloc( (void **) &fd, memSize );
for(int run(0); run!=10000; ++run)
{
//change value to avoid optimizations
xh[0] = ((float)run)/10000.0;
cudaMemcpy( xd, xh, n * sizeof(float), cudaMemcpyHostToDevice );
checkCUDAError("cudaMemcpy");
cudaMemcpy( yd, yh, n * sizeof(float), cudaMemcpyHostToDevice );
checkCUDAError("cudaMemcpy");
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
euclid_kernel<<< dimGrid, dimBlock >>>( xd, yd, fd );
cudaThreadSynchronize();
checkCUDAError("kernel execution");
cudaMemcpy( fh, fd, memSize, cudaMemcpyDeviceToHost );
checkCUDAError("cudaMemcpy");
}
cudaFree(xd);
cudaFree(yd);
cudaFree(fd);
free(xh);
free(yh);
free(fh);
return 0;
}
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(-1);
}
}
It takes about 6" to run on an FX QUADRO 380, while the corresponding serial version using just one i7-870 core takes just about 3". Do I miss something? Is the code under optimised in some ways? Or is it just expected behaviour that for simple calculations (like this all-pairs Euclidean distance) the overhead needed to move memory exceeds the computational gain?
I think you are being killed by the time to move the data.
Especially since you are calling the CUDA kernel with individual values, it might be quicker to upload a large set of values as a 1D array and operate on them.
Also sqrt isn't done in HW on Cuda (at least not on my GPU) whereas the CPU has optimized FPU HW for this and is probably 10x faster than the GPU, and for a small job like this is probably keeping all the results in cache between the timign runs.
Reduce your global memory reads since they are expensive.
You have 4 global memory reads per thread which can be reduced to 2 using shared memory.
__global__ void euclid_kernel(const float * inX_g, const float* inY_g, float * outF_g)
{
const unsigned int threadId = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ float xBlock_s;
__shared__ float yBlock_s;
if(threadIdx.x == 0)
{
xBlock_s = inX_g[blockIdx.x];
yBlock_s = inY_g[blockIdx.x];
}
__syncthreads();
float xSub = xBlock_s - inX_g[threadIdx.x];
float ySub = yBlock_s - inY_g[threadIdx.x];
outF_g[threadId] = sqrt(xSub * xSub + ySub * ySub);
}
You should also test with different block sizes (aslong you have 100% occupancy).
You are splitting the problem so that each block is responsible for a single i vs all 256 j's. This is bad locality, as those 256 j's have to be reloaded for every block, for a total of 2*256*(256 + 1) loads. Instead, split your grid so that each block is responsible for a range of, say, 16 i's and 16 j's, which is still 256 blocks*256 threads. But each block now loads only 2*(16+16) values, for a total or 2*256*32 total loads. The idea is, reuse each loaded value as many times as possible. This may not have a huge impact with 256x256, but becomes more and more important as the size scales.
This optimization is used for efficient matrix multiplies, which have a similar locality problem. See http://en.wikipedia.org/wiki/Loop_tiling, or google for "optimized matrix multiply" for more details. And perhaps the matrix multiplication kernel in the NVIDIA SDK gives some details and ideas.

Resources