When I read the programming guide, I got the feeling that shared memory will always improve the performance, but it seems not.
I have two functions:
const int Ntimes=1;
__global__ void testgl(float *A, float *C, int numElements){
int ti = threadIdx.x;
int b0 = blockDim.x*blockIdx.x;
if (b0+ti < numElements){
for(int i=0;i<Ntimes;i++){
A[b0+ti]=A[b0+ti]*A[b0+ti]*10-2*A[b0+ti]+1;
}
C[b0+ti] = A[b0+ti]*A[b0+ti];
}
}
__global__ void testsh(float *A, float *C, int numElements){
int ti = threadIdx.x;
int b0 = blockDim.x*blockIdx.x;
__shared__ float a[1024];
if (b0+ti < numElements){
a[ti]=A[b0+ti];
}
__syncthreads();
if (b0+ti < numElements){
for(int i=0;i<Ntimes;i++){
a[ti]=a[ti]*a[ti]*10-2*a[ti]+1;
}
C[b0+ti] = a[ti]*a[ti];
}
}
int main(void){
int numElements = 500000;
size_t size = numElements * sizeof(float);
// Allocate the host input
float *h_A = (float *)malloc(size);
float *h_B = (float *)malloc(size);
// Allocate the host output
float *h_C = (float *)malloc(size);
float *h_D = (float *)malloc(size);
// Initialize the host input
for (int i = 0; i < numElements; i++){
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = h_A[i];
}
// Allocate the device input
float *d_A = NULL; cudaMalloc((void **)&d_A, size);
float *d_B = NULL; cudaMalloc((void **)&d_B, size);
float *d_C = NULL; cudaMalloc((void **)&d_C, size);
float *d_D = NULL; cudaMalloc((void **)&d_D, size);
//Copy to Device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 1024;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
testgl<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_C, numElements);
testsh<<<blocksPerGrid, threadsPerBlock>>>(d_B, d_D, numElements);
// Copy the device resultto the host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaMemcpy(h_D, d_D, size, cudaMemcpyDeviceToHost);
// Free device global memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFree(d_D);
// Free host memory
free(h_A);
free(h_B);
free(h_C);
free(h_D);
// Reset the device and exit
cudaDeviceReset();
return 0;
}
If Ntimes is set to be 1, testgl costs 49us, and testsh costs 97us.
If Ntimes is set to be 100, testgl costs 9.7ms, and testsh costs 8.9ms.
I do not know why it's more than 100 times longer.
So it seems the shared memory helps only when we want to do a lot of things in device, is that right?
The card used here is GTX680.
Thanks in advance.
shared memory will always improve the performance
Thats not true. It depends on the algorithm. If you have a perfectly coalesced memory access in the kernel and you are accessing the global memory just once it may not help. But if you are implementing suppose a matrix multiplication where you need the partial sums to be held then it will be useful.
It will be also helpful if you are accessing the same memory location more than once in the kernel it will help in this case since the shared memory latency is 100 times less than the global memory because its on-chip memory.
When you analyse that the kernel is bandwidth limited then its a good place to think if there is a scope of using the shared memory and increase the performance. Its also better strategy to check the occupancy calculator to check if the usage of shared memory is going to affect the occupancy.
shared memory helps only when we want to do a lot of things in device ?
Partial Yes. Shared memory helps when we want to do a lot of things in device.
In your case in the above kernel, as you are accessing the global memory more than once in the kernel it should help. It will be helpful if you can provide the complete reproducer to analyze the code. Also it will be helpful to know the card details you are running on.
Related
I have been trying store a 2D array in texture memory and read from it via cudaBindTexture2D
but the value returned is 0, but I'm not sure if this is the right use of cudaBindTexture2D, and tex2D();
I made a pretty simple code to try it out :
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
texture<uint, cudaTextureType2D, cudaReadModeElementType> tex;
__global__
void texture2DTest(int *x){
*x = tex2D(tex,0,0);
}
void initTable(int textureTable[][9]){
int i=0;
int j=0;
for(i=0; i<10; i++){
for(j=0; j<9; j++){
textureTable[i][j]=0;
}
}
textureTable[0][0] = 12;
}
int main (int argc, char ** argv){
int textureTable[10][9];
int *d_x;
int x=2;
size_t pitch;
initTable(textureTable);
cudaMalloc(&d_x, sizeof(int));
cudaMemcpy(d_x, &x, sizeof(int), cudaMemcpyHostToDevice);
cudaMallocPitch( (void**)textureTable,&pitch, 9, 10);
cudaChannelFormatDesc desc = cudaCreateChannelDesc<uint>();
cudaBindTexture2D(NULL, tex, textureTable, desc, 9, 10, pitch) ;
texture2DTest<<<1,1>>>(d_x);
cudaThreadSynchronize();
cudaMemcpy(&x,d_x, sizeof(int), cudaMemcpyDeviceToHost);
printf(" \n %d \n",x);
cudaUnbindTexture(tex);
return 0;
}
Thank you.
There are quite a few issues in the provided code.
The device memory allocation using cudaMallocPitch is totally broken. You are trying to allocate device memory to a 2D array which is already allocated on the host.
Trying to do so will result in memory corruption and undefined behavior. A separate pointer variable is required for device memory allocation and memory should be copied from host to device after allocation.
The third argument of cudaMallocPitch expects width of memory in bytes; not elements.
Textures can only be bound to device memory, so cudaBindTexture expects device memory pointer as input.
Fixing all of the above issues, your final main will look something like this:
int main (int argc, char ** argv)
{
int textureTable[10][9];
int *d_x;
int x = 2;
size_t pitch;
initTable(textureTable);
cudaMalloc(&d_x, sizeof(int));
cudaMemcpy(d_x, &x, sizeof(int), cudaMemcpyHostToDevice);
int* d_textureTable; //Device texture table
//Allocate pitch linear memory to device texture table
cudaMallocPitch((void**)&d_textureTable,&pitch, 9 * sizeof(int), 10);
//Use Memcpy2D as the pitch of host and device memory may be different
cudaMemcpy2D(d_textureTable, pitch, textureTable, 9 * sizeof(int), 9 *sizeof(int), 10, cudaMemcpyHostToDevice);
cudaChannelFormatDesc desc = cudaCreateChannelDesc<uint>();
cudaBindTexture2D(NULL, tex, d_textureTable, desc, 9, 10, pitch) ;
texture2DTest<<<1,1>>>(d_x);
cudaThreadSynchronize();
cudaMemcpy(&x,d_x, sizeof(int), cudaMemcpyDeviceToHost);
printf(" \n %d \n",x);
cudaUnbindTexture(tex);
//Don't forget to free the allocated memory
cudaFree(d_textureTable);
cudaFree(d_x);
return 0;
}
I am writing a simpled code about the addition of the elements of 2 matrices A and B; the code is quite simple and it is inspired on the example given in chapter 2 of the CUDA C Programming Guide.
#include <stdio.h>
#include <stdlib.h>
#define N 2
__global__ void MatAdd(int A[][N], int B[][N], int C[][N]){
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main(){
int A[N][N] = {{1,2},{3,4}};
int B[N][N] = {{5,6},{7,8}};
int C[N][N] = {{0,0},{0,0}};
int (*pA)[N], (*pB)[N], (*pC)[N];
cudaMalloc((void**)&pA, (N*N)*sizeof(int));
cudaMalloc((void**)&pB, (N*N)*sizeof(int));
cudaMalloc((void**)&pC, (N*N)*sizeof(int));
cudaMemcpy(pA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(pB, B, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(pC, C, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
int numBlocks = 1;
dim3 threadsPerBlock(N,N);
MatAdd<<<numBlocks,threadsPerBlock>>>(A,B,C);
cudaMemcpy(C, pC, (N*N)*sizeof(int), cudaMemcpyDeviceToHost);
int i, j; printf("C = \n");
for(i=0;i<N;i++){
for(j=0;j<N;j++){
printf("%d ", C[i][j]);
}
printf("\n");
}
cudaFree(pA);
cudaFree(pB);
cudaFree(pC);
printf("\n");
return 0;
}
when i run it i keep getting the initial matrix C = [0 0 ; 0 0] instead of the addition of the elements(i,j) of the 2 matrices A and B; i have previously done another example about the addition of the elements of two arrays and it seems to work fine; however this time i don't know why it does not work.
I believe there's something wrong with the cudaMalloc command by i don't really know what else could it be.
Any ideas?
MatAdd<<<numBlocks,threadsPerBlock>>>(pA,pB,pC); instead of MatAdd<<<numBlocks,threadsPerBlock>>>(A,B,C); solves the problem.
The reason is that A,B and C are allocated on the CPU, while pA,pB and pC are allocated of the GPU, using CudaMalloc(). Once pA,pB and pC are allocated, the values are sent from the CPU to GPU by cudaMemcpy(pA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
Then, the addition is performed on the GPU, that is with pA,pB and pC. To use printf, the result pC is sent from the GPU to the CPU via cudaMemcpy(C, pC, (N*N)*sizeof(int), cudaMemcpyDeviceToHost);
Think as if the CPU cannot see pA and the GPU cannot see A.
So I have been stuck on this problem for a while. My struct looks like this:
typedef struct
{
int size;
int dim[DIMENSIONS];
float *data;
}matrix;
Now the problem for me is how to malloc and memcpy. This is how I'm doing it:
matrix * d_in;
matrix * d_out;
const int THREADS_BYTES = sizeof(int) + sizeof(int)*DIMENSIONS + sizeof(float)*h_A->_size;
cudaMalloc((void **) &d_in, THREADS_BYTES);
cudaMemcpy(d_in, h_A, THREADS_BYTES, cudaMemcpyHostToDevice);
EDIT: this is how I allocated h_a:
matrix A; // = (matrix*)malloc(sizeof(matrix));
A._dim[0] = 40;
A._dim[1] = 60;
A._size = A._dim[0]*A._dim[1];
A._data = (float*)malloc(A._size*sizeof(float));
matrix *h_A = &A;
Where h_A is a matrix I allocated. I call my kernel like this:
DeviceComp<<<gridSize, blockSize>>>(d_out, d_in);
However, in my kernel I cannot reach any data from the struct, only the array and the variable.
This is a common problem. When you did the malloc operation on the host (for h_a->data), you allocated host data, which is not accessible from the device.
This answer describes in some detail what is going on and how to fix it.
In your case, something like this should work:
matrix A; // = (matrix*)malloc(sizeof(matrix));
A._dim[0] = 40;
A._dim[1] = 60;
A._size = A._dim[0]*A._dim[1];
A._data = (float*)malloc(A._size*sizeof(float));
matrix *h_A = &A;
float *d_data;
cudaMalloc((void **) &d_data, A._size*sizeof(float));
matrix * d_in;
matrix * d_out;
const int THREADS_BYTES = sizeof(int) + sizeof(int)*DIMENSIONS + sizeof(float)*h_A->_size;
cudaMalloc((void **) &d_in, THREADS_BYTES);
cudaMemcpy(d_in, h_A, THREADS_BYTES, cudaMemcpyHostToDevice);
cudaMemcpy(&(d_in->data), &d_data, sizeof(float *), cudaMemcpyHostToDevice);
Note that this doesn't actually copy the data area from the host copy of A to the device copy. It simply makes a device-accessible data area, equal in size to the host data area. If you also want to copy the data area, that will require another cudaMemcpy operation, using h_a->data and d_data.
I'm having a bit of trouble understanding how to send a 2D array to Cuda. I have a program that parses a large file with a 30 data points on each line. I read about 10 rows at a time and then create a matrix for each line and items(so in my example of 10 rows with 30 data points, it would be int list[10][30]; My goal is to send this array to my kernal and have each block process a row(I have gotten this to work perfectly in normal C, but Cuda has been a bit more challenging).
Here's what I'm doing so far but no luck(note: sizeofbucket = rows, and sizeOfBucketsHoldings = items in row...I know I should win a award for odd variable names):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}
The error I'm getting is:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
line 266 is the kernel call process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch); I think the problem is I am trying to create my array in my function as int * but how else can I create it? In my pure C code, I use int current_list[num_of_rows][num_items_in_row] which works but I can't get the same outcome to work in Cuda.
My end goal is simple I just want to get each block to process each row(sizeOfBuckets) and then have it loop through all items in that row(sizeOfBucketHoldings). I orginally just did a normal cudamalloc and cudaMemcpy but it wasn't working so I looked around and found out about MallocPitch and 2dcopy(both of which were not in my cuda by example book) and I have been trying to study examples but they seem to be giving me the same error(I'm currently reading the CUDA_C programming guide found this idea on page22 but still no luck). Any ideas? or suggestions of where to look?
Edit:
To test this, I just want to add the value of each row together(I copied the logic from the cuda by example array addition example).
My kernel:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}
Here's how I declare the total array in my main:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );
You have some mistakes in your code.
Then you copy host array to device you should pass one dimensional host pointer.See the function signature.
You don't need to allocate static 2D array for device memory. It creates static array in host memory then you recreate it as device array. Keep in mind it must be one dimensional array, too. See this function signature.
This example should help you with memory allocation:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
{
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
{
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
}
}
int main()
{
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaDeviceSynchronize();
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
}
To access your 2D array in kernel you should use pattern base_addr + y * pitch_d + x.
WARNING: the pitvh allways in bytes. You need to cast your pointer to byte*.
I have the following code in cuda_computation.cu
#include <iostream>
#include <stdio.h>
#include <cuda.h>
#include <assert.h>
void checkCUDAError(const char *msg);
__global__ void euclid_kernel(float *x, float* y, float* f)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int i = blockIdx.x;
int j = threadIdx.x;
f[idx] = sqrt((x[i]-x[j])*(x[i]-x[j]) + (y[i]-y[j])*(y[i]-y[j]));
}
int main()
{
float *xh;
float *yh;
float *fh;
float *xd;
float *yd;
float *fd;
size_t n = 256;
size_t numBlocks = n;
size_t numThreadsPerBlock = n;
size_t memSize = numBlocks * numThreadsPerBlock * sizeof(float);
xh = (float *) malloc(n * sizeof(float));
yh = (float *) malloc(n * sizeof(float));
fh = (float *) malloc(memSize);
for(int ii(0); ii!=n; ++ii)
{
xh[ii] = ii;
yh[ii] = ii;
}
cudaMalloc( (void **) &xd, n * sizeof(float) );
cudaMalloc( (void **) &yd, n * sizeof(float) );
cudaMalloc( (void **) &fd, memSize );
for(int run(0); run!=10000; ++run)
{
//change value to avoid optimizations
xh[0] = ((float)run)/10000.0;
cudaMemcpy( xd, xh, n * sizeof(float), cudaMemcpyHostToDevice );
checkCUDAError("cudaMemcpy");
cudaMemcpy( yd, yh, n * sizeof(float), cudaMemcpyHostToDevice );
checkCUDAError("cudaMemcpy");
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
euclid_kernel<<< dimGrid, dimBlock >>>( xd, yd, fd );
cudaThreadSynchronize();
checkCUDAError("kernel execution");
cudaMemcpy( fh, fd, memSize, cudaMemcpyDeviceToHost );
checkCUDAError("cudaMemcpy");
}
cudaFree(xd);
cudaFree(yd);
cudaFree(fd);
free(xh);
free(yh);
free(fh);
return 0;
}
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(-1);
}
}
It takes about 6" to run on an FX QUADRO 380, while the corresponding serial version using just one i7-870 core takes just about 3". Do I miss something? Is the code under optimised in some ways? Or is it just expected behaviour that for simple calculations (like this all-pairs Euclidean distance) the overhead needed to move memory exceeds the computational gain?
I think you are being killed by the time to move the data.
Especially since you are calling the CUDA kernel with individual values, it might be quicker to upload a large set of values as a 1D array and operate on them.
Also sqrt isn't done in HW on Cuda (at least not on my GPU) whereas the CPU has optimized FPU HW for this and is probably 10x faster than the GPU, and for a small job like this is probably keeping all the results in cache between the timign runs.
Reduce your global memory reads since they are expensive.
You have 4 global memory reads per thread which can be reduced to 2 using shared memory.
__global__ void euclid_kernel(const float * inX_g, const float* inY_g, float * outF_g)
{
const unsigned int threadId = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ float xBlock_s;
__shared__ float yBlock_s;
if(threadIdx.x == 0)
{
xBlock_s = inX_g[blockIdx.x];
yBlock_s = inY_g[blockIdx.x];
}
__syncthreads();
float xSub = xBlock_s - inX_g[threadIdx.x];
float ySub = yBlock_s - inY_g[threadIdx.x];
outF_g[threadId] = sqrt(xSub * xSub + ySub * ySub);
}
You should also test with different block sizes (aslong you have 100% occupancy).
You are splitting the problem so that each block is responsible for a single i vs all 256 j's. This is bad locality, as those 256 j's have to be reloaded for every block, for a total of 2*256*(256 + 1) loads. Instead, split your grid so that each block is responsible for a range of, say, 16 i's and 16 j's, which is still 256 blocks*256 threads. But each block now loads only 2*(16+16) values, for a total or 2*256*32 total loads. The idea is, reuse each loaded value as many times as possible. This may not have a huge impact with 256x256, but becomes more and more important as the size scales.
This optimization is used for efficient matrix multiplies, which have a similar locality problem. See http://en.wikipedia.org/wiki/Loop_tiling, or google for "optimized matrix multiply" for more details. And perhaps the matrix multiplication kernel in the NVIDIA SDK gives some details and ideas.