how to generalize square matrix multiplication to handle arbitrary dimensions - c

I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200.
#include <stdio.h>
#include <cuda.h>
#define width 1000
__global__ void kernel(int *a,int *b,int *c)
{
int tx = threadIdx.x + blockIdx.x*blockDim.x;
int ty = threadIdx.y + blockIdx.y*blockDim.y;
int sum=0,k;
for(k=0;k<(width);++k)
{
sum += a[ty*width +k]*b[k*width + tx];
}
c[ty*width + tx] = sum;
}
int main()
{
int a[width*width],c[width*width],b[width*width];
int *dev_a,*dev_b,*dev_c;
int i,count=0;
int size = (width*width)*sizeof(int);
for(i=0;i<(width*width);i++)
{
a[i] = 1;
b[i] = 1;
}
cudaMalloc((void **)&dev_a,size);
cudaMalloc((void **)&dev_b,size);
cudaMalloc((void **)&dev_c,size);
cudaMemcpy(dev_a,&a,size,cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,&b,size,cudaMemcpyHostToDevice);
dim3 dimBlock(20,20);
dim3 blockID(50,50);
kernel<<<blockID,dimBlock>>>(dev_a,dev_b,dev_c);
cudaMemcpy(&c,dev_c,size,cudaMemcpyDeviceToHost);
for(i=0;i<(width*width);i++)
{
count++;
if(count == (width+1))
{
count = 1;
printf("\n");
}
printf("%d ",c[i]);
}
printf("\n");
return 0;
}

This code will work for very specific dimensions but not for others.
It will work for square matrix multiplication when width is exactly equal to the product of your block dimension (number of threads - 20 in the code you have shown) and your grid dimension (number of blocks - 50 in the code you have shown).
So when width is 20*50 (1000) it will work as shown. But if I change width to some other value (say 800) and make no other changes, your code won't work. In the case of 800, however, I could get your code working by changing the grid dimension from 50 to 40, then width = 800 = 20 *40.
But what if I need to multiply two matrices of width 799? I can't come up with a product of grid and block dimension that will match that width exactly.
This is a fairly standard problem in CUDA programming - I cannot come up with convenient block and grid dimensions to exactly match my work (i.e. data) size, and if I launch too many (threads/blocks) things don't seem to work.
To fix this problem we must do 2 things:
Be sure to launch at least enough, but maybe more than enough threads (blocks of threads) to cover the entire data set
Add conditional code in the kernel, so that only the threads corresponding to valid data do any real work.
To address item 1 above, we modify our grid dimension calculations to something like this:
dim3 dimBlock(16,16);
dim3 blockID((width+dimBlock.x-1)/dimBlock.x,(width+dimBlock.y-1)/dimBlock.y);
To address item 2 above we modify our kernel code to condition thread behavior on whether or not the thread corresponds to valid data:
__global__ void kernel(int *a,int *b,int *c, int mwidth)
{
int tx = threadIdx.x + blockIdx.x*blockDim.x;
int ty = threadIdx.y + blockIdx.y*blockDim.y;
if ((tx<mwidth)&&(ty<mwidth)){
int sum=0,k;
for(k=0;k<(mwidth);++k)
{
sum += a[ty*mwidth +k]*b[k*mwidth + tx];
}
c[ty*mwidth + tx] = sum;}
}
And since we've modified the kernel with a new parameter, we have to pass that parameter on invocation:
kernel<<<blockID,dimBlock>>>(dev_a,dev_b,dev_c, width);
That should be what is needed to logically extend the code you have shown to handle "arbitrary" dimensions. I would also suggest adding proper cuda error checking any time you are having trouble with a CUDA code.

Related

Calculating potential energy in a molecular dynamics simulation

Background
Imagine that we have N particles inside a box of length L, which interact with each other (through a Lennard Jones potential).
I want to compute the total potential energy of the system. I implemented the function POT which calculates all the contributions from all the particles and gives the correct results (this is tested and can be assumed true).
I also wrote a function POT_ONE which only calculates the potential energy of one particle with respect to all the others. This means that if I want to calculate the total potential energy I will have to call this function N times (making sure that the particle does not interact with itself) and then divide by 2 since I double count the interactions.
Goal
My goal is to make the second function yield the same results as the first one.
Problem
There is something really strange going on: If I put 4 particles, the two functions give the same results. If I put a fifth one then there is deviation. Then for 6,7,8 particles,again, it gives correct results and then for N=9 I am getting a different result. In the case N=1000 the result that I am getting from POT_ONE is somemthing like 113383820348202024.
My results for N=5 are:
-0.003911 with POT and
12.864234 with POT_ONE
In case someone tries to run the code and wants to check the N=4 case, he/she should change the number of particles (np) which is defined as global variable and then comment the line pos[12]=1;pos[13]=1;pos[14]=1;.
Code
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
/*GENERAL PARAMETERS*/
int dim=3; //number of dimensions
int np=5; //number of particles
double L=36.413; //box length (A)
double invL=1/36.413; //inverse of box length
/*ARGON CHARACTERISTICS*/
double sig=3.4; // Angstroms (A)
double e=0.001; // eV
double distSQ(double array[]){
/*calculates the squared distance given the array x=[dx,dy,dz]*/
int i;
double r2=0;
for(i=0;i<dim;i++) r2+=array[i]*array[i];
return r2;
}//distSQ
void MIC(double dr[],double L, int dim){
/* MINIMUM IMAGE CONVENTION: dr[] is the array dr = [dx,dy,dz] describing relative
positions of two particles, L is the box length, dim the number
of dimensions */
int i;
for(i=0;i<dim;i++) dr[i]-=round(dr[i]*invL)*L;
}//MIC
void POT(double x1[],double* potential){
/*given the positions of each particle in the form x=[x0,y0,z0;x1,y1,z1;...;xn-1,yn-1,zn-1],
the number of dimensions dim and particles np, it calculates the potential energy of the configuration*/
//variables for potential calculation
int i,j,k;
double *x2;
double r2inv; // 1/r^2
double foo,bar;
double dr[dim];
*potential=0; // set potential energy to zero
//main part of POT
for(i=0;i<np-1;i++){
x2=x1+dim;
for(j=i+1;j<np;j++){
//calculate relative distances between particles i & j
//apply periodic BCs and then calculate squared distance
//and the potential energy between them.
for(k=0;k<dim;k++) dr[k] = x2[k]-x1[k];
MIC(dr,L,dim); //periodic boundary conditions
r2inv=1/distSQ(dr);
//calculate potential energy
foo = sig*sig*r2inv;
bar = foo*foo*foo;
*potential+=bar*(bar-1);
}//for j
x1+=dim;
}//for i
*potential*=4*e; //scale and give energy units
}//POT
void POT_ONE(int particle,double pos[],double* potential){
*potential=0;
int i,k;
double dr[dim];
double r2inv,foo,bar;
double par_pos[dim];
int index=particle*dim;
par_pos[0]=pos[index];
par_pos[1]=pos[index+1];
par_pos[2]=pos[index+2];
for(i=0;i<np;i++){
if(i!=particle){
for(k=0;k<dim;k++) dr[k]=pos[k]-par_pos[k];
MIC(dr,L,dim);
r2inv=1/distSQ(dr);
foo=sig*sig*r2inv;
bar=foo*foo*foo;
*potential+=bar*(bar-1);
}
pos+=dim;
}
*potential*=4*e; //scale and give energy units
}//POT_ONE
int main(){
int D=np*dim;
double* pos=malloc(D*sizeof(double));
double potential=0; //calculated with POT
double U=0; ////calculated with POT_ONE
double tempU=0;
pos[0]=0;pos[1]=0;pos[2]=0;
pos[3]=4;pos[4]=0;pos[5]=0;
pos[6]=0;pos[7]=4;pos[8]=0;
pos[9]=0;pos[10]=0;pos[11]=4;
pos[12]=1;pos[13]=1;pos[14]=1;
POT(pos,&potential);
printf("POT: %f\n", potential);
int i,j;
for(i=0;i<np;i++){
POT_ONE(i,pos,&tempU);
U+=tempU;
}
U=U/2;
printf("POT_ONE: %f\n\n", U);
return 0;
}
Your error is in POT, where you forgot to update x2 at the end of the inner loop.
for (i = 0; i < np - 1; i++) {
double *x2 = x1 + dim;
for (j = i + 1; j < np; j++) {
// ... calculate stuff ..
x2 += dim;
}
x1 += dim;
}
An easier and arguably more readable variant is to forgo pointer arithmetic altogether and use boring old indices:
for (k = 0; k < dim; k++) {
dr[k] = x[j * dim + k] - x[i * dim + k];
}
Further observations:
Please make your variables local to the scope where they are used. A large list of uninitialized variables at the top of the function makes it very hard to track variables, even in a short function like yours.
Please consider returning single values from functions instead of passing in pointers. In my opinion, that makes functions like the square of the distance more readable.
The structure of your code is hard to see, because everything is run togeher very tightly, even the comments.

How to avoid the error of AVX2 when the matrix dimension isn't multiples of 4?

I made matrix-vector multiplication program using AVX2, FMA in C. I compiled using GCC ver7 with -mfma, -mavx.
However, I got the error "incorrect checksum for freed object - object was probably modified after being freed."
I think the error would generate if the matrix dimension isn't multiples of 4.
I know AVX2 use ymm register that can use 4 double precision floating point number. Therefore, I can use AVX2 without error in case the matrix is multiples of 4.
But, here is my question.
How can I use AVX2 efficiently if the matrix isn't multiples of 4 ???
Here is my code.
#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "time.h"
#include "x86intrin.h"
void mv(double *a,double *b,double *c, int m, int n, int l)
{
__m256d va,vb,vc;
int k;
int i;
for (k = 0; k < l; k++) {
vb = _mm256_broadcast_sd(&b[k]);
for (i = 0; i < m; i+=4) {
va = _mm256_loadu_pd(&a[m*k+i]);
vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
}
}
}
int main(int argc, char* argv[]) {
// set variables
int m;
double* a;
double* b;
double* c;
int i;
int temp=0;
struct timespec startTime, endTime;
m=9;
// main program
// set vector or matrix
a=(double *)malloc(sizeof(double) * m*m);
b=(double *)malloc(sizeof(double) * m*1);
c=(double *)malloc(sizeof(double) * m*1);
for (i=0;i<m;i++) {
a[i]=1;
b[i]=1;
c[i]=0.0;
}
for (i=m;i<m*m;i++) {
a[i]=1;
}
// check start time
clock_gettime(CLOCK_REALTIME, &startTime);
mv(a, b, c, m, 1, m);
// check end time
clock_gettime(CLOCK_REALTIME, &endTime);
free(a);
free(b);
free(c);
return 0;
}
You load and store vectors of 4 double, but your loop condition only checks that the first vector element is in-bounds, so you can write outside objects by up to 3x8 = 24 bytes when m is not a multiple of 4.
You need something like i < (m-3) in main loop, and a cleanup strategy for handling the last partial vector of data. Vectorizing with SIMD is very much like unrolling: you have to check that it's ok to do multiple future elements in the loop condition.
A scalar cleanup loop works well, but we can do better. For example, do as many 128-bit vectors as possible after the last full 256-bit vector (i.e. up to 1), before going scalar.
In many cases (e.g. write-only destination) an unaligned vector load that ends at the end of your arrays is very good (when m>=4). It can overlap with your main loop if m%4 != 0, but that's fine because your output array doesn't overlap your inputs, so redoing an element as part of a single cleanup is cheaper than branching to avoid it.
But that doesn't work here, because your logic is c[i+0..3] += ..., so redoing an element would make it wrong.
// cleanup using a 128-bit FMA, then scalar if there's an odd element.
// untested
void mv(double *a,double *b,double *c, int m, int n, int l)
{
/* the loop below should actually work for m=1..3, but a separate strategy might be good.
if (m < 4) {
// maybe check m >= 2 and use __m128 vectors?
// or vectorize differently?
}
*/
for (int k = 0; k < l; k++) {
__m256 vb = _mm256_broadcast_sd(&b[k]);
int i;
for (i = 0; i < (m-3); i+=4) {
__m256d va = _mm256_loadu_pd(&a[m*k+i]);
__m256d vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
}
if (i<(m-1)) {
__m128d lasta = _mm_loadu_pd(&a[m*k+i]);
__m128d lastc = _mm_loadu_pd(&c[i]);
lastc = _mm_fmadd_pd(lastc, va, _mm256_castpd256_pd128(vb));
_mm_storeu_pd( &c[i], lastc );
// i+=2; // last element only checks m odd/even, doesn't use i
}
// if (i<m)
if (m&1) {
// odd number of elements, do the last non-vector one
c[m-1] += a[m*k + m-1] * _mm256_cvtsd_f64(vb);
}
}
}
I haven't looked at exactly how gcc/clang -O3 compile that. Sometimes compilers try to get too smart with cleanup code (e.g. trying to auto-vectorize scalar cleanup loops).
Other strategies could include doing the last up-to-4 elements with an AVX masked store: you need the same mask for the end of every matrix row, so generating it once and then using it at the end of every row could be good. See Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all. (To simplify branching, you'd set it up so your main loop only goes to i < (m-4), then you always run the cleanup. In the m%4 == 0 case, the mask is all-ones so you do the final full vector.) If you can't safely read past the end of the matrix, you probably need a masked load as well as masked store.
You could also look at aligning your rows for efficiency, or a row stride that's separate from the logical length of rows. (i.e. pad rows out to 32-byte boundaries). Leaving padding at the end of rows simplifies the cleanup, because you can always do whole vectors that write padding.
Special case m==2: instead of broadcasting one element from b[], you'd like to broadcast 2 elements into two 128-bit lanes of a __m256d, so one 256-bit FMA could do 2 rows at once.

Merge sort using CUDA: efficient implementation for small input arrays

I have the following problem: given two sorted arrays A and B, I have to produce a sorted array C with the elements of A and B.
I found some solution for solving this problem using CUDA: Merge Path, for example http://www.cc.gatech.edu/~bader/papers/GPUMergePath-ICS2012.pdf
However, their problem is given by the size of the original arrays, at least 10k elements. I have a different problem.
The arrays I've to merge are much smaller (1000 elements at most) and the complexity is given by the number of merges to be done (the order of 10 to the power of 10, 10^5 arrays of size ~1000 to be merged with each other).
Part of their problem is to split the original arrays into equally sized parts that are processed in parallel. The arrays I have to merge are small enough to entirely fit in the GPU shared memory.
Thrust is not my first choice because the output of my procedure is not the sorted array itself, but a calculation with its elements: so I think that a specialized kernel should be faster than first sort the element indices and then use them for the calculation.
A serial version of the algorithm looks like:
i=0
j=0
k=0
T=4
while i<N and j<M:
if A[i]<B[j]:
start_i = max(0,i-T)
C[k]=sum(A[start_i:i+1])
i+=1
else:
start_j = max(0,j-T)
C[k]=sum(B[start_j:j+1])
j+=1
k+=1
while i<N:
start_i = max(0,i-T)
C[k]=sum(A[start_i:i+1])
i+=1
k+=1
while j<M:
start_j = max(0,j-T)
C[k]=sum(B[start_j:j+1])
j+=1
k+=1
How can I exploit CUDA capabilities to solve this problem?
The two most important optimization goals for any CUDA program should be to:
expose (sufficient) parallelism
make efficient use of memory
There are certainly many other things that can be considered during optimization, but these are the two most important items to address first.
A merge operation (not quite the same as a merge-sort) is, at first glance, an inherently serial operation. We cannot make a proper decision about which item to choose, from either A or B input array, to place next in the output array C, until we have made all previous selections in C. In this respect, the merge algorithm (in this realization) makes it difficult to expose parallelism, and the paper linked in the question is almost entirely focused on that topic.
The goal of the algorithm described in the paper is to decompose the two input vectors A and B into multiple smaller pieces that can be worked on independently, so as to expose parallelism. In particular, the goal is to keep all the SMs in a GPU busy, and keep all the SP's in an SM busy. Once a sufficient decomposition of work is performed, each SP is ultimately performing a sequential merge (as mentioned in the paper):
Merging stage - Each core merges the two sub arrays
that it has been given using the same algorithm as a
simple sequential merge.
However, as you've pointed out, what you want to do is somewhat different. You already have many arrays, and you want to perform independent merge operations on those many arrays. Since your array count is ~100,000, this is enough independent pieces of work to consider mapping each to a GPU SP (ie. thread). This means that we can then, just as in the paper, use a simple sequential merge on each core/SP/thread. So the problem of exposing parallelism is in your case, already done (to perhaps a sufficient degree).
At this point we could consider implementing this as-is. The code I show later offers this as a starting point for comparison. However what we discover is the performance is not very good, and this is due to the fact that a merge algorithm fundamentally has a data-dependent access sequence, and so it is (more) difficult to arrange for coalesced access on the GPU. The authors of the paper propose to mitigate this problem by first reading the data (in a coalesced fashion) into shared memory, and then having the algorithm work on it out of shared memory, where the penalty for disorganized access is less.
I'll propose a different approach:
arrange the sequential merge algorithm so that each element of A and B need only be read once
arrange the storage of A, B, and C in column-major form as opposed to the more "natural" row-major storage that one might consider. This is effectively transposing the storage matrices for A, B, and C vectors. This allows for an improvement in coalesced access, as the GPU threads navigate their way through the merging operation on their individual A and B vectors. It's far from perfect, but the improvement is substantial.
Here's a worked example that implements the above idea, running a simple sequential merge in each thread, and each thread merging one of the A vectors with one of the B vectors:
$ cat t784.cu
#include <stdio.h>
#include <stdlib.h>
#include <thrust/sort.h>
#include <thrust/merge.h>
#define NUM_SETS 100000
#define DSIZE 100
typedef int mytype;
// for ascending sorted data
#define cmp(A,B) ((A)<(B))
#define nTPB 512
#define nBLK 128
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__host__ __device__ void smerge(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, const unsigned len_a, const unsigned len_b, const unsigned stride_a = 1, const unsigned stride_b = 1, const unsigned stride_c = 1){
unsigned len_c = len_a+len_b;
unsigned nc = 0;
unsigned na = 0;
unsigned nb = 0;
unsigned fa = (len_b == 0);
unsigned fb = (len_a == 0);
T nxta = a[0];
T nxtb = b[0];
while (nc < len_c){
if (fa) {c[stride_c*nc++] = nxta; na++; nxta = a[stride_a*na];}
else if (fb) {c[stride_c*nc++] = nxtb; nb++; nxtb = b[stride_b*nb];}
else if (cmp(nxta,nxtb)){
c[stride_c*nc++] = nxta;
na++;
if (na == len_a) fb++;
else nxta = a[stride_a*na];}
else {
c[stride_c*nc++] = nxtb;
nb++;
if (nb == len_b) fa++;
else nxtb = b[stride_b*nb];}}
}
template <typename T>
__global__ void rmtest(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, int num_arr, int len){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
while (idx < num_arr){
int sel=idx*len;
smerge(a+sel, b+sel, c+(2*sel), len, len);
idx += blockDim.x*gridDim.x;}
}
template <typename T>
__global__ void cmtest(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, int num_arr, int len, int stride_a, int stride_b, int stride_c){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
while (idx < num_arr){
smerge(a+idx, b+idx, c+idx, len, len, stride_a, stride_b, stride_c);
idx += blockDim.x*gridDim.x;}
}
template <typename T>
int rmvalidate(T *a, T *b, T *c, int num_arr, int len){
T *vc = (T *)malloc(2*len*sizeof(T));
for (int i = 0; i < num_arr; i++){
thrust::merge(a+(i*len), a+((i+1)*len), b+(i*len), b+((i+1)*len), vc);
#ifndef TIMING
for (int j = 0; j < len*2; j++)
if (vc[j] != c[(i*2*len)+j]) {printf("rm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c[(i*2*len)+j], vc[j]); return 0;}
#endif
}
return 1;
}
template <typename T>
int cmvalidate(const T *c1, const T *c2, int num_arr, int len){
for (int i = 0; i < num_arr; i++)
for (int j = 0; j < 2*len; j++)
if (c1[i*(2*len)+j] != c2[j*(num_arr)+i]) {printf("cm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c2[j*(num_arr)+i], c1[i*(2*len)+j]); return 0;}
return 1;
}
int main(){
mytype *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;
h_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
h_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
h_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype)*2);
cudaMalloc(&d_a, (DSIZE*NUM_SETS+1)*sizeof(mytype));
cudaMalloc(&d_b, (DSIZE*NUM_SETS+1)*sizeof(mytype));
cudaMalloc(&d_c, DSIZE*NUM_SETS*sizeof(mytype)*2);
// test "row-major" storage
for (int i =0; i<DSIZE*NUM_SETS; i++){
h_a[i] = rand();
h_b[i] = rand();}
thrust::sort(h_a, h_a+DSIZE*NUM_SETS);
thrust::sort(h_b, h_b+DSIZE*NUM_SETS);
cudaMemcpy(d_a, h_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
unsigned long gtime = dtime_usec(0);
rmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE);
cudaDeviceSynchronize();
gtime = dtime_usec(gtime);
cudaMemcpy(h_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
unsigned long ctime = dtime_usec(0);
if (!rmvalidate(h_a, h_b, h_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}
ctime = dtime_usec(ctime);
printf("CPU time: %f, GPU RM time: %f\n", ctime/(float)USECPSEC, gtime/(float)USECPSEC);
// test "col-major" storage
mytype *ch_a, *ch_b, *ch_c;
ch_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
ch_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
ch_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
for (int i = 0; i < NUM_SETS; i++)
for (int j = 0; j < DSIZE; j++){
ch_a[j*NUM_SETS+i] = h_a[i*DSIZE+j];
ch_b[j*NUM_SETS+i] = h_b[i*DSIZE+j];}
cudaMemcpy(d_a, ch_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, ch_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
gtime = dtime_usec(0);
cmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE, NUM_SETS, NUM_SETS, NUM_SETS );
cudaDeviceSynchronize();
gtime = dtime_usec(gtime);
cudaMemcpy(ch_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
if (!cmvalidate(h_c, ch_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}
printf("GPU CM time: %f\n", gtime/(float)USECPSEC);
return 0;
}
$ nvcc -O3 -DTIMING -o t784 t784.cu
$ ./t784
CPU time: 0.030691, GPU RM time: 0.045814
GPU CM time: 0.002784
$
Notes:
The GPU is actually slower than the naive single-threaded CPU code when the memory organization is row major. But for the column-major organization (which tends to improve opportunities for coalesced access) the GPU code is about 10x faster than the CPU code for my test case. This ~10x speedup factor is roughly in the range (~10-20x) of the speedup factors shown in the paper for a GPU MergePath 32-bit integer speedup vs. x86 serial merge.
using int vs. float datatypes makes a significant difference in the CPU timing. int seems to be faster (on the CPU) so I'm showing that version here. (This disparity is mentioned in the paper as well.)
The -DTIMING switch added to the compile command pares down the first validation function so that it just does the CPU merge operation, for timing.
The basic merge code is templated to be able to handle different data types, and parameterized so that it can used in either the column-major or row major operation.
I've dispensed with CUDA error checking for brevity of presenation. However, any time you're having trouble with a CUDA code, you should always use proper cuda error checking.
What about using thrust (as I suggested in the comments)? It should be possible to use thrust::merge with a suitable device/sequential execution policy, to more or less mimic what I have done above. However, thrust expects vectors to be contiguous, and so, without additional complexity, it could only be used in the row-major case, which we've seen is severely penalized by bad memory access patterns. It should be possible to create a set of permutation iterators in thrust that would allow the column-major, strided access that improves the memory scenario, but I have not pursued that.

C picture rotation optimization

This is for all you C experts out there..
The first function takes a two-dimensional matrix src[dim][dim] representing pixels of an image, and rotates it 90 degrees into a destination matrix dst[dim][dim]. The second function takes the same src[dim][dim] and smoothens the image by replacing every pixel value with the average of all the pixels around it (in a maximum of 3 × 3 window centered at that pixel).
I need to optimize the program in account for time and cycles, how else would I be able to optimize the following?:
void rotate(int dim, pixel *src, pixel *dst,)
{
int i, j, nj;
nj = 0;
/* below are the main computations for the implementation of rotate. */
for (j = 0; j < dim; j++) {
nj = dim-1-j; /* Code Motion moved operation outside inner for loop */
for (i = 0; i < dim; i++) {
dst[RIDX(nj, i, dim)] = src[RIDX(i, j, dim)];
}
}
}
/* A struct used to compute averaged pixel value */
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
/* Compute min and max of two integers, respectively */
static int minimum(int a, int b)
{ return (a < b ? a : b); }
static int maximum(int a, int b)
{ return (a > b ? a : b); }
/*
* initialize_pixel_sum - Initializes all fields of sum to 0
*/
static void initialize_pixel_sum(pixel_sum *sum)
{
sum->red = sum->green = sum->blue = 0;
sum->num = 0;
return;
}
/*
* accumulate_sum - Accumulates field values of p in corresponding
* fields of sum
*/
static void accumulate_sum(pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
/*
* assign_sum_to_pixel - Computes averaged pixel value in current_pixel
*/
static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum)
{
current_pixel->red = (unsigned short) (sum.red/sum.num);
current_pixel->green = (unsigned short) (sum.green/sum.num);
current_pixel->blue = (unsigned short) (sum.blue/sum.num);
return;
}
/*
* avg - Returns averaged pixel value at (i,j)
*/
static pixel avg(int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum);
for(ii = maximum(i-1, 0); ii <= minimum(i+1, dim-1); ii++)
for(jj = maximum(j-1, 0); jj <= minimum(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(&current_pixel, sum);
return current_pixel;
}
void smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
/* below are the main computations for the implementation of the smooth function. */
for (j = 0; j < dim; j++)
for (i = 0; i < dim; i++)
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}
I moved dim-1-j outside of the inner for loop of rotate which reduces time and cycles used in the program, but is there anything else that can be used for either main function?
Thanks!
There are several oprimizations you can do; some a compiler might do for you but best to write it out yourself. For example: moving constant expressions out of the loop (you did that once; there are more places you can do that - don't forget that the condition is checked every iteration too, so optimize the loop condition in this manner too) and, as Chris pointed out, use pointers that you increment instead of full array indexing. I also see some function calls that can be rewritten in-line.
I also want to point to an article on stackoverflow about matrix multiplication and optimizing that to use the processor cache. In essence it first rearranges the arrrays into memory bocks that fit the cache, then performs the operation on those blocks, then moves to the next block, and so on. You may be able to re-use the ideas for your rotation.
See Optimizing assembly generated by Microsoft Visual Studio Compiler
For the rotation, you get a better utilization of the cache by decomposing in smaller image tiles.
For the smoothing,
1) expand the whole operation inside the main double loop, do not use these intermediate micro-functions;
2) completely unroll the accumulation and averaging (it's only a sum of 9 terms), hard coding the indexes;
3) process in different loops along the edges (where not all 9 pixels are available) and in the middle. The middle deserves maximum optimization (especially (2));
4) try and avoid the divisions by 9 (you can think of replacing the division by a table lookup).
Top speed will be obtained by handcrafting vectorized optimization (SSE/AVX), but this requires some deal of experience. Multicore parallelization is also an option.
To give you an idea, it is possible to apply a 3x3 average on a 1 MB grayscale image in less than 0.5 ms (monocore, Core i7#3.4 GHz). We can extrapolate to 2 ms or so for a 1 Mpixel RGB image.
Since you can't provide a running program these are just ideas of things that could help:
Assuming values in the range [0,256) then use uint8_t as your rgbn values. This takes up 1/4 of the memory of the int version but will likely require more cycles; I can't know if this would be faster or not without more knowledge. The idea is that since you use 1/4 of the memory you are more likely to keep more values in L1-L3 cache.
Since your neighbors are the same whether you are rotated or not, calculate the average before rotating. I suspect this would help out with caching but again can't be sure; it depends on some code I can't see.
Parallelize the outer loop. Since you have easy grid dimensions and the inputs and outputs don't have read/write conflicts this is a trivial thing to do. This will certainly take more cycles but will possibly be faster.
Hard-code your edges; you are currently doing maximum and minimum operations on every call to average, but for the inner points it is unneeded. Calculate the edges and the inner points separately.

CUDA: Is it safe to apply `+=` in parallel to elements of an array located on the device?

I noticed strange (incorrect) behavior after compiling and executing a CUDA script, and was able to isolate it to the following minimal example. First I define an export-to-CSV function for integer arrays (just for debugging convenience):
#include <stdio.h>
#include <stdlib.h>
void int1DExportCSV(int *ptr, int n){
FILE *f;
f = fopen("1D IntOutput.CSV", "w");
int i = 0;
for (i = 0; i < n-1; i++){
fprintf(f, "%i,", ptr[i]);
}
fprintf(f, "%i", ptr[n-1]);
}
Then I defined a kernel function which increases a certain element of an input array by one:
__global__ void kernel(int *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
ptr[offset] += 1;
}
The main loop allocates a vector of one's called a, allocates an empty array b, and allocates a device copy of a called dev_a:
#define DIM 64
int main(void){
int *a;
a = (int*)malloc(DIM*DIM*sizeof(int));
int i;
for(i = 0; i < DIM*DIM; i++){
a[i] = 0;
}
int *b;
b = (int*)malloc(DIM*DIM*sizeof(int));
int *dev_a;
cudaMalloc( (void**)&dev_a, sizeof(int)*DIM*DIM );
cudaMemcpy( dev_a, a, DIM*DIM*sizeof(int), cudaMemcpyHostToDevice );
Then I feed dev_a into a DIM-by-DIM-by-DIM grid of blocks, each with DIM threads, copy the results back, and export them to CSV:
dim3 blocks(DIM,DIM,DIM);
kernel<<<blocks,DIM>>>(dev_a);
cudaMemcpy( b, dev_a, sizeof(int)*DIM*DIM, cudaMemcpyDeviceToHost );
cudaFree(dev_a);
int1DExportCSV(b, DIM*DIM);
}
The resulting CSV file is DIM*DIM in length, and is filled with DIM's. However, while the length is correct, it should be filled with DIM*DIM's, since I am essentially launching a DIM*DIM*DIM*DIM hypercube of threads, in which the last two dimensions are all devoted to incrementing a unique element of the device array dev_a by one.
My first reaction was to suspect that the ptr[offset] += 1 step might be a culprit, since multiple threads are potentially executing this step at the exact same time, and so each thread might be updating an old copy of ptr while unaware that there are a bunch of other threads doing it at the same time. However, I don't know enough about the "taboo's of CUDA" to tell if this is a reasonable guess or not.
Hardware problems are (to the best of my knowledge) not an issue; I am using a GTX560 Ti, so launching a 3-dimensional grid of blocks is allowed, and my thread count per block is 64, well below the maximum of 1024 imposed by the Fermi architecture.
Am I making a simple mistake? Or is there a subtle error in my example?
Additionally, I noticed that when I increase DIM to 256, the resulting array appears to be filled with random integers between 290 to 430! I am completely baffled by this behavior.
No, it's not safe. The threads in a block are stepping on each other.
Your threads in each threadblock are all updating the same location in memory:
ptr[offset] += 1;
offset is the same for every thread in the block:
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
That is a no-no. The results are undefined.
Instead use atomics:
atomicAdd(ptr+offset, 1);
or a parallel reduction method of some sort.

Resources