Shouldn't be 3x3 convolution much faster on GPU (OpenCL)

Shouldn't be 3x3 convolution much faster on GPU (OpenCL) - c

I'm learning how to optimize code for GPU. I read about importance of memory locality. I've also seen some tutorials and examples of GPU convolution. Based on that I wrote and tested several own kernels. Surprisingly I found that the simplest naive kernell is the fastest!? and it is jut <10x faster than CPU. (Yes I amortized upload/download time by running the kenrnel 64x).
What I do wrong? I woud expect that convolution is just that kind of operation for which GPUs are optimized. If I can get 100x speed-up on matrix multiplication, why convolution is so slow?
performance [CPU ticks/pixel] (lower is better):
CPU-naive 9.5
GPU-naive 1.64
GPU-local 2.56
GPU-local_async 15.10
GPU-scanline-private 7.35
GPU-scanline_async 15.37
EDIT: GPU-scanline_async I made later after reading advices about async_work_group_copy
I wonder 2 things:
Is the kernel speed limited by memory bandwidth or by computing power? From what I read I would expect memory. But the results of the tests show the opposite.
Kernel GPU-local is slower than GPU-naive even though it does much less global memory reads
modification of the kernel by gaussian-filter coeffs (i.e. add multiplication per each pixel) makes it >2x slower, although it does same number of memory reads
But if it is limited by processing power than why I get 100x faster matrix-multiplication on GPU than on CPU ?
why the kernel GPU-scanline-private is so slow? The memory locality is much better (just 3 instead of 9 reads from global memory per pixel) and the logic is minimal (no ifs/switches)
The test was done on my laptop with CPU Intel Core i7 6700HQ Skylake and GPU nVidia 960M by running the kernels 64x/frame on floating point array of 256x256 pixels. The code full can be seen here.
=========== Kernel codes ===========
kernel GPU-Naive 2D global=(256,256) local=(16,16)
__kernel void blur2D_naive(
__global float* I,
__global float* O
){
const int ix = get_global_id (0)+1;
const int iy = get_global_id (1)+1;
const int nx = get_global_size(0)+2;
int i = iy * nx + ix;
// 1.6 ticks/pixel
O[i] =( I[i-nx-1] + I[i-nx] + I[i-nx+1] +
I[i -1] + I[i ] + I[i +1] +
I[i+nx-1] + I[i+nx] + I[i+nx+1] ) * 0.11111111111;
// modified with gaussian mask 4.9 ticks/pixel
//O[i] =( 0.0625*I[i-nx-1] + 0.125*I[i-nx] + 0.0625*I[i-nx+1] +
// 0.125 *I[i -1] + 0.25 *I[i ] + 0.125 *I[i +1] +
// 0.0625*I[i+nx-1] + 0.125*I[i+nx] + 0.0625*I[i+nx+1] );
}
kernel GPU-local 2D global=(256,256) local=(16,16)
#define NBx 18 // tile size including borders [halo] 16+2
#define NBy 18
// seems to be slower than naive method
__kernel void blur2D_local(
__global float* I,
__global float* O
){
__local float L[NBx*NBy];
const int2 iG = (int2)(get_global_id (0)+1 , get_global_id (1)+1 );
const int2 nG = (int2)(get_global_size(0)+2 , get_global_size(1)+2 );
const int2 iL = (int2)(get_local_id (0)+1 , get_local_id (1)+1 );
const int2 nL = (int2)(get_local_size (0)+2 , get_local_size (1)+2 );
const int2 iGR = (int2)(get_group_id (0) , get_group_id (1) );
// copy boundary pixels to local memory
switch( get_local_id(1) ){ // some threads copy one more of boundary (halo) pixels
case 4:
switch( get_local_id(0) ){ // copy corner points
case 0: L[ 0 ] = I[ nG.x* get_group_id(1)*get_local_size(1) + get_group_id(0)*get_local_size(0) ]; break; // upper-left
case 1: L[ NBx-1 ] = I[ nG.x* get_group_id(1)*get_local_size(1) + get_group_id(0)*get_local_size(0)+(NBx-1) ]; break; // upper-right
case 2: L[ (NBy-1)*NBx ] = I[ nG.x*(get_group_id(1)*get_local_size(1)+(NBy-1)) + get_group_id(0)*get_local_size(0) ]; break; // lower-left
case 3: L[ NBy* NBx-1 ] = I[ nG.x*(get_group_id(1)*get_local_size(1)+(NBy-1)) + get_group_id(0)*get_local_size(0)+(NBx-1) ]; break; // lower-rigth
}
// copy border lines
case 0: L[ iL.x ] = I[ nG.x* get_group_id(1)*get_local_size(1) + iG.x ]; break; // top line
case 1: L[ NBx*(NBy-1) + iL.x ] = I[ nG.x*(get_group_id(1)*get_local_size(1)+(NBy-1) ) + iG.x ]; break; // botton line
case 2: L[ NBx*iL.x ] = I[ nG.x*(get_group_id(1)*get_local_size(1)+get_local_id(0) ) + get_group_id(0)*get_local_size(0) ]; break; // left line
case 3: L[ NBx*iL.x + (NBx-1) ] = I[ nG.x*(get_group_id(1)*get_local_size(1)+get_local_id(0) ) + (get_group_id(0)*get_local_size(0)+(NBx-1)) ]; break; // right line
} // each thread coppied at max. 1 border pixels
int ig = iG.y*nG.x + iG.x;
int il = iL.y*nL.x + iL.x;
L[il] = I[ig]; // each thread copy his pixel to local memory
barrier(CLK_LOCAL_MEM_FENCE);
const float renorm = 1.0/9.0;
O[ig] =( L[il-NBx-1] + L[il-NBx] + L[il-NBx+1] +
L[il -1] + L[il ] + L[il +1] +
L[il+NBx-1] + L[il+NBx] + L[il+NBx+1] ) / 9.0;
}
kernel GPU-local_async 2D global=(256,16) local=(16,16)
#define nTiles 16
#define NBx 18
#define NBy 18
#define copy_tile(event,ig0,I,L) { int ig_=ig0; int il_=0; for(int i=0; i<NBy; i++){ event = async_work_group_copy( L+il_, I+ig_, NBx, event ); ig_+=nx; il_+=NBx; } }
// https://streamcomputing.eu/blog/2014-06-19/using-async_work_group_copy-on-2d-data/
__kernel void blur2D_local_async(
__global float* I,
__global float* O
){
const int nx = get_global_size(0)+2;
__local float LI[NBx*NBy*2];
int iL0 = 0;
int iL1 = NBx*NBy;
event_t event = 0;
int ig0 = get_group_id(0)*get_local_size(0);
copy_tile(event,ig0,I,LI);
for( int it=0; it<nTiles; it++ ){
int ig = ig0 + (get_local_id(1)+1)*nx + get_local_id(0)+1;
int il = (get_local_id(1)+1)*NBx + get_local_id(0) + iL0;
ig0 += get_local_size(1)*nx;
event_t event_ = 0;
copy_tile(event_,ig0,I,LI+iL1);
wait_group_events(1, &event);
//barrier(CLK_LOCAL_MEM_FENCE);
O[ig] =( LI[il-NBx] + LI[il-NBx+1] + LI[il-NBx+2] +
LI[il ] + LI[il +1] + LI[il +2] +
LI[il+NBx] + LI[il+NBx+1] + LI[il+NBx+2] ) * 0.11111111111;
int iLtmp=iL0; iL0=iL1; iL1=iLtmp;
event = event_;
}
}
kernel GPU-scanline_private 1D global=(256) local=(32)
__kernel void blur2D_scanline_priv(
int nx, int ny,
__global float* I,
__global float* O
){
int ig = get_global_id(0)+1;
float3 Lm = (float3)( I[ig-1], I[ig], I[ig+1] ); ig += nx;
float3 L0 = (float3)( I[ig-1], I[ig], I[ig+1] );
for(int iy=1; iy<(ny-1); iy++ ){
ig += nx;
float3 Lp= (float3)( I[ig-1], I[ig], I[ig+1] );
O[ig-nx] =
( Lm.x + Lm.y + Lm.z +
L0.x + L0.y + L0.z +
Lp.x + Lp.y + Lp.z ) * 0.11111111111;
Lm=L0; L0=Lp;
}
}
kernel GPU-scanline_async 1D global=(256) local=(32)
#define NB 34
__kernel void blur2D_scanline_async(
int nx, int ny,
__global float* I,
__global float* O
){
__local float L[NB*4];
int i0=0;
int i1=NB;
int i2=NB*2;
int i3=NB*3;
event_t event = 0;
int ig0 = get_group_id(0)*get_local_size(0);
event = async_work_group_copy( L , I+ig0, NB, event ); ig0 += nx;
event = async_work_group_copy( L+NB , I+ig0, NB, event ); ig0 += nx;
event = async_work_group_copy( L+NB*2, I+ig0, NB, event ); ig0 += nx;
const int il = get_local_id(0);
int ig = get_global_id(0)+1;
for(int iy=1; iy<(ny-2); iy++ ){
wait_group_events(1, &event);
event = async_work_group_copy( L+i3, I+ig0, NB, event ); ig0 += nx;
ig += nx;
O[ig] =
( L[i0+il] + L[i0+il+1] + L[i0+il+2] +
L[i1+il] + L[i1+il+1] + L[i1+il+2] +
L[i2+il] + L[i2+il+1] + L[i2+il+2] ) * 0.11111111111;
__local float *Ltmp;
int itmp=i0; i0=i1; i1=i2; i2=i3; i3=itmp;
}
}
kernel CPU-naive
void blur(int nx, int ny, float * I, float * O ){
float renorm = 1.0/9.0;
for(int iy=1;iy<ny-1;iy++){ for(int ix=1;ix<nx-1;ix++){
int i = iy*nx+ix;
O[i] =( I[i-nx-1] + I[i-nx] + I[i-nx+1] +
I[i -1] + I[i ] + I[i +1] +
I[i+nx-1] + I[i+nx] + I[i+nx+1] ) * renorm;
} }
}

In matrix multiplication, each sub-matrix (patch) is used for all patches in all lines in the other matrix. If there is 2x2 sub-matrix in a patch and if main matrix is 20x20 then each sub-matrix is used 10 times for multiplication. GPU generally uses 16x16 or 32x32 sized patches which means, for a 2kx2k multiplication each 16x16 patch is re-used for 128 times at least.
MM reuse = 128
and add the sub-matrix - sub-matrix multiplication re-use, it is enough to push gpu to limits.
In a 3x3 convolution, 3x3 patch is not used for a whole scanline or a whole picture. Only its pixels are re-used.
3x3 stencil: each pixel is re-used by neighbouring 8 stencils.
5x5 stencil: each pixel is re-used by neighbouring 24 stencils.
to catch up with matrix multiplication, it would need
11x11 stencil to have a reuse of 120
which is also more local than matrix multiplication and should get more gflops than it but it is not doing equal amounts of multiplications and additions.
It is doing 9 additions + 1 multiplications.
8 potential multiplications are lost. Nearly half of GFLOPS limit is lost.
You should try async workgroup copies.
load top-left 18x18,
load top 18x18 and compute top-left async
load top-right 18x18 and compute top async and store top-left async
load right 18x18 and compute top-left async and store top async
load .... compute ... store... all async so both local memory and main memory could be used(main memory would take advantage of naive version, L1 maybe)
Matrix multiplication/with 16x16 sub matrices) vs convolution(17x17 brush size):
Matrix: L2 re-use ratio increases with main matrix size, or L1 re-use ratio increases with sub-matrix size (L1)
Convolution: total re-use ratio is same for all image sizes but L1 usage ratio increases with brush size(good)
Matrix: 16*16*16 multiplications + 16*16*16 additions per workgroup
Convolution: 17*17 additions + 1 multiplication per thread(bad)
Matrix: uniform thread usage, no if-else, all local memory is re-used
Convolution: needs to load at least 16 pixel further than borders(ghost walls with 16 thickness) which are to be re-used by neighbour workgroups but those neighbour workgroups may be in another compute unit and just use L2 instead of being on same compute unit to use L1 (ugly)
That is why I suggested async work group copies to use those neighbours on same compute unit (and L1) and increase re-use ratio.
Matrix: increasing patch size also increases re-use by cubic power rate in sub-matrix multiplications(but decreases L2 re-usage because of having less patches per line, which makes total re-use like square-power rate)
Convolution: increasing patch size increases re-use by square power rate
Matrix: local memory must be at least 2x tile area (sub mat-mat mul)
Convolution: local memory must be at least tile area + ghost walls area
Matrix: can do 4x4 sub-sub-multiplications in private memory(which use each element 4 times) which means 4x4 memory = 64 add+64 mul
Convolution: loading 4x4 into private memory doesnt do anything but just a 4-pixel compute (for a 3x3 brush) which means 4x4 memory = 36 add + 4 mul
Having an addition-heavy kernel leaves room for another multiplication-heavy kernel to work concurrently or in same kernel asynchronously. Maybe if you are using this for an image processing, maybe you can add some "blend" or "resize" kernels inside so they work together?
Scanline version is loading 3 elements, doing 9 add + 1 mul then repeats, loaded elements stay for 3 turns which means they are re-used for 3 times only and its neighbours(x or y directio) may not fall in neighbour thread or even neighbour workgroup. Also 3 loads versus 1 stores is unbalanced. If memory bandwidth is 100 GB/s then it would use 50GB/s for loads, 15 GB/s for stores unless they are coming from L1.
You can decrease add/mul imbalance by using accumulator.
store = (accumulator) * 0.1111111
accumulator+=new vector // 3 adds
accumulator-=old vecotr // 3 adds
so now it is 6 adds + 1 muls so more balanced like: 1Tflops GPU will have 500Gflops for adds, 90 Gflops for muls.
Naive version doesn't use local memory, leaving more room for more wavefronts in-flight. Local memory version actually breaks L1 access pattern and let less wavefronts in-flight. This reduces VALU occupation.
You can decrease local memory usage by doing scanline on workgroup level instead of thread level. What I mean is something like:
load from memory: x x x x x x x x x x
do scanline for it: (left to right,1-D) a b c d e f g h i j
now use it for scanline on workgroup level: a c c u m u l a t o r (+new)
(top to bottom) z x z x z x z x z x (- old)
calculate frontline 1-d scanline: 30 additions for each new row
calculate wide vector 2-d scanline:30*30 additions
each pixel get 1 value instead of adding 3 values
storing: 16x16 multiplications
much less local memory used, more balanced (~8 add 1 mul)
this has 1-d scanline that is single-thread for N cycles or multi threaded reduce for LogN cycles(considering enough threads in a compute unit).

Related

How do I obtain the theoretical/Linpack FLOPS performance on basic vector/matrix operations?

The question is simple. How do I further optimize my code as the basic matrix operations are critical and common to my calculation. BLAS and LAPACK operations are good in linear algebra but neither of them provides basic element by element addition/multiply operations (Hadamard). Theoretical performance maybe difficult, but Linpack performance or 60~80% Linpack performance should be achievable. (I can only do 12%, if I use multiply-add, then only 25%)
For references
Theoretical performance: 8259u has 4 cores * 3.8GHz * 16 FLOPS = 240 GFlops
Linpack performance: 8259u can run as fast as 140~160 GFlops double precision operations.
Platform: Macbook Pro 2018, Monterey
CPU: i5-8259u, 4c8t
RAM: 8GB
CC: gcc 11.3.0
CFLAGS: -mavx2 -mfma -fopenmp -O3
Here's my attempt
the flops are calculated as follows:
double time = stop - start;
double ops = 1.0 * Nx * Ny * iterNum; //2.0 for complex numbers
double flops = ops / time;
double gFlops = flops / 1E9;
Here's some results when I run my code. real and complex results are almost the same. Only showing the real results (roughly):
//Nx = Ny = 2048, iterNum = 10000
//Typical matrix size and iteration depth for my calculation
threads = 1: 1 GFlops
threads = 2: 2 GFlops
threads = 4: 3 GFlops
threads = 8: 4 GFlops
threads = 16: 9 GFlops
threads = 32: 11 GFlops
threads = 64: 15 GFlops
threads = 128: 18 GFlops
threads = 256: 19 GFlops
threads = 512: 21 GFlops
threads = 1024: 20 GFlops
threads = 2048: 40 GFlops // wrong answer
For the convenience of large matrix on heap and integrating with mathGL, the matrix is flattened as a vector consisting of Nx * Ny elements cascading by rows.
// for real numbers
x = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
y = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
z = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
sum = (double *)_mm_malloc(Nx * Ny * sizeof(double), 32);
// for complex numbers
x = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
y = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
z = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
sum = (double *)_mm_malloc(Nx * Ny * sizeof(double complex), 32);
and the addition was done parallelly using openmp.
double start = omp_get_wtime();
#pragma omp parallel private(shift)
{
for (int tds = omp_get_thread_num(); tds < threads; tds = tds + threads)
{
shift = Nx * Ny / threads * tds;
for (int i = 0; i < iterNum; i++)
{
AddComplex(sum+shift, sum+shift, z+shift, Nx/threads, Ny);
}
}
}
double stop = omp_get_wtime();
I wrote explicit vectorization code using AVX intrinsics "immintrin.h".
//real matrix addition
void AddReal(double *summation, const double *summand, const double *addend, int Nx, int Ny)
{
int nBlock = Nx * Ny / realPackSize;
int nRem = Nx * Ny % realPackSize;
register __m256d packSummand, packAddend, packSum;
const double *px = summand;
const double *py = addend;
double *pSum = summation;
for (int i = 0; i < nBlock; i++)
{
packSummand = _mm256_load_pd(px);
packAddend = _mm256_load_pd(py);
packSum = _mm256_add_pd(packSummand, packAddend);
_mm256_store_pd(pSum, packSum);
px = px + realPackSize;
py = py + realPackSize;
pSum = pSum + realPackSize;
}
for (int i = 0; i < nRem; i++)
{
pSum[i] = px[i] + py[i];
}
px = NULL;
py = NULL;
pSum = NULL;
return;
}
//Complex matrix addition
void AddComplex(double complex *summation, const double complex *summand, const double complex *addend, int Nx, int Ny)
{
int nBlock = Nx * Ny / complexPackSize;
int nRem = Nx * Ny % complexPackSize;
register __m256d packSummand, packAddend, packSum;
const double complex *px = summand;
const double complex *py = addend;
double complex *pSum = summation;
for (int i = 0; i < nBlock; i++)
{
packSummand = _mm256_load_pd(px);
packAddend = _mm256_load_pd(py);
packSum = _mm256_add_pd(packSummand, packAddend);
_mm256_store_pd(pSum, packSum);
px = px + complexPackSize;
py = py + complexPackSize;
pSum = pSum + complexPackSize;
}
for (int i = 0; i < nRem; i++)
{
pSum[i] = px[i] + py[i];
}
px = NULL;
py = NULL;
pSum = NULL;
return;
}

Level 1 (eg. dot product) and level 2 (eg. vector-matrix multiplication) BLAS functions are known not to scale (especially level 1 BLAS functions) as opposed to level 3 (eg. matrix-multiplication). Indeed, they are generally memory-bound: the amount of data read/written is O(n) while the amount of floating-point operation is also O(n). This is not the case for level 3 BLAS which are generally clearly compute-bound.
Theoretical performance maybe difficult, but Linpack performance or 60~80% Linpack performance should be achievable
If the computation is memory bound, then, no, this is not possible. Linpack is generally clearly compute bound on nearly all machine. The think is memory is slow and the speed of the RAM is not increasing as fast as the speed of processors over the last decades. This is known as a memory wall (formulated few decades ago and still true nowadays).
Here's some results when I run my code.
Having a faster computation with from using 1024 threads instead of 512 on a mobile processor with 4 core and 8 thread make me think that there is a huge problem somewhere. The maximum should be reached with 8 threads, or otherwise this means the computation is clearly inefficient. Indeed, running more threads than hardware threads cause the OS scheduler to make expensive context-switch (higher overhead). In the end, your processor never runs more that 8 tasks at a time. There are two possibility:
The timings are not correct (the provided piece of code about that seems fine to me)
The program is bogus
The computation exhibit a super-linear speed up (possibly due to cache)
I wrote explicit vectorization code using AVX intrinsics "immintrin.h".
The hot loop contains 2 loads, 1 store, 1 add and few instructions incrementing integers. Your processor can do 2 loads and 1 store per cycle so the SIMD part can be done in 1 cycle of throughput (though the latency can be much bigger) assuming nBlock is large enough.
Your processor can do 2 add per cycle so half the throughput is lost. However, you cannot write something faster than that if the load/write are mandatory.
If complexPackSize is smaller than a SIMD lane, then I think the processor has to make complex operation due to the overlapp with the past iteration that will certainly make it run the loop much less efficiently (a loop carried dependency will make the loop latency bound which is very inefficient here). If complexPackSize is much larger than a cache line, then prefetching will likely be an issue.
Your processor cannot execute too many instructions at the same time. The increment instruction and the loop check cause 5 instruction to be executed, which consume at least 1 cycle. This reduce the throughput by a factor of 2 again so not more than 25% of the theoretical performance can be reached. This can be improved a bit by unrolling the loop. Unrolling might also improve the execution because the _mm256_add_pd instruction has a pretty high latency. One should keep in mind that SIMD instructions are great for throughput but not for latency. Thus, when the latency is not an issue, SIMD codes should be fast.
Note that the write allocate cache policy cause data to be read when _mm256_store_pd is used increasing the amount of data transferred from the RAM and reducing the observed throughput. _mm256_stream_pd can be used to avoid this effect but it is fast only if data are not read just after or when data do not fit in the cache anyway. It also require data to be aligned. In fact, _mm256_store_pd also requires that and if it is not the case, it certainly cause a silent bug. The same applies for _mm256_load_pd: _mm256_loadu_pd should be used instead for unaligned data. I am not sure data read is always aligned. It should be fine if complexPackSize is a power of two divisible by 32 as well as shift. However, I highly doubt this is the case for shift, especially with a large number of threads. I also find very suspicious to use a constant complexPackSize while the SIMD lanes have a fixed size. Did you checked the results in all cases?

OpenCL, C - Leibniz Formula for Pi

I'm trying to get some experience with OpenCL, the environment is setup and I can create and execute kernels. I am currently trying to compute pi in parallel using the Leibniz formula but have been receiving some strange results.
The kernel is as follow:
__kernel void leibniz_cl(__global float *space, __global float *result, int chunk_size)
{
__local float pi[THREADS_PER_WORKGROUP];
pi[get_local_id(0)] = 0.;
for (int i = 0; i < chunk_size; i += THREADS_PER_WORKGROUP) {
// `idx` is the work item's `i` in the grander scheme
int idx = (get_group_id(0) * chunk_size) + get_local_id(0) + i;
float idx_f = 1 / ((2 * (float) idx) + 1);
// Make the fraction negative if needed
if(idx & 1)
idx_f = -idx_f;
pi[get_local_id(0)] += idx_f;
}
// Reduction within workgroups (in `pi[]`)
for(int groupsize = THREADS_PER_WORKGROUP / 2; groupsize > 0; groupsize >>= 1) {
if (get_local_id(0) < groupsize)
pi[get_local_id(0)] += pi[get_local_id(0) + groupsize];
barrier(CLK_LOCAL_MEM_FENCE);
}
If I end the function here and set result to pi[get_local_id(0)] for !get_global_id(0) (as in the reduction for the first group), printing result prints -nan.
Remainder of kernel:
// Reduction amongst workgroups (into `space[]`)
if(!get_local_id(0)) {
space[get_group_id(0)] = pi[get_local_id(0)];
for(int groupsize = get_num_groups(0) / 2; groupsize > 0; groupsize >>= 1) {
if(get_group_id(0) < groupsize)
space[get_group_id(0)] += space[get_group_id(0) + groupsize];
barrier(CLK_LOCAL_MEM_FENCE);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(get_global_id(0) == 0)
*result = space[get_group_id(0)] * 4;
}
Returning space[get_group_id(0)] * 4 returns either -nan or a very large number which clearly is not an approximation of pi.
I can't decide if it is an OpenCL concept I'm missing or a parallel execution one in general. Any help is appreciated.
Links
Reduction template: OpenCL float sum reduction
Leibniz Formula: https://www.wikiwand.com/en/Leibniz_formula_for_%CF%80

Maybe these are not most critical issues with the code but they can be the source of problem:
You definetly should use barrier(CLK_LOCAL_MEM_FENCE); before local reduction. This can be avoided if only you know that work group size is equal or smaller than number of threads in wavefront running same instruction in parallel - 64 for AMD GPUs, 32 for NVidia GPUs.
Global reduction must be done in multiple launches of kernel because barrier() works for work items of same work group only. Clear and 100% working way to insert a barrier into kernel is splittion it in two in the place where global barier is needed.

Strategy for doing final reduction

I am trying to implement an OpenCL version for doing reduction of a array of float.
To achieve it, I took the following code snippet found on the web :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
This kernel code works well but I would like to compute the final sum by adding all the partial sums of each work group.
Currently, I do this step of final sum by CPU with a simple loop and iterations nWorkGroups.
I saw also another solution with atomic functions but it seems to be implemented for int, not for floats. I think that only CUDA provides atomic functions for float.
I saw also that I could another kernel code which performs this operation of sum but I would like to avoid this solution in order to keep a simple readable source. Maybe I cannot do without this solution...
I must tell you that I use OpenCL 1.2 (returned by clinfo) on a Radeon HD 7970 Tahiti 3GB (I think that OpenCL 2.0 is not supported with my card).
More generally, I would like to get advice about the simplest method to perform this last final summation with my graphics card model and OpenCL 1.2.

If that float's order of magnitude is smaller than exa scale, then:
Instead of
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
You could use
if (local_id == 0)
{
if(strategy==ATOMIC)
{
long integer_part=getIntegerPart(localSums[0]);
atom_add (&totalSumIntegerPart[0] ,integer_part);
long float_part=1000000*getFloatPart(localSums[0]);
// 1000000 for saving meaningful 7 digits as integer
atom_add (&totalSumFloatPart[0] ,float_part);
}
}
this will overflow float part so when you divide it by 1000000 in another kernel, it may have more than 1000000 value so you get its integer part and add it to the real integer part:
float value=0;
if(strategy==ATOMIC)
{
float float_part=getFloatPart_(totalSumFloatPart[0]);
float integer_part=getIntegerPart_(totalSumFloatPart[0])
+ totalSumIntegerPart[0];
value=integer_part+float_part;
}
just a few atomic operations shouldn't be effective on whole kernel time.
Some of these get___part can be written easily already using floor and similar functions. Some need a divide by 1M.

Sorry for previous code.
also It has problem.
CLK_GLOBAL_MEM_FENCE effects only current workgroup.
I confused. =[
If you want to reduction sum by GPU, you should enqueue reduction kernel by NDRangeKernel function after clFinish(commandQueue).
Plaese just take concept.
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
barrier(CLK_GLOBAL_MEM_FENCE);
if(get_group_id(0)==0){
if(local_id < get_num_groups(0)){ // 16384
for(int n=0 ; n<get_num_groups(0) ; n+= group_size )
localSums[local_id] += partialSums[local_id+n];
barrier(CLK_LOCAL_MEM_FENCE);
for(int s=group_size/2;s>0;s/=2){
if(local_id < s)
localSums[local_id] += localSums[local_id+s];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(local_id == 0)
partialSums[0] = localSums[0];
}
}
}

Can anyone help me to optimize this for loop use SSE?

I have a for loop which will run many times, and will cost a lot of time:
for (int z=0; z<temp; z++)
{
float findex= a + b * A[z];
int iindex = findex ;
outArray[z] += inArray[iindex] + (findex - iindex) * (inArray[iindex+1] - inArray[iindex]);
a++;
}
I have optimized this code, but have no performance improvement! Maybe my SSE code is bad, can any one help me?

Try using the restrict keyword on inArray and outArray. Otherwise the compiler has to assume that inArray could be == outArray. In this case no parallelization would be possible.

Your loop has a loop carried dependency when you write to outArray[z]. Your CPU can do more than one floating point sum at once but with your current loop you only allows one sum of outArray[z]. To fix this you should unroll your loop.
for (int z=0; z<temp; z+=2) {
float findex_v1 = a + b * A[z];
int iindex_v1 = findex_v1;
outArray[z] += inArray[iindex_v1] + (findex_v1 - iindex_v1) * (inArray[iindex_v1+1] - inArray[iindex_v1]);
float findex_v2 = (a+1) + b * A[z+1];
int iindex_v2 = findex_v2;
outArray[z+1] += inArray[iindex_v2] + (findex_v2 - iindex_v2) * (inArray[iindex_v2+1] - inArray[iindex_v2]);
a+=2;
}
In terms of SIMD the problem is that you have to gather non-contiguous data when you access inArray[iindex_v1]. AVX2 has some gather instructions but I have not tried them. Otherwise it may be best to do the gather without SIMD. All the operations accessing z access contiguous memory so that part is easy. Psuedo-code (without unrolling) would look something like this
int indexa[4];
float inArraya[4];
float dinArraya[4];
int4 a4 = a + float4(0,1,2,3);
for (int z=0; z<temp; z+=4) {
//use SSE for contiguous memory
float4 findex4 = a4 + b * float4.load(&A[z]);
int4 iindex4 = truncate_to_int(findex4);
//don't use SSE for non-contiguous memory
iindex4.store(indexa);
for(int i=0; i<4; i++) {
inArraya[i] = inArray[indexa[i]];
dinArraya[i] = inArray[indexa[i+1]] - inArray[indexa[i]];
}
//loading from and array right after writing to it causes a CPU stall
float4 inArraya4 = float4.load(inArraya);
float4 dinArraya4 = float4.load(dinArraya);
//back to SSE
float4 outArray4 = float4.load(&outarray[z]);
outArray4 += inArray4 + (findex4 - iindex4)*dinArray4;
outArray4.store(&outArray[z]);
a4+=4;
}

Reduce matrix rows with CUDA

Windows 7, NVidia GeForce 425M.
I wrote a simple CUDA code which calculates the row sums of a matrix.
The matrix has uni-dimensional representation (pointer to a float).
The serial version of code is below (it has 2 loops, as expected):
void serial_rowSum (float* m, float* output, int nrow, int ncol) {
float sum;
for (int i = 0 ; i < nrow ; i++) {
sum = 0;
for (int j = 0 ; j < ncol ; j++)
sum += m[i*ncol+j];
output[i] = sum;
}
}
Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:
dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock));
kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);
and the kernel function which performs the parallel sum of the rows (still has 1 loop):
__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {
int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;
if (rowIdx < nrow) {
float sum=0;
for (int k = 0 ; k < ncol ; k++)
sum+=m[rowIdx*ncol+k];
s[rowIdx] = sum;
}
}
So far so good. The serial and parallel (CUDA) results are equal.
The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).
IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.
Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:
Matrix: nrow = 90000 x ncol = 1000
Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.
CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.
CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.
Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.
I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.
Since I am far from being a CUDA expert:
Am I coding the fastest version for this task? How could I improve my code?
Can I get rid of the loop in the kernel function?
Any thoughts appreciated.
EDIT 1
Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.
EDIT 2
As suggest by users other users and here endorsed:
FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.

Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.
Kernel Approach
Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.
Multiple thread blocks each reduces one part of the data;
One thread block reduces from result of stage 1 to the final 1 element.
For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by #Novak. This may improve the performance more, especially for matrices with bad shape.
Thrust Approach
General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.
However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.
cuBLAS Approach
Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.
y = A * ones(size(A,2),1);
where y is the sum of rows of A.
cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.
Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.

Reducing the rows of a matrix can be solved by using CUDA Thrust in three ways (they may not be the only ones, but addressing this point is out of scope). As also recognized by the same OP, using CUDA Thrust is preferable for such a kind of problem. Also, an approach using cuBLAS is possible.
APPROACH #1 - reduce_by_key
This is the approach suggested at this Thrust example page. It includes a variant using make_discard_iterator.
APPROACH #2 - transform
This is the approach suggested by Robert Crovella at CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array.
APPROACH #3 - inclusive_scan_by_key
This is the approach suggested by Eric at How to normalize matrix columns in CUDA with max performance?.
APPROACH #4 - cublas<t>gemv
It uses cuBLAS gemv to multiply the relevant matrix by a column of 1's.
THE FULL CODE
Here is the code condensing the two approaches. The Utilities.cu and Utilities.cuh files are mantained here and omitted here. The TimingGPU.cu and TimingGPU.cuh are maintained here and are omitted as well.
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
// --- Required for approach #2
__device__ float *vals;
/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {
const int Ncols; // --- Number of columns
row_reduction(int _Ncols) : Ncols(_Ncols) {}
__device__ float operator()(float& x, int& y ) {
float temp = 0.f;
for (int i = 0; i<Ncols; i++)
temp += vals[i + (y*Ncols)];
return temp;
}
};
/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
T C;
__host__ __device__ MulC(T c) : C(c) { }
__host__ __device__ T operator()(T x) { return x * C; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nrows = 5; // --- Number of rows
const int Ncols = 8; // --- Number of columns
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);
TimingGPU timerGPU;
/***************/
/* APPROACH #1 */
/***************/
timerGPU.StartCounter();
// --- Allocate space for row sums and indices
thrust::device_vector<float> d_row_sums(Nrows);
thrust::device_vector<int> d_row_indices(Nrows);
// --- Compute row sums by summing values with equal row indices
//thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
// thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
// d_matrix.begin(),
// d_row_indices.begin(),
// d_row_sums.begin(),
// thrust::equal_to<int>(),
// thrust::plus<float>());
thrust::reduce_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
thrust::make_discard_iterator(),
d_row_sums.begin());
printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());
// --- Print result
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums[i] << "\n";
}
/***************/
/* APPROACH #2 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_2(Nrows, 0);
float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0), d_row_sums_2.begin(), row_reduction(Ncols));
printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_2[i] << "\n";
}
/***************/
/* APPROACH #3 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_3(Nrows, 0);
thrust::device_vector<float> d_temp(Nrows * Ncols);
thrust::inclusive_scan_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
d_temp.begin());
thrust::copy(
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
d_row_sums_3.begin());
printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_3[i] << "\n";
}
/***************/
/* APPROACH #4 */
/***************/
cublasHandle_t handle;
timerGPU.StartCounter();
cublasSafeCall(cublasCreate(&handle));
thrust::device_vector<float> d_row_sums_4(Nrows);
thrust::device_vector<float> d_ones(Ncols, 1.f);
float alpha = 1.f;
float beta = 0.f;
cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols,
thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));
printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_4[i] << "\n";
}
return 0;
}
TIMING RESULTS (tested on a Kepler K20c)
Matrix size #1 #1-v2 #2 #3 #4 #4 (no plan)
100 x 100 0.63 1.00 0.10 0.18 139.4 0.098
1000 x 1000 1.25 1.12 3.25 1.04 101.3 0.12
5000 x 5000 8.38 15.3 16.05 13.8 111.3 1.14
100 x 5000 1.25 1.52 2.92 1.75 101.2 0.40
5000 x 100 1.35 1.99 0.37 1.74 139.2 0.14
It seems that approaches #1 and #3 outperform approach #2, except in the cases of small numbers of columns. The best approach, however, is approach #4, which is significantly more convenient than the others, provided that the time needed to create the plan can be amortized during the computation.

If this is the extent (summing the rows) of the operations you need to do with this data, I wouldn't expect a sizable benefit from the GPU. You have exactly one arithmetic operation per data element, and for that you are paying the cost of transferring that data element to the GPU. And beyond a certain problem size (whatever it takes to keep the machine busy) you get no added benefit from larger problem sizes, because the arithmetic intensity is O(n).
So this isn't a particularly exciting problem to solve on the GPU.
But as talonmies has indicated, you have a coalescing problem in the way you have crafted it, which will further slow things down. Let's take a look at a small example:
C1 C2 C3 C4
R1 11 12 13 14
R2 21 22 23 24
R3 31 32 33 34
R4 41 42 43 44
Above is a simple pictorial example of a small portion of your matrix. The machine data storage is such that elements (11), (12), (13), and (14) are stored in adjacent memory locations.
For coalesced access, we want an access pattern such that adjacent memory locations are requested from the same instruction, executed across the warp.
We need to think about execution of your code from the standpoint of a warp, that is 32 threads executing in lock-step. What is your code doing? Which elements is it retrieving (asking for) at each step/instruction? Let's take a look at this line of code:
sum+=m[rowIdx*ncol+k];
Adjacent threads in the warp have adjacent (i.e. consecutive) values for rowIdx as you have created that variable. So when k = 0, which data element is being asked for by each thread when we try to retrieve the value m[rowIdx*ncol+k] ?
In block 0, thread 0 has a rowIdx of 0. Thread 1 has a rowIdx of 1, etc. So the values being asked for by each thread at this instruction are:
Thread: Memory Location: Matrix Element:
0 m[0] (11)
1 m[ncol] (21)
2 m[2*ncol] (31)
3 m[3*ncol] (41)
But this is not coalesced access! Elements (11), (21), etc. are not adjacent in memory. For coalesced access, we would like that Matrix Element row to read like this:
Thread: Memory Location: Matrix Element:
0 m[?] (11)
1 m[?] (12)
2 m[?] (13)
3 m[?] (14)
If you then work backwards to determine what the value of ? should be, you will come up with an instruction something like this:
sum+=m[k*ncol+rowIdx];
This will give coalesced access, but it will not give you the correct answer, because we are now summing matrix columns instead of matrix rows. We can fix this by re-organizing your data storage to be in column-major order rather than row-major order. (You should be able to google that for ideas, right?) Conceptually, this is equivalent to transposing your matrix m. Whether this is convenient for you to do or not is outside the scope of your question, as I see it, and not really a CUDA issue. It may be a simple thing for you to do as you are creating the matrix on the host or transferring the matrix from host to device. But in summary, I don't know of a way to sum the matrix rows with 100% coalesced access, if the matrix is stored in row-major order. (You could resort to a sequence of row-reductions but that looks painful to me.)
It's not uncommon, when we are thinking about ways to accelerate code on the GPU, to consider re-organizing our data storage to facilitate the GPU. This is one example.
And, yes, what I'm outlining here still retains a loop in the kernel.
As an additional comment, I would suggest timing the data copy portions, and kernel (compute) portions separately. I can't tell from your question whether you are timing just the kernel or the entire (GPU) operation, including the data copies. If you time the data copies separately, you may discover that just the data copy time exceeds your CPU time. Any effort put into optimizing your CUDA code will not affect the data copy time. This might be a useful data point before you spend much time on this.