Fast 2D convolution for DSP

Fast 2D convolution for DSP - c

I want to implement some image-processing algorithms which are intended to run on a beagleboard. These algorithms use convolutions extensively. I'm trying to find a good C implementation for 2D convolution (probably using the Fast Fourier Transform). I also want the algorithm to be able to run on the beagleboard's DSP, because I've heard that the DSP is optimized for these kinds of operations (with its multiply-accumulate instruction).
I have no background in the field so I think it won't be a good idea to implement the convolution myself (I probably won't do it as good as someone who understands all the math behind it). I believe a good C convolution implementation for DSP exists somewhere but I wasn't able find it?
Could someone help?
EDIT: Turns out the kernel is pretty small. Its dimensions are either 2X2 or 3X3. So I guess I'm not looking for an FFT-based implementation. I was searching for convolution on the web to see its definition so I can implement it in a straight forward way (I don't really know what convolution is). All I've found is something with multiplied integrals and I have no idea how to do it with matrices. Could somebody give me a piece of code (or pseudo code) for the 2X2 kernel case?

What are the dimensions of the image and the kernel ? If the kernel is large then you can use FFT-based convolution, otherwise for small kernels just use direct convolution.
The DSP might not be the best way to do this though - just because it has a MAC instruction doesn't mean that it will be more efficient. Does the ARM CPU on the Beagle Board have NEON SIMD ? If so then that might be the way to go (and more fun too).
For a small kernel, you can do direct convolution like this:
// in, out are m x n images (integer data)
// K is the kernel size (KxK) - currently needs to be an odd number, e.g. 3
// coeffs[K][K] is a 2D array of integer coefficients
// scale is a scaling factor to normalise the filter gain
for (i = K / 2; i < m - K / 2; ++i) // iterate through image
{
for (j = K / 2; j < n - K / 2; ++j)
{
int sum = 0; // sum will be the sum of input data * coeff terms
for (ii = - K / 2; ii <= K / 2; ++ii) // iterate over kernel
{
for (jj = - K / 2; jj <= K / 2; ++jj)
{
int data = in[i + ii][j +jj];
int coeff = coeffs[ii + K / 2][jj + K / 2];
sum += data * coeff;
}
}
out[i][j] = sum / scale; // scale sum of convolution products and store in output
}
}
You can modify this to support even values of K - it just takes a little care with the upper/lower limits on the two inner loops.

I know it might be off topic but due to the similarity between C and JavaScript I believe it could still be helpful. PS.: Inspired by #Paul R answer.
Two dimensions 2D convolution algorithm in JavaScript using arrays
function newArray(size){
var result = new Array(size);
for (var i = 0; i < size; i++) {
result[i] = new Array(size);
}
return result;
}
function convolveArrays(filter, image){
var result = newArray(image.length - filter.length + 1);
for (var i = 0; i < image.length; i++) {
var imageRow = image[i];
for (var j = 0; j <= imageRow.length; j++) {
var sum = 0;
for (var w = 0; w < filter.length; w++) {
if(image.length - i < filter.length) break;
var filterRow = filter[w];
for (var z = 0; z < filter.length; z++) {
if(imageRow.length - j < filterRow.length) break;
sum += image[w + i][z + j] * filter[w][z];
}
}
if(i < result.length && j < result.length)
result[i][j] = sum;
}
}
return result;
}
You can check the full blog post at http://ec2-54-232-84-48.sa-east-1.compute.amazonaws.com/two-dimensional-convolution-algorithm-with-arrays-in-javascript/

Related

Why is FFT of (A+B) different from FFT(A) + FFT(B)?

I have been fighting with a very weird bug for almost a month. Asking you guys is my last hope. I wrote a program in C that integrates the 2d Cahn–Hilliard equation using the Implicit Euler (IE) scheme in Fourier (or reciprocal) space:
Where the "hats" mean that we are in Fourier space: h_q(t_n+1) and h_q(t_n) are the FTs of h(x,y) at times t_n and t_(n+1), N[h_q] is the nonlinear operator applied to h_q, in Fourier space, and L_q is the linear one, again in Fourier space. I don't want to go too much into the details of the numerical method I am using, since I am sure that the problem is not coming from there (I tried using other schemes).
My code is actually quite simple. Here is the beginning, where basically I declare variables, allocate memory and create the plans for the FFTW routines.
# include <stdlib.h>
# include <stdio.h>
# include <time.h>
# include <math.h>
# include <fftw3.h>
# define pi M_PI
int main(){
// define lattice size and spacing
int Nx = 150; // n of points on x
int Ny = 150; // n of points on y
double dx = 0.5; // bin size on x and y
// define simulation time and time step
long int Nt = 1000; // n of time steps
double dt = 0.5; // time step size
// number of frames to plot (at denominator)
long int nframes = Nt/100;
// define the noise
double rn, drift = 0.05; // punctual drift of h(x)
srand(666); // seed the RNG
// other variables
int i, j, nt; // variables for space and time loops
// declare FFTW3 routine
fftw_plan FT_h_hft; // routine to perform fourier transform
fftw_plan FT_Nonl_Nonlft;
fftw_plan IFT_hft_h; // routine to perform inverse fourier transform
// declare and allocate memory for real variables
double *Linft = fftw_alloc_real(Nx*Ny);
double *Q2 = fftw_alloc_real(Nx*Ny);
double *qx = fftw_alloc_real(Nx);
double *qy = fftw_alloc_real(Ny);
// declare and allocate memory for complex variables
fftw_complex *dh = fftw_alloc_complex(Nx*Ny);
fftw_complex *dhft = fftw_alloc_complex(Nx*Ny);
fftw_complex *Nonl = fftw_alloc_complex(Nx*Ny);
fftw_complex *Nonlft = fftw_alloc_complex(Nx*Ny);
// create the FFTW plans
FT_h_hft = fftw_plan_dft_2d ( Nx, Ny, dh, dhft, FFTW_FORWARD, FFTW_ESTIMATE );
FT_Nonl_Nonlft = fftw_plan_dft_2d ( Nx, Ny, Nonl, Nonlft, FFTW_FORWARD, FFTW_ESTIMATE );
IFT_hft_h = fftw_plan_dft_2d ( Nx, Ny, dhft, dh, FFTW_BACKWARD, FFTW_ESTIMATE );
// open file to store the data
char acstr[160];
FILE *fp;
sprintf(acstr, "CH2d_IE_dt%.2f_dx%.3f_Nt%ld_Nx%d_Ny%d_#f%.ld.dat",dt,dx,Nt,Nx,Ny,Nt/nframes);
After this preamble, I initialise my function h(x,y) with a uniform random noise, and I also take the FT of it. I set the imaginary part of h(x,y), which is dh[i*Ny+j][1] in the code, to 0, since it is a real function. Then I calculate the wavevectors qx and qy, and with them, I compute the linear operator of my equation in Fourier space, which is Linft in the code. I consider only the - fourth derivative of h as the linear term, so that the FT of the linear term is simply -q^4... but again, I don't want to go into the details of my integration method. The question is not about it.
// generate h(x,y) at initial time
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
rn = (double) rand()/RAND_MAX; // extract a random number between 0 and 1
dh[i*Ny+j][0] = drift-2.0*drift*rn; // shift of +-drift
dh[i*Ny+j][1] = 0.0;
}
}
// execute plan for the first time
fftw_execute (FT_h_hft);
// calculate wavenumbers
for (i = 0; i < Nx; i++) { qx[i] = 2.0*i*pi/(Nx*dx); }
for (i = 0; i < Ny; i++) { qy[i] = 2.0*i*pi/(Ny*dx); }
for (i = 1; i < Nx/2; i++) { qx[Nx-i] = -qx[i]; }
for (i = 1; i < Ny/2; i++) { qy[Ny-i] = -qy[i]; }
// calculate the FT of the linear operator
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Q2[i*Ny+j] = qx[i]*qx[i] + qy[j]*qy[j];
Linft[i*Ny+j] = -Q2[i*Ny+j]*Q2[i*Ny+j];
}
}
Then, finally, it comes the time loop. Essentially, what I do is the following:
Every once in a while, I save the data to a file and print some information on the terminal. In particular, I print the highest value of the FT of the Nonlinear term. I also check if h(x,y) is diverging to infinity (it shouldn't happen!),
Calculate h^3 in direct space (that is simply dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0]). Again, the imaginary part is set to 0,
Take the FT of h^3,
Obtain the complete Nonlinear term in reciprocal space (that is N[h_q] in the IE algorithm written above) by computing -q^2*(FT[h^3] - FT[h]). In the code, I am referring to the lines Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]) and the one below, for the imaginary part. I do this because:
Advance in time using the IE method, transform back in direct space, and then normalise.
Here is the code:
for(nt = 0; nt < Nt; nt++) {
if((nt % nframes)== 0) {
printf("%.0f %%\n",((double)nt/(double)Nt)*100);
printf("Nonlft %.15f \n",Nonlft[(Nx/2)*(Ny/2)][0]);
// write data to file
fp = fopen(acstr,"a");
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
fprintf(fp, "%4d %4d %.6f\n", i, j, dh[i*Ny+j][0]);
}
}
fclose(fp);
}
// check if h is going to infinity
if (isnan(dh[1][0])!=0) {
printf("crashed!\n");
return 0;
}
// calculate nonlinear term h^3 in direct space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
}
}
// Fourier transform of nonlinear term
fftw_execute (FT_Nonl_Nonlft);
// second derivative in Fourier space is just multiplication by -q^2
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]);
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][1] -dhft[i*Ny+j][1]);
}
}
// Implicit Euler scheme in Fourier space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
dhft[i*Ny+j][0] = (dhft[i*Ny+j][0] + dt*Nonlft[i*Ny+j][0])/(1.0 - dt*Linft[i*Ny+j]);
dhft[i*Ny+j][1] = (dhft[i*Ny+j][1] + dt*Nonlft[i*Ny+j][1])/(1.0 - dt*Linft[i*Ny+j]);
}
}
// transform h back in direct space
fftw_execute (IFT_hft_h);
// normalize
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
dh[i*Ny+j][0] = dh[i*Ny+j][0] / (double) (Nx*Ny);
dh[i*Ny+j][1] = dh[i*Ny+j][1] / (double) (Nx*Ny);
}
}
}
Last part of the code: empty the memory and destroy FFTW plans.
// terminate the FFTW3 plan and free memory
fftw_destroy_plan (FT_h_hft);
fftw_destroy_plan (FT_Nonl_Nonlft);
fftw_destroy_plan (IFT_hft_h);
fftw_cleanup();
fftw_free(dh);
fftw_free(Nonl);
fftw_free(qx);
fftw_free(qy);
fftw_free(Q2);
fftw_free(Linft);
fftw_free(dhft);
fftw_free(Nonlft);
return 0;
}
If I run this code, I obtain the following output:
0 %
Nonlft 0.0000000000000000000
1 %
Nonlft -0.0000000000001353512
2 %
Nonlft -0.0000000000000115539
3 %
Nonlft 0.0000000001376379599
...
69 %
Nonlft -12.1987455309071730625
70 %
Nonlft -70.1631962517720353389
71 %
Nonlft -252.4941743351609204637
72 %
Nonlft 347.5067875825179726235
73 %
Nonlft 109.3351142318568633982
74 %
Nonlft 39933.1054502610786585137
crashed!
The code crashes before reaching the end and we can see that the Nonlinear term is diverging.
Now, the thing that doesn't make sense to me is that if I change the lines in which I calculate the FT of the Nonlinear term in the following way:
// calculate nonlinear term h^3 -h in direct space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0] -dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
}
}
// Fourier transform of nonlinear term
fftw_execute (FT_Nonl_Nonlft);
// second derivative in Fourier space is just multiplication by -q^2
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][0];
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][1];
}
}
Which means that I am using this definition:
instead of this one:
Then the code is perfectly stable and no divergence happens! Even for billions of time steps! Why does this happen, since the two ways of calculating Nonlft should be equivalent?
Thank you very much to anyone who will take the time to read all of this and give me some help!
EDIT: To make things even more weird, I should point out that this bug does NOT happen for the same system in 1D. In 1D both methods of calculating Nonlft are stable.
EDIT: I add a short animation of what happens to the function h(x,y) just before crashing. Also: I quickly re-wrote the code in MATLAB, which uses Fast Fourier Transform functions based on the FFTW library, and the bug is NOT happening... the mystery deepens.

I solved it!!
The problem was the calculation of the Nonl term:
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
That needs to be changed to:
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0] -3.0*dh[i*Ny+j][0]*dh[i*Ny+j][1]*dh[i*Ny+j][1];
Nonl[i*Ny+j][1] = -dh[i*Ny+j][1]*dh[i*Ny+j][1]*dh[i*Ny+j][1] +3.0*dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][1];
In other words: I need to consider dh as a complex function (even though it should be real).
Basically, because of stupid rounding errors, the IFT of the FT of a real function (in my case dh), is NOT purely real, but will have a very small imaginary part. By setting Nonl[i*Ny+j][1] = 0.0 I was completely ignoring this imaginary part.
The issue, then, was that I was recursively summing FT(dh), dhft, and an object obtained using the IFT(FT(dh)), this is Nonlft, but ignoring the residual imaginary parts!
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]);
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][1] -dhft[i*Ny+j][1]);
Obviously, calculating Nonlft as dh^3 -dh and then doing
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][0];
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][1];
Avoided the problem of doing this "mixed" sum.
Phew... such a relief! I wish I could assign the bounty to myself! :P
EDIT: I'd like to add that, before using the fftw_plan_dft_2d functions, I was using fftw_plan_dft_r2c_2d and fftw_plan_dft_c2r_2d (real-to-complex and complex-to-real), and I was seeing the same bug. However, I suppose that I couldn't have solved it if I didn't switch to fftw_plan_dft_2d, since the c2r function automatically "chops off" the residual imaginary part coming from the IFT. If this is the case and I'm not missing something, I think that this should be written somewhere on the FFTW website, to prevent users from running into problems like this. Something like "r2c and c2r transforms are not good to implement pseudospectral methods".
EDIT: I found another SO question that addresses exactly the same problem.

How to make an IIR filter?

i'm trying to make IIR filter. I made FIR filter, but I feels IIR is more difficult than FIR.
I think IIR is similar with FIR, but it made me feels confused.
I think the filters are like this.
FIR : y(n) = b0(x[n]) + ... +bM-1(x[n-M+1])
IIR : y(n) = {b0(x[n]) + ... +bM-1(x[n-M+1])} - {a1(y[n-1]) + ... +aN(y[n-N}
in this case, how about a0? Is it just 1?
The part of y[n-1]..... is the problem. I confused how to make it.
Here is my code.
for (n = 0; n < length; n++) {
coeffa = coeffs_A;
coeffb = coeffs_B;
inputp = &insamp[filterLength - 1 + n];
acc = 0;
bcc = 0;
for (k = 0; k < filterLength; k++) {
bcc += (*coeffb++) * (*inputp--);
}
for (k = 0; k < filterLength; k++) {
acc += (*coeffa++) * (////////);
}
output[n] = bcc-acc;
}
In this case, filterLength is 7 and n is 80
////// is what i want to know.
Am I think wrong?

Typically IIR filters are implemented using direct form I or direct form II topologies. Each form requires memory states. These memory states keep the histories of the outputs and inputs in them. This makes it a lot simpler to implement the IIR filter. gtkiostream implements a direct form II approach and may be a helpful reference.
In addressing your question, I like the approach of directly estimating your filters from inputs and outputs, using the input and output buffers as memory states. As a coefficients always operate on outputs, use the outputs as the missing variable, something like this :
acc += (*coeffa++) * output[n-k];

I took your programm and rewrote it a bit to make it a MCVE and be easier readable:
#include <stdio.h>
#define flen 3
#define slen 10
int main(void)
{
int n,k;
double a[flen] = {1,-0.5,0}, b[flen]={1,0,0};
double input[slen]={0,0,1}; // 0s for flen-1 to avoid out of bounds
double output[slen]={0};
double acc, bcc;
for (n = flen-1; n < slen; n++) {
acc = 0;
bcc = 0;
for (k = 0; k < flen; k++) {
bcc += b[k] * (input[n-k]);
}
for (k = 0; k < flen; k++) {
acc += a[k] * (output[n-k]);
}
output[n] = bcc-acc;
printf("%f\t%f\t%f\t%f\n",input[n],acc,bcc,output[n]);
}
return 0;
}
I am not entirely sure this is correct because I do not have the means to test with different filter settings against a filter design tool.
The impulse response for the given coefficents seems to be okay. But I haven't done anything in this topic for a few years now so I am not certain. The general idea should be okay I think.

Loop Tiling Optimisations

I've been attempting to optimise one of my loops in my C code in order to make it use the cache more efficiently. I have a few issues. I'm not 100% sure if I'm even writing the code correctly to loop block due to the fact that I am seeing no increase in speed in the run time of my programme. Here is the code:
for(int k = 0; k < N; k+=b){
for (int i = k; i<MIN(N,i+b); ++i) {
a1[i] = 0.0f;
a2[i] = 0.0f;
for (int j = 0; j < N; j++) {
x = x[j] - x[i];
y = y[j] - y[i];
2 = x*x + y*y + eps;
r2inv = 1.0f / sqrt(r2);
r6inv = r2inv * r2inv * r2inv;
s = m[j] * r6inv;
ax[i] += s * x;
ay[i] += s * y;
}
}
}
I also have another issue. How do I go about choosing a correct block size? I understand that you want to load in enough to fill the l1 cache.
Thanks for the help in advance.

What you are doing is rather pointless, because i goes from 0 to N-1 in your code, just in a slightly more complicated way. So you benefit exactly zero from your attempts at tiling.
What is more critical is the array y, so that is what you should be tiling (if N is large, and if the speed isn't limited by the division and square root). For every value i, you make one complete pass through the array y. You can also easily save a few floating point operations for each j, and since r6inv is symmetrical between i and j, only half the values need to be calculated.

Using Hash Tables in Leu of Multidimensional Array

EDIT: Found a solution! Like the commenters suggested, using memset is an insanely better approach. Replace the entire for loop with
memset(lookup->n, -3, (dimensions*sizeof(signed char)));
where
long int dimensions = box1 * box2 * box3 * box4 * box5 * box6 * box7 * box8 * memvara * memvarb * memvarc * memvard * adirect * tdirect * fs * bs * outputnum;
Intro
Right now, I'm looking at a beast of a for-loop:
for (j = 0;j < box1; j++)
{
for (k = 0; k < box2; k++)
{
for (l = 0; l < box3; l++)
{
for (m = 0; m < box4; m++)
{
for (x = 0;x < box5; x++)
{
for (y = 0; y < box6; y++)
{
for (xa = 0;xa < box7; xa++)
{
for (xb = 0; xb < box8; xb++)
{
for (nb = 0; nb < memvara; nb++)
{
for (na = 0; na < memvarb; na++)
{
for (nx = 0; nx < memvarc; nx++)
{
for (nx1 = 0; nx1 < memvard; nx1++)
{
for (naa = 0; naa < adirect; naa++)
{
for (nbb = 0; nbb < tdirect; nbb++)
{
for (ncc = 0; ncc < fs; ncc++)
{
for (ndd = 0; ndd < bs; ndd++)
{
for (o = 0; o < outputnum; o++)
{
lookup->n[j][k][l][m][x][y][xa][xb][nb][na][nx][nx1][naa][nbb][ncc][ndd][o] = -3; //set to default value
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
The Problem
This loop is called every cycle in the main run to reset values to an initial state. Unfortunately, it is necessary for the structure of the program that this many values are kept in a single data structure.
Here's the kicker: for every 60 seconds of program run time, 57 seconds goes to this function alone.
The Question
My question is this: would hash tables be an appropriate substitute for a linear array? This array has an O(n^17) cardinality, yet hash tables have an ideal of O(1).
If so, what hash library would you recommend? This program is in C and has no native hash support.
If not, what would you recommend instead?
Can you provide some pseudo-code on how you think this should be implemented?
Notes
OpenMP was used in an attempt to parallelize this loop. Numerous implementations only resulted in slightly-to-greatly increased run time.
Memory usage is not particularly an issue -- this program is intended to be ran on an insanely high-spec'd computer.
We are student researchers, thrust into a heretofore unknown world of optimization and parallelization -- please bear with us, and thank you for any help

Hash vs Array
As comments have specified, an array should not be a problem here. Lookup into an array with a known offset is O(1).
The Bottleneck
It seems to me that the bulk of the work here (and the reason it is slow) is the number of pointer de-references in the inner-loop.
To explain in a bit more detail, consider myData[x][y][z] in the following code:
for (int x = 0; x < someVal1; x++) {
for (int y = 0; y < someVal2; y++) {
for (int z = 0; z < someVal3; z++) {
myData[x][y][z] = -3; // x and y only change in outer-loops.
}
}
}
To compute the location for the -3, we do a lookup and add a value - once for myData[x], then again to get to myData[x][y], and once more finally for myData[x][y][z].
Since this lookup is in the inner-most portion of the loop, we have redundant reads. myData[x] and myData[x][y] are being recomputed, even when only z's value is changing. The lookups were performed during a previous iteration, but the results weren't stored.
For your loop, there are many layers of lookups being computed each iteration, even when only the value of o is changing in that inner-loop.
An Improvement for the Bottleneck
To make one lookup, per loop iteration, per loop level, simply store intermediate lookups. Using int* as the indirection (though any type would work here), the sample code above (with myData) would become:
int **a, *b;
for (int x = 0; x < someVal1; x++) {
a = myData[x]; // Store the lookup.
for (int y = 0; y < someVal2; y++) {
b = a[y]; // Indirection based on the stored lookup.
for (int z = 0; z < someVal3; z++) {
b[z] = -3; // This can be extrapolated as needed to deeper levels.
}
}
}
This is just sample code, small adjustments may be necessary to get it to compile (casts and so forth). Note that there is probably no advantage to using this approach with a 3-dimensional array. However, for a 17-dimensional large data set with simple inner-loop operations (such as assignment), this approach should help quite a bit.
Finally, I'm assuming you aren't actually just assigning the value of -3. You can use memset to accomplish that goal much more efficiently.

Sum 3D matrix cuda

I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n].
Values are mapped to memory with something like:
index = 0;
for (z = 0; z<n; ++z)
for(y = 0; y<width; ++y)
for(x = 0; x<height; ++x) {
matrix[index] = value;
index++;
}
Q1: is this Cuda kernel ok?
idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;
for(z=0; z<n; z++){
A[idx*width+idy] += B[idx*width+idy+z*width*height] + C[idx*width+idy+z*width*height];
}
Q2: Is this faster way to do the calculation?
idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;
idz = blockIdx.z*blockDim.z + threadIdx.z;
int stride_x = blockDim.x * gridDim.x;
int stride_y = blockDim.y * gridDim.y;
int stride_z = blockDim.z * gridDim.z;
while ( idx < height && idy < width && idz < n ) {
atomicAdd( &(A[idx*width+idy]), B[idx*width+idy+idz*width*height] + C[idx*width+idy+idz*width*height] );
idx += stride_x;
idy += stride_y;
idz += stride_z;
}

First kernel is ok. But we have not coalesced access to matrix B and C.
As for second kernel function. You have data racing cause not only one thread has an an ability to write in A[idx*width+idy] addres. You need in additional synchronization like AttomicAdd
As for general question:
I think that experiments show that it is better. It's depends on typical matrix sizes that you have. Remember that maximum thread block size on Fermi < 1024 and if matrices have large size you gem many thread blocks. Usually it's slower (to have many thread blocks).

Real simple in ArrayFire:
array A = randu(nx,ny,nz);
array B = sum(A,2); // sum along 3rd dimension
print(B);

Q1: Test it with matrices where you know the answer
Remark: You might have problems when using very large matrices. Use a while loop with appropriate increments. Cuda by Example is as usual the reference book.
An example for implementing a nested loop can be found here: For nested loops with CUDA. There a while loop is implemented.
marina.k is right about the race condition. That would favor approach one, as atomic operations tend to slow down the code.