Why is FFT of (A+B) different from FFT(A) + FFT(B)?

Why is FFT of (A+B) different from FFT(A) + FFT(B)? - c

I have been fighting with a very weird bug for almost a month. Asking you guys is my last hope. I wrote a program in C that integrates the 2d Cahn–Hilliard equation using the Implicit Euler (IE) scheme in Fourier (or reciprocal) space:
Where the "hats" mean that we are in Fourier space: h_q(t_n+1) and h_q(t_n) are the FTs of h(x,y) at times t_n and t_(n+1), N[h_q] is the nonlinear operator applied to h_q, in Fourier space, and L_q is the linear one, again in Fourier space. I don't want to go too much into the details of the numerical method I am using, since I am sure that the problem is not coming from there (I tried using other schemes).
My code is actually quite simple. Here is the beginning, where basically I declare variables, allocate memory and create the plans for the FFTW routines.
# include <stdlib.h>
# include <stdio.h>
# include <time.h>
# include <math.h>
# include <fftw3.h>
# define pi M_PI
int main(){
// define lattice size and spacing
int Nx = 150; // n of points on x
int Ny = 150; // n of points on y
double dx = 0.5; // bin size on x and y
// define simulation time and time step
long int Nt = 1000; // n of time steps
double dt = 0.5; // time step size
// number of frames to plot (at denominator)
long int nframes = Nt/100;
// define the noise
double rn, drift = 0.05; // punctual drift of h(x)
srand(666); // seed the RNG
// other variables
int i, j, nt; // variables for space and time loops
// declare FFTW3 routine
fftw_plan FT_h_hft; // routine to perform fourier transform
fftw_plan FT_Nonl_Nonlft;
fftw_plan IFT_hft_h; // routine to perform inverse fourier transform
// declare and allocate memory for real variables
double *Linft = fftw_alloc_real(Nx*Ny);
double *Q2 = fftw_alloc_real(Nx*Ny);
double *qx = fftw_alloc_real(Nx);
double *qy = fftw_alloc_real(Ny);
// declare and allocate memory for complex variables
fftw_complex *dh = fftw_alloc_complex(Nx*Ny);
fftw_complex *dhft = fftw_alloc_complex(Nx*Ny);
fftw_complex *Nonl = fftw_alloc_complex(Nx*Ny);
fftw_complex *Nonlft = fftw_alloc_complex(Nx*Ny);
// create the FFTW plans
FT_h_hft = fftw_plan_dft_2d ( Nx, Ny, dh, dhft, FFTW_FORWARD, FFTW_ESTIMATE );
FT_Nonl_Nonlft = fftw_plan_dft_2d ( Nx, Ny, Nonl, Nonlft, FFTW_FORWARD, FFTW_ESTIMATE );
IFT_hft_h = fftw_plan_dft_2d ( Nx, Ny, dhft, dh, FFTW_BACKWARD, FFTW_ESTIMATE );
// open file to store the data
char acstr[160];
FILE *fp;
sprintf(acstr, "CH2d_IE_dt%.2f_dx%.3f_Nt%ld_Nx%d_Ny%d_#f%.ld.dat",dt,dx,Nt,Nx,Ny,Nt/nframes);
After this preamble, I initialise my function h(x,y) with a uniform random noise, and I also take the FT of it. I set the imaginary part of h(x,y), which is dh[i*Ny+j][1] in the code, to 0, since it is a real function. Then I calculate the wavevectors qx and qy, and with them, I compute the linear operator of my equation in Fourier space, which is Linft in the code. I consider only the - fourth derivative of h as the linear term, so that the FT of the linear term is simply -q^4... but again, I don't want to go into the details of my integration method. The question is not about it.
// generate h(x,y) at initial time
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
rn = (double) rand()/RAND_MAX; // extract a random number between 0 and 1
dh[i*Ny+j][0] = drift-2.0*drift*rn; // shift of +-drift
dh[i*Ny+j][1] = 0.0;
}
}
// execute plan for the first time
fftw_execute (FT_h_hft);
// calculate wavenumbers
for (i = 0; i < Nx; i++) { qx[i] = 2.0*i*pi/(Nx*dx); }
for (i = 0; i < Ny; i++) { qy[i] = 2.0*i*pi/(Ny*dx); }
for (i = 1; i < Nx/2; i++) { qx[Nx-i] = -qx[i]; }
for (i = 1; i < Ny/2; i++) { qy[Ny-i] = -qy[i]; }
// calculate the FT of the linear operator
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Q2[i*Ny+j] = qx[i]*qx[i] + qy[j]*qy[j];
Linft[i*Ny+j] = -Q2[i*Ny+j]*Q2[i*Ny+j];
}
}
Then, finally, it comes the time loop. Essentially, what I do is the following:
Every once in a while, I save the data to a file and print some information on the terminal. In particular, I print the highest value of the FT of the Nonlinear term. I also check if h(x,y) is diverging to infinity (it shouldn't happen!),
Calculate h^3 in direct space (that is simply dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0]). Again, the imaginary part is set to 0,
Take the FT of h^3,
Obtain the complete Nonlinear term in reciprocal space (that is N[h_q] in the IE algorithm written above) by computing -q^2*(FT[h^3] - FT[h]). In the code, I am referring to the lines Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]) and the one below, for the imaginary part. I do this because:
Advance in time using the IE method, transform back in direct space, and then normalise.
Here is the code:
for(nt = 0; nt < Nt; nt++) {
if((nt % nframes)== 0) {
printf("%.0f %%\n",((double)nt/(double)Nt)*100);
printf("Nonlft %.15f \n",Nonlft[(Nx/2)*(Ny/2)][0]);
// write data to file
fp = fopen(acstr,"a");
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
fprintf(fp, "%4d %4d %.6f\n", i, j, dh[i*Ny+j][0]);
}
}
fclose(fp);
}
// check if h is going to infinity
if (isnan(dh[1][0])!=0) {
printf("crashed!\n");
return 0;
}
// calculate nonlinear term h^3 in direct space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
}
}
// Fourier transform of nonlinear term
fftw_execute (FT_Nonl_Nonlft);
// second derivative in Fourier space is just multiplication by -q^2
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]);
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][1] -dhft[i*Ny+j][1]);
}
}
// Implicit Euler scheme in Fourier space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
dhft[i*Ny+j][0] = (dhft[i*Ny+j][0] + dt*Nonlft[i*Ny+j][0])/(1.0 - dt*Linft[i*Ny+j]);
dhft[i*Ny+j][1] = (dhft[i*Ny+j][1] + dt*Nonlft[i*Ny+j][1])/(1.0 - dt*Linft[i*Ny+j]);
}
}
// transform h back in direct space
fftw_execute (IFT_hft_h);
// normalize
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
dh[i*Ny+j][0] = dh[i*Ny+j][0] / (double) (Nx*Ny);
dh[i*Ny+j][1] = dh[i*Ny+j][1] / (double) (Nx*Ny);
}
}
}
Last part of the code: empty the memory and destroy FFTW plans.
// terminate the FFTW3 plan and free memory
fftw_destroy_plan (FT_h_hft);
fftw_destroy_plan (FT_Nonl_Nonlft);
fftw_destroy_plan (IFT_hft_h);
fftw_cleanup();
fftw_free(dh);
fftw_free(Nonl);
fftw_free(qx);
fftw_free(qy);
fftw_free(Q2);
fftw_free(Linft);
fftw_free(dhft);
fftw_free(Nonlft);
return 0;
}
If I run this code, I obtain the following output:
0 %
Nonlft 0.0000000000000000000
1 %
Nonlft -0.0000000000001353512
2 %
Nonlft -0.0000000000000115539
3 %
Nonlft 0.0000000001376379599
...
69 %
Nonlft -12.1987455309071730625
70 %
Nonlft -70.1631962517720353389
71 %
Nonlft -252.4941743351609204637
72 %
Nonlft 347.5067875825179726235
73 %
Nonlft 109.3351142318568633982
74 %
Nonlft 39933.1054502610786585137
crashed!
The code crashes before reaching the end and we can see that the Nonlinear term is diverging.
Now, the thing that doesn't make sense to me is that if I change the lines in which I calculate the FT of the Nonlinear term in the following way:
// calculate nonlinear term h^3 -h in direct space
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0] -dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
}
}
// Fourier transform of nonlinear term
fftw_execute (FT_Nonl_Nonlft);
// second derivative in Fourier space is just multiplication by -q^2
for ( i = 0; i < Nx; i++ ) {
for ( j = 0; j < Ny; j++ ) {
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][0];
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][1];
}
}
Which means that I am using this definition:
instead of this one:
Then the code is perfectly stable and no divergence happens! Even for billions of time steps! Why does this happen, since the two ways of calculating Nonlft should be equivalent?
Thank you very much to anyone who will take the time to read all of this and give me some help!
EDIT: To make things even more weird, I should point out that this bug does NOT happen for the same system in 1D. In 1D both methods of calculating Nonlft are stable.
EDIT: I add a short animation of what happens to the function h(x,y) just before crashing. Also: I quickly re-wrote the code in MATLAB, which uses Fast Fourier Transform functions based on the FFTW library, and the bug is NOT happening... the mystery deepens.

I solved it!!
The problem was the calculation of the Nonl term:
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0];
Nonl[i*Ny+j][1] = 0.0;
That needs to be changed to:
Nonl[i*Ny+j][0] = dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][0] -3.0*dh[i*Ny+j][0]*dh[i*Ny+j][1]*dh[i*Ny+j][1];
Nonl[i*Ny+j][1] = -dh[i*Ny+j][1]*dh[i*Ny+j][1]*dh[i*Ny+j][1] +3.0*dh[i*Ny+j][0]*dh[i*Ny+j][0]*dh[i*Ny+j][1];
In other words: I need to consider dh as a complex function (even though it should be real).
Basically, because of stupid rounding errors, the IFT of the FT of a real function (in my case dh), is NOT purely real, but will have a very small imaginary part. By setting Nonl[i*Ny+j][1] = 0.0 I was completely ignoring this imaginary part.
The issue, then, was that I was recursively summing FT(dh), dhft, and an object obtained using the IFT(FT(dh)), this is Nonlft, but ignoring the residual imaginary parts!
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][0] -dhft[i*Ny+j][0]);
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]*(Nonlft[i*Ny+j][1] -dhft[i*Ny+j][1]);
Obviously, calculating Nonlft as dh^3 -dh and then doing
Nonlft[i*Ny+j][0] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][0];
Nonlft[i*Ny+j][1] = -Q2[i*Ny+j]* Nonlft[i*Ny+j][1];
Avoided the problem of doing this "mixed" sum.
Phew... such a relief! I wish I could assign the bounty to myself! :P
EDIT: I'd like to add that, before using the fftw_plan_dft_2d functions, I was using fftw_plan_dft_r2c_2d and fftw_plan_dft_c2r_2d (real-to-complex and complex-to-real), and I was seeing the same bug. However, I suppose that I couldn't have solved it if I didn't switch to fftw_plan_dft_2d, since the c2r function automatically "chops off" the residual imaginary part coming from the IFT. If this is the case and I'm not missing something, I think that this should be written somewhere on the FFTW website, to prevent users from running into problems like this. Something like "r2c and c2r transforms are not good to implement pseudospectral methods".
EDIT: I found another SO question that addresses exactly the same problem.

Related

(Computational fluid dynamics) Problem with arrays. Why is my C program outputting -1.#IND00

I have a problem with a program i'm writing in C that solves the 1D linear convection equation. Basically i have initialized an two arrays. The first array (u0_array) is an array of ones with the arrays elements set equal to two over the interval or 0.5 < x < 1. The second array (usol_array) serves as a temporary array that the result or solution will be stored in.
The problem i am running into is the nested for loop at the end of my code. This loop applies the update equation required to calculate the next point, to each element in the array. When i run my script and try to print the results the output i get is just -1.IND00 for each iteration of the loop. (I am following Matlab style pseudo code which i also have attached below) I'm very new with C so my inexperience shows. I don't know why this is happening. If anyone could suggest a possible fix to this i would be very grateful. I have attached my code so far below with a few comments so you can follow my thought process. I have also attached the Matlab style pseudo code i'm followed.
# include <math.h>
# include <stdlib.h>
# include <stdio.h>
# include <time.h>
int main ( )
{
//eqyation to be solved 1D linear convection -- du/dt + c(du/dx) = 0
//initial conditions u(x,0) = u0(x)
//after discretisation using forward difference time and backward differemce space
//update equation becomes u[i] = u0[i] - c * dt/dx * (u0[i] - u[i - 1]);
int nx = 41; //num of grid points
int dx = 2 / (nx - 1); //magnitude of the spacing between grid points
int nt = 25;//nt is the number of timesteps
double dt = 0.25; //the amount of time each timestep covers
int c = 1; //assume wavespeed
//set up our initial conditions. The initial velocity u_0
//is 2 across the interval of 0.5 <x < 1 and u_0 = 1 everywhere else.
//we will define an array of ones
double* u0_array = (double*)calloc(nx, sizeof(double));
for (int i = 0; i < nx; i++)
{
u0_array[i] = 1;
}
// set u = 2 between 0.5 and 1 as per initial conditions
//note 0.5/dx = 10, 1/dx+1 = 21
for (int i = 10; i < 21; i++)
{
u0_array[i] = 2;
//printf("%f, ", u0_array[i]);
}
//make a temporary array that allows us to store the solution
double* usol_array = (double*)calloc(nx, sizeof(double));
//apply numerical scheme forward difference in
//time an backward difference in space
for (int i = 0; i < nt; i++)
{
//first loop iterates through each time step
usol_array[i] = u0_array[i];
//printf("%f", usol_array[i]);
//MY CODE WORKS FINE AS I WANT UP TO THIS LOOP
//second array iterates through each grid point for that time step and applies
//the update equation
for (int j = 1; j < nx - 1; j++)
{
u0_array[j] = usol_array[j] - c * dt/dx * (usol_array[j] - usol_array[j - 1]);
printf("%f, ", u0_array[j]);
}
}
return EXIT_SUCCESS;
}
For reference also the pseudo code i am following is attached below
1D linear convection pseudo Code (Matlab Style)

Instead of integer division, use FP math.
This avoid the later division by 0 and -1.#IND00
// int dx = 2 / (nx - 1); quotient is 0.
double dx = 2.0 / (nx - 1);
OP's code does not match comment
// u[i] = u0[i] - c * dt/dx * (u0[i] - u[i - 1]
u0_array[j] = usol_array[j] - c * dt/dx * (usol_array[j] - usol_array[j - 1]);
Easier to see if the redundant _array removed.
// v correct?
// u[i] = u0[i] - c * dt/dx * (u0[i] - u[i - 1]
u0[j] = usol[j] - c * dt/dx * (usol[j] - usol[j - 1]);
// ^^^^ Correct?

What you probably want is in matlab terms
for i = 1:nt
usol = u0
u0(2:nx) = usol(2:nx) - c*dt/dx*(usol(2:nx)-usol(1:nx-1))
end%for
This means that you have 2 inner loops over the space dimension, one for each of the two vector operations. These two separate loops have to be explicit in the C code
//apply numerical scheme forward difference in
//time an backward difference in space
for (int i = 0; i < nt; i++) {
//first loop iterates through each time step
for (int j = 0; j < nx; j++) {
usol_array[j] = u0_array[j];
//printf("%f", usol_array[i]);
}
//second loop iterates through each grid point for that time step
//and applies the update equation
for (int j = 1; j < nx; j++)
{
u0_array[j] = usol_array[j] - c * dt/dx * (usol_array[j] - usol_array[j - 1]);
printf("%f, ", u0_array[j]);
}
}

How to integrate my calculation C code with OpenMP

I am using a dual channels DAQ card with data stream mode. I wrote some code for analysis/calculation and put them to the main code for operation. However, the FIFO overflow warning sign always occur once its total data reach around 6000 MSamples (the DAQ on board memory is 8GB). I am well-noticed that a complicated calculation might retard the system and cause the overflow but all of the works I wrote are necessary to my experiment which means cannot be replaced (or there is more effective code can let me get the same result). I have heard that the OpenMP might be a solution to boost up the speed, but I am just a beginner in C, how could I implement to my calculation code?
My computer has 64GB RAM and Intel Core i7 processor. I always turn off other unnecessary software when running the data stream code. The code has been optimize as possible as I can, like simplify the hilbert() and use memcpy to pick out a specific range of data points.
This is how I process the data:
1.Install the FFTW source code for the Hilbert transform.
2.For loop to de-interleave pi16Buffer data to ch2Buffer
3.memcpy to get a certain range of data that I am interested put them to another array called ch2newBuffer
4.Do the hilbert() on ch2newBuffer and calculate its absolute number.
5.Find the max value of ch1 and abs(hilbert(ch2newBuffer)).
6.Calculate max(abs(hilbert(ch2))) / max(ch1).
Here is a part of the my DAQ code which in charge to calculation:
void hilbert(const int16* in, fftw_complex* out, fftw_plan plan_forward, fftw_plan plan_backward)
{
// copy the data to the complex array
for (int i = 0; i < N; ++i) {
out[i][REAL] = in[i];
out[i][IMAG] = 0;
}
// creat a DFT plan and execute it
//fftw_plan plan = fftw_plan_dft_1d(N, out, out, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(plan_forward);
// destroy a plan to prevent memory leak
//fftw_destroy_plan(plan_forward);
int hN = N>>1; // half of the length (N/2)
int numRem = hN; // the number of remaining elements
// multiply the appropriate value by 2
//(those should multiplied by 1 are left intact because they wouldn't change)
for (int i = 1; i < hN; ++i) {
out[i][REAL] *= 2;
out[i][IMAG] *= 2;
}
// if the length is even, the number of the remaining elements decrease by 1
if (N % 2 == 0)
numRem--;
else if (N > 1) {
out[hN][REAL] *= 2;
out[hN][IMAG] *= 2;
}
// set the remaining value to 0
// (multiplying by 0 gives 0, so we don't care about the multiplicands)
memset(&out[hN + 1][REAL], 0, numRem * sizeof(fftw_complex));
// creat a IDFT plan and execute it
//plan = fftw_plan_dft_1d(N, out, out, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(plan_backward);
// do some cleaning
//fftw_destroy_plan(plan_backward);
//fftw_cleanup();
// scale the IDFT output
//for (int i = 0; i < N; ++i) {
//out[i][REAL] /= N;
//out[i][IMAG] /= N;
//}
}
float SumBufferData(void* pBuffer, uInt32 u32Size, uInt32 u32SampleBits)
{
// In this routine we sum up all the samples in the buffer. This function
// should be replaced with the user's analysys function
if ( 8 == u32SampleBits )
{
pu8Buffer = (uInt8 *)pBuffer;
for (i = 0; i < u32Size; i++)
{
i64Sum += pu8Buffer[i];
}
}
else
{
pi16Buffer = (int16 *)pBuffer;
fftw_complex(hilbertedch2[N]);
fftw_plan plan_forward = fftw_plan_dft_1d(N, hilbertedch2, hilbertedch2, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_plan plan_backward = fftw_plan_dft_1d(N, hilbertedch2, hilbertedch2, FFTW_BACKWARD, FFTW_ESTIMATE);
ch2Buffer = (int16*)calloc(u32Size / 2, sizeof * ch2Buffer);
ch2newBuffer= (int16*)calloc(u32Size/2, sizeof* ch2newBuffer);
// De-interleave the data from pi16Buffer
for (i = 0; i < u32Size/2 ; i++)
{
ch2Buffer[i] = pi16Buffer[i*2+1];
}
// Pick out the data points range that we are interested
memcpy(ch2newBuffer, &ch2Buffer[6944], 1024 * sizeof(ch2Buffer[0]));
// Do the hilbert transform to these data points
hilbert(ch2newBuffer, hilbertedch2, plan_forward, plan_backward);
fftw_destroy_plan(plan_forward);
fftw_destroy_plan(plan_backward);
//Find max value in each segs of ch1 and ch2
for (i = 128; i < 200 ; i++)
{
if (pi16Buffer[i*2] > max1)
max1 = pi16Buffer[i*2];
}
for (i = 0; i < 1024; i++)
{
if (fabs(hilbertedch2[i][IMAG]) > max2)
max2 = fabs(hilbertedch2[i][IMAG]);
}
Corrected = max2 / max1 / N; // Calculate the signal correction
}
free(ch2Buffer);
free(ch2newBuffer);
return Corrected;
}

Loop are typically a good start for parallelism, for instance:
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
out[i][REAL] = in[i];
out[i][IMAG] = 0;
}
or
#pragma omp parallel for reduction(max:max2)
for (i = 0; i < 1024; i++)
{
float tmp = fabs(hilbertedch2[i][IMAG]);
max2 = (max2 > tmp) ? max2 : tmp.
}
That being said, you need to profile your code find out where the execution takes the most time and try to parallelized if possible. However, looking at what you have posted, I do not see a lot of parallelism opportunity there.

Slow OMP vs serial

I'm trying to optimize a C subroutine called from R that takes up ~60% of the computation time for a problem I'm trying to solve. This is down from 86% when coded purely in R. The vast majority of the execution time in my C code is taking place in a nested for loop and so this seems an obvious candidate to try and parallelize using OpenMP. I've tried doing so with variable results – at best the elapsed time is fractionally worse than not using OMP, at worst the performance scaled inversely to the number of threads. The code for the fastest version is below:
#include <R.h>
#include <Rmath.h>
#include <omp.h>
void gradNegLogLik_c(double *param, double *delta, double *X, double *M, int *nBeta, int *nEpsilon, int *nObs, double *gradient){
// ========================================================================================
// param: double[nBeta + nEpsilon] values of parameters at which to evaluate gradient
// delta: double[nObs] satellite - buoy differences
// X: double[nObs * (nBeta + nEpsilon)] design matrix for mean components (i.e. beta terms)
// M: double[nObs * (nBeta + nEpsilon)] design matrix for variance components (i.e. epsilon terms)
// nBeta: int number of mean terms
// nEpsilon: int number of variance terms
// nObs: int number of observations
// gradient: double[nBeta + nEpsilon] output array of gradients
// ========================================================================================
// ========================================================================================
// local variables
size_t i, j, ind;
size_t nterms = *nBeta + *nEpsilon;
size_t nbeta = *nBeta;
size_t nepsilon = *nEpsilon;
size_t nobs = *nObs;
// allocate local memory and set to zero
double *sigma2 = calloc( nobs , sizeof(double) );
double *fittedValues = calloc( nobs , sizeof(double) );
double *residuals = calloc( nobs , sizeof(double) );
double *beta = calloc( nbeta , sizeof(double) );
double *epsilon2 = calloc( nepsilon , sizeof(double) );
double *residuals2 = calloc( nobs , sizeof(double) );
double gradBeta, gradEpsilon;
// extract beta and epsilon terms from param
// =========================================
for(i = 0 ; i < nbeta ; i++){
beta[i] = param[ i ];
epsilon2[i] = param[ nbeta + i ];
}
// Initialise gradient to zero for return value
// =========================================
for( i = 0 ; i < nterms ; i++){
gradient[i] = 0;
}
// calculate sigma, fitted values and residuals
// ============================================
for( i = 0 ; i < nbeta ; i++){
for( j = 0 ; j < nobs ; j++){
ind = i * nobs + j;
sigma2[j] += M[ind] * epsilon2[i];
fittedValues[j] += X[ind] * beta[i];
}
}
for( j = 0 ; j < nobs ; j++){
// calculate reciprocal as this is what we actually use and
// we only want to do it once.
sigma2[j] = 1 / sigma2[j];
residuals[j] = delta[j] - fittedValues[j];
residuals2[j] = residuals[j]*residuals[j];
}
// Loop over all observations and calculate value of (negative) derivative
// =======================================================================
#pragma omp parallel for private(i, j, ind, gradBeta, gradEpsilon)\
shared(gradient, nbeta, nobs, X, M, sigma2, fittedValues, delta, residuals2) \
default(none)
for( i = 0 ; i < nbeta ; i++){
gradBeta = 0.0;
gradEpsilon = 0.0;
for(j = 0 ; j < nobs ; j++){
ind = i * nobs + j;
gradBeta -= -1.0*X[ind] * sigma2[j]*(fittedValues[j] - delta[j]);
gradEpsilon -= 0.5*M[ind] * sigma2[j]*(residuals2[j] * sigma2[j] - 1);
}
gradient[i] = gradBeta;
gradient[nbeta + i] = gradEpsilon;
}
// End of function
// free local memory
free(sigma2);
free(fittedValues);
free(residuals);
free(beta);
free(epsilon2);
free(residuals2);
}
nObs is order 10000.
nBeta is in the range 20 – several hundred.
nEpsilon = nBeta and is not currently used.
After searching through this site and an afternoon googling and trying different things I don't seem to be able to make any further improvement. My first thoughts were false sharing – I've tried various things such as unrolling the outer loop to set 8 elements of gradient[] at a time to creating a temporary padded array to store the results in. I've also tried different combinations of shared, private and firstprivate. None of this appears to improve things and my fastest execution time is marginally worse in parallel than in serial. This leads to two questions before I spend any more time on this:
Is my problem (repeating ~9000 of the same set of calculations 20 - 900 times) too small to make it worthwhile using OMP?
Is there something I'm missing or doing wrong?
I suspect it's the latter as I'm relatively inexperienced when using C and OMP. Any help / thoughts would be appreciated.
(For info, I'm running on SLED11 server with 16 cores and 192GB of memory and using GCC 4.7.2 to compile my C code). Other users are using the server but the relative performance of OMP vs serial code seems independent of the other users.
Thanks in advance,
Dave.
EDIT: For info the compile command I've used is
gcc -I/RHOME/R/3.0.1/lib64/R/include -DNDEBUG -I/usr/local/include -fpic \
-std=c99 -Wall -pedantic –O3 -fopenmp -c src/gradNegLogLik_call.c \
-o src/gradNegLogLik_call.o
Most of the flags are set by the R CMD SHLIB command - I've added the -O3 -fopenmp manually.

It may be useful to give some context to my question above before giving my answer to what I've done to speed up my code (although this has been achieved without using OMP).
My original C function was written to calculate the gradient of a log likelihood function to be used with the R optim() command and the L-BFGS-B method. For each call of optim my log likelihood and gradient functions are each called ~100 times as optim finds the best solution. As a result, these two functions take up the bulk of my execution time, as expected and reported by Rprof, and so were the two targets for converting to C to improve the efficiency of my code.
Converting my two functions to C and optimizing that code has resulted in my calls to optim reducing from an average elapsed time of 1.88s per call to 0.25s per call. This has reduced my processing time from ~1 month to a few days. The change that had the biggest impact (beside calling C) was changing the ordering of the nested loops. The original order was chosen due to the way R stores matrices and chosen to avoid having to transpose my matrices for each call of my C functions. Recognizing that the transpose only needs to be done once for each call to optim(), and not each C call as I had originally coded, this is a small overhead to pay compared to the impact / benefit of changing the order in the C functions.
Given this increase in speed it's had to justify spending any more time on this. The final version of my gradient function (as per my original post) is given below.
Note that whilst I've changed from using .C to .Call in R (hence the change to the function arguments etc) this in itself doesn’t account for the speed increase.
#include <R.h>
#include <Rmath.h>
#include <Rinternals.h>
#include <omp.h>
SEXP gradNegLogLik_call(SEXP param ,SEXP delta, SEXP X, SEXP M, SEXP nBeta, SEXP nEpsilon){
// local variables
double *par, *d;
double *sigma2, *fittedValues, *residuals, *grad, *Xuse, *Muse;
double val, sig2, gradBeta, gradEpsilon;
int n, m, ind, nterms, i, j;
SEXP gradient;
// get / associate parameters with local pointer
par = REAL(param);
Xuse = REAL(X);
Muse = REAL(M);
d = REAL(delta);
n = LENGTH(delta);
m = INTEGER(nBeta)[0];
nterms = m + m;
// allocate memory
PROTECT( gradient = allocVector(REALSXP, nterms ));
// set pointer to real portion of gradient
grad = REAL(gradient);
// set all gradient terms to zero
for(i = 0 ; i < nterms ; i++){
grad[i] = 0.0;
}
sigma2 = Calloc(n, double );
fittedValues = Calloc(n, double );
residuals = Calloc(n, double );
// calculate sigma, fitted values and residuals
for(i = 0 ; i < n ; i++){
val = 0.0;
sig2 = 0.0;
for(j = 0 ; j < m ; j++){
ind = i*m + j;
val += Xuse[ind]*par[j];
sig2 += Muse[ind]*par[j+m];
}
// calculate reciprocal of sigma as this is what we actually use
// and we only want to do it once
sigma2[i] = 1.0 / sig2;
fittedValues[i] = val;
residuals[i] = d[i] - val;
}
// now loop over each paramter and calculate derivative
for(i = 0 ; i < n ; i++){
gradBeta = -1.0*sigma2[i]*(fittedValues[i] - d[i]);
gradEpsilon = 0.5*sigma2[i]*(residuals[i]*residuals[i]*sigma2[i] - 1);
for(j = 0 ; j < m ; j++){
ind = i*m + j;
grad[j] -= Xuse[ind]*gradBeta;
grad[j+m] -= Muse[ind]*gradEpsilon;
}
}
UNPROTECT(1);
Free(sigma2);
Free(residuals);
Free(fittedValues);
// return array of gradients
return gradient;
}

LU Decomposition from Numerical Recipes not working; what am I doing wrong?

I've literally copied and pasted from the supplied source code for Numerical Recipes for C for in-place LU Matrix Decomposition, problem is its not working.
I'm sure I'm doing something stupid but would appreciate anyone being able to point me in the right direction on this; I've been working on its all day and can't see what I'm doing wrong.
POST-ANSWER UPDATE: The project is finished and working. Thanks to everyone for their guidance.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#define MAT1 3
#define TINY 1e-20
int h_NR_LU_decomp(float *a, int *indx){
//Taken from Numerical Recipies for C
int i,imax,j,k;
float big,dum,sum,temp;
int n=MAT1;
float vv[MAT1];
int d=1.0;
//Loop over rows to get implicit scaling info
for (i=0;i<n;i++) {
big=0.0;
for (j=0;j<n;j++)
if ((temp=fabs(a[i*MAT1+j])) > big)
big=temp;
if (big == 0.0) return -1; //Singular Matrix
vv[i]=1.0/big;
}
//Outer kij loop
for (j=0;j<n;j++) {
for (i=0;i<j;i++) {
sum=a[i*MAT1+j];
for (k=0;k<i;k++)
sum -= a[i*MAT1+k]*a[k*MAT1+j];
a[i*MAT1+j]=sum;
}
big=0.0;
//search for largest pivot
for (i=j;i<n;i++) {
sum=a[i*MAT1+j];
for (k=0;k<j;k++) sum -= a[i*MAT1+k]*a[k*MAT1+j];
a[i*MAT1+j]=sum;
if ((dum=vv[i]*fabs(sum)) >= big) {
big=dum;
imax=i;
}
}
//Do we need to swap any rows?
if (j != imax) {
for (k=0;k<n;k++) {
dum=a[imax*MAT1+k];
a[imax*MAT1+k]=a[j*MAT1+k];
a[j*MAT1+k]=dum;
}
d = -d;
vv[imax]=vv[j];
}
indx[j]=imax;
if (a[j*MAT1+j] == 0.0) a[j*MAT1+j]=TINY;
for (k=j+1;k<n;k++) {
dum=1.0/(a[j*MAT1+j]);
for (i=j+1;i<n;i++) a[i*MAT1+j] *= dum;
}
}
return 0;
}
void main(){
//3x3 Matrix
float exampleA[]={1,3,-2,3,5,6,2,4,3};
//pivot array (not used currently)
int* h_pivot = (int *)malloc(sizeof(int)*MAT1);
int retval = h_NR_LU_decomp(&exampleA[0],h_pivot);
for (unsigned int i=0; i<3; i++){
printf("\n%d:",h_pivot[i]);
for (unsigned int j=0;j<3; j++){
printf("%.1lf,",exampleA[i*3+j]);
}
}
}
WolframAlpha says the answer should be
1,3,-2
2,-2,7
3,2,-2
I'm getting:
2,4,3
0.2,2,-2.8
0.8,1,6.5
And so far I have found at least 3 different versions of the 'same' algorithm, so I'm completely confused.
PS yes I know there are at least a dozen different libraries to do this, but I'm more interested in understanding what I'm doing wrong than the right answer.
PPS since in LU Decomposition the lower resultant matrix is unity, and using Crouts algorithm as (i think) implemented, array index access is still safe, both L and U can be superimposed on each other in-place; hence the single resultant matrix for this.

I think there's something inherently wrong with your indices. They sometimes have unusual start and end values, and the outer loop over j instead of i makes me suspicious.
Before you ask anyone to examine your code, here are a few suggestions:
double-check your indices
get rid of those obfuscation attempts using sum
use a macro a(i,j) instead of a[i*MAT1+j]
write sub-functions instead of comments
remove unnecessary parts, isolating the erroneous code
Here's a version that follows these suggestions:
#define MAT1 3
#define a(i,j) a[(i)*MAT1+(j)]
int h_NR_LU_decomp(float *a, int *indx)
{
int i, j, k;
int n = MAT1;
for (i = 0; i < n; i++) {
// compute R
for (j = i; j < n; j++)
for (k = 0; k < i-2; k++)
a(i,j) -= a(i,k) * a(k,j);
// compute L
for (j = i+1; j < n; j++)
for (k = 0; k < i-2; k++)
a(j,i) -= a(j,k) * a(k,i);
}
return 0;
}
Its main advantages are:
it's readable
it works
It lacks pivoting, though. Add sub-functions as needed.
My advice: don't copy someone else's code without understanding it.
Most programmers are bad programmers.

For the love of all that is holy, don't use Numerical Recipies code for anything except as a toy implementation for teaching purposes of the algorithms described in the text -- and, really, the text isn't that great. And, as you're learning, neither is the code.
Certainly don't put any Numerical Recipies routine in your own code -- the license is insanely restrictive, particularly given the code quality. You won't be able to distribute your own code if you have NR stuff in there.
See if your system already has a LAPACK library installed. It's the standard interface to linear algebra routines in computational science and engineering, and while it's not perfect, you'll be able to find lapack libraries for any machine you ever move your code to, and you can just compile, link, and run. If it's not already installed on your system, your package manager (rpm, apt-get, fink, port, whatever) probably knows about lapack and can install it for you. If not, as long as you have a Fortran compiler on your system, you can download and compile it from here, and the standard C bindings can be found just below on the same page.
The reason it's so handy to have a standard API to linear algebra routines is that they are so common, but their performance is so system-dependant. So for instance, Goto BLAS
is an insanely fast implementation for x86 systems of the low-level operations which are needed for linear algebra; once you have LAPACK working, you can install that library to make everything as fast as possible.
Once you have any sort of LAPACK installed, the routine for doing an LU factorization of a general matrix is SGETRF for floats, or DGETRF for doubles. There are other, faster routines if you know something about the structure of the matrix - that it's symmetric positive definite, say (SBPTRF), or that it's tridiagonal (STDTRF). It's a big library, but once you learn your way around it you'll have a very powerful piece of gear in your numerical toolbox.

The thing that looks most suspicious to me is the part marked "search for largest pivot". This does not only search but it also changes the matrix A. I find it hard to believe that is correct.
The different version of the LU algorithm differ in pivoting, so make sure you understand that. You cannot compare the results of different algorithms. A better check is to see whether L times U equals your original matrix, or a permutation thereof if your algorithm does pivoting. That being said, your result is wrong because the determinant is wrong (pivoting does not change the determinant, except for the sign).
Apart from that #Philip has good advice. If you want to understand the code, start by understanding LU decomposition without pivoting.

To badly paraphrase Albert Einstein:
... a man with a watch always knows the
exact time, but a man with two is
never sure ....
Your code is definitely not producing the correct result, but even if it were, the result with pivoting will not directly correspond to the result without pivoting. In the context of a pivoting solution, what Alpha has really given you is probably the equivalent of this:
1 0 0 1 0 0 1 3 -2
P= 0 1 0 L= 2 1 0 U = 0 -2 7
0 0 1 3 2 1 0 0 -2
which will then satisfy the condition A = P.L.U (where . denotes the matrix product). If I compute the (notionally) same decomposition operation another way (using the LAPACK routine dgetrf via numpy in this case):
In [27]: A
Out[27]:
array([[ 1, 3, -2],
[ 3, 5, 6],
[ 2, 4, 3]])
In [28]: import scipy.linalg as la
In [29]: LU,ipivot = la.lu_factor(A)
In [30]: print LU
[[ 3. 5. 6. ]
[ 0.33333333 1.33333333 -4. ]
[ 0.66666667 0.5 1. ]]
In [31]: print ipivot
[1 1 2]
After a little bit of black magic with ipivot we get
0 1 0 1 0 0 3 5 6
P = 0 0 1 L = 0.33333 1 0 U = 0 1.3333 -4
1 0 0 0.66667 0.5 1 0 0 1
which also satisfies A = P.L.U . Both of these factorizations are correct, but they are different and they won't correspond to a correctly functioning version of the NR code.
So before you can go deciding whether you have the "right" answer, you really should spend a bit of time understanding the actual algorithm that the code you copied implements.

This thread has been viewed 6k times in the past 10 years. I had used NR Fortran and C for many years, and do not share the low opinions expressed here.
I explored the issue you encountered, and I believe the problem in your code is here:
for (k=j+1;k<n;k++) {
dum=1.0/(a[j*MAT1+j]);
for (i=j+1;i<n;i++) a[i*MAT1+j] *= dum;
}
while in the original if (j != n-1) { ... } is used. I think the two are not equivalent.
NR's lubksb() does have a small issue in the way they set up finding the first non-zero element, but this can be skipped at very low cost, even for a large matrix. With that, both ludcmp() and lubksb(), entered as published, work just fine, and as far as I can tell perform well.
Here's a complete test code, mostly preserving the notation of NR, wth minor simplifications (tested under Ubuntu Linux/gcc):
/* A sample program to demonstrate matrix inversion using the
* Crout's algorithm from Teukolsky and Press (Numerical Recipes):
* LU decomposition + back-substitution, with partial pivoting
* 2022.06 edward.sternin at brocku.ca
*/
#define N 7
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define a(i,j) a[(i)*n+(j)]
/* implied 1D layout is a(0,0), a(0,1), ... a(0,n-1), a(1,0), a(1,1), ... */
void matrixPrint (double *M, int nrow, int ncol) {
int i,j;
for (i=0;i<nrow;i++) {
for (j=0;j<ncol;j++) { fprintf(stderr," %+.3f\t",M[i*ncol+j]); }
fprintf(stderr,"\n");
}
}
void die(char msg[]) {
fprintf(stderr,"ERROR in %s, aborting\n",msg);
exit(1);
}
void ludcmp(double *a, int n, int *indx) {
int i, imax, j, k;
double big, dum, sum, temp;
double *vv;
/* i=row index, i=0..(n-1); j=col index, j=0..(n-1) */
vv=(double *)malloc((size_t)(n * sizeof(double)));
if (!vv) die("ludcmp: allocation failure");
for (i = 0; i < n; i++) { /* loop over rows */
big = 0.0;
for (j = 0; j < n; j++) {
if ((temp=fabs(a(i,j))) > big) big=temp;
}
if (big == 0.0) die("ludcmp: a singular matrix provided");
vv[i] = 1.0 / big; /* vv stores the scaling factor for each row */
}
for (j = 0; j < n; j++) { /* Crout's method: loop over columns */
for (i = 0; i < j; i++) { /* except for i=j */
sum = a(i,j);
for (k = 0; k < i; k++) { sum -= a(i,k) * a(k,j); }
a(i,j) = sum; /* Eq. 2.3.12, in situ */
}
big = 0.0; /* searching for the largest pivot element */
for (i = j; i < n; i++) {
sum = a(i,j);
for (k = 0; k < j; k++) { sum -= a(i,k) * a(k,j); }
a(i,j) = sum;
if ((dum = vv[i] * fabs(sum)) >= big) {
big = dum;
imax = i;
}
}
if (j != imax) { /* if needed, interchange rows */
for (k = 0; k < n; k++){
dum = a(imax,k);
a(imax,k) = a(j,k);
a(j,k) = dum;
}
vv[imax] = vv[j]; /* keep the scale factor with the new row location */
}
indx[j] = imax;
if (j != n-1) { /* divide by the pivot element */
dum = 1.0 / a(j,j);
for (i = j + 1; i < n; i++) a(i,j) *= dum;
}
}
free(vv);
}
void lubksb(double *a, int n, int *indx, double *b) {
int i, ip, j;
double sum;
for (i = 0; i < n; i++) {
/* Forward substitution, Eq.2.3.6, unscrambling permutations from indx[] */
ip = indx[i];
sum = b[ip];
b[ip] = b[i];
for (j = 0; j < i; j++) sum -= a(i,j) * b[j];
b[i] = sum;
}
for (i = n-1; i >= 0; i--) { /* backsubstitution, Eq. 2.3.7 */
sum = b[i];
for (j = i + 1; j < n; j++) sum -= a(i,j) * b[j];
b[i] = sum / a(i,i);
}
}
int main() {
double *a,*y,*col,*aa,*res,sum;
int i,j,k,*indx;
a=(double *)malloc((size_t)(N*N * sizeof(double)));
y=(double *)malloc((size_t)(N*N * sizeof(double)));
col=(double *)malloc((size_t)(N * sizeof(double)));
indx=(int *)malloc((size_t)(N * sizeof(int)));
aa=(double *)malloc((size_t)(N*N * sizeof(double)));
res=(double *)malloc((size_t)(N*N * sizeof(double)));
if (!a || !y || !col || !indx || !aa || !res) die("main: memory allocation failure");
srand48((long int) N);
for (i=0;i<N;i++) {
for (j=0;j<N;j++) { aa[i*N+j] = a[i*N+j] = drand48(); }
}
fprintf(stderr,"\nRandomly generated matrix A = \n");
matrixPrint(a,N,N);
ludcmp(a,N,indx);
for(j=0;j<N;j++) {
for(i=0;i<N;i++) { col[i]=0.0; }
col[j]=1.0;
lubksb(a,N,indx,col);
for(i=0;i<N;i++) { y[i*N+j]=col[i]; }
}
fprintf(stderr,"\nResult of LU/BackSub is inv(A) :\n");
matrixPrint(y,N,N);
for (i=0; i<N; i++) {
for (j=0;j<N;j++) {
sum = 0;
for (k=0; k<N; k++) { sum += y[i*N+k] * aa[k*N+j]; }
res[i*N+j] = sum;
}
}
fprintf(stderr,"\nResult of inv(A).A = (should be 1):\n");
matrixPrint(res,N,N);
return(0);
}

Fast 2D convolution for DSP

I want to implement some image-processing algorithms which are intended to run on a beagleboard. These algorithms use convolutions extensively. I'm trying to find a good C implementation for 2D convolution (probably using the Fast Fourier Transform). I also want the algorithm to be able to run on the beagleboard's DSP, because I've heard that the DSP is optimized for these kinds of operations (with its multiply-accumulate instruction).
I have no background in the field so I think it won't be a good idea to implement the convolution myself (I probably won't do it as good as someone who understands all the math behind it). I believe a good C convolution implementation for DSP exists somewhere but I wasn't able find it?
Could someone help?
EDIT: Turns out the kernel is pretty small. Its dimensions are either 2X2 or 3X3. So I guess I'm not looking for an FFT-based implementation. I was searching for convolution on the web to see its definition so I can implement it in a straight forward way (I don't really know what convolution is). All I've found is something with multiplied integrals and I have no idea how to do it with matrices. Could somebody give me a piece of code (or pseudo code) for the 2X2 kernel case?

What are the dimensions of the image and the kernel ? If the kernel is large then you can use FFT-based convolution, otherwise for small kernels just use direct convolution.
The DSP might not be the best way to do this though - just because it has a MAC instruction doesn't mean that it will be more efficient. Does the ARM CPU on the Beagle Board have NEON SIMD ? If so then that might be the way to go (and more fun too).
For a small kernel, you can do direct convolution like this:
// in, out are m x n images (integer data)
// K is the kernel size (KxK) - currently needs to be an odd number, e.g. 3
// coeffs[K][K] is a 2D array of integer coefficients
// scale is a scaling factor to normalise the filter gain
for (i = K / 2; i < m - K / 2; ++i) // iterate through image
{
for (j = K / 2; j < n - K / 2; ++j)
{
int sum = 0; // sum will be the sum of input data * coeff terms
for (ii = - K / 2; ii <= K / 2; ++ii) // iterate over kernel
{
for (jj = - K / 2; jj <= K / 2; ++jj)
{
int data = in[i + ii][j +jj];
int coeff = coeffs[ii + K / 2][jj + K / 2];
sum += data * coeff;
}
}
out[i][j] = sum / scale; // scale sum of convolution products and store in output
}
}
You can modify this to support even values of K - it just takes a little care with the upper/lower limits on the two inner loops.

I know it might be off topic but due to the similarity between C and JavaScript I believe it could still be helpful. PS.: Inspired by #Paul R answer.
Two dimensions 2D convolution algorithm in JavaScript using arrays
function newArray(size){
var result = new Array(size);
for (var i = 0; i < size; i++) {
result[i] = new Array(size);
}
return result;
}
function convolveArrays(filter, image){
var result = newArray(image.length - filter.length + 1);
for (var i = 0; i < image.length; i++) {
var imageRow = image[i];
for (var j = 0; j <= imageRow.length; j++) {
var sum = 0;
for (var w = 0; w < filter.length; w++) {
if(image.length - i < filter.length) break;
var filterRow = filter[w];
for (var z = 0; z < filter.length; z++) {
if(imageRow.length - j < filterRow.length) break;
sum += image[w + i][z + j] * filter[w][z];
}
}
if(i < result.length && j < result.length)
result[i][j] = sum;
}
}
return result;
}
You can check the full blog post at http://ec2-54-232-84-48.sa-east-1.compute.amazonaws.com/two-dimensional-convolution-algorithm-with-arrays-in-javascript/

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight