Doing FFT in realtime - c

I want to do the FFT of an audio signal in real time, meaning while the person is speaking in the microphone. I will fetch the data (I do this with portaudio, if it would be easier with wavein I would be happy to use that - if you can tell me how). Next I am using the FFTW library - I know how to perform 1D, 2D (real&complex) FFT, but I am not so sure how to do this, since I would have to do a 3D FFT to get frequency, amplitude (this would determine the color gradient) and time. Or is it just a 2D FFT, and I get amplitude and frequency?

I use a Sliding DFT, which is many times faster than an FFT in the case where you need to do a fourier transform each time a sample arrives in the input buffer.
It's based on the fact that once you have performed a fourier transform for the last N samples, and a new sample arrives, you can "undo" the effect of the oldest sample, and apply the effect of the latest sample, in a single pass through the fourier data! This means that the sliding DFT performance is O(n) compared with O(Log2(n) times n) for the FFT. Also, there's no restriction to powers of two for the buffer size to maintain performance.
The complete test program below compares the sliding DFT with fftw. In my production code I've optimized the below code to unreadibility, to make it three times faster.
#include <complex>
#include <iostream>
#include <time.h>
#include <math_defines.h>
#include <float.h>
#define DO_FFTW // libfftw
#define DO_SDFT
#if defined(DO_FFTW)
#pragma comment( lib, "d:\\projects\\common\\fftw\\libfftw3-3.lib" )
namespace fftw {
#include <fftw/fftw3.h>
}
fftw::fftw_plan plan_fwd;
fftw::fftw_plan plan_inv;
#endif
typedef std::complex<double> complex;
// Buffer size, make it a power of two if you want to improve fftw
const int N = 750;
// input signal
complex in[N];
// frequencies of input signal after ft
// Size increased by one because the optimized sdft code writes data to freqs[N]
complex freqs[N+1];
// output signal after inverse ft of freqs
complex out1[N];
complex out2[N];
// forward coeffs -2 PI e^iw -- normalized (divided by N)
complex coeffs[N];
// inverse coeffs 2 PI e^iw
complex icoeffs[N];
// global index for input and output signals
int idx;
// these are just there to optimize (get rid of index lookups in sdft)
complex oldest_data, newest_data;
//initilaize e-to-the-i-thetas for theta = 0..2PI in increments of 1/N
void init_coeffs()
{
for (int i = 0; i < N; ++i) {
double a = -2.0 * PI * i / double(N);
coeffs[i] = complex(cos(a)/* / N */, sin(a) /* / N */);
}
for (int i = 0; i < N; ++i) {
double a = 2.0 * PI * i / double(N);
icoeffs[i] = complex(cos(a),sin(a));
}
}
// initialize all data buffers
void init()
{
// clear data
for (int i = 0; i < N; ++i)
in[i] = 0;
// seed rand()
srand(857);
init_coeffs();
oldest_data = newest_data = 0.0;
idx = 0;
}
// simulating adding data to circular buffer
void add_data()
{
oldest_data = in[idx];
newest_data = in[idx] = complex(rand() / double(N));
}
// sliding dft
void sdft()
{
complex delta = newest_data - oldest_data;
int ci = 0;
for (int i = 0; i < N; ++i) {
freqs[i] += delta * coeffs[ci];
if ((ci += idx) >= N)
ci -= N;
}
}
// sliding inverse dft
void isdft()
{
complex delta = newest_data - oldest_data;
int ci = 0;
for (int i = 0; i < N; ++i) {
freqs[i] += delta * icoeffs[ci];
if ((ci += idx) >= N)
ci -= N;
}
}
// "textbook" slow dft, nested loops, O(N*N)
void ft()
{
for (int i = 0; i < N; ++i) {
freqs[i] = 0.0;
for (int j = 0; j < N; ++j) {
double a = -2.0 * PI * i * j / double(N);
freqs[i] += in[j] * complex(cos(a),sin(a));
}
}
}
double mag(complex& c)
{
return sqrt(c.real() * c.real() + c.imag() * c.imag());
}
void powr_spectrum(double *powr)
{
for (int i = 0; i < N/2; ++i) {
powr[i] = mag(freqs[i]);
}
}
int main(int argc, char *argv[])
{
const int NSAMPS = N*10;
clock_t start, finish;
#if defined(DO_SDFT)
// ------------------------------ SDFT ---------------------------------------------
init();
start = clock();
for (int i = 0; i < NSAMPS; ++i) {
add_data();
sdft();
// Mess about with freqs[] here
//isdft();
if (++idx == N) idx = 0; // bump global index
if ((i % 1000) == 0)
std::cerr << i << " iters..." << '\r';
}
finish = clock();
std::cout << "SDFT: " << NSAMPS / ((finish-start) / (double)CLOCKS_PER_SEC) << " fts per second." << std::endl;
double powr1[N/2];
powr_spectrum(powr1);
#endif
#if defined(DO_FFTW)
// ------------------------------ FFTW ---------------------------------------------
plan_fwd = fftw::fftw_plan_dft_1d(N, (fftw::fftw_complex *)in, (fftw::fftw_complex *)freqs, FFTW_FORWARD, FFTW_MEASURE);
plan_inv = fftw::fftw_plan_dft_1d(N, (fftw::fftw_complex *)freqs, (fftw::fftw_complex *)out2, FFTW_BACKWARD, FFTW_MEASURE);
init();
start = clock();
for (int i = 0; i < NSAMPS; ++i) {
add_data();
fftw::fftw_execute(plan_fwd);
// mess about with freqs here
//fftw::fftw_execute(plan_inv);
if (++idx == N) idx = 0; // bump global index
if ((i % 1000) == 0)
std::cerr << i << " iters..." << '\r';
}
// normalize fftw's output
for (int j = 0; j < N; ++j)
out2[j] /= N;
finish = clock();
std::cout << "FFTW: " << NSAMPS / ((finish-start) / (double)CLOCKS_PER_SEC) << " fts per second." << std::endl;
fftw::fftw_destroy_plan(plan_fwd);
fftw::fftw_destroy_plan(plan_inv);
double powr2[N/2];
powr_spectrum(powr2);
#endif
#if defined(DO_SDFT) && defined(DO_FFTW)
// ------------------------------ ---------------------------------------------
const double MAX_PERMISSIBLE_DIFF = 1e-11; // DBL_EPSILON;
double diff;
// check my ft gives same power spectrum as FFTW
for (int i = 0; i < N/2; ++i)
if ( (diff = abs(powr1[i] - powr2[i])) > MAX_PERMISSIBLE_DIFF)
printf("Values differ by more than %g at index %d. Diff = %g\n", MAX_PERMISSIBLE_DIFF, i, diff);
#endif
return 0;
}

If you need amplitude, frequency and time in one graph, then the transform is known as a Time-Frequency decomposition. The most popular one is called the Short Time Fourier Transform. It works as follows:
1. Take a small portion of the signal (say 1 second)
2. Window it with a small window (say 5 ms)
3. Compute the 1D fourier transform of the windowed signal.
4. Move the window by a small amount (2.5 ms)
5. Repeat above steps until end of signal.
6. All of this data is entered into a matrix that is then used to create the kind of 3D representation of the signal that shows its decomposition along frequency, amplitude and time.
The length of the window will decide the resolution you are able to obtain in frequency and time domains. Check here for more details on STFT and search for "Robi Polikar"'s tutorials on wavelet transforms for a layman's introduction to the above.
Edit 1:
You take a windowing function (there are innumerable functions out there - here is a list. Most intuitive is a rectangular window but the most commonly used are the Hamming/Hanning window functions. You can follow the steps below if you have a paper-pencil in hand and draw it along.
Assume that the signal that you have obtained is 1 sec long and is named x[n]. The windowing function is 5 msec long and is named w[n]. Place the window at the start of the signal (so the end of the window coincides with the 5ms point of the signal) and multiply the x[n] and w[n] like so:
y[n] = x[n] * w[n] - point by point multiplication of the signals.
Take an FFT of y[n].
Then you shift the window by a small amount (say 2.5 msec). So now the window stretches from 2.5ms to 7.5 ms of the signal x[n]. Repeat the multiplication and FFT generation steps. In other words, you have an overlap of 2.5 msec. You will see that changing the length of the window and the overlap gives you different resolutions on the time and Frequency axis.
Once you do this, you need to feed all the data into a matrix and then have it displayed. The overlap is for minimising the errors that might arise at boundaries and also to get more consistent measurements over such short time frames.
P.S: If you had understood STFT and other time-frequency decompositions of a signal, then you should have had no problems with steps 2 and 4. That you have not understood the above mentioned steps makes me feel like you should revisit time-frequency decompositions also.

You can create a realtime FFT by choosing a short time-span and analysing (FFT'ing) just that time-span. You can probably get away with just selecting non-overlapping timespans of say 100-500 milliseconds; the analytically purer way to do this would be using a sliding-window (again of e.g. 100-500 ms), but that is often unnecessary and you can show nice graphics with the non-overlapping timespans without much processing power.

Real-time FFT means completely different from what you just described. It means that for given N and X[N] your algorithm gives Fx[i] while incrementing value i. Meaning, proceeding value does not compute until current value computation completed. This is completely different from what you just described.
Hardware usually uses FFT with around 1k-16k points. Fixed N, not real-time computation. Moving window FFT as described with previous answers.

Related

How to integrate my calculation C code with OpenMP

I am using a dual channels DAQ card with data stream mode. I wrote some code for analysis/calculation and put them to the main code for operation. However, the FIFO overflow warning sign always occur once its total data reach around 6000 MSamples (the DAQ on board memory is 8GB). I am well-noticed that a complicated calculation might retard the system and cause the overflow but all of the works I wrote are necessary to my experiment which means cannot be replaced (or there is more effective code can let me get the same result). I have heard that the OpenMP might be a solution to boost up the speed, but I am just a beginner in C, how could I implement to my calculation code?
My computer has 64GB RAM and Intel Core i7 processor. I always turn off other unnecessary software when running the data stream code. The code has been optimize as possible as I can, like simplify the hilbert() and use memcpy to pick out a specific range of data points.
This is how I process the data:
1.Install the FFTW source code for the Hilbert transform.
2.For loop to de-interleave pi16Buffer data to ch2Buffer
3.memcpy to get a certain range of data that I am interested put them to another array called ch2newBuffer
4.Do the hilbert() on ch2newBuffer and calculate its absolute number.
5.Find the max value of ch1 and abs(hilbert(ch2newBuffer)).
6.Calculate max(abs(hilbert(ch2))) / max(ch1).
Here is a part of the my DAQ code which in charge to calculation:
void hilbert(const int16* in, fftw_complex* out, fftw_plan plan_forward, fftw_plan plan_backward)
{
// copy the data to the complex array
for (int i = 0; i < N; ++i) {
out[i][REAL] = in[i];
out[i][IMAG] = 0;
}
// creat a DFT plan and execute it
//fftw_plan plan = fftw_plan_dft_1d(N, out, out, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(plan_forward);
// destroy a plan to prevent memory leak
//fftw_destroy_plan(plan_forward);
int hN = N>>1; // half of the length (N/2)
int numRem = hN; // the number of remaining elements
// multiply the appropriate value by 2
//(those should multiplied by 1 are left intact because they wouldn't change)
for (int i = 1; i < hN; ++i) {
out[i][REAL] *= 2;
out[i][IMAG] *= 2;
}
// if the length is even, the number of the remaining elements decrease by 1
if (N % 2 == 0)
numRem--;
else if (N > 1) {
out[hN][REAL] *= 2;
out[hN][IMAG] *= 2;
}
// set the remaining value to 0
// (multiplying by 0 gives 0, so we don't care about the multiplicands)
memset(&out[hN + 1][REAL], 0, numRem * sizeof(fftw_complex));
// creat a IDFT plan and execute it
//plan = fftw_plan_dft_1d(N, out, out, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(plan_backward);
// do some cleaning
//fftw_destroy_plan(plan_backward);
//fftw_cleanup();
// scale the IDFT output
//for (int i = 0; i < N; ++i) {
//out[i][REAL] /= N;
//out[i][IMAG] /= N;
//}
}
float SumBufferData(void* pBuffer, uInt32 u32Size, uInt32 u32SampleBits)
{
// In this routine we sum up all the samples in the buffer. This function
// should be replaced with the user's analysys function
if ( 8 == u32SampleBits )
{
pu8Buffer = (uInt8 *)pBuffer;
for (i = 0; i < u32Size; i++)
{
i64Sum += pu8Buffer[i];
}
}
else
{
pi16Buffer = (int16 *)pBuffer;
fftw_complex(hilbertedch2[N]);
fftw_plan plan_forward = fftw_plan_dft_1d(N, hilbertedch2, hilbertedch2, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_plan plan_backward = fftw_plan_dft_1d(N, hilbertedch2, hilbertedch2, FFTW_BACKWARD, FFTW_ESTIMATE);
ch2Buffer = (int16*)calloc(u32Size / 2, sizeof * ch2Buffer);
ch2newBuffer= (int16*)calloc(u32Size/2, sizeof* ch2newBuffer);
// De-interleave the data from pi16Buffer
for (i = 0; i < u32Size/2 ; i++)
{
ch2Buffer[i] = pi16Buffer[i*2+1];
}
// Pick out the data points range that we are interested
memcpy(ch2newBuffer, &ch2Buffer[6944], 1024 * sizeof(ch2Buffer[0]));
// Do the hilbert transform to these data points
hilbert(ch2newBuffer, hilbertedch2, plan_forward, plan_backward);
fftw_destroy_plan(plan_forward);
fftw_destroy_plan(plan_backward);
//Find max value in each segs of ch1 and ch2
for (i = 128; i < 200 ; i++)
{
if (pi16Buffer[i*2] > max1)
max1 = pi16Buffer[i*2];
}
for (i = 0; i < 1024; i++)
{
if (fabs(hilbertedch2[i][IMAG]) > max2)
max2 = fabs(hilbertedch2[i][IMAG]);
}
Corrected = max2 / max1 / N; // Calculate the signal correction
}
free(ch2Buffer);
free(ch2newBuffer);
return Corrected;
}
Loop are typically a good start for parallelism, for instance:
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
out[i][REAL] = in[i];
out[i][IMAG] = 0;
}
or
#pragma omp parallel for reduction(max:max2)
for (i = 0; i < 1024; i++)
{
float tmp = fabs(hilbertedch2[i][IMAG]);
max2 = (max2 > tmp) ? max2 : tmp.
}
That being said, you need to profile your code find out where the execution takes the most time and try to parallelized if possible. However, looking at what you have posted, I do not see a lot of parallelism opportunity there.

How to implement summation using parallel reduction in OpenCL?

I'm trying to implement a kernel which does parallel reduction. The code below works on occasion, I have not been able to pin down why it goes wrong on the occasions it does.
__kernel void summation(__global float* input, __global float* partialSum, __local float *localSum){
int local_id = get_local_id(0);
int workgroup_size = get_local_size(0);
localSum[local_id] = input[get_global_id(0)];
for(int step = workgroup_size/2; step>0; step/=2){
barrier(CLK_LOCAL_MEM_FENCE);
if(local_id < step){
localSum[local_id] += localSum[local_id + step];
}
}
if(local_id == 0){
partialSum[get_group_id(0)] = localSum[0];
}}
Essentially I'm summing the values per work group and storing each work group's total into partialSum, the final summation is done on the host. Below is the code which sets up the values for the summation.
size_t global[1];
size_t local[1];
const int DATA_SIZE = 15000;
float *input = NULL;
float *partialSum = NULL;
int count = DATA_SIZE;
local[0] = 2;
global[0] = count;
input = (float *)malloc(count * sizeof(float));
partialSum = (float *)malloc(global[0]/local[0] * sizeof(float));
int i;
for (i = 0; i < count; i++){
input[i] = (float)i+1;
}
I'm thinking it has something to do when the size of the input is not a power of two? I noticed it begins to go off for numbers around 8000 and beyond. Any assistance is welcome. Thanks.
I'm thinking it has something to do when the size of the input is not a power of two?
Yes. Consider what happens when you try to reduce, say, 9 elements. Suppose you launch 1 work-group of 9 work-items:
for (int step = workgroup_size / 2; step > 0; step /= 2){
// At iteration 0: step = 9 / 2 = 4
barrier(CLK_LOCAL_MEM_FENCE);
if (local_id < step) {
// Branch taken by threads 0 to 3
// Only 8 numbers added up together!
localSum[local_id] += localSum[local_id + step];
}
}
You're never summing the 9th element, hence the reduction is incorrect. An easy solution is to pad the input data with enough zeroes to make the work-group size the immediate next power-of-two.

Ising 1-Dimensional C - program

i am trying to simulate the Ising Model 1-D. This model consists in a chain of spin (100 spins) and using the Mont Carlo - Metropolis to accept the flip of a spin if the energy of the system (unitary) goes down or if it will be less than a random number.
In the correct program, both the energy the magnetization go to zero, and we have the results as a Gaussian (graphics of Energyor the magnetization by the number of Monte Carlo steps).
I have done some work but i think my random generator isn't correctt for this, and i don't know how/where to implement the boundary conditions: the last spin of the chain is the first one.
I need help to finish it. Any help will be welcome. Thank you.
I am pasting my C program down:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h> //necessary for function time()
#define LENGTH 100 //size of the chain of spins
#define TEMP 2 // Temperature in units of J
#define WARM 200 // Termalização
#define MCS 20000 //Monte Carlo Steps
void start( int spin[])
{
/* starts with all the spins 1 */
int i;
for (i = 0 ; i < 100; i++)
{
spin[i] = 1;
}
}
double energy( int spin[]) //of the change function J=1
{
int i;
double energyX=0;// because the begining Energy = -J*sum (until 100) =-100,
for (i = 0;i<100;i++)
energyX=energyX-spin[i]*spin[i+1];
return(energyX);
}
int randnum(){
int num;
srand(time(NULL));
/* srand(time(NULL)) objectives to initiate the random number generator
with the value of the function time(NULL). This is calculated as being the
total of seconds passed since january first of 1970 until the present date.
So, this way, for each execution the value of the "seed" will be different.
*/
srand(time(NULL));
//picking one spin randomly zero to 100
num=rand() % 100;
printf("num = %d ", num);
return num;
}
void montcarlo( int spin[])
{
int i,j,num;
double prob;
double energyA, energyB; // A -> old energy and B -> the new energy
int rnum1,rnum2;
prob=exp(-(energyB-energyA)/TEMP);
energyA = 0;
energyB = 0;
for (i = 0;i<100;i++)
{
for (j = 0;j<100;j++)
{
energyA=energy(spin);
rnum1=randnum();
rnum2=randnum(); // i think they will give me different numbers
spin[rnum1] = -spin[rnum1]; //flip of the randomly selected spin
energyB = energyB-spin[j]*spin[j+1];
if ((energyB-energyA<0)||((energyB-energyA>0)&&(rnum2>prob))){ // using rnum2 not to be correlated if i used rnum1
spin[rnum1]=spin[rnum1];} // keep the flip
else if((energyB-energyA>0)&&(rnum2<prob))
spin[rnum1]=-spin[rnum1]; // unflip
}
}
}
int Mag_Moment( int spin[] ) // isso é momento magnetico
{
int i;
int mag;
for (i = 0 ; i < 100; i++)
{
mag = mag + spin[i];
}
return(mag);
}
int main()
{
// starting the spin's chain
int spin[100];//the vector goes til LENGHT=100
int i,num,j;
int itime;
double mag_moment;
start(spin);
double energy_chain=0;
energy_chain=energy(spin); // that will give me -100 in the begining
printf("energy_chain starts with %f", energy_chain);// initially it gives -100
/*Warming it makes the spins not so ordered*/
for (i = 1 ; i <= WARM; i++)
{
itime = i;
montcarlo(spin);
}
printf("Configurtion after warming %d \n", itime);
for (j = 0 ; j < LENGTH; j++)
{
printf("%d",spin[j]);
}
printf("\n");
energy_chain=energy(spin); // new energy after the warming
/*openning a file to save the values of energy and magnet moment of the chain*/
FILE *fp; // declaring the file for the energy
FILE *fp2;// declaring the file for the mag moment
fp=fopen("energy_chain.txt","w");
fp2=fopen("mag_moment.txt","w");
int pures;// net value of i
int a;
/* using Monte Carlo metropolis for the whole chain */
for (i = (WARM + 1) ; i <= MCS; i++)
{
itime=i;//saving the i step for the final printf.
pures = i-(WARM+1);
montcarlo(spin);
energy_chain = energy_chain + energy(spin);// the spin chain is moodified by void montcarlo
mag_moment = mag_moment + Mag_Moment(spin);
a=pures%10000;// here i select a value to save in a txt file for 10000 steps to produce graphs
if (a==0){
fprintf(fp,"%.12f\n",energy_chain); // %.12f just to give a great precision
fprintf(fp2,"%.12f\n",mag_moment);
}
}
fclose(fp); // closing the files
fclose(fp2);
/* Finishing -- Printing */
printf("energy_chain = %.12f\n", energy_chain);
printf("mag_moment = %.12f \n", mag_moment);
printf("Temperature = %d,\n Size of the system = 100 \n", TEMP);
printf("Warm steps = %d, Montcarlo steps = %d \n", WARM , MCS);
printf("Configuration in time %d \n", itime);
for (j = 0 ; j < 100; j++)
{
printf("%d",spin[j]);
}
printf("\n");
return 0;
}
you should call srand(time(NULL)); only once in your program. Every time you call this in the same second you will get the same sequence of random numbers. So it is very likely that both calls to randnum will give you the same number.
Just add srand(time(NULL)); at the begin of main and remove it elsewhere.
I see a number of bugs in this code, I think. The first one is the re-seeding of the srand() each loop which has already been addressed. Many of the loops go beyond the array bounds, such as:
for (ii = 0;ii<100;ii++)
{
energyX = energyX - spin[ii]*spin[ii+1];
}
This will give you spin[99]*spin[100] for the last loop, for which is out of bounds. That is kind of peppered throughout the code. Also, I noticed the probability rnum2 is an int but compared as if it's supposed to be a double. I think dividing the rnum2 by 100 will give a reasonable probability.
rnum2 = (randnum()/100.0); // i think they will give me different numbers
The initial probability used to calculate the spin is, prob=exp(-(energyB-energyA)/TEMP); but both energy values are not initialized, maybe this is intentional, but I think it would be better to just use rand(). The Mag_Moment() function never initializes the return value, so you wind up with a return value that is garbage. Can you point me to the algorithm you are trying to reproduce? I'm just curious.

Slow OMP vs serial

I'm trying to optimize a C subroutine called from R that takes up ~60% of the computation time for a problem I'm trying to solve. This is down from 86% when coded purely in R. The vast majority of the execution time in my C code is taking place in a nested for loop and so this seems an obvious candidate to try and parallelize using OpenMP. I've tried doing so with variable results – at best the elapsed time is fractionally worse than not using OMP, at worst the performance scaled inversely to the number of threads. The code for the fastest version is below:
#include <R.h>
#include <Rmath.h>
#include <omp.h>
void gradNegLogLik_c(double *param, double *delta, double *X, double *M, int *nBeta, int *nEpsilon, int *nObs, double *gradient){
// ========================================================================================
// param: double[nBeta + nEpsilon] values of parameters at which to evaluate gradient
// delta: double[nObs] satellite - buoy differences
// X: double[nObs * (nBeta + nEpsilon)] design matrix for mean components (i.e. beta terms)
// M: double[nObs * (nBeta + nEpsilon)] design matrix for variance components (i.e. epsilon terms)
// nBeta: int number of mean terms
// nEpsilon: int number of variance terms
// nObs: int number of observations
// gradient: double[nBeta + nEpsilon] output array of gradients
// ========================================================================================
// ========================================================================================
// local variables
size_t i, j, ind;
size_t nterms = *nBeta + *nEpsilon;
size_t nbeta = *nBeta;
size_t nepsilon = *nEpsilon;
size_t nobs = *nObs;
// allocate local memory and set to zero
double *sigma2 = calloc( nobs , sizeof(double) );
double *fittedValues = calloc( nobs , sizeof(double) );
double *residuals = calloc( nobs , sizeof(double) );
double *beta = calloc( nbeta , sizeof(double) );
double *epsilon2 = calloc( nepsilon , sizeof(double) );
double *residuals2 = calloc( nobs , sizeof(double) );
double gradBeta, gradEpsilon;
// extract beta and epsilon terms from param
// =========================================
for(i = 0 ; i < nbeta ; i++){
beta[i] = param[ i ];
epsilon2[i] = param[ nbeta + i ];
}
// Initialise gradient to zero for return value
// =========================================
for( i = 0 ; i < nterms ; i++){
gradient[i] = 0;
}
// calculate sigma, fitted values and residuals
// ============================================
for( i = 0 ; i < nbeta ; i++){
for( j = 0 ; j < nobs ; j++){
ind = i * nobs + j;
sigma2[j] += M[ind] * epsilon2[i];
fittedValues[j] += X[ind] * beta[i];
}
}
for( j = 0 ; j < nobs ; j++){
// calculate reciprocal as this is what we actually use and
// we only want to do it once.
sigma2[j] = 1 / sigma2[j];
residuals[j] = delta[j] - fittedValues[j];
residuals2[j] = residuals[j]*residuals[j];
}
// Loop over all observations and calculate value of (negative) derivative
// =======================================================================
#pragma omp parallel for private(i, j, ind, gradBeta, gradEpsilon)\
shared(gradient, nbeta, nobs, X, M, sigma2, fittedValues, delta, residuals2) \
default(none)
for( i = 0 ; i < nbeta ; i++){
gradBeta = 0.0;
gradEpsilon = 0.0;
for(j = 0 ; j < nobs ; j++){
ind = i * nobs + j;
gradBeta -= -1.0*X[ind] * sigma2[j]*(fittedValues[j] - delta[j]);
gradEpsilon -= 0.5*M[ind] * sigma2[j]*(residuals2[j] * sigma2[j] - 1);
}
gradient[i] = gradBeta;
gradient[nbeta + i] = gradEpsilon;
}
// End of function
// free local memory
free(sigma2);
free(fittedValues);
free(residuals);
free(beta);
free(epsilon2);
free(residuals2);
}
nObs is order 10000.
nBeta is in the range 20 – several hundred.
nEpsilon = nBeta and is not currently used.
After searching through this site and an afternoon googling and trying different things I don't seem to be able to make any further improvement. My first thoughts were false sharing – I've tried various things such as unrolling the outer loop to set 8 elements of gradient[] at a time to creating a temporary padded array to store the results in. I've also tried different combinations of shared, private and firstprivate. None of this appears to improve things and my fastest execution time is marginally worse in parallel than in serial. This leads to two questions before I spend any more time on this:
Is my problem (repeating ~9000 of the same set of calculations 20 - 900 times) too small to make it worthwhile using OMP?
Is there something I'm missing or doing wrong?
I suspect it's the latter as I'm relatively inexperienced when using C and OMP. Any help / thoughts would be appreciated.
(For info, I'm running on SLED11 server with 16 cores and 192GB of memory and using GCC 4.7.2 to compile my C code). Other users are using the server but the relative performance of OMP vs serial code seems independent of the other users.
Thanks in advance,
Dave.
EDIT: For info the compile command I've used is
gcc -I/RHOME/R/3.0.1/lib64/R/include -DNDEBUG -I/usr/local/include -fpic \
-std=c99 -Wall -pedantic –O3 -fopenmp -c src/gradNegLogLik_call.c \
-o src/gradNegLogLik_call.o
Most of the flags are set by the R CMD SHLIB command - I've added the -O3 -fopenmp manually.
It may be useful to give some context to my question above before giving my answer to what I've done to speed up my code (although this has been achieved without using OMP).
My original C function was written to calculate the gradient of a log likelihood function to be used with the R optim() command and the L-BFGS-B method. For each call of optim my log likelihood and gradient functions are each called ~100 times as optim finds the best solution. As a result, these two functions take up the bulk of my execution time, as expected and reported by Rprof, and so were the two targets for converting to C to improve the efficiency of my code.
Converting my two functions to C and optimizing that code has resulted in my calls to optim reducing from an average elapsed time of 1.88s per call to 0.25s per call. This has reduced my processing time from ~1 month to a few days. The change that had the biggest impact (beside calling C) was changing the ordering of the nested loops. The original order was chosen due to the way R stores matrices and chosen to avoid having to transpose my matrices for each call of my C functions. Recognizing that the transpose only needs to be done once for each call to optim(), and not each C call as I had originally coded, this is a small overhead to pay compared to the impact / benefit of changing the order in the C functions.
Given this increase in speed it's had to justify spending any more time on this. The final version of my gradient function (as per my original post) is given below.
Note that whilst I've changed from using .C to .Call in R (hence the change to the function arguments etc) this in itself doesn’t account for the speed increase.
#include <R.h>
#include <Rmath.h>
#include <Rinternals.h>
#include <omp.h>
SEXP gradNegLogLik_call(SEXP param ,SEXP delta, SEXP X, SEXP M, SEXP nBeta, SEXP nEpsilon){
// local variables
double *par, *d;
double *sigma2, *fittedValues, *residuals, *grad, *Xuse, *Muse;
double val, sig2, gradBeta, gradEpsilon;
int n, m, ind, nterms, i, j;
SEXP gradient;
// get / associate parameters with local pointer
par = REAL(param);
Xuse = REAL(X);
Muse = REAL(M);
d = REAL(delta);
n = LENGTH(delta);
m = INTEGER(nBeta)[0];
nterms = m + m;
// allocate memory
PROTECT( gradient = allocVector(REALSXP, nterms ));
// set pointer to real portion of gradient
grad = REAL(gradient);
// set all gradient terms to zero
for(i = 0 ; i < nterms ; i++){
grad[i] = 0.0;
}
sigma2 = Calloc(n, double );
fittedValues = Calloc(n, double );
residuals = Calloc(n, double );
// calculate sigma, fitted values and residuals
for(i = 0 ; i < n ; i++){
val = 0.0;
sig2 = 0.0;
for(j = 0 ; j < m ; j++){
ind = i*m + j;
val += Xuse[ind]*par[j];
sig2 += Muse[ind]*par[j+m];
}
// calculate reciprocal of sigma as this is what we actually use
// and we only want to do it once
sigma2[i] = 1.0 / sig2;
fittedValues[i] = val;
residuals[i] = d[i] - val;
}
// now loop over each paramter and calculate derivative
for(i = 0 ; i < n ; i++){
gradBeta = -1.0*sigma2[i]*(fittedValues[i] - d[i]);
gradEpsilon = 0.5*sigma2[i]*(residuals[i]*residuals[i]*sigma2[i] - 1);
for(j = 0 ; j < m ; j++){
ind = i*m + j;
grad[j] -= Xuse[ind]*gradBeta;
grad[j+m] -= Muse[ind]*gradEpsilon;
}
}
UNPROTECT(1);
Free(sigma2);
Free(residuals);
Free(fittedValues);
// return array of gradients
return gradient;
}

different binomial coefficient algorithm's efficiency compare

I've compared two algorithms to calculate binomial coefficient C(n, k) as below: #1 is derived from the formulaic definition of the binomial coefficient to calculate, #2 uses dynamic programming.
#include <stdio.h>
#include <sys/time.h>
#define min(x, y) (x<y?x:y)
#define NMAX 150
double binomial_formula(int n, int k) {
double denominator=1, numerator=1, i;
for (i = 0; i< k; i++)
numerator *= (n-i), denominator *= (i+1);
return numerator/denominator;
}
double binomial_dynamic_pro(int n, int k) {
double c[NMAX][NMAX];
int i,j;
for (i = 0; i <= n; i++) {
for (j = 0; j <= min(i, k); j++) {
if (i == j || j == 0)
c[i][j] = 1;
else
c[i][j] = c[i-1][j-1]+c[i-1][j];
}
}
return c[n][k];
}
int main(void) {
struct timeval s, e;
int n = 50, k = 30;
double re = 0;
printf("now formula calc C(%d, %d)..\n", n, k);
gettimeofday(&s, NULL);
re = binomial_formula(n, k);
gettimeofday(&e, NULL);
printf("%.0f, use time: %ld'us\n", re,
1000000*(e.tv_sec-s.tv_sec)+ (e.tv_usec-s.tv_usec));
printf("now dynamic calc C(%d, %d)..\n", n, k);
gettimeofday(&s, NULL);
re = binomial_dynamic_pro(n, k);
gettimeofday(&e, NULL);
printf("%.0f, use time: %ld'us\n", re,
1000000*(e.tv_sec-s.tv_sec)+ (e.tv_usec-s.tv_usec));
return 0;
}
and I use gcc to compile, and it runs out like this:
now formula calc C(50, 30)..
47129212243960, use time: 2'us
now dynamic calc C(50, 30)..
47129212243960, use time: 102'us
These results are unexpected for me. I thought that dynamic programming should be faster, for it's O(nk), but the formula's method should be O(k^2) and it uses multiplication, which should be also be slower.
So why is the dynamic programming version so much slower?
binomial_formula as written is definitely not O(k^2). It only has a single loop which is of size k making it O(k). You should also keep in mind that on modern architectures that the cost of memory accesses dwarf the cost of any single instruction by an order of magnitude, and your dynamic programming solution reads and writes many more addresses in memory. The first version can be computed entirely in a few registers.
Note that you can actually improve on your linear version by recognizing that C(n,k) == C(n, n-k):
double binomial_formula(int n, int k) {
double delominator=1, numerator=1, i;
if (k > n/2)
k = n - k;
for (i = 0; i< k; i++)
numerator *= (n-i), delominator *= (i+1);
return numerator / delominator;
}
You should keep in mind that dynamic programming is just a technique and not a silver bullet. It doesn't magically make all algorithms faster.
First algorithm
Takes linear time
Uses a constant amount of space
Second algorithm
Takes quadratic time
Uses quadratic amount of space
In terms of time/space, first algorithm is better but the second algorithm has the advantage of computing answer for smaller values as well; it can be used as a pre-processing step.
Imagine that you are given a number of queries of the form n k and you are asked to write n choose k for each of them. Further, imagine that the number of queries is big (say around n*n). Using the first algorithm takes O(nq) = O(n*n*n) while using the second algorithm takes O(n*n).
So it all depends on what you are trying to do.

Resources