DFT function implementation in C - c

I am working on implementation of BFSK implementation on a DSP processor and am currently simulating it on a LINUX machine using C. I am working on the demodulation function and it involves taking a FFT of the incoming data. For simulation purposes, I have a pre-defined function for DFT which is:
void dft(complex_float* in, complex_float* out, int N, int inv)
{
int i, j;
float a, f;
complex_float s, w;
f = inv ? 1.0/N : 1.0;
for (i = 0; i < N; i++) {
s.re = 0;
s.im = 0;
for (j = 0; j < N; j++) {
a = -2*PI*i*j/N;
if (inv) a = -a;
w.re = cos(a);
w.im = sin(a);
s.re += in[j].re * w.re - in[j].im * w.im;
s.im += in[j].im * w.re + in[j].re * w.im;
}
out[i].re = s.re*f;
out[i].im = s.im*f;
}
Here the complex_float is a struct defined as follows:
typedef struct {
float re;
float im;
} complex_float;
In the dft() function, the parameter N denotes the number of DFT points.
My doubt is that since the algorithm also involves a frequency hopping sequence, while demodulating the signal, I need to check the amplitude of DFT of the signal at different frequency components.
In MATLAB this was quite simple as the FFT function there involves the sampling frequency as well and I could find the power at any frequency point as
powerat_at_freq = floor((freq * fftLength) / Sampling_freq)
But the C function does not involve any frequencies, so how can I determine the magnitude of the DFT at any particular frequency?

The index in the FFT table for a particular frequency is calculated as follows:
int i = round(f / fT*N)
where f is the wanted frequency, fT is the sampling frequency and N is the number of FFT points.
The FFT should be fine-grained enough (i.e. N should be large) to cover all the frequencies.
If the precise frequency isn't present in the FFT, the nearest one will be used. More
info about FFT indexes versus frequencies:
How do I obtain the frequencies of each value in an FFT?

The frequency represented depends on the sample rate of the data fed to it (divided by the length if the FFT). Thus any DFT or FFT can represent any frequency you want just by feeding it the right amount of data at the right sample rate.

You can refer to the FFTW library which is famous and useful in the applicational area of FFT.
The official website is: http://www.fftw.org/
By the way, the matlab's FFT function is also implemented through the FFTW library.

Related

Best approach to FIFO implementation in a kernel OpenCL

Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).
I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).
Description of the picture:
One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.
What happens with my code?
For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.
Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.
I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.
Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).
float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];
float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {
//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end
int globalId = get_global_id(0);
float output = 0.0f;
//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
tempArrayForShift[i] = temp[i];
}
//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
temp[i] = tempArrayForShift[i];
}
//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];
//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i < 58; i ++) {
// output = multipliedResult[i] + output;
}
//Returned summed value of temp array
return output;
}
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
//Initialize the temporary array values to 0
for (int i = 0; i < 58; i ++) {
temp[i] = 0;
tempArrayForShift[i] = 0;
multipliedResult[i] = 0;
}
//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i < 60; i ++) {
Output[i] = fifo(Array[i], coefficients, 58);
}
}
I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.
To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.
A simple implementation would be (launch with as many threads as output values):
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output)
{
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 58; i++)
{
float tmp=0;
if (globalId+i > 56)
{
tmp=Array[i+globalId-57]*coefficient[57-i];
}
sum += tmp;
}
output[globalId]=sum;
}
This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.
Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.
the steps:
init local values to 0
copy coefficients to local variable
looping over the output elements to compute:
shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier
the code:
__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){
int globalId = get_global_id(0);
int localId = get_local_id(0);
int localSize = get_local_size(0);
//1 init local values to 0
localArray[localId] = 0.0f
//2 copy coefficients to local
//don't bother with this id __constant is working for you
//requires another local to be passed in: localCoeff
//localCoeff[localId] = coefficients[localId];
//barrier for both steps 1 and 2
barrier(CLK_LOCAL_MEM_FENCE);
float tmp;
for(int i = 0; i< outputSize; i++)
{
//3 shift elements (+barrier)
if(localId > 0){
tmp = localArray[localId -1]
}
barrier(CLK_LOCAL_MEM_FENCE);
localArray[localId] = tmp
//4 copy new element (work item 0 only, + barrier)
if(localId == 0){
localArray[0] = Array[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
//5 compute dot product
//5a multiply + barrier
localSums[localId] = localArray[localId] * coefficients[localId];
barrier(CLK_LOCAL_MEM_FENCE);
//5b reduction loop + barrier
for(int j = 1; j < localSize; j <<= 1) {
int mask = (j << 1) - 1;
if ((localId & mask) == 0) {
localSums[local_index] += localSums[localId +j]
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//6 copy dot product (WI 0 only)
if(localId == 0){
Output[i] = localSums[0];
}
//7 barrier
//only needed if there is more code after the loop.
//the barrier in #3 covers this in the case where the loop continues
//barrier(CLK_LOCAL_MEM_FENCE);
}
}
What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:
for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)
Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.
//1 init local values
if(groupId == 0){
localArray[localId] = 0.0f
}else{
localArray[localSize - localId] = Array[workAmount - localId];
}
These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.
Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.
Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.

Weight Initialisation

I plan to use the Nguyen-Widrow Algorithm for an NN with multiple hidden layers. While researching, I found a lot of ambiguities and I wish to clarify them.
The following is pseudo code for the Nguyen-Widrow Algorithm
Initialize all weight of hidden layers with random values
For each hidden layer{
beta = 0.7 * Math.pow(hiddenNeurons, 1.0 / number of inputs);
For each synapse{
For each weight{
Adjust weight by dividing by norm of weight for neuron and * multiplying by beta value
}
}
}
Just wanted to clarify whether the value of hiddenNeurons is the size of the particular hidden layer, or the size of all the hidden layers within the network. I got mixed up by viewing various sources.
In other words, if I have a network (3-2-2-2-3) (index 0 is input layer, index 4 is output layer), would the value hiddenNeurons be:
NumberOfNeuronsInLayer(1) + NumberOfNeuronsInLayer(2) + NumberOfNeuronsInLaer(3)
Or just
NumberOfNeuronsInLayer(i) , where i is the current Layer I am at
EDIT:
So, the hiddenNeurons value would be the size of the current hidden layer, and the input value would be the size of the previous hidden layer?
The Nguyen-Widrow initialization algorithm is the following :
Initialize all weight of hidden layers with (ranged) random values
For each hidden layer
2.1 calculate beta value, 0.7 * Nth(#neurons of input layer) root of
#neurons of current layer
2.2 for each synapse
2.1.1 for each weight
2.1.2 Adjust weight by dividing by norm of weight for neuron and
multiplying by beta value
Encog Java Framework
Sounds to me like you want more precise code. Here are some actual code lines from a project I'm participating to. Hope you read C. It's a bit abstracted and simplified. There is a struct nn, that holds the neural net data. You probably have your own abstract data type.
Code lines from my project (somewhat simplified):
float *w = nn->the_weight_array;
float factor = 0.7f * powf( (float) nn->n_hidden, 1.0f / nn->n_input);
for( w in all weight )
*w++ = random_range( -factor, factor );
/* Nguyen/Widrow */
w = nn->the_weight_array;
for( i = nn->n_input; i; i-- ){
_scale_nguyen_widrow( factor, w, nn->n_hidden );
w += nn->n_hidden;
}
Functions called:
static void _scale_nguyen_widrow( float factor, float *vec, unsigned int size )
{
unsigned int i;
float magnitude = 0.0f;
for ( i = 0; i < size; i++ )
magnitude += vec[i] * vec[i];
magnitude = sqrtf( magnitude );
for ( i = 0; i < size; i++ )
vec[i] *= factor / magnitude;
}
static inline float random_range( float min, float max)
{
float range = fabs(max - min);
return ((float)rand()/(float)RAND_MAX) * range + min;
}
Tip:
After you've implemented the Nguyen/Widrow weight initialization, you can actually add a little code line in the forward calculation that dumps each activation to a file. Then you can check how good the set of neurons hits the activation function. Find the mean and standard deviation. You can even plot it with a plotting tool, ie. gnuplot. (You need a plotting tool like gnuplot anyway for plotting error rates etc.) I did that for my implementation. The plots came out nice, and the initial learning became much faster using Nguyen/Widrow for my project.
PS: I'm not sure my implementation is correct according to Nguyen and Widrows intentions. I don't even think I care, as long as it does improve the initial learning.
Good luck,
-Øystein

Implementing matrix multiplication with openCL / C

I understand the theory of matrix multiplication, I just have two questions about this particular kernel implementation:
For reference, num_rows = 32. The matrix B (b_mat) has been transposed before by another kernel, so as I understand it we're dot-ting row vectors together.
1) why do we need to use the param "vectors_per_row" and thus the inner loop? I thought we could just do sum += dot(row of A, row of B), and it seems like this param is splitting up the row into smaller portions (why?).
2) I don't understand the address offset for a_mat and b_mat, i.e. a_mat += start; b_mat += start*4;
__kernel void matrix_mult(__global float4 *a_mat,
__global float4 *b_mat, __global float *c_mat) {
float sum;
int num_rows = get_global_size(0);
int vectors_per_row = num_rows/4;
int start = get_global_id(0) * vectors_per_row;
a_mat += start;
c_mat += start*4;
for(int i=0; i<num_rows; i++) {
sum = 0.0f;
for(int j=0; j<vectors_per_row; j++) {
sum += dot(a_mat[j],
b_mat[i*vectors_per_row + j]);
}
c_mat[i] = sum;
}
}
Your matrix is composed of an array of float4's. Flaoa4's are vectors of 4 floats. This is where the 4 comes from. Dot only works with the builtin types, so you have to do it on the float4.
c_mat is of type float, which is why it has start*4 and a_mat has start. The offset is because the code is split up across several (potentially hundreds) of threads. Each thread is only calculating a small part of the multiply operation. start is simply where the thread starts computing. This is what the get_global_id(0) is for. It essentially gets your thread id. Technically it's the thread index of the first dimension, but it appears you only have one thread dimension, so here you can just think of it as thread id.

Online entropy evaluation algorithm

Is there a way to evaluate entropy on a stream of discrete values similar to SumamryStatistics for mean and deviation?
I need this algorithm for real-time solr component, and it will probably iterate over large document collections(100,000).
Related question, what is the best way to compute Entropy in Map Reduce like environment.
There may be a way - it depends somewhat on the characteristics of the stream, and what you want to do with the results.
The sample entropy is a function of the sample probability distribution. You can store a running count of each value together with the running total count, which means that the distribution can be calculated on demand. Excuse my sloppy Java, it's been about a year since I wrote it.
Map<K,Integer> runningCount = new Map<K,Integer>();
int totalCount = 0;
public void addValue(K k) {
runningCount.insert(k, runningCount.get(k) + 1);
totalCount += 1;
}
public Map<K,Double> getDistribution() {
Map<K,Double> dist = new Map<K,Double>();
for (K k : runningCount.keys()) {
dist.insert(k, runningCount.get(k) / totalCount);
}
return dist;
}
This means that you can also compute the entropy on demand:
public double getEntropy() {
Map<K,Double> dist = getDistribution();
double entropy = 0;
for (K k : dist.keys()) {
double p = dist.get(k);
entropy -= p * Math.log(p);
}
return entropy;
}
This algorithm is O(n) to compute both the distribution and the entropy, where n is the number of values your stream might take on. It is independent of the number of values in the stream, as you can see from the fact that the addValue method doesn't store the stream values.

What is wrong with my low pass filter?

I have an array of int samples ranging from 32766 to -32767. In part of trying to create an envelope detector I've written a low pass filter, but it doesn't seem to be doing the job. Please keep in mind I'm trying to filter an entire array in one shot (no buffers).
This is not streamed, but applied to recorded audio for later playback. It is written in C. An example cutoff argument would be 0.5.
void lopass(int *input, float cutoff, int *output)
{
float sample = 0;
for (int i=1 ; i < (1430529-10); i++) // we will go through all except the last 10 samples
{
for (int j = i; j < (i+10); j++) { // only do this for a WINDOW of a hundred samples
float _in = (float)input[j];
float _out = (float)output[j-1];
sample = (cutoff * _in) + (32766 - (32766*cutoff)) * _out;
}
output[i] = (int)sample;
}
}
I thought that I would run my filtering statement on a window of 10 samples. Not only is it super slow, but it doesn't really do much but seemingly lower the overall amplitude. \
If you have any advice, or suggestions (or code!) on how to do this properly, that would be great!
A low-pass filter is basically some variant of averaging a number of values together. That means at least in the normal case your inner loop will accumulate a value. It's hard to guess the exact intent from your code, but you end up with something on the extremely general order of:
sample = 0;
for (int j=i; j<i+10; j++)
sample += input[j];
output[i] = sample / 10;
As it stands right now, this just does averaging, with no cutoff specified -- that means it has a fixed (and fairly slow) cuttoff curve. The cutoff is governed only by the number of samples in the window.
To control the cutoff, you do not (at least normally) multiply all the input values by the same amount -- that would basically just modify the scale factor. Instead, you take a set of samples (10 of them, in your case) of the cutoff curve you want to apply, run them through an inverse FFT, and get a set of 10 coefficients. You then apply those coefficients in your loop:
sample = 0;
for (j=0; j<10; j++)
sample += input[i+j] * coefficients[j];
output[i] = sample;
The number of samples in your window isn't normally an input to the design process -- rather, it's an output. You start by specify the cutoff frequency (as a fraction of the sampling frequency) and the cutoff width, and based on those you compute the necessary window size.
There are quite a few different techniques for computing your coefficients. Regardless of how you compute them, however, you normally end up with something on this general order -- accumulate the sum of the samples in the window, each multiplied by its respective coefficient.
The EE times had a pretty good article on filter design a few years ago.
Don't know if it's relevant, but in your code the inner loop is doing nothing
for (j=???; j<???; j++) {
sample = ???;
}
is the same as
// for (j=???; j<???; j++) {
sample = ???; // for last j
// }
The arithmetic in the filter looks wrong, and as #pmg already pointed out, you are not storing output values correctly. It should probably be:
void lopass(int *input, float cutoff, int *output)
{
float sample = 0.0f;
output[0] = 0.0f;
for (int i = 1 ; i < (1430529 - 10); i++)
{
for (int j = i; j < (i + 10); j++)
{
float _in = (float)input[j];
float _out = (float)output[j-1];
sample = (cutoff * _in) + (1.0f - cutoff) * _out;
output[i] = (int)sample;
}
}
}
There are still a few minor issues to be fixed but this should at least work as a fairly crude single pole recursive (IIR) filter.
It's a broken moving-window filter of 10 samples in the inner loop (where you actually use only the last sample of the 10), when in your comments you say you want 100 samples in your rectangular filter window.
The first error will give you a filter transition frequency 10X too high.

Resources