Low Pass Filter in OpenCL - c

I am trying to implement a low pass filter in OpenCL and the theory behind all this has me confused a bit. I have attached my code at the bottom after my explanation of the scenario.
First off, let me try to explain the whole scenario in point form.
For the input, we have a cos signal with a sample size, frequency (Frequency sample obtained by multiplying sample size with frequency) and a step size.
The value of at each step size is stored in an array with the frequency and step size multiplied to the function
This array is then passed into the kernel, which then will execute the low pass filter function.
Kernel returns an output array with the new filtered values.
The cos function is always returning a value from (-1,1), the only thing that modifies this value is the frequency. So it may repeat faster or slower depending on the frequency BUT it is always between (-1,1).
This is where I am confused, I am not sure how to apply a low pass filter to these values. Let say the cutoff was 100Hz for the filter. I can't just say:
if(array[i] > 100 ) { //delete or ignore this value. Else store in a array }
The reason this won't work is because the value of array[i] ranges from (-1,1). So how then would I apply this filter? What values am I going to compare?
From a physical perspective, I can see how it works, a capacitor and a resistor to calculate the cut-off frequency and send the input through the circuit. But programmatically, I do not see how I can implement this. I have seen many implementations of this on-line but the code wasn't documented enough to get a good understanding of what was going on.
Here is the code on my host side:
//Array to hold the information of signal
float *Array;
//Number of sampling points
int sampleSize = 100;
float h = 0;
//Signal Frequency in Hz
float signalFreq = 10;
//Number of points between 0 and max val (T_Sample)
float freqSample = sampleSize*signalFreq;
//Step = max value or T_Sample
float stepSize = 1.0 / freqSample;
//Allocate enough memory for the array
Array = (float*)malloc(sampleSize*sizeof(float));
//Populate the array with modified cosine
for (int i = 0; i < sampleSize; i++) {
Array[0] = cos(2*CL_M_PI*signalFreq*h);
h = h + stepSize;
printf("Value of current sample for cos is: %f \n", Array[0]);
}
My kernel is only as follows: (Obviously this is not the code for the filter, this is where I am confused).
__kernel void lowpass(__global int *Array, __local float *cutOffValue, __global int *Output) {
int idx = get_global_id(0);
Output[idx] = Array[idx];
};
I found this PDF that implements a lot of filters. Near the end of the document you can find a float implementation of the Low Pass Filter.
http://scholar.uwindsor.ca/cgi/viewcontent.cgi?article=6242&context=etd
In the filter implementation in that pdf, the compare data[j] to value. Also I have no idea what numItems or workItems is.
If someone can provide some insight on this that would be great. I have searched a lot of other examples on low pass filters but I just can't wrap my head around the implementation. I hope I made this question clear. Again, I know how/what the low pass filter does. I just have no idea as to what values I need to compare in order for the filtering to take place.
Found this question aswell:
Low Pass filter in C

I have a possible solution. What I am attempting will be a moving average fir (which I am told is the easiest form of a lowpass filter that one can implement).
What is required:
FIFO buffer
Coefficient values (I generated and obtained mine from matlab for a specific cut-off frequency)
Input and Output arrays for the program
I have not implemented this code wise, but I do understand how to use it on a theoretical level. I have created a diagram below to try and explain the process.
Essentially, from another input array values will be passed into the FIFO buffer one at a time. Every time a value is passed in, the kernel will do a multiplication across the FIFO buffer that has 'n' taps. Each tap has a coefficient value associated with it. So the input at a particular element gets multiplied with the coefficient value and all the values are then accumulated and stored in one element of the output buffer.
Note that the coefficients were generated in Matlab. I didn't know how else to grab these values. At first I was going to just use the coefficient of 1/n, but I am pretty sure that is just going to distort the values of the signal.
And that should do the trick, I am going to implement this in the code now, but if there is anything wrong with this theory feel free to correct it.

Related

Compute efficiently Max and Min on data stream

I'm working in C with a stream of data. Basically I receive a column array of 6 elements every n milliseconds. I would like to compute the max value for each row of data.
To make this clear this is how my data looks like (this is a toy example, actually I'll have thousand of columns acquired):
[6] [-10] [5]
[1] [5] [3]
[5] [30] [10]
[2] [-10] [0]
[-2][5] [10]
[-5][0] [1]
So basically (as I said) I receive a column of data every n milliseconds, and I want to compute the max and min value row-wise. So in my previous example my result would be:
max_values=[6,5,30,2,10,1]
min_values=[-10,1,5,-10,-2,-5]
I want to point out that I have no access to the full matrix, I can only work over single columns of 6 elements that I receive every n milliseconds.
This is my simple code algorithm so far (I'm omitting the whole code since it's part of a bigger project):
for(int i=0;i<6;i++){
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
}
Where input, temp_max and temp_min are all float arrays of dimension 6.
Basically my code executes this piece of code everytime a new input array is available and updates the maximum and minimum accordingly.
Since I'm interested in performance (this is going to run on an embedded system), is there any way to improve this part of the code? Calling a comparison for each single element of the 2 arrays doesn't seem the most smart idea.
With random input data (i.e. unordered data), it'll be pretty hard (aka impossible) to find min/max without a comparison per element.
You may get some minor improvement from something like:
temp_max[0]=input[0];
temp_min[0]=input[0];
for(int i=1;i<6;i++){ // Only 1..6
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
else // If current element was larger than max, you don't need to check min
{
if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
}
}
but I doubt this will be a significant improvement.
Branching is slow, especially on embedded systems. Scalar computation too.
Hopefully, your targeted processor seems to be an ARM-based processor supporting the NEON SIMD instruction set (apparently one based on a 64-bits ARM-V8 A53 architecture). NEON can compute 4 32-bits floating-point operations in a row. This should be much faster than the current code (which compilers apparently fail to vectorize).
Here is an example code (untested):
void minmax_optim(float temp_min[6], float temp_max[6], float input[6]) {
/* Compute the first 4 floats */
float32x4_t vInput = vld1q_f32(input);
float32x4_t vMin = vld1q_f32(temp_min);
float32x4_t vMax = vld1q_f32(temp_max);
vMin = vminq_f32(vInput, vMin);
vMax = vmaxq_f32(vInput, vMax);
vst1q_f32(temp_min, vMin);
vst1q_f32(temp_max, vMax);
/* Remainder 2 floats */
float32x2_t vLastInput = vld1_f32(input+4);
float32x2_t vLastMin = vld1_f32(temp_min+4);
float32x2_t vLastMax = vld1_f32(temp_max+4);
vLastMin = vmin_f32(vLastInput, vLastMin);
vLastMax = vmax_f32(vLastInput, vLastMax);
vst1_f32(temp_min+4, vLastMin);
vst1_f32(temp_max+4, vLastMax);
}
The resulting code should be much faster. One can see on goldbolt that the number of instructions of this vectorized implementation is drastically smaller than the reference implementation without any conditional jump instructions.
You nailed it -- you have to keep a temporary max and min arrays. Unfortunately, if we're talking strictly C, it seems to be the single possible and thus most performant algorithm possible.
Since you've mentioned it's going to run on embedded system (but omitted which), please make sure you have hardware floating point support. If there isn't, that's going to be high performance penality. If you have high-end hardware, you can look for availability of vector instructions, but then that's platform-specific, possibly by use of assembly.
To my impression the approach as such cannot be substantially improved, as the input is not available as a whole. That being said, the inner comparisons can be compacted. The assignemnts
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
can be improved to
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
else if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
because if the current value replaces the temporary maximum, it cannot also replace the temporary minimum (assuming some sensible initialization).
Only for max but it is easy to expand
#define MAX(a,b,c) (a) > (b) ? ((b) > (c) ? (b) : (a) > (c) ? (a) : (c) ) : (b) > (c) ? (b) : (c)
void rowmax(int *a, int *b, int *c, int *result, size_t size)
{
for(size_t index = 0; index < size; index++)
{
result[index] = MAX(a[index], b[index], c[index]);
}
}

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?
No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>
Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.
Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

Led bar indicator flickering within adjacent values and how to avoid this (embedded-C)

I am designing a measurement instrument that has a visible user output on a 30-LEDs bar. The program logic acts in this fashion (pseudo-code)
while(1)
{
(1)Sensor_Read();
(2)Transform_counts_into_leds();
(3)Send_to_bar();
{
The relevant function (2) is a simple algorithm that transforms the counts from the I2C sensor to a value serially sent to shift-registers that control the single leds.
The variable sent to function (3) is simply the number of LEDs that have to stay on (0 for al LEDs off, 30 for all LEDs on)
uint8_t Transform_counts_into_leds(uint16_t counts)
{
float on_leds;
on_leds = (uint8_t)(counts * 0.134); /*0.134 is a dummy value*/
return on_leds;
}
using this program logic when counts value is on the threshold between two LEDs, the next led flickers
I think this is a bad user experience for my device and I want the LEDs, once lit, to stay stable in a small range of values.
QUESTION: How a solution to this problem could be implemented in my project?
Hysteresis is useful for a number of applications, but I would suggest not appropriate in this instance. The problem is that if the level genuinely falls from say 8 to 7 for example you would not see any change until at least one sample at 6 and it would jump to 6 and there would have to be a sample of 8 before it went back to 7.
A more appropriate solution in the case is a moving average, although it is simpler and more useful to use a moving sum and use the higher resolution that gives. For example a moving-sum of 16 effectively adds (almost) 4 bits of resolution, making a 8 bit sensor effectively 12 bit - at the cost of bandwidth of course; you don't get something for nothing. In this case lower bandwidth (i.e. less responsive to higher frequencies is exactly what you need)
Moving sum:
#define BUFFER_LEN 16 ;
#define SUM_MAX (255 * BUFFER_LEN)
#define LED_MAX 30
uint8_t buffer[BUFFER_LEN] = {0} ;
int index = 0 ;
uint16_t sum = 0 ;
for(;;)
{
uint8_t sample = Sensor_Read() ;
// Maintain sum of buffered values by
// subtracting oldest buffered value and
// adding the new sample
sum -= buffer[index] ;
sum += sample ;
// Replace oldest sample with new sample
// and increment index to next oldest sample
buffer[index] = sample ;
index = (index + 1) % BUFFER_LEN ;
// Transform to LED bar level
int led_level = (LED_MAX * sum) / SUM_MAX ;
// Show level
setLedBar( led_level ) ;
}
The underlying problem -- displaying sensor data in a human-friendly way -- is very interesting. Here's my approach in pseudocode:
Loop:
Read sensor
If sensor outside valid range:
Enable warning LED
Sleep in a low-power state for a while
Restart loop
Else:
Disable warning LED
Filter sensor value
Compute display value from sensor value with extra precision:
If new display value differs sufficiently from current value:
Update current displayed value
Update display with scaled-down display value
Filtering deals with noise in the measurements. Filtering smoothes out any sudden changes in the measurement, removing sudden spikes. It is like erosion, turning sharp and jagged mountains into rolling fells and hills.
Hysteresis hides small changes, but does not otherwise filter the results. Hysteresis won't affect noisy or jagged data, it only hides small changes.
Thus, the two are separate, but complementary methods, that affect the readout in different ways.
Below, I shall describe two different filters, and two variants of simple hysteresis implementation suitable for numeric and bar graph displays.
If possible, I'd recommend you write some scripts or test programs that output the input data and the variously filtered output data, and plot it in your favourite plotting program (mine is Gnuplot). Or, better yet, experiment! Nothing beats practical experiments for human interface stuff (at least if you use existing suggestions and known theory as your basis, and leap forward from there).
Moving average:
You create an array of N sensor readings, updating them in a round-robin fashion, and using their average as the current reading. This produces very nice (as in human-friendly, intuitive) results, as only the N latest sensor readings affect the average.
When the application is first started, you should copy the very first reading into all N entries in the averaging array. For example:
#define SENSOR_READINGS 32
int sensor_reading[SENSOR_READINGS];
int sensor_reading_index;
void sensor_init(const int reading)
{
int i;
for (i = 0; i < SENSOR_READINGS; i++)
sensor_reading[i] = reading;
sensor_reading_index = 0;
}
int sensor_update(const int reading)
{
int i, sum;
sensor_reading_index = (sensor_reading_index + 1) % SENSOR_READINGS;
sensor_reading[sensor_reading_index] = reading;
sum = sensor_reading[0];
for (i = 1; i < SENSOR_READINGS; i++)
sum += sensor_reading[i];
return sum / SENSOR_READINGS;
}
At start-up, you call sensor_init() with the very first valid sensor reading, and sensor_update() with the following sensor readings. The sensor_update() will return the filtered result.
The above works best when the sensor is regularly polled, and SENSOR_READINGS can be chosen large enough to properly filter out any unwanted noise in the sensor readings. Of course, the array requires RAM, which may be on short supply in some microcontrollers.
Exponential smoothing:
When there is not enough RAM to use a moving average to filter data, an exponential smoothing filter is often applied.
The idea is that we keep an average value, and recalculate the average using each new sensor reading using (A * average + B * reading) / (A + B). The effect of each sensor reading on the average decays exponentially: the weight of the most current sensor reading is always B/(A+B), the weight of the previous one is A*B/(A+B)^2, the weight of the one before that is A^2*B/(A+B)^3, and so on (^ indicating exponentiation); the weight of the n'th sensor reading in the past (with current one being n=0) is A^n*B/(A+B)^(n+1).
The code corresponding to the previous filter is now
#define SENSOR_AVERAGE_WEIGHT 31
#define SENSOR_CURRENT_WEIGHT 1
int sensor_reading;
void sensor_init(const int reading)
{
sensor_reading = reading;
}
int sensor_update(const int reading)
return sensor_reading = (sensor_reading * SENSOR_AVERAGE_WEIGHT +
reading * SENSOR_CURRENT_WEIGHT) /
(SENSOR_AVERAGE_WEIGHT + SENSOR_CURRENT_WEIGHT);
}
Note that if you choose the weights so that their sum is a power of two, most compilers optimize the division into a simple bit shift.
Applying hysteresis:
(This section, including example code, edited on 2016-12-22 for clarity.)
Proper hysteresis support involves having the displayed value in higher precision than is used for output. Otherwise, your output value with hysteresis applied will never change by a single unit, which I would consider a bad design in an user interface. (I'd much prefer a value to flicker between two consecutive values every few seconds, to be honest -- and that's what I see in e.g. the weather stations I like best with good temperature sensors.)
There are two typical variants in how hysteresis is applied to readouts: fixed, and dynamic. Fixed hysteresis means that the displayed value is updated whenever the value differs by a fixed limit; dynamic means the limits are set dynamically. (The dynamic hysteresis is much rarer, but it may be very useful when coupled with the moving average; one can use the standard deviation (or error bars) to set the hysteresis limits, or set asymmetric limits depending on whether the new value is smaller or greater than the previous one.)
The fixed hysteresis is very simple to implement. First, because we need to apply the hysteresis to a higher-precision value than the output, we choose a suitable multiplier. That is, display_value = value / DISPLAY_MULTIPLIER, where value is the possibly filtered sensor value, and display_value is the integer value displayed (number of bars lit, for example).
Note that below, display_value and the value returned by the functions, refer to the integer value displayed, for example the number of lit LED bars. value is the (possibly filtered) sensor reading, and saved_value containing the sensor reading that is currently displayed.
#define DISPLAY_HYSTERESIS 10
#define DISPLAY_MULTIPLIER 32
int saved_value;
void display_init(const int value)
{
saved_value = value;
}
int display_update(const int value)
{
const int delta = value - saved_value;
if (delta < -DISPLAY_HYSTERESIS ||
delta > DISPLAY_HYSTERESIS)
saved_value = value;
return saved_value / DISPLAY_MULTIPLIER;
}
The delta is just the difference between the new sensor value, and the sensor value corresponding to the currently displayed value.
The effective hysteresis, in units of displayed value, is DISPLAY_HYSTERESIS/DISPLAY_MULTIPLIER = 10/32 = 0.3125 here. It means that the displayed value can be updated three times before a visible change is seen (if e.g. slowly decreasing or increasing; more if the value is just fluctuating, of course). This eliminates rapid flickering between two visible values (when the value is in the middle of two displayed values), but ensures the error of the reading is less than half display units (on average; half plus effective hysteresis in the worst case).
In a real life application, you usually use a more complete form return (saved_value * DISPLAY_SCALE + DISPLAY_OFFSET) / DISPLAY_MULTIPLIER, which scales the filtered sensor value by DISPLAY_SCALE/DISPLAY_MULTIPLIER and moves the zero point by DISPLAY_OFFSET/DISPLAY_MULTIPLIER, both evaluated at 1.0/DISPLAY_MULTIPLIER precision, but only using integer operations. However, for simplicity, I'll just assume that to derive the display value value, say the number of lit LED bars, you just divide the sensor value by DISPLAY_MULTIPLIER. In either case, the hysteresis is DISPLAY_HYSTERESIS/DISPLAY_MULTIPLIER of the output unit. Ratios of about 0.1 to 0.5 work fine; and the below test values, 10 and 32, yields 0.3125, which is about midway of the range of ratios that I believe work best.
Dynamic hysteresis is very similar to above:
#define DISPLAY_MULTIPLIER 32
int saved_value_below;
int saved_value;
int saved_value_above;
void display_init(const int value, const int below, const int above)
{
saved_value_below = below;
saved_value = value;
saved_value_above = above;
}
int display_update(const int value, const int below, const int above)
{
if (value < saved_value - saved_value_below ||
value > saved_value + saved_value_above) {
saved_value_below = below;
saved_value = value;
saved_value_above = above;
}
return saved_value / DISPLAY_MULTIPLIER;
}
Note that if DISPLAY_HYSTERESIS*2 <= DISPLAY_MULTIPLIER, the displayed value is always within a display unit of the actual (filtered) sensor value. In other words, hysteresis can easily deal with flickering, but it does not need to add much error to the displayed value.
In many practical cases the best amount of hysteresis applied depends on the amount of short-term variations in the sensor samples. This includes not only noise, but also the types of signals that are to be measured. A hysteresis of just 0.3 (relative to the output unit) is sufficient to completely eliminate the flicker when sensor readings flip the filtered sensor value between two consecutive integers that map to different integer ouputs, as it ensures that the filtered sensor value must change by at least 0.3 (in output display units) before it effects a change in the display.
The maximum error with hysteresis is half display units plus the current hysteresis. The half unit is the minimum error possible (since consecutive units are one unit apart, so when the true value is in the middle, either value shown is correct to within half a unit). With dynamic hysteresis, if you always start with some fixed hysteresis value when a reading changes enough, but when the reading is within the hysteresis, you instead just decrease the hysteresis (if greater than zero). This approach leads to a changing sensor value being tracked correctly (maximum error being half an unit plus the initial hysteresis), but a relatively static value being displayed as accurately as possible (at half an unit maximum error). I don't show an example of this, because it adds another tunable (how the hysteresis decays towards zero), and requires that you verify (calibrate) the sensor (including any filtering) first; otherwise it's like polishing a turd: possible, but not useful.
Also note that if you have 30 bars in the display, you actually have 31 states (zero bars, one bar, .., 30 bars), and thus the proper range for the value is 0 to 31*DISPLAY_MULTIPLIER - 1, inclusive.

Remove 1000Hz tone from FFT array in C

I have an array of doubles which is the result of the FFT applied on an array, that contains the audio data of a Wav audio file in which i have added a 1000Hz tone.
I obtained this array thought the DREALFT defined in "Numerical Recipes".(I must use it).
(The original array has a length that is power of two.)
Mine array has this structure:
array[0] = first real valued component of the complex transform
array[1] = last real valued component of the complex transform
array[2] = real part of the second element
array[3] = imaginary part of the second element
etc......
Now, i know that this array represent the frequency domain.
I want to determine and kill the 1000Hz frequency.
I have tried this formula for finding the index of the array which should contain the 1000Hz frequency:
index = 1000. * NElements /44100;
Also, since I assume that this index refers to an array with real values only, i have determined the correct(?) position in my array, that contains imaginary values too:
int correctIndex=2;
for(k=0;k<index;k++){
correctIndex+=2;
}
(I know that surely there is a way easier but it is the first that came to mind)
Then, i find this value: 16275892957.123705, which i suppose to be the real part of the 1000Hz frequency.(Sorry if this is an imprecise affermation but at the moment I do not care to know more about it)
So i have tried to suppress it:
array[index]=-copy[index]*0.1f;
I don't know exactly why i used this formula but is the only one that gives some results, in fact the 1000hz tone appears to decrease slightly.
This is the part of the code in question:
double *copy = malloc( nCampioni * sizeof(double));
int nSamples;
/*...Fill copy with audio data...*/
/*...Apply ZERO PADDING and reach the length of 8388608 samples,
or rather 8388608 double values...*/
/*Apply the FFT (Sure this works)*/
drealft(copy - 1, nSamples, 1);
/*I determine the REAL(?) array index*/
i= 1000. * nSamples /44100;
/*I determine MINE(?) array index*/
int j=2;
for(k=0;k<i;k++){
j+=2;
}
/*I reduce the array value, AND some other values aroud it as an attempt*/
for(i=-12;i<12;i+=2){
copy[j-i]=-copy[i-j]*0.1f;
printf("%d\n",j-i);
}
/*Apply the inverse FFT*/
drealft(copy - 1, nSamples, -1);
/*...Write the audio data on the file...*/
NOTE: for simplicity I omitted the part where I get an array of double from an array of int16_t
How can i determine and totally kill the 1000Hz frequency?
Thank you!
As Oli Charlesworth writes, because your target frequency is not exactly one of the FFT bins (your index, TargetFrequency * NumberOfElements / SamplingRate, is not exactly an integer), the energy of the target frequency will be spread across all bins. For a start, you can eliminate some of the frequency by zeroing the bin closest to the target frequency. This will of course affect other frequencies somewhat too, since it is slightly off target. To better suppress the target frequency, you will need to consider a more sophisticated filter.
However, for educational purposes: To suppress the frequency corresponding to a bin, simply set that bin to zero. You must set both the real and the imaginary components of the bin to zero, which you can do with:
copy[index*2 + 0] = 0;
copy[index*2 + 1] = 1;
Some notes about this:
You had this code to calculate the position in the array:
int correctIndex = 2;
for (k = 0; k < index; k++) {
correctIndex += 2;
}
That is equivalent to:
correctIndex = 2*(index+1);
I believe you want 2*index, not 2*(index+1). So you were likely reducing the wrong bin.
At one point in your question, you wrote array[index] = -copy[index]*0.1f;. I do not know what array is. You appeared to be working in place in copy. I also do not know why you multiplied by 1/10. If you want to eliminate a frequency, just set it to zero. Multiplying it by 1/10 only reduces it to 10% of its original magnitude.
I understand that you must pass copy-1 to drealft because the Numerical Recipes code uses one-based indexing. However, the C standard does not support the way you are doing it. The behavior of the expression copy-1 is not defined by the standard. It will work in most C implementations. However, to write supported portable code, you should do this instead:
// Allocate one extra element.
double *memory = malloc((nCampioni+1) * sizeof *memory);
// Make a pointer that is convenient for your work.
double *copy = memory+1;
…
// Pass the necessary base address to drealft.
drealft(memory, nSamples, 1);
// Suppress a frequency.
copy[index*2 + 0] = 0;
copy[index*2 + 1] = 0;
…
// Free the memory.
free(memory);
One experiment I suggest you consider is to initialize an array with just a sine wave at the desired frequency:
for (i = 0; i < nSamples; ++i)
copy[i] = sin(TwoPi * Frequency / SampleRate * i);
(TwoPi is of course 2*3.1415926535897932384626433.) Then apply drealft and look at the results. You will see that much of the energy is at a peak in the closest bin to the target frequency, but much of it has also spread to other bins. Clearly, zeroing a single bin and performing the inverse FFT cannot eliminate all of the frequency. Also, you should see that the peak is in the same bin you calculated for index. If it is not, something is wrong.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Resources