arm_cfft_sR_q31_len4096 undeclared

arm_cfft_sR_q31_len4096 undeclared - c

I am doing some FFT calculations on a STM32F407 and I want to compare the different FFT functions that is available in the CMSIS DSP library. When I am using the f32 CFFT functions it works as one would expect but when I try to use the q31/q15 functions I get an error saying "arm_cfft_sR_q31_len4096" or arm_cfft_sR_q15_len4096 undeclared when I call their respective cfft functions. I have included arm_const_structs.h where it should be defined but apparently it isn't? It works with arm_cfft_sR_f32_len4096 for the f32 version of the functions so what might be the problem?
Here is how the f32 version of my fft calculations look:
#include "arm_const_structs.h"
float32_t fft_data[FFT_SIZE * 2];
uint16_t util_calculate_fft_value(uint16_t *buffer, uint32_t len, uint32_t fft_freq, uint32_t fft_freq2)
{
uint16_t i;
float32_t maxValue; // Max FFT value is stored here
uint32_t maxIndex; // Index in Output array where max value is
tmStartMeasurement(&time); // Record clock cycles
// Ensure in buffer is not longer than FFT buffer
if (len > FFT_SIZE)
len = FFT_SIZE;
// Convert buffer uint16 to fft input float32
for (i = 0; i < len ; i++)
{
fft_data[i*2] = (float32_t)buffer[i] / 2048.0 - 1.0; // Real part
fft_data[i*2 + 1] = 0; // Imaginary part
}
// Process the data through the CFFT module intFlag = 0, doBitReverse = 1
arm_cfft_f32(&arm_cfft_sR_f32_len4096, fft_data, 0, 1);
// Process the data through the Complex Magniture Module for calculating the magnitude at each bin
arm_cmplx_mag_f32(fft_data, fft_data, FFT_SIZE / 2);
// Find maxValue as max in fft_data
arm_max_f32(fft_data, FFT_SIZE, &maxValue, &maxIndex);
if (fft_freq == 0)
{ // Find maxValue as max in fft data
arm_max_f32(fft_data, FFT_SIZE, &maxValue, &maxIndex);
}
else
{ // Grab maxValue from fft data at freq position
arm_max_f32(&fft_data[fft_freq * FFT_SIZE / ADC_SAMP_SPEED - 1], 3, &maxValue, &maxIndex);
if (fft_freq2 != 0)
{
// Grab maxValue from fft data at freq2 position
float32_t maxValue2; // Max FFT value is stored here
uint32_t maxIndex2; // Index in Output array where max value is
arm_max_f32(&fft_data[fft_freq * FFT_SIZE / ADC_SAMP_SPEED - 1], 3, &maxValue2, &maxIndex2);
maxValue = (maxValue + maxValue2) / 2.0;
}
}
tmStopMeasurement(&time); // Get number of clock cycles
// Convert output back to uint16 for plotting
for (i = 0; i < len / 2; i++)
{
buffer[i] = (uint16_t)(fft_data[i] * 10.0);
}
// Zero the rest of the buffer
for (i = len / 2; i < len; i++)
{
buffer[i] = 0;
}
LOG_INFO("FFT number of cycles: %i\n", time.worst);
return ((uint16_t)(maxValue * 10.0));
}

I found a copy of arm_const_structs.h online. It includes the following line:
extern const arm_cfft_instance_q31 arm_cfft_sR_q31_len4096;
The extern keyword means that this line is a declaration of arm_cfft_SR_q31_len4096, not a definition. The variable must also be defined somewhere else in your code.
I found the definition in arm_const_structs.c.
const arm_cfft_instance_q31 arm_cfft_sR_q31_len4096 = {
4096, twiddleCoef_4096_q31, armBitRevIndexTable_fixed_4096, ARMBITREVINDEXTABLE_FIXED_4096_TABLE_LENGTH
};
Make sure you have included arm_const_structs.c in your project so that it compiles and links with your program.

Related

Does fmodf() cause a hardfault in stm32?

I am trying to create a modulated waveform out of 2 sine waves.
To do this I need the modulo(fmodf) to know what amplitude a sine with a specific frequency(lo_frequency) has at that time(t). But I get a hardfault when the following line is executed:
j = fmodf(2 * PI * lo_frequency * t, 2 * PI);
Do you have an idea why this gives me a hardfault ?
Edit 1:
I exchanged fmodf with my_fmodf:
float my_fmodf(float x, float y){
if(y == 0){
return 0;
}
float n = x / y;
return x - n * y;
}
But still the hardfault occurs, and when I debug it it doesn't even jump into this function(my_fmodf).
Heres the whole function in which this error occurs:
int* create_wave(int* message){
/* Mixes the message signal at 10kHz and the carrier at 40kHz.
* When a bit of the message is 0 the amplitude is lowered to 10%.
* When a bit of the message is 1 the amplitude is 100%.
* The output of the STM32 can't be negative, thats why the wave swings between
* 0 and 256 (8bit precision for faster DAC)
*/
static int rf_frequency = 10000;
static int lo_frequency = 40000;
static int sample_rate = 100000;
int output[sample_rate];
int index, mix;
float j, t;
for(int i = 0; i <= sample_rate; i++){
t = i * 0.00000001f; // i * 10^-8
j = my_fmodf(2 * PI * lo_frequency * t, 2 * PI);
if (j < 0){
j += (float) 2 * PI;
}
index = floor((16.0f / (lo_frequency/rf_frequency * 0.0001f)) * t);
if (index < 16) {
if (!message[index]) {
mix = 115 + sin1(j) * 0.1f;
} else {
mix = sin1(j);
}
} else {
break;
}
output[i] = mix;
}
return output;
}
Edit 2:
I fixed the warning: function returns address of local variable [-Wreturn-local-addr] the way "chux - Reinstate Monica" suggested.
int* create_wave(int* message){
static uint16_t rf_frequency = 10000;
static uint32_t lo_frequency = 40000;
static uint32_t sample_rate = 100000;
int *output = malloc(sizeof *output * sample_rate);
uint8_t index, mix;
float j, n, t;
for(int i = 0; i < sample_rate; i++){
t = i * 0.00000001f; // i * 10^-8
j = fmodf(2 * PI * lo_frequency * t, 2 * PI);
if (j < 0){
j += 2 * PI;
}
index = floor((16.0f / (lo_frequency/rf_frequency * 0.0001f)) * t);
if (index < 16) {
if (!message[index]) {
mix = (uint8_t) floor(115 + sin1(j) * 0.1f);
} else {
mix = sin1(j);
}
} else {
break;
}
output[i] = mix;
}
return output;
}
But now I get the hardfault on this line:
output[i] = mix;
EDIT 3:
Because the previous code contained a very large buffer array that did not fit into the 16KB SRAM of the STM32F303K8 I needed to change it.
Now I use a "ping-pong" buffer where I use the callback of the DMA for "first-half-transmitted" and "completly-transmitted":
void HAL_DAC_ConvHalfCpltCallbackCh1(DAC_HandleTypeDef * hdac){
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_SET);
for(uint16_t i = 0; i < 128; i++){
new_value = sin_table[(i * 8) % 256];
if (message[message_index] == 0x0){
dac_buf[i] = new_value * 0.1f + 115;
} else {
dac_buf[i] = new_value;
}
}
}
void HAL_DAC_ConvCpltCallbackCh1 (DAC_HandleTypeDef * hdac){
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_RESET);
for(uint16_t i = 128; i < 256; i++){
new_value = sin_table[(i * 8) % 256];
if (message[message_index] == 0x0){
dac_buf[i] = new_value * 0.1f + 115;
} else {
dac_buf[i] = new_value;
}
}
message_index++;
if (message_index >= 16) {
message_index = 0;
// HAL_DAC_Stop_DMA (&hdac1, DAC_CHANNEL_1);
}
}
And it works the way I wanted:
But the frequency of the created sine is too low.
I cap at around 20kHz but I'd need 40kHz.
I allready increased the clock by a factor of 8 so that one is maxed out:
.
I can still decrease the counter period (it is 50 at the moment), but when I do so the interrupt callback seems to take longer than the period to the next one.
At least it seems so as the output becomes very distorted when I do that.
I also tried to decrease the precision by taking only every 8th sine value but
I cant do this any more because then the output does not look like a sine wave anymore.
Any ideas how I could optimize the callback so that it takes less time ?
Any other ideas ?

Does fmodf() cause a hardfault in stm32?
It is other code problems causing the hard fault here.
Failing to compile with ample warnings
Best code tip: enable all warnings. #KamilCuk
Faster feedback than Stackoverflow.
I'd expect something like below on a well enabled compiler.
return output;
warning: function returns address of local variable [-Wreturn-local-addr]
Returning a local Object
Cannot return a local array. Allocate instead.
// int output[sample_rate];
int *output = malloc(sizeof *output * sample_rate);
return output;
Calling code will need to free() the pointer.
Out of range array access
static int sample_rate = 100000;
int output[sample_rate];
// for(int i = 0; i <= sample_rate; i++){
for(int i = 0; i < sample_rate; i++){
...
output[i] = mix;
}
Stack overflow?
static int sample_rate = 100000; int output[sample_rate]; is a large local variable. Maybe allocate or try something smaller?
Advanced: loss of precision
A good fmodf() does not lose precision. For a more precise answer consider double math for the intermediate results. An even better approach is more involved.
float my_fmodf(float x, float y){
if(y == 0){
return 0;
}
double n = 1.0 * x / y;
return (float) (x - n * y);
}
Can I not use any function within another ?
Yes. Code has other issues.

1 value every 10uS makes only 100kSPS whis is not too much for this macro. In my designs I generate > 5MSPS signals without any problems. Usually I have one buffer and DMA in circular mode. First I fill the buffer and start generation. When the half transmition DMA interrupt is trigerred I fill the first half of the buffer with fresh data. The the transmition complete interrupt is trigerred I fill the second half and this process repeats all over again.

Struggle to program an LFO

based on some tutorial code I've found I coded a little synthesizer with three oscilators and four different waveform. It works well and I want to add an LFO to module the sounds. Since I didn't coded everything on my own I'm a bit confused of how I could fit the LFO formula on my code. This is more or less what I tried in order to implement the LFO formula on a sinewave.(This formula is something like this: sinewaveFormula + 0.5 * Sinefreq * sin(2pi*1) * time)
double normalize(double phase)
{
double cycles = phase/(2.0*pi);
phase -= trunc(cycles) * 2.0 * pi;
if (phase < 0) phase += 2.0*pi;
return phase;
}
double sine(double phase)
{ phase = normalize(phase); return (sin(phase));}
static void build_sine_table(int16_t *data, int wave_length) {
double phase_increment = (2.0f * pi) / (double)wave_length;
double current_phase = 0;
for(int i = 0; i < wave_length; i++) {
int sample = synthOsc(current_phase, oscNum, selectedWave, selectedWave2, selectedWave3, intensity, intensity2, intensity3) + 0.5 * ((current_phase* wave_length) / (2*pi)) * sin(2*pi*(1.0)) * wave_length;
data[i] = (int16_t)sample;
current_phase += phase_increment;
}
}
static void write_samples(int16_t *s_byteStream, long begin, long end, long length) {
if(note > 0) {
double d_sample_rate = sample_rate;
double d_table_length = table_length;
double d_note = note;
// get correct phase increment for note depending on sample rate and table length.
double phase_increment = (get_pitch(d_note) / d_sample_rate) * d_table_length;
// loop through the buffer and write samples.
for (int i = 0; i < length; i+=2) {
phase_double += phase_increment;
phase_int = (int)phase_double;
if(phase_double >= table_length) {
double diff = phase_double - table_length;
phase_double = diff;
phase_int = (int)diff;
}
if(phase_int < table_length && phase_int > -1) {
if(s_byteStream != NULL) {
int16_t sample = sine_waveform_wave[phase_int];
target_amp = update_envelope();
if(smoothing_enabled) {
// move current amp towards target amp for a smoother transition.
if(current_amp < target_amp) {
current_amp += smoothing_amp_speed;
if(current_amp > target_amp) {
current_amp = target_amp;
}
} else if(current_amp > target_amp) {
current_amp -= smoothing_amp_speed;
if(current_amp < target_amp) {
current_amp = target_amp;
}
}
} else {
current_amp = target_amp;
}
sample *= current_amp; // scale volume.
s_byteStream[i+begin] = sample; // left channel
s_byteStream[i+begin+1] = sample; // right channel
}
}
}
}
}
The code compile but there's no LFO on the sine. I don't understand how I could make this formula work with this code.

It may help to get a basic understanding of how a LFO actually works. It is not that difficult - as an LFO is just another oscillator that is mixed to the waveform you want to modulate.
I would suggest to remove your LFO formular from your call of synthOsc(), then you get a clean oscillator signal again. As a next step, create another oscillator signal for which you can use a very low frequency. Mix both signals together and you are done.
Expresssed in simple math, it is like this:
int the_sample_you_want_to_modulate = synthOsc1(...);
int a_sample_with_very_low_frequency = synthOsc2(...);
Mixing two waveforms is done through addition:
int mixed_sample = the_sample_you_want_to_modulate + a_sample_with_very_low_frequency;
The resulting sample will sweep now based on the frequency you have used for synthOsc2().
As you can see, to implement an LFO you actually do not need a separate formular. You already have the formular when you know how to create an oscillator.
Note that if you add two sine oscillators that have the exact same frequency, the resulting signal will just get louder. But when each has a different frequency, you will get a new waveform. For LFOs (which are in fact just ordinary oscillators - like in your build_sine_table() function) you typically set a very low frequency: 1 - 10 Hz is low enough to get an audible sweep. For higher frequencies you get chords as a result.

Writing a wave generator with SDL

I've coded a simple sequencer in C with SDL 1.2 and SDL_mixer(to play .wav file). It works well and I want to add some audio synthesis to this program. I've look up the and I found this sinewave code using SDL2(https://github.com/lundstroem/synth-samples-sdl2/blob/master/src/synth_samples_sdl2_2.c)
Here's how the sinewave is coded in the program:
static void build_sine_table(int16_t *data, int wave_length)
{
/*
Build sine table to use as oscillator:
Generate a 16bit signed integer sinewave table with 1024 samples.
This table will be used to produce the notes.
Different notes will be created by stepping through
the table at different intervals (phase).
*/
double phase_increment = (2.0f * pi) / (double)wave_length;
double current_phase = 0;
for(int i = 0; i < wave_length; i++) {
int sample = (int)(sin(current_phase) * INT16_MAX);
data[i] = (int16_t)sample;
current_phase += phase_increment;
}
}
static double get_pitch(double note) {
/*
Calculate pitch from note value.
offset note by 57 halfnotes to get correct pitch from the range we have chosen for the notes.
*/
double p = pow(chromatic_ratio, note - 57);
p *= 440;
return p;
}
static void audio_callback(void *unused, Uint8 *byte_stream, int byte_stream_length) {
/*
This function is called whenever the audio buffer needs to be filled to allow
for a continuous stream of audio.
Write samples to byteStream according to byteStreamLength.
The audio buffer is interleaved, meaning that both left and right channels exist in the same
buffer.
*/
// zero the buffer
memset(byte_stream, 0, byte_stream_length);
if(quit) {
return;
}
// cast buffer as 16bit signed int.
Sint16 *s_byte_stream = (Sint16*)byte_stream;
// buffer is interleaved, so get the length of 1 channel.
int remain = byte_stream_length / 2;
// split the rendering up in chunks to make it buffersize agnostic.
long chunk_size = 64;
int iterations = remain/chunk_size;
for(long i = 0; i < iterations; i++) {
long begin = i*chunk_size;
long end = (i*chunk_size) + chunk_size;
write_samples(s_byte_stream, begin, end, chunk_size);
}
}
static void write_samples(int16_t *s_byteStream, long begin, long end, long length) {
if(note > 0) {
double d_sample_rate = sample_rate;
double d_table_length = table_length;
double d_note = note;
/*
get correct phase increment for note depending on sample rate and table length.
*/
double phase_increment = (get_pitch(d_note) / d_sample_rate) * d_table_length;
/*
loop through the buffer and write samples.
*/
for (int i = 0; i < length; i+=2) {
phase_double += phase_increment;
phase_int = (int)phase_double;
if(phase_double >= table_length) {
double diff = phase_double - table_length;
phase_double = diff;
phase_int = (int)diff;
}
if(phase_int < table_length && phase_int > -1) {
if(s_byteStream != NULL) {
int16_t sample = sine_wave_table[phase_int];
sample *= 0.6; // scale volume.
s_byteStream[i+begin] = sample; // left channel
s_byteStream[i+begin+1] = sample; // right channel
}
}
}
}
}
I don't understand how I could change the sinewave formula to genrate other waveform like square/triangle/saw ect...
EDIT:
Because I forgot to explain it, here's what I tried.
I followed the example I've seen on this video series(https://www.youtube.com/watch?v=tgamhuQnOkM). The source code of the method provided by the video is on github, and the wave generation code is looking like this:
double w(double dHertz)
{
return dHertz * 2.0 * PI;
}
// General purpose oscillator
double osc(double dHertz, double dTime, int nType = OSC_SINE)
{
switch (nType)
{
case OSC_SINE: // Sine wave bewteen -1 and +1
return sin(w(dHertz) * dTime);
case OSC_SQUARE: // Square wave between -1 and +1
return sin(w(dHertz) * dTime) > 0 ? 1.0 : -1.0;
case OSC_TRIANGLE: // Triangle wave between -1 and +1
return asin(sin(w(dHertz) * dTime)) * (2.0 / PI);
}
Because the C++ code here uses windows soun api I could not copy/paste this method to make it work on the piece of code I've found using SDL2.
So I tried to this in order to obtain a square wave:
static void build_sine_table(int16_t *data, int wave_length)
{
double phase_increment = ((2.0f * pi) / (double)wave_length) > 0 ? 1.0 : -1.0;
double current_phase = 0;
for(int i = 0; i < wave_length; i++) {
int sample = (int)(sin(current_phase) * INT16_MAX);
data[i] = (int16_t)sample;
current_phase += phase_increment;
}
}
This didn't gave me a square wave but more a saw wave.
Here's what I tried to get a triangle wave:
static void build_sine_table(int16_t *data, int wave_length)
{
double phase_increment = (2.0f * pi) / (double)wave_length;
double current_phase = 0;
for(int i = 0; i < wave_length; i++) {
int sample = (int)(asin(sin(current_phase) * INT16_MAX)) * (2 / pi);
data[i] = (int16_t)sample;
current_phase += phase_increment;
}
}
This also gave me another type of waveform, not triangle.

You’d replace the sin function call with call to one of the following:
// this is a helper function only
double normalize(double phase)
{
double cycles = phase/(2.0*M_PI);
phase -= trunc(cycles) * 2.0 * M_PI;
if (phase < 0) phase += 2.0*M_PI;
return phase;
}
double square(double phase)
{ return (normalize(phase) < M_PI) ? 1.0 : -1.0; }
double sawtooth(double phase)
{ return -1.0 + normalize(phase) / M_PI; }
double triangle(double phase)
{
phase = normalize(phase);
if (phase >= M_PI)
phase = 2*M_PI - phase;
return -1.0 + 2.0 * phase / M_PI;
}
You’d be building tables just like you did for the sine, except they’d be the square, sawtooth and triangle tables, respectively.

image proccessing further optimization

I'm new to optimization and was given a task to optimize a function that processes an image as much as possible. it takes an image, blurs it and then saves the blurred image, and then continues and sharpens the image, and saves also the sharpened image.
Here is my code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
I wanted to ask about the if statements, is there anything better that can replace those? And also more generally speaking can anyone spot an optimization mistakes here, or can offer his inputs?
Thanks a lot!
updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
index += 2;
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
------------------------------------------------------------------------------updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = n + 1;
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operate like that : instead of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
index += 2;
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}

Some general optimization guidelines:
If you're running on x86, compile as a 64-bit binary. x86 is really a register-starved CPU. In 32-bit mode you pretty much have only 5 or 6 32-bit general-purpose registers available, and you only get "all" 6 if you compile with optimizations like -fomit-frame-pointer on GCC. In 64-bit mode you'll have 13 or 14 64-bit general-purpose registers.
Get a good compiler and use the highest possible general optimization level.
Profile! Profile! Profile! Actually profile your code so actually know where the performance bottlenecks are. Any guesses about the location of any performance bottlenecks are likely wrong.
Once you find your bottlenecks, examine the actual instructions the compiler produces and look at the bottleneck areas, just to see what's happening. Perhaps the bottleneck is where the compiler had to do a lot of register spilling and filling because of register pressure. This can be really helpful if you can profile down to the instruction level.
Use the insights from the profiling and examination of the generated instructions to improve your code and compile arguments. For example, if you're seeing a lot of register spilling and filling, you need to reduce register pressure, perhaps by manually coalescing loops or disabling prefetching with a compiler option.
Experiment with different page size options. If a single row of pixels is a significant fraction of a page size, reaching into other rows is more likely to reach into another page and result in a TLB miss. Using larger memory pages may significantly reduce this.
Some specific ideas for your code:
Use only one outer loop. You'll have to experiment to find the fastest way to handle your "extra" edge pixels. The fastest way might be to not do anything special, roll right over them like "normal" pixels, and just ignore the values in them later.
Manually unroll the two inner loops - you're only doing 9 pixels.
Don't use calculateIndex() - use the address of the current pixel and find the other pixels simply by subtracting or adding the proper value from the current pixel address. For example, the address of the upper-left pixel in your inner loops would be something like currentPixelAddress - n - 1.
Those would convert your four-deep nested loops into a single loop with very little index calculations needed.

A few ideas - untested.
You have if(ii==i && jj=j) to test for the central pixel in your sharpening loop which you do 9x for every pixel. I think it would be faster to remove that if and do exactly the same for every pixel but then make a correction, outside the loop by adding 10x the central pixel.
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
Where you do dst[calculateIndex(i, j, n)] = current_pixel;, you can probably calculate the index once before the loop at the start and then just increment the pointer with each write inside the loop - assuming your arrays are contiguous and unpadded.
index=calculateIndex(1,1,n)
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
...
dst[index++] = current_pixel;
}
index+=2; // skip over last pixel of this line and first pixel of next line
}
As you move your 3x3 window of 9 pixels across the image, you could "remember" the left-most column of 3 pixels from the previous position, then instead of 9 additions for each pixel, you would do a single subtraction for the left-most column leaving the window and 3 additions for the new column entering the window on the right side, i.e. 4 calculations instead of 9.

CMSIS FIR bandpass filter

I am trying to implement a 60kHz bandpass filter on the STM32F407 microcontroller and I'm having some issues. I have generated the filter with the help of MATLABs fdatool and then simulated it in MATLAB as well. The following MATLAB script simlates it.
% FIR Window Bandpass filter designed using the FIR1 function.
% All frequency values are in Hz.
Fs = 5250000; % Sampling Frequency
N = 1800; % Order
Fc1 = 59950; % First Cutoff Frequency
Fc2 = 60050; % Second Cutoff Frequency
flag = 'scale'; % Sampling Flag
% Create the window vector for the design algorithm.
win = hamming(N+1);
% Calculate the coefficients using the FIR1 function.
b = fir1(N, [Fc1 Fc2]/(Fs/2), 'bandpass', win, flag);
Hd = dfilt.dffir(b);
%----------------------------------------------------------
%----------------------------------------------------------
T = 1 / Fs; % sample time
L = 4500; % Length of signal
t = (0:L-1)*T; % Time vector
% Animate the passband frequency span
for f=55500:50:63500
signal = sin(2*pi*f*t);
plot(filter(Hd, signal));
axis([0 L -1 1]);
str=sprintf('Signal frequency (Hz) %d', f);
title(str);
drawnow;
end
pause;
close all;
signal = sin(2*pi*50000*t) + sin(2*pi*60000*t) + sin(2*pi*78000*t);
signal = signal / 3;
signal = signal(1:1:4500);
filterInput = signal;
filterOutput = filter(Hd,signal);
subplot(2,1,1);
plot(filterInput);
axis([0 4500 -1 1]);
subplot(2,1,2);
plot(filterOutput)
axis([0 4500 -1 1]);
pause;
close all;
From the fdatool I extract the filter co-efficents to 16-bit unsigned integers in q15 format, this because of the 12-bit ADC that I'm using. The filter co-efficents header that is generated by MATLAB is here and the resulting plot of the co-efficents can be seen in the following picture
Below is the code for the filter implementation which obviously isn't working and I don't really know what I can do differently, I've looked at some examples online Example 1 and Example 2
#include "fdacoefs.h"
#define FILTER_SAMPLES 4500
#define BLOCK_SIZE 900
static uint16_t firInput[FILTER_SAMPLES];
static uint16_t firOutput[FILTER_SAMPLES];
static uint16_t firState[NUM_TAPS + BLOCK_SIZE - 1];
uint16_t util_calculate_filter(uint16_t *buffer, uint32_t len)
{
uint16_t i;
uint16_t max;
uint16_t min;
uint32_t index;
// Create filter instance
arm_fir_instance_q15 instance;
// Ensure that the buffer length isn't longer than the sample size
if (len > FILTER_SAMPLES)
len = FILTER_SAMPLES;
for (i = 0; i < len ; i++)
{
firInput[i] = buffer[i];
}
// Call Initialization function for the filter
arm_fir_init_q15(&instance, NUM_TAPS, &firCoeffs, &firState, BLOCK_SIZE);
// Call the FIR process function, num of blocks to process = (FILTER_SAMPLES / BLOCK_SIZE)
for (i = 0; i < (FILTER_SAMPLES / BLOCK_SIZE); i++) //
{
// BLOCK_SIZE = samples to process per call
arm_fir_q15(&instance, &firInput[i * BLOCK_SIZE], &firOutput[i * BLOCK_SIZE], BLOCK_SIZE);
}
arm_max_q15(&firOutput, len, &max, &index);
arm_min_q15(&firOutput, len, &min, &index);
// Convert output back to uint16 for plotting
for (i = 0; i < (len); i++)
{
buffer[i] = (uint16_t)(firOutput[i] - 30967);
}
return (uint16_t)((max+min));
}
The ADC is sampling at 5.25 MSPS and it is sampling a 60kHz signal 4500 times and here you can see the Input to the filter and then the Output of the filter which is pretty weird..
Is there anything obvious that I've missed? Because I'm completely lost and any pointers and tips are helpful!

As Lundin pointed out I changed it to work with 32 bit integers instead and that actually solved my problem. Ofcourse I generated new filter co-efficents with MATLABS fdatool as signed 32 bit integers instead.
static signed int firInput[FILTER_SAMPLES];
static signed int firOutput[FILTER_SAMPLES];
static signed int firState[NUM_TAPS + BLOCK_SIZE -1];
uint16_t util_calculate_filter(uint16_t *buffer, uint32_t len)
{
uint16_t i;
int power;
uint32_t index;
// Create filter instance
arm_fir_instance_q31 instance;
// Ensure that the buffer length isn't longer than the sample size
if (len > FILTER_SAMPLES)
len = FILTER_SAMPLES;
for (i = 0; i < len ; i++)
{
firInput[i] = (int)buffer[i];
}
// Call Initialization function for the filter
arm_fir_init_q31(&instance, NUM_TAPS, &firCoeffs, &firState, BLOCK_SIZE);
// Call the FIR process function, num of blocks to process = (FILTER_SAMPLES / BLOCK_SIZE)
for (i = 0; i < (FILTER_SAMPLES / BLOCK_SIZE); i++) //
{
// BLOCK_SIZE = samples to process per call
//arm_fir_q31(&instance, &firInput[i * BLOCK_SIZE], &firOutput[i * BLOCK_SIZE], BLOCK_SIZE);
arm_fir_q31(&instance, &firInput[i * BLOCK_SIZE], &firOutput[i * BLOCK_SIZE], BLOCK_SIZE);
}
arm_power_q31(&firOutput, len, &power);
// Convert output back to uint16 for plotting
for (i = 0; i < (len); i++)
{
buffer[i] = (uint16_t)(firOutput[i] - 63500);
}
return (uint16_t)((power/10));
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight