Does fmodf() cause a hardfault in stm32? - c

I am trying to create a modulated waveform out of 2 sine waves.
To do this I need the modulo(fmodf) to know what amplitude a sine with a specific frequency(lo_frequency) has at that time(t). But I get a hardfault when the following line is executed:
j = fmodf(2 * PI * lo_frequency * t, 2 * PI);
Do you have an idea why this gives me a hardfault ?
Edit 1:
I exchanged fmodf with my_fmodf:
float my_fmodf(float x, float y){
if(y == 0){
return 0;
}
float n = x / y;
return x - n * y;
}
But still the hardfault occurs, and when I debug it it doesn't even jump into this function(my_fmodf).
Heres the whole function in which this error occurs:
int* create_wave(int* message){
/* Mixes the message signal at 10kHz and the carrier at 40kHz.
* When a bit of the message is 0 the amplitude is lowered to 10%.
* When a bit of the message is 1 the amplitude is 100%.
* The output of the STM32 can't be negative, thats why the wave swings between
* 0 and 256 (8bit precision for faster DAC)
*/
static int rf_frequency = 10000;
static int lo_frequency = 40000;
static int sample_rate = 100000;
int output[sample_rate];
int index, mix;
float j, t;
for(int i = 0; i <= sample_rate; i++){
t = i * 0.00000001f; // i * 10^-8
j = my_fmodf(2 * PI * lo_frequency * t, 2 * PI);
if (j < 0){
j += (float) 2 * PI;
}
index = floor((16.0f / (lo_frequency/rf_frequency * 0.0001f)) * t);
if (index < 16) {
if (!message[index]) {
mix = 115 + sin1(j) * 0.1f;
} else {
mix = sin1(j);
}
} else {
break;
}
output[i] = mix;
}
return output;
}
Edit 2:
I fixed the warning: function returns address of local variable [-Wreturn-local-addr] the way "chux - Reinstate Monica" suggested.
int* create_wave(int* message){
static uint16_t rf_frequency = 10000;
static uint32_t lo_frequency = 40000;
static uint32_t sample_rate = 100000;
int *output = malloc(sizeof *output * sample_rate);
uint8_t index, mix;
float j, n, t;
for(int i = 0; i < sample_rate; i++){
t = i * 0.00000001f; // i * 10^-8
j = fmodf(2 * PI * lo_frequency * t, 2 * PI);
if (j < 0){
j += 2 * PI;
}
index = floor((16.0f / (lo_frequency/rf_frequency * 0.0001f)) * t);
if (index < 16) {
if (!message[index]) {
mix = (uint8_t) floor(115 + sin1(j) * 0.1f);
} else {
mix = sin1(j);
}
} else {
break;
}
output[i] = mix;
}
return output;
}
But now I get the hardfault on this line:
output[i] = mix;
EDIT 3:
Because the previous code contained a very large buffer array that did not fit into the 16KB SRAM of the STM32F303K8 I needed to change it.
Now I use a "ping-pong" buffer where I use the callback of the DMA for "first-half-transmitted" and "completly-transmitted":
void HAL_DAC_ConvHalfCpltCallbackCh1(DAC_HandleTypeDef * hdac){
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_SET);
for(uint16_t i = 0; i < 128; i++){
new_value = sin_table[(i * 8) % 256];
if (message[message_index] == 0x0){
dac_buf[i] = new_value * 0.1f + 115;
} else {
dac_buf[i] = new_value;
}
}
}
void HAL_DAC_ConvCpltCallbackCh1 (DAC_HandleTypeDef * hdac){
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_RESET);
for(uint16_t i = 128; i < 256; i++){
new_value = sin_table[(i * 8) % 256];
if (message[message_index] == 0x0){
dac_buf[i] = new_value * 0.1f + 115;
} else {
dac_buf[i] = new_value;
}
}
message_index++;
if (message_index >= 16) {
message_index = 0;
// HAL_DAC_Stop_DMA (&hdac1, DAC_CHANNEL_1);
}
}
And it works the way I wanted:
But the frequency of the created sine is too low.
I cap at around 20kHz but I'd need 40kHz.
I allready increased the clock by a factor of 8 so that one is maxed out:
.
I can still decrease the counter period (it is 50 at the moment), but when I do so the interrupt callback seems to take longer than the period to the next one.
At least it seems so as the output becomes very distorted when I do that.
I also tried to decrease the precision by taking only every 8th sine value but
I cant do this any more because then the output does not look like a sine wave anymore.
Any ideas how I could optimize the callback so that it takes less time ?
Any other ideas ?

Does fmodf() cause a hardfault in stm32?
It is other code problems causing the hard fault here.
Failing to compile with ample warnings
Best code tip: enable all warnings. #KamilCuk
Faster feedback than Stackoverflow.
I'd expect something like below on a well enabled compiler.
return output;
warning: function returns address of local variable [-Wreturn-local-addr]
Returning a local Object
Cannot return a local array. Allocate instead.
// int output[sample_rate];
int *output = malloc(sizeof *output * sample_rate);
return output;
Calling code will need to free() the pointer.
Out of range array access
static int sample_rate = 100000;
int output[sample_rate];
// for(int i = 0; i <= sample_rate; i++){
for(int i = 0; i < sample_rate; i++){
...
output[i] = mix;
}
Stack overflow?
static int sample_rate = 100000; int output[sample_rate]; is a large local variable. Maybe allocate or try something smaller?
Advanced: loss of precision
A good fmodf() does not lose precision. For a more precise answer consider double math for the intermediate results. An even better approach is more involved.
float my_fmodf(float x, float y){
if(y == 0){
return 0;
}
double n = 1.0 * x / y;
return (float) (x - n * y);
}
Can I not use any function within another ?
Yes. Code has other issues.

1 value every 10uS makes only 100kSPS whis is not too much for this macro. In my designs I generate > 5MSPS signals without any problems. Usually I have one buffer and DMA in circular mode. First I fill the buffer and start generation. When the half transmition DMA interrupt is trigerred I fill the first half of the buffer with fresh data. The the transmition complete interrupt is trigerred I fill the second half and this process repeats all over again.

Related

Struggle to program an LFO

based on some tutorial code I've found I coded a little synthesizer with three oscilators and four different waveform. It works well and I want to add an LFO to module the sounds. Since I didn't coded everything on my own I'm a bit confused of how I could fit the LFO formula on my code. This is more or less what I tried in order to implement the LFO formula on a sinewave.(This formula is something like this: sinewaveFormula + 0.5 * Sinefreq * sin(2pi*1) * time)
double normalize(double phase)
{
double cycles = phase/(2.0*pi);
phase -= trunc(cycles) * 2.0 * pi;
if (phase < 0) phase += 2.0*pi;
return phase;
}
double sine(double phase)
{ phase = normalize(phase); return (sin(phase));}
static void build_sine_table(int16_t *data, int wave_length) {
double phase_increment = (2.0f * pi) / (double)wave_length;
double current_phase = 0;
for(int i = 0; i < wave_length; i++) {
int sample = synthOsc(current_phase, oscNum, selectedWave, selectedWave2, selectedWave3, intensity, intensity2, intensity3) + 0.5 * ((current_phase* wave_length) / (2*pi)) * sin(2*pi*(1.0)) * wave_length;
data[i] = (int16_t)sample;
current_phase += phase_increment;
}
}
static void write_samples(int16_t *s_byteStream, long begin, long end, long length) {
if(note > 0) {
double d_sample_rate = sample_rate;
double d_table_length = table_length;
double d_note = note;
// get correct phase increment for note depending on sample rate and table length.
double phase_increment = (get_pitch(d_note) / d_sample_rate) * d_table_length;
// loop through the buffer and write samples.
for (int i = 0; i < length; i+=2) {
phase_double += phase_increment;
phase_int = (int)phase_double;
if(phase_double >= table_length) {
double diff = phase_double - table_length;
phase_double = diff;
phase_int = (int)diff;
}
if(phase_int < table_length && phase_int > -1) {
if(s_byteStream != NULL) {
int16_t sample = sine_waveform_wave[phase_int];
target_amp = update_envelope();
if(smoothing_enabled) {
// move current amp towards target amp for a smoother transition.
if(current_amp < target_amp) {
current_amp += smoothing_amp_speed;
if(current_amp > target_amp) {
current_amp = target_amp;
}
} else if(current_amp > target_amp) {
current_amp -= smoothing_amp_speed;
if(current_amp < target_amp) {
current_amp = target_amp;
}
}
} else {
current_amp = target_amp;
}
sample *= current_amp; // scale volume.
s_byteStream[i+begin] = sample; // left channel
s_byteStream[i+begin+1] = sample; // right channel
}
}
}
}
}
The code compile but there's no LFO on the sine. I don't understand how I could make this formula work with this code.
It may help to get a basic understanding of how a LFO actually works. It is not that difficult - as an LFO is just another oscillator that is mixed to the waveform you want to modulate.
I would suggest to remove your LFO formular from your call of synthOsc(), then you get a clean oscillator signal again. As a next step, create another oscillator signal for which you can use a very low frequency. Mix both signals together and you are done.
Expresssed in simple math, it is like this:
int the_sample_you_want_to_modulate = synthOsc1(...);
int a_sample_with_very_low_frequency = synthOsc2(...);
Mixing two waveforms is done through addition:
int mixed_sample = the_sample_you_want_to_modulate + a_sample_with_very_low_frequency;
The resulting sample will sweep now based on the frequency you have used for synthOsc2().
As you can see, to implement an LFO you actually do not need a separate formular. You already have the formular when you know how to create an oscillator.
Note that if you add two sine oscillators that have the exact same frequency, the resulting signal will just get louder. But when each has a different frequency, you will get a new waveform. For LFOs (which are in fact just ordinary oscillators - like in your build_sine_table() function) you typically set a very low frequency: 1 - 10 Hz is low enough to get an audible sweep. For higher frequencies you get chords as a result.

Writing a wave generator with SDL

I've coded a simple sequencer in C with SDL 1.2 and SDL_mixer(to play .wav file). It works well and I want to add some audio synthesis to this program. I've look up the and I found this sinewave code using SDL2(https://github.com/lundstroem/synth-samples-sdl2/blob/master/src/synth_samples_sdl2_2.c)
Here's how the sinewave is coded in the program:
static void build_sine_table(int16_t *data, int wave_length)
{
/*
Build sine table to use as oscillator:
Generate a 16bit signed integer sinewave table with 1024 samples.
This table will be used to produce the notes.
Different notes will be created by stepping through
the table at different intervals (phase).
*/
double phase_increment = (2.0f * pi) / (double)wave_length;
double current_phase = 0;
for(int i = 0; i < wave_length; i++) {
int sample = (int)(sin(current_phase) * INT16_MAX);
data[i] = (int16_t)sample;
current_phase += phase_increment;
}
}
static double get_pitch(double note) {
/*
Calculate pitch from note value.
offset note by 57 halfnotes to get correct pitch from the range we have chosen for the notes.
*/
double p = pow(chromatic_ratio, note - 57);
p *= 440;
return p;
}
static void audio_callback(void *unused, Uint8 *byte_stream, int byte_stream_length) {
/*
This function is called whenever the audio buffer needs to be filled to allow
for a continuous stream of audio.
Write samples to byteStream according to byteStreamLength.
The audio buffer is interleaved, meaning that both left and right channels exist in the same
buffer.
*/
// zero the buffer
memset(byte_stream, 0, byte_stream_length);
if(quit) {
return;
}
// cast buffer as 16bit signed int.
Sint16 *s_byte_stream = (Sint16*)byte_stream;
// buffer is interleaved, so get the length of 1 channel.
int remain = byte_stream_length / 2;
// split the rendering up in chunks to make it buffersize agnostic.
long chunk_size = 64;
int iterations = remain/chunk_size;
for(long i = 0; i < iterations; i++) {
long begin = i*chunk_size;
long end = (i*chunk_size) + chunk_size;
write_samples(s_byte_stream, begin, end, chunk_size);
}
}
static void write_samples(int16_t *s_byteStream, long begin, long end, long length) {
if(note > 0) {
double d_sample_rate = sample_rate;
double d_table_length = table_length;
double d_note = note;
/*
get correct phase increment for note depending on sample rate and table length.
*/
double phase_increment = (get_pitch(d_note) / d_sample_rate) * d_table_length;
/*
loop through the buffer and write samples.
*/
for (int i = 0; i < length; i+=2) {
phase_double += phase_increment;
phase_int = (int)phase_double;
if(phase_double >= table_length) {
double diff = phase_double - table_length;
phase_double = diff;
phase_int = (int)diff;
}
if(phase_int < table_length && phase_int > -1) {
if(s_byteStream != NULL) {
int16_t sample = sine_wave_table[phase_int];
sample *= 0.6; // scale volume.
s_byteStream[i+begin] = sample; // left channel
s_byteStream[i+begin+1] = sample; // right channel
}
}
}
}
}
I don't understand how I could change the sinewave formula to genrate other waveform like square/triangle/saw ect...
EDIT:
Because I forgot to explain it, here's what I tried.
I followed the example I've seen on this video series(https://www.youtube.com/watch?v=tgamhuQnOkM). The source code of the method provided by the video is on github, and the wave generation code is looking like this:
double w(double dHertz)
{
return dHertz * 2.0 * PI;
}
// General purpose oscillator
double osc(double dHertz, double dTime, int nType = OSC_SINE)
{
switch (nType)
{
case OSC_SINE: // Sine wave bewteen -1 and +1
return sin(w(dHertz) * dTime);
case OSC_SQUARE: // Square wave between -1 and +1
return sin(w(dHertz) * dTime) > 0 ? 1.0 : -1.0;
case OSC_TRIANGLE: // Triangle wave between -1 and +1
return asin(sin(w(dHertz) * dTime)) * (2.0 / PI);
}
Because the C++ code here uses windows soun api I could not copy/paste this method to make it work on the piece of code I've found using SDL2.
So I tried to this in order to obtain a square wave:
static void build_sine_table(int16_t *data, int wave_length)
{
double phase_increment = ((2.0f * pi) / (double)wave_length) > 0 ? 1.0 : -1.0;
double current_phase = 0;
for(int i = 0; i < wave_length; i++) {
int sample = (int)(sin(current_phase) * INT16_MAX);
data[i] = (int16_t)sample;
current_phase += phase_increment;
}
}
This didn't gave me a square wave but more a saw wave.
Here's what I tried to get a triangle wave:
static void build_sine_table(int16_t *data, int wave_length)
{
double phase_increment = (2.0f * pi) / (double)wave_length;
double current_phase = 0;
for(int i = 0; i < wave_length; i++) {
int sample = (int)(asin(sin(current_phase) * INT16_MAX)) * (2 / pi);
data[i] = (int16_t)sample;
current_phase += phase_increment;
}
}
This also gave me another type of waveform, not triangle.
You’d replace the sin function call with call to one of the following:
// this is a helper function only
double normalize(double phase)
{
double cycles = phase/(2.0*M_PI);
phase -= trunc(cycles) * 2.0 * M_PI;
if (phase < 0) phase += 2.0*M_PI;
return phase;
}
double square(double phase)
{ return (normalize(phase) < M_PI) ? 1.0 : -1.0; }
double sawtooth(double phase)
{ return -1.0 + normalize(phase) / M_PI; }
double triangle(double phase)
{
phase = normalize(phase);
if (phase >= M_PI)
phase = 2*M_PI - phase;
return -1.0 + 2.0 * phase / M_PI;
}
You’d be building tables just like you did for the sine, except they’d be the square, sawtooth and triangle tables, respectively.

I'm designing a guitar tuner through ATmega16p and CodeVisionAVR and i just can't get my code to run

I'm designing a guitar tuner through an atmel mega16 processor and CodeVisionAVR for my university's second project. I have connected a mono jack to the processor's PINA.7 (ADC converter) and GND. I have 7 LEDs (PORTB.0..6) that should turn on through a series of if/elseif based on the frequency of the fundamental of the signal.
I'm taking the fundamental of the signal through a DFT (i know there are faster FTs but our university told us we should use a DFT, they know why) of 800 samples. Out of the 800 samples selected, it calculates the frequency spectrum. Then the next for is used to calculate the absolute value of each frequency, and picks the largest, so it can be a good refrence point for a guitar tuner.
Momentairly, i have included in the main function just a large frequency condition to see if the LED lights up, but it doesn't.
I have tried switching on LEDs from 0 to 6 throughout the code and it seems to stop at F = computeDft();, so i removed the variable, and just let the computeDft(); run, but the next leds did not light up. Is the function never getting called? I have tried the function in Visual Studio with a generated cosine function and it works perfectly. It always detects the fundamental. Why doesn't it work in CVAVR?
#define M_PI 3.1415926f
#define N 800
unsigned char read_adc(void)
{
ADCSRA |= 0x40; //start conversion;
while (ADCSRA&(0x40)); //wait conversion end
return (float)ADCH;
}
typedef struct
{
float re;
float im;
} Complex;
float computeDft()
{
unsigned char x[N] = {0};
float max = 0;
float maxi = 0;
float magnitude = 0;
Complex X1[N] = {0};
int n = N;
int k;
for (n = 0; n < N; ++n)
{
for (k = 0; k < n; k++)
{
x[k] = read_adc();
X1[n].re += x[k] * cos(n * k * M_PI / N);
X1[n].im -= x[k] * sin(n * k * M_PI / N);
}
}
for (k = 0; k < n; k++)
{
magnitude = sqrt(X1[k].re * X1[k].re + X1[k].im * X1[k].im);
if (magnitude > maxi)
{
maxi = magnitude;
max = k;
}
}
return max;
}
/*
* main function of program
*/
void main (void)
{
float F = 0;
Init_initController(); // this must be the first "init" action/call!
#asm("sei") // enable interrupts
LED1 = 1; // initial state, will be changed by timer 1
L0 = 0;
L1 = 0;
L2 = 0;
L3 = 0;
L4 = 0;
L5 = 0;
L6 = 0;
ADMUX = 0b10100111; // set ADC0
ADCSRA = 0b10000111; //set ADEN, precale by 128
while(TRUE)
{
wdogtrig(); // call often else processor will reset ;
F = computeDft();
if (F > 50 && F < 200)
{
L3 = 1;
}
}
}// end main loop
The result i'm trying to achieve is a signal from a phone or a computer (probably a YouTube video of a guy tuning his guitar) is sent through the jack to the processor in the AD converter (PINA.7). The main function calls the computeDft; function, which will ask the read_adc(); to add to x[k] the value of the voltage that is being sent through the cable, then compute it's Dft. The same function then selects the frequency of the fundamental (the one with the highest absolute value), then returns it. Inside the main function, a variable will be assigned the value of the fundamental, and through a series of ifs, it will compare it's value to the standard guitar strings frequencies of 82.6, 110, etc...
1. First of all: just picking the bigger harmonic in DFT, is not good as a tuner, since, depending on the instrument played, overtones may have a larger amplitude. The decent tuner may be done by using e.g. auto-correlation algorithm.
2. I see this line in your project:
wdogtrig(); // call often else processor will reset ;
Why you need the watchdog in the first place? Where it is configured? What timeout it is set for? How you think, how long would it take to perform both nested loops in computeDft()? With a lot of floating point operations inside including calculation of sine and cosine at each step? On a 16MHz 8-bit MCU? I think that will take several seconds at least, so do not use the watchdog at all, or reset it more often.
3. Look at
cos(n * k * M_PI / N);
(by the way, are you sure it is cos(n * k * M_PI / N); not cos(n * k * 2 * M_PI / N);?)
since cos(x) = cos(x + 2 * M_PI), you can see this formula can be expressed as cos((n * k * 2) % (2 * N) * M_PI / N). I.e. you can precalculate all 2*N possible values and put them as a constant table into the flash memory.
4. Look at nested loops in computeDft()
Inside the inner loop, you are calling read_adc() each time!
You want to pick the signal into the buffer once, and then perform DFT over the saved signal. I.e. first you read ADC values into x[k] array:
for (k = 0; k < N; k++)
{
x[k] = read_adc();
}
and only then you perform DFT calculations over it:
for (n = 0; n < N; ++n)
{
for (k = 0; k < n; k++)
{
X1[n].re += x[k] * cos(n * k * M_PI / N);
X1[n].im -= x[k] * sin(n * k * M_PI / N);
}
}
5. Look carefully at two cycles:
for (n = 0; n < N; ++n)
..
X1[n].re += x[k] * cos(n * k * M_PI / N);
X1[n].im -= x[k] * sin(n * k * M_PI / N);
}
Here at each step, you are calculating the value of X1[n], none of the previous X1 values are used.
And another loop below:
for (k = 0; k < n; k++)
{
magnitude = sqrt(X1[k].re * X1[k].re + X1[k].im * X1[k].im);
...
}
here you are calculating the magnitude of X1[k] and no previous of next values of X1 are used. So, you can simply combine them together:
for (n = 0; n < N; ++n)
{
for (k = 0; k < n; k++)
{
X1[n].re += x[k] * cos(n * k * M_PI / N);
X1[n].im -= x[k] * sin(n * k * M_PI / N);
}
magnitude = sqrt(X1[n].re * X1[n].re + X1[n].im * X1[n].im);
if (magnitude > maxi)
{
maxi = magnitude;
max = k;
}
}
Here you can clearly see, you need no reason to store X1[n].re and X1[n].im in any array. Just get rid of them!
for (n = 0; n < N; ++n)
{
float re = 0;
float im = 0;
for (k = 0; k < n; k++)
{
re += x[k] * cos(n * k * M_PI / N);
im -= x[k] * sin(n * k * M_PI / N);
}
magnitude = sqrt(re * re + im * im);
if (magnitude > maxi)
{
maxi = magnitude;
max = k;
}
}
That's all! You have saved 6 KB by removing pointless Complex X1[N] array
6. There is a error in your initialization code:
ADMUX = 0b10100111; // set ADC0
I don't know what is "ATmega16P", I assume it works the same as "ATmega16". So most significant bits of this register, called REFS1 and REFS0 are used to select the reference voltage. Possible values are:
00 - external voltage from AREF pin;
01 - AVCC voltage taken as reference
11 - internal regulator (2.56V for ATmega16, 1.1V for ATmega168PA)
10 is an incorrect value.
7. the guitar output is a small signal, maybe several dozens of millivolts. Also, it is an AC signal, which can be as positive, so negative as well. So, before putting the signal onto MCU's input you have to shift it (otherwise you'll see only the positive half wave) and amplify it.
I.e. it is not enough just to connect jack plug to GND and ADC input, you need some schematics which will make the signal of the appropriate level.
You can google for it. For example this:
(from This project)

CMSIS FIR bandpass filter

I am trying to implement a 60kHz bandpass filter on the STM32F407 microcontroller and I'm having some issues. I have generated the filter with the help of MATLABs fdatool and then simulated it in MATLAB as well. The following MATLAB script simlates it.
% FIR Window Bandpass filter designed using the FIR1 function.
% All frequency values are in Hz.
Fs = 5250000; % Sampling Frequency
N = 1800; % Order
Fc1 = 59950; % First Cutoff Frequency
Fc2 = 60050; % Second Cutoff Frequency
flag = 'scale'; % Sampling Flag
% Create the window vector for the design algorithm.
win = hamming(N+1);
% Calculate the coefficients using the FIR1 function.
b = fir1(N, [Fc1 Fc2]/(Fs/2), 'bandpass', win, flag);
Hd = dfilt.dffir(b);
%----------------------------------------------------------
%----------------------------------------------------------
T = 1 / Fs; % sample time
L = 4500; % Length of signal
t = (0:L-1)*T; % Time vector
% Animate the passband frequency span
for f=55500:50:63500
signal = sin(2*pi*f*t);
plot(filter(Hd, signal));
axis([0 L -1 1]);
str=sprintf('Signal frequency (Hz) %d', f);
title(str);
drawnow;
end
pause;
close all;
signal = sin(2*pi*50000*t) + sin(2*pi*60000*t) + sin(2*pi*78000*t);
signal = signal / 3;
signal = signal(1:1:4500);
filterInput = signal;
filterOutput = filter(Hd,signal);
subplot(2,1,1);
plot(filterInput);
axis([0 4500 -1 1]);
subplot(2,1,2);
plot(filterOutput)
axis([0 4500 -1 1]);
pause;
close all;
From the fdatool I extract the filter co-efficents to 16-bit unsigned integers in q15 format, this because of the 12-bit ADC that I'm using. The filter co-efficents header that is generated by MATLAB is here and the resulting plot of the co-efficents can be seen in the following picture
Below is the code for the filter implementation which obviously isn't working and I don't really know what I can do differently, I've looked at some examples online Example 1 and Example 2
#include "fdacoefs.h"
#define FILTER_SAMPLES 4500
#define BLOCK_SIZE 900
static uint16_t firInput[FILTER_SAMPLES];
static uint16_t firOutput[FILTER_SAMPLES];
static uint16_t firState[NUM_TAPS + BLOCK_SIZE - 1];
uint16_t util_calculate_filter(uint16_t *buffer, uint32_t len)
{
uint16_t i;
uint16_t max;
uint16_t min;
uint32_t index;
// Create filter instance
arm_fir_instance_q15 instance;
// Ensure that the buffer length isn't longer than the sample size
if (len > FILTER_SAMPLES)
len = FILTER_SAMPLES;
for (i = 0; i < len ; i++)
{
firInput[i] = buffer[i];
}
// Call Initialization function for the filter
arm_fir_init_q15(&instance, NUM_TAPS, &firCoeffs, &firState, BLOCK_SIZE);
// Call the FIR process function, num of blocks to process = (FILTER_SAMPLES / BLOCK_SIZE)
for (i = 0; i < (FILTER_SAMPLES / BLOCK_SIZE); i++) //
{
// BLOCK_SIZE = samples to process per call
arm_fir_q15(&instance, &firInput[i * BLOCK_SIZE], &firOutput[i * BLOCK_SIZE], BLOCK_SIZE);
}
arm_max_q15(&firOutput, len, &max, &index);
arm_min_q15(&firOutput, len, &min, &index);
// Convert output back to uint16 for plotting
for (i = 0; i < (len); i++)
{
buffer[i] = (uint16_t)(firOutput[i] - 30967);
}
return (uint16_t)((max+min));
}
The ADC is sampling at 5.25 MSPS and it is sampling a 60kHz signal 4500 times and here you can see the Input to the filter and then the Output of the filter which is pretty weird..
Is there anything obvious that I've missed? Because I'm completely lost and any pointers and tips are helpful!
As Lundin pointed out I changed it to work with 32 bit integers instead and that actually solved my problem. Ofcourse I generated new filter co-efficents with MATLABS fdatool as signed 32 bit integers instead.
static signed int firInput[FILTER_SAMPLES];
static signed int firOutput[FILTER_SAMPLES];
static signed int firState[NUM_TAPS + BLOCK_SIZE -1];
uint16_t util_calculate_filter(uint16_t *buffer, uint32_t len)
{
uint16_t i;
int power;
uint32_t index;
// Create filter instance
arm_fir_instance_q31 instance;
// Ensure that the buffer length isn't longer than the sample size
if (len > FILTER_SAMPLES)
len = FILTER_SAMPLES;
for (i = 0; i < len ; i++)
{
firInput[i] = (int)buffer[i];
}
// Call Initialization function for the filter
arm_fir_init_q31(&instance, NUM_TAPS, &firCoeffs, &firState, BLOCK_SIZE);
// Call the FIR process function, num of blocks to process = (FILTER_SAMPLES / BLOCK_SIZE)
for (i = 0; i < (FILTER_SAMPLES / BLOCK_SIZE); i++) //
{
// BLOCK_SIZE = samples to process per call
//arm_fir_q31(&instance, &firInput[i * BLOCK_SIZE], &firOutput[i * BLOCK_SIZE], BLOCK_SIZE);
arm_fir_q31(&instance, &firInput[i * BLOCK_SIZE], &firOutput[i * BLOCK_SIZE], BLOCK_SIZE);
}
arm_power_q31(&firOutput, len, &power);
// Convert output back to uint16 for plotting
for (i = 0; i < (len); i++)
{
buffer[i] = (uint16_t)(firOutput[i] - 63500);
}
return (uint16_t)((power/10));
}

C: Accessing lookup tables faster?

I have a piece of code that traces 4 sines at a time.
My original code was making roughly 12000 sin() function calls per frame and was running at 30 fps.
I tried optimizing it by generating lookup tables. I ended up with 16 different lookup tables. I declared and load them in a separate header file at the top of my program. Each table is declared like so:
static const float d4_lookup[800] {...};
Now, with this new method I actually lost fps?! I'm running at 20 fps now instead of 30. Each frame now only has to do 8 sin / cos calls and 19200 lookup calls vs 12000 sin() calls.
I compile using gcc with -O3 flag on. At the moment, the lookup tables are included at the top and are part of the global scope of the program.
I assume I'm not loading them in the right memory or something to that effect. How can I speed up the lookup time?
** EDIT 1 **
As requested, here's the function that uses the lookup calls, it is called once per frame:
void
update_sines(void)
{
static float c1_sin, c1_cos;
static float c2_sin, c2_cos;
static float c3_sin, c3_cos;
static float c4_sin, c4_cos;
clock_gettime(CLOCK_MONOTONIC, &spec);
s = spec.tv_sec;
ms = spec.tv_nsec * 0.0000001;
etime = concatenate((long)s, ms);
c1_sin = sinf(etime * 0.00525);
c1_cos = cosf(etime * 0.00525);
c2_sin = sinf(etime * 0.007326);
c2_cos = cosf(etime * 0.007326);
c3_sin = sinf(etime * 0.0046);
c3_cos = cosf(etime * 0.0046);
c4_sin = sinf(etime * 0.007992);
c4_cos = cosf(etime * 0.007992);
int k;
for (k = 0; k < 800; ++k)
{
sine1[k] = a1_lookup[k] * ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k])) + d1_lookup[k];
sine2[k] = a2_lookup[k] * ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k])) + d2_lookup[k] + 50;
sine3[k] = a3_lookup[k] * ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k])) + d3_lookup[k];
sine4[k] = a4_lookup[k] * ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k])) + d4_lookup[k] + 50;
}
}
** UPDATE **
For anyone reading this thread, I gave up on this problem. I tried using OpenCL kernels, structs, SIMD instructions as well as all the solutions shown here. In the end the original code that computed the sinf() 12800 per frame worked faster than the lookup tables since the lookup tables didn't fit into the cache. Yet it was still only doing 30 fps. It just had too much going on to keep up with my 60fps expectations. I've decided to take a different direction. Thanks to everyone who contributed to this thread. Most of these solutions would probably work to get some half decent speed improvements but nothing like the 200% speed up I needed here to have the lookup tables work the way I wanted.
Sometimes it's hard to know what's slowing you down, but potentially you are going to ruin your cache hits, you could try a lookup of a struct
typedef struct
{
float bx1_sin;
float bx2_sin;
float bx3_sin;
float bx4_sin;
float bx1_cos;
etc etc
including sine1,2,3,4 as well
} lookup_table
then
lookup_table lookup[800]
now everything at the kth lookup will be in the same small chunk of memory.
also, if you use a macro that takes k as a parameter to do do the contents of the loop lets say SINE_CALC(k), or an inline function...
you can do
for (k = 0; k < 800; ++k)
{
SINE_CALC(k); k++;
SINE_CALC(k); k++;
SINE_CALC(k); k++;
SINE_CALC(k); k++;
SINE_CALC(k); k++;
}
if you do a macro, make sure the k++ is outside the macro call like shown
Try unrolling your loops like this:
for (k = 0; k < 800; ++k)
{
sine1[k] = a1_lookup[k];
sine2[k] = a2_lookup[k];
sine3[k] = a3_lookup[k];
sine4[k] = a4_lookup[k];
}
for (k = 0; k < 800; ++k)
{
sine1[k] *= ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k]));
sine2[k] *= ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k]));
sine3[k] *= ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k]));
sine4[k] *= ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k]));
}
for (k = 0; k < 800; ++k)
{
sine1[k] += d1_lookup[k];
sine2[k] += d2_lookup[k] + 50;
sine3[k] += d3_lookup[k];
sine4[k] += d4_lookup[k] + 50;
}
By accessing fewer lookup tables in each loop, you should be able to stay in the cache. The middle loop could be split up as well, but you'll need to create an intermediate table for one of the sub-expressions.
Intel processors can predict serial access (and perform prefetch) for up to 4 arrays both for forward and backward traverse. At least this was true in Core 2 Duo days. Split your for in:
for (k = 0; k < 800; ++k)
sine1[k] = a1_lookup[k] * ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k])) + d1_lookup[k];
for (k = 0; k < 800; ++k)
sine2[k] = a2_lookup[k] * ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k])) + d2_lookup[k] + 50;
for (k = 0; k < 800; ++k)
sine3[k] = a3_lookup[k] * ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k])) + d3_lookup[k];
for (k = 0; k < 800; ++k)
sine4[k] = a4_lookup[k] * ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k])) + d4_lookup[k] + 50;
I guess you have more cache load than benchmarks in other answers so this does matters. I recommend you not to unroll loops, compilers do it well.
Using a simple sin lookup table will yields >20% speed increase on my linux machine (vm, gcc, 64bit). Interestingly, the size of lookup table (within reasonable < L1 cache size values) does not influence the speed of execution.
Using a fastsin simple implementation from here I got >45% improvement.
Code:
#include <math.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/time.h>
#include <time.h>
#define LOOKUP_SIZE 628
uint64_t currentTimestampUs( void )
{
struct timeval tv;
time_t localTimeRet;
uint64_t timestamp = 0;
//time_t tzDiff = 0;
struct tm when;
int64_t localeOffset = 0;
{
localTimeRet = time(NULL);
localtime_r ( &localTimeRet, &when );
localeOffset = when.tm_gmtoff * 1000000ll;
}
gettimeofday ( &tv, NULL );
timestamp = ((uint64_t)((tv.tv_sec) * 1000000ll) ) + ( (uint64_t)(tv.tv_usec) );
timestamp+=localeOffset;
return timestamp;
}
const double PI = 3.141592653589793238462;
const double PI2 = 3.141592653589793238462 * 2;
static float sinarr[LOOKUP_SIZE];
void initSinArr() {
int a =0;
for (a=0; a<LOOKUP_SIZE; a++) {
double arg = (1.0*a/LOOKUP_SIZE)*((double)PI * 0.5);
float sinval_f = sin(arg); // double computation earlier to avoid losing precision on value
sinarr[a] = sinval_f;
}
}
float sinlookup(float val) {
float normval = val;
while (normval < 0) {
normval += PI2;
}
while (normval > PI2) {
normval -= PI2;
}
int index = LOOKUP_SIZE*(2*normval/PI);
if (index > 3*LOOKUP_SIZE) {
index = -index + 4*LOOKUP_SIZE;//LOOKUP_SIZE - (index-3*LOOKUP_SIZE);
return -sinarr[index];
} else if (index > 2*LOOKUP_SIZE) {
index = index - 2*LOOKUP_SIZE;
return -sinarr[index];
} else if (index > LOOKUP_SIZE) {
index = 2*LOOKUP_SIZE - index;
return sinarr[index];
} else {
return sinarr[index];
}
}
float sin_fast(float x) {
while (x < -PI)
x += PI2;
while (x > PI)
x -= PI2;
//compute sine
if (x < 0)
return 1.27323954 * x + .405284735 * x * x;
else
return 1.27323954 * x - 0.405284735 * x * x;
}
int main(void) {
initSinArr();
int a = 0;
float val = 0;
const int num_tries = 100000;
uint64_t startLookup = currentTimestampUs();
for (a=0; a<num_tries; a++) {
for (val=0; val<PI2; val+=0.01) {
float compval = sinlookup(val);
(void)compval;
}
}
uint64_t startSin = currentTimestampUs();
for (a=0; a<num_tries; a++) {
for (val=0; val<PI2; val+=0.01) {
float compval = sin(val);
(void)compval;
}
}
uint64_t startFastSin = currentTimestampUs();
for (a=0; a<num_tries; a++) {
for (val=0; val<PI2; val+=0.01) {
float compval = sin_fast(val);
(void)compval;
}
}
uint64_t end = currentTimestampUs();
int64_t lookupMs = (startSin - startLookup)/1000;
int64_t sinMs = (startFastSin - startSin)/1000;
int64_t fastSinMs = (end - startFastSin)/1000;
printf(" lookup: %lld ms\n", lookupMs );
printf(" sin: %lld ms\n", sinMs );
printf(" diff: %lld ms\n", sinMs-lookupMs);
printf(" diff%: %lld %\n", 100*(sinMs-lookupMs)/sinMs);
printf("fastsin: %lld ms\n", fastSinMs );
printf(" sin: %lld ms\n", sinMs );
printf(" diff: %lld ms\n", sinMs-fastSinMs);
printf(" diff%: %lld %\n", 100*(sinMs-fastSinMs)/sinMs);
}
Sample result:
lookup: 2276 ms
sin: 3004 ms
diff: 728 ms
diff%: 24 %
fastsin: 1500 ms
sin: 3004 ms
diff: 1504 ms
diff%: 50 %

Resources