Optimize a weighted moving average

Optimize a weighted moving average - c

Environment : STM32H7 and GCC
Working with a flow of data : 1 sample received from SPI every 250 us
I do a "triangle" weighted moving average with 256 samples, like this but middle sample is weighted 1 and it forms a triangle around it
My samples are stored in uint32_t val[256] circular buffer, it works with a uint8_t write_index
The samples are 24 bits, the max value of a sample is 0x00FFFFFF
uint8_t write_idx =0;
uint32_t val[256];
float coef[256];
void init(void)
{
uint8_t counter=0;
// I calculate my triangle coefs
for(uint16_t c=0;c<256;c++)
{
coef[c]=(c>127)?--counter:++counter;
coef[c]/=128;
}
}
void ACQ_Complete(void)
{
uint32_t moy=0;
// write_idx is meant to wrap
val[write_idx++]= new_sample;
// calc moving average (uint8_t)(c-write_idx) is meant to wrap
for(uint16_t c=0;c<256;c++)
moy += (uint32_t)(val[c]*coef[(uint8_t)(c-write_idx)]);
moy/=128;
}
I have to do the calcs during a 250 us time span, but I measured with a debug GPIO pin that the "moy" part takes 252 us
Code is simulated here
Interesting fact : If I remove the (uint32_t) cast near the end it takes 274 us instead of 252 us
How can I get it done faster ?
I was thinking of using uint32 instead of float for coef (by multiply by 1000 for example) but my uint32 would overflow

This should unquestionably be done in integer. It will be both faster and more accurate.
This processor can do 32x32+64=64 multiply accumulate in a single cycle!
Multiply all your coefficients by a power of 2 (not 1000 mentioned in the comments), and then shift down at the end rather than divide.
uint32_t coef[256];
uint64_t moy = 0;
for(unsigned int c = 0; c < 256; c++)
{
moy += (val[c] * (uint64_t)coef[(c - write_idx) & 0xFFu]);
}
moy >>= N;

Related

PWM for LED dimming unusual behavior

I made a function, where PWM signal is generated at the output (PORTD) without usage of PWM control registers inside PIC microcontroller (PIC18F452). In order to slowly dim LED connected at the output, I was trying to increase the time needed for pulse to advance from 0% of one period to 100% of one period of square wave, while having square wave frequency constant. Everything should go as planned, except that second parameter being passed into pwm function, somehow resets, when going from 655 to 666 (that is, when duty cycle is at 65%). After this event, value being passed to pwm function proceeds from 0. Where as it should not reset at transition from 655 to 656 but at transition from 1000 to 1001.
void main(void) {
TRISD = 0x00; //port D set as output
LATD = 0x00; //port D output set LOW
unsigned int width = 1000; // length of T_on + T_off
unsigned int j;
unsigned int res;
while(1){
for (j = 1; j <= width; j++){
res = (unsigned int)((j*100)/width);
pwm(&LATD, res);
}
}
return;
}
void pwm(volatile unsigned char *lat, unsigned int cycle){
if(cycle > 100){ // reset the "cycle"
cycle = 100;
}
unsigned int i = 1;
while(i<=(cycle)){ // T_on
*lat = 0x01;
i++;
}
unsigned int j = 100-cycle;
while(j){ // T_off
*lat = 0;
j--;
}
return;
}
As for the program itself, it should work like so:
second parameter passed into pwm function is the duty cycle (in %) which changes from 0 to 100
with variable "width" the time needed for duty cycle to advance from 0% to 100% is controlled (width = 100 represents fastest time and everything above that is considered gradually slower time from 0% to 100%)
expression ((j*100)/width) serves as step variable inside "while" loop inside pwm function:
if width = 100, step is increased every increment of "j"
if width = 1000, step is increased every 10 increments of "j",
etc.
PORTD is passed into function as its address, whereas in function pwm, this address is operated via pointer variable lat
As for the problem itself, I could only assume two possibilities: either data type of second parameter of function pwm is incorrect or there is some unknown limitation within PIC microprocessor.
Also, here are definitions of configuration bits (device specific registers) of PIC, located int header file included in this program: https://imgur.com/a/UDYifgN
This is, how the program should operate: https://vimeo.com/488207207
This is, how the program currently operates: https://vimeo.com/488207746

The problem is a 16 Bit overflow:
res = (unsigned int)((j*100)/width);
if j is greater then 655 the result of the calculation j*100 is greater 16 Bit. Switch this to 32 Bit. Or easier make your loop from 0...100 for res.
e.g.
for (res = 0; res <= 100; res++){
pwm(&LATD, res);
}

Calculation with preprocessor-defines yields incorrect value

I am pretty sure it is a common mistake of a beginner but I don't know the correct words for finding an existing answer.
I have defined some consecutive preprocessor macros:
#define DURATION 10000
#define NUMPIXELS 60
#define NUMTRANSITION 15
#define UNITS_PER_PIXEL 128
#define R ((NUMPIXELS-1)/2 + NUMTRANSITION) * UNITS_PER_PIXEL
Later I use these values to calculate a value and assign it to a variable. Here two examples:
// example 1
uint16_t temp; // ranges from 500 to 10000
temp = 255 * (temp - 500) / (10000 - 500);
Here, temp is always 0. Since my guess was/is an issue with the datatype, I also tried uint32_t temp. However, temp was always 255 in this case.
// example 2
uint32_t h = millis() + offset - t0 * R / DURATION;
millis() returns an increasing unsigned long value (milliseconds since the start). Here, h increases a factor of 4 too fast. The same for unsigned long h. When I tried a workaround by dividing by 4*DURATION, h was always 0.
Is it an datatype issue? If or if not, how can I solve it?

This code works on Arduino Uno and an ESP32 as expected
#define DURATION 10000
#define NUMPIXELS 60
#define NUMTRANSITION 15
#define UNITS_PER_PIXEL 128
#define R ((NUMPIXELS-1)/2 + NUMTRANSITION) * UNITS_PER_PIXEL
uint32_t t0 = millis();
void setup() {
// put your setup code here, to run once:
Serial.begin (115200);
//Later I use these values to calculate a value and assign it to a variable. Here two examples:
// example 1
randomSeed(721);
}
void loop() {
// For UNO /ESP use uint16_t temp = random(500,10000); // ranges from 500 to 10000
// for ATtiny85 (DigiSpark)
uint32_t temp = random(500,10000); // ranges from 500 to 10000
temp = 255 * (temp - 500) / (10000 - 500);
// Here, temp is always 0. Since my guess was / is an issue with the datatype, I also tried uint32_t temp. However, temp was always 255 in this case.
// example 2
Serial.println(temp);
uint16_t offset = random(2000, 5000);
uint32_t h = millis() + offset - t0 * R / DURATION;
Serial.println(h);
delay (5000); // Only to prevent too much Serial data, never use in production
}
Environment ArduinoIDE 1.8.12/ avr core 1.8.2 and ESP32 1.04. What hardware are you compiling to? Try the test program, it does (at least on the tested hardware) what it should.
EDIT: For reference OP uses Attiny85 (Digispark) where var size defs are more critical than on UNO /ESP -instead of Serial you would use SerialUSB. Tip for future support requesters -> always reference your environment (HW & SW) with microcontrollers because of possible hardware specific issues

Find two worst values and delete in sum

A microcontroller has the job to sample ADC Values (Analog to Digital Conversion). Since these parts are affected by tolerance and noise, the accuracy can be significantly increased by deleting the 4 worst values. The find and delete does take time, which is not ideal, since it will increase the cycle time.
Imagine a frequency of 100MHz, so each command of software does take 10ns to process, the more commands, the longer the controller is blocked from doing the next set of samples
So my goal is to do the sorting process as fast as possible for this i currently use this code, but this does only delete the two worst!
uint16_t getValue(void){
adcval[8] = {};
uint16_t min = 16383 //14bit full
uint16_t max = 1; //zero is physically almost impossible!
uint32_t sum = 0; //variable for the summing
for(uint8_t i=0; i<8;i++){
if(adc[i] > max) max = adc[i];
if(adc[i] < min) min = adc[i];
sum=sum+adcval[i];
}
uint16_t result = (sum-max-min)/6; //remove two worst and divide by 6
return result;
}
Now I would like to extend this function to delete the 4 worst values out of the 8 samples to get more precision. Any advice on how to do this?
Additionally, it would be wonderful to build an efficient function that finds the most deviating values, instead of the highest and lowest. For example, imagine the this two arrays
uint16_t adc1[8] {5,6,10,11,11,12,20,22};
uint16_t adc2[8] {5,6,7,7,10,11,15,16};
First case would gain precision by the described mechanism (delete the 4 worst). But the second case would have deleted the values 5 and 6 as well as 15 and 16. But this would theoretically make the calculation worse, since deleting 10,11,15,16 would be better. Is there any fast solution of deleting the 4 most deviating?

If your ADC is returning values from 5 to 16 14 bits and the voltage reference 3.3V, the voltage varies from 1mV to 3mV. It is very likely that it is the correct reading. It is very difficult to design good input circuit for 14 bits ADC.
It is better to run the running average. What is the running average? It is software low pass filter.
Blue are readings from the ADC, red -running average
Second signal is the very low amplitude sine wave (9-27mV - assuming 14 bits and 3.3Vref)
The algorithm:
static int average;
int running_average(int val, int level)
{
average -= average / level;
average += val * level;
return average / level;
}
void init_average(int val, int level)
{
average = val * level;
}
if the level is the power of 2. This version needs only 6 instructions (no branches) to calculate the average.
static int average;
int running_average(int val, int level)
{
average -= average >> level;
average += val << level;
return average >> level;
}
void init_average(int val, int level)
{
average = val << level;
}
I assume that average will no overflow. If yes you need to chose larger type

This answer is kinda of topic as it recommends a hardware solution but if performance is required and the MCU can't implement P__J__'s solution than this is your next best thing.
It seems you want to remove noise from your input signal. This can be done in software using DSP (digital signal processing) but it can also be done by configuring your hardware differently.
By adding the proper filter at the proper space before your ADC, it will be possible to remove much (outside) noise from your ADC output. (you can't of course go below a certain amount that is innate in the ADC but alas.)
There are several q&a on electronics.stackexchange.com.
One solution is adding a capacitor to filter some high frequency noise. As noted by DerStorm8
The Photon has another great solution here by suggesting RC, Sallen-Key and a cascade of Sallen-Key filters for a continuous signal filter.
Here (ADN007) is a Analog Design Note from Microchip on "Techniques that Reduce System Noise in ADC Circuits"
It may seem that designing a low noise, 12-bit Analog-to-Digital
Converter (ADC) board or even a 10-bit board is easy. This is
true, unless one ignores the basics of low noise design. For
instance, one would think that most amplifiers and resistors work
effectively in 12-bit or 10-bit environments. However, poor device
selection becomes a major factor in the success or failure of the
circuit. Another, often ignored, area that contributes a great deal
of noise, is conducted noise. Conducted noise is already in the
circuit board by the time the signal arrives at the input of the
ADC. The most effective way to remove this noise is by using a
low-pass (anti-aliasing) filter prior to the ADC. Including by-pass
capacitors and using a ground plane will also eliminate this type
of noise. A third source of noise is radiated noise. The major
sources of this type of noise are Electromagnetic Interference
(EMI) or capacitive coupling of signals from trace-to-trace.
If all three of these issues are addressed, then it is true that
designing a low noise 12-bit ADC board is easy.
And their recommended solution path:
It is easy to design a true 12-bit ADC system by using a few
key low noise guidelines. First, examine your devices (resistors
and amplifiers) to make sure they are low noise. Second, use a
ground plane whenever possible. Third, include a low-pass filter
in the signal path if you are changing the signal from analog to
digital. Finally, and always, include by-pass capacitors. These
capacitors not only remove noise but also foster circuit stability.
Here is a good paper by Analog Devices on input noise. They note in here that "there are some instances where input noise can actually be helpful in achieving higher resolution."
All analog-to-digital converters (ADCs) have a certain amount of input-referred noise—modeled as a noise source connected in series with the input of a noise-free ADC. Input-referred noise is not to be confused with quantization noise, which is only of interest when an ADC is processing time-varying signals. In most cases, less input noise is better; however, there are some instances where input noise can actually be helpful in achieving higher resolution. If this doesn’t seem to make sense right now, read on to find out how some noise can be good noise.

Given that you have a fixed size array, a hard-coded sorting network should be able to correctly sort the entire array with only 19 comparisons. Currently you have 8+2*8=24 comparisons already, although it is possible that the compiler unrolls the loop, leaving you with 16 comparisons. It is conceivable that, depending on the microcontroller hardware, a sorting network can be implemented with some degree of parallelism -- perhaps you also have to query the adc values sequentially which would give you opportunity to pre-sort them, while waiting for the comparison.
An optimal sorting network should be searchable online. Wikipedia has some pointers.
So, you would end up with some code like this:
sort_data(adcval);
return (adcval[2]+adcval[3]+adcval[4]+adcval[5])/4;
Update:
As you can take from this picture (source) of optimal sorting networks, a complete sort takes 19 comparisons. However 3 of those are not strictly needed if you only want to extract the middle 4 values. So you get down to 16 comparisons.

to delete the 4 worst values out of the 8 samples
The methods are described on geeksforgeeks k largest(or smallest) elements in an array and you can implement the best method that suits you.
I decided to use this good site to generate best sorting algorithm with SWAP() macros needed to sort the array of 8 elements. Then I created a small C program that will test any combination of 8 element array on my sorting function. Then, because we only care of groups of 4 elements, I did something bruteforce - for each of the SWAP() macros I tried to comment the macro and see if the program still succeeds. I could comment 5 SWAP macros, leaving 14 comparisons needed to identify the smallest 4 elements in the array of 8 samples.
/**
* Sorts the array, but only so that groups of 4 matter.
* So group of 4 smallest elements and 4 biggest elements
* will be sorted ok.
* s[0]...s[3] will have lowest 4 elements
* so they have to be "deleted"
* s[4]...s[7] will have the highest 4 values
*/
void sort_but_4_matter(int s[8]) {
#define SWAP(x, y) do { \
if (s[x] > s[y]) { \
const int t = s[x]; \
s[x] = s[y]; \
s[y] = t; \
} \
} while(0)
SWAP(0, 1);
//SWAP(2, 3);
SWAP(0, 2);
//SWAP(1, 3);
//SWAP(1, 2);
SWAP(4, 5);
SWAP(6, 7);
SWAP(4, 6);
SWAP(5, 7);
//SWAP(5, 6);
SWAP(0, 4);
SWAP(1, 5);
SWAP(1, 4);
SWAP(2, 6);
SWAP(3, 7);
//SWAP(3, 6);
SWAP(2, 4);
SWAP(3, 5);
SWAP(3, 4);
#undef SWAP
}
/* -------- testing code */
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
int cmp_int(const void *a, const void *b) {
return *(const int*)a - *(const int*)b;
}
void printit_arr(const int *arr, size_t n) {
printf("{");
for (size_t i = 0; i < n; ++i) {
printf("%d", arr[i]);
if (i != n - 1) {
printf(" ");
}
}
printf("}");
}
void printit(const char *pre, const int arr[8],
const int in[8], const int res[4]) {
printf("%s: ", pre);
printit_arr(arr, 8);
printf(" ");
printit_arr(in, 8);
printf(" ");
printit_arr(res, 4);
printf("\n");
}
int err = 0;
void test(const int arr[8], const int res[4]) {
int in[8];
memcpy(in, arr, sizeof(int) * 8);
sort_but_4_matter(in);
// sort for memcmp below
qsort(in, 4, sizeof(int), cmp_int);
if (memcmp(in, res, sizeof(int) * 4) != 0) {
printit("T", arr, in, res);
err = 1;
}
}
void test_all_combinations() {
const int result[4] = { 0, 1, 2, 3 }; // sorted
const size_t n = 8;
int num[8] = { 0, 1, 2, 3, 4, 5, 6, 7 };
for (size_t j = 0; j < n; j++) {
for (size_t i = 0; i < n-1; i++) {
int temp = num[i];
num[i] = num[i+1];
num[i+1] = temp;
test(num, result);
}
}
}
int main() {
test_all_combinations();
return err;
}
Tested on godbolt. The sort_but_4_matter with gcc -O2 on x86_64 compiles to less then 100 instruction.

Use Arduino to generate sin wave modulated by gold code

I am trying to use Arduino to generate sin wave and gold code is used to determine when the wave will have a phase shift. However, the output is not performed as I expected. Sometimes, it does not occur any phase shift for consequent ten cycles, which should not happen according to our definition of gold code array. Which part of the code could I try to fix the problem?
int gold_code[]={1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,1,-1,-1,1,1,-1,-1, 1,1,-1,1,1,1,-1,1,-1,1,1,-1,1,1,-1,-1,-1,-1,-1,1,1,-1,-1,1,1,-1,1,-1,1,-1,-1,1,1,1,-1,-1,1,1,1,1,-1,1,1,-1, -1, 1, 1,1,-1,-1,1,-1,-1,1,1,1};
void loop()
{
int n = sizeof(gold_code)/sizeof(gold_code[0]);
byte bsin[128];
int it;
unsigned long tm0;
unsigned int tm;
for(int i=0;i<128;i++)
{
bsin[i] = 8 + (int)(0.5 + 7.*sin( (double)i*3.14159265/64.));
}
int count=0;
int count1=0;
Serial.println(n);
tm0 = micros();
while(true)
{
tm = micros() - tm0;
if(tm > 511)
{
tm0 = tm0+512;
tm -= 512;
count++;
//Serial.println(gold_code[count%n]);
}
tm = (tm >> 2) ;
if(gold_code[count%n]==0){
PORTB = bsin[tm];
}
else{
PORTB = 16-bsin[tm];
}
}
}

The variable count eventually overflows and becomes negative. This, in conjunction with the modulo operation is a sign (pun intended) of a disaster waiting to happen.
Use a different method for limiting the value of count to the bounds of your gold_codearray.
You should expect a significant increase in frequency after removing the modulo operation, so you may need to add some pacing to your loop.
The pacing in your loop is wrong. Variable count increments 4 times as fast as your phase counter.
Also, #Edward Karak raises a valid point. To do a proper phase shift, you should add (or subtract) from tm, not from the sin value.
[EDIT] I was not quite happy with the way the phase shift is handled. It just doesn't feel right to advance the gold counter at the same pace as the phase counter. So I added a separate timer for that. Advances in the gold_code array every 8 microseconds for now, but you can change it to whatever you're supposed to have.
as in:
unsigned char tm0 = 0;
unsigned char tm0_gold = 0;
const unsigned char N = sizeof(gold_code) / sizeof(gold_code[0]);
unsigned char phase = 0;
for(;;)
{
// pacing for a stable frequency
unsigned char mic = micros() & 0xFF;
if (mic - tm0_gold >= 8)
{
tm0_gold = mic;
// compute and do the phase shift
if (++count >= N)
count -= N;
if (gold_code[count] > 0) // you have == 0 in your code, but that doesn't make sense.
phase += 16; // I can't make any sense of what you are trying to do,
// so I'll just add 45° of phase for each positive value
// you'll probably want to make your own test here
}
if (mic - tm0 >= 4)
{
tm0 = mic;
// advance the phase. keep within the LUT bounds
if (++phase >= 128)
phase -= 128;
// output
PORTB = bsin[phase];
}
}
For frequency stability, you will want to move the sine generator to a timer interrupt, after debugging. This will free up your loop() to do some extra control.
I don't quite understand why count increments as fast as the phase counter.
You may want to increment count at a slower pace to reach your goal.

What is the significance of the output of this FFT function?

I have the following periodic data which has a period of ~2000:
I am trying to discover the period of the data and the offset of the first peak. I have the following FFT function to perform a Fourier Transform:
typedef double _Complex cplx;
void _fft(cplx buf[], cplx out[], int n, int step){
if (step < n) {
_fft(out, buf, n, step*2);
_fft(out+step, buf+step, n, step*2);
for(int i=0; i<n; i+=step*2) {
cplx t = cexp(-I * M_PI * i / n) * out[i + step];
buf[i / 2] = (out[i] + t);
buf[(i + n)/2] = (out[i] - t);
}
}
}
void fft(cplx* buf, int n){
cplx* out = (cplx*)malloc(sizeof(cplx) * n);
memcpy(out, buf, sizeof(cplx)*n);
_fft(buf, out, n, 1);
for(int i=0; i<n; i++){ buf[i] /= n; }
free(out);
}
Which has been adapted from here: Fast Fourier Transformation (C) (link contains a full running example with main function and example data)
I understand that a Fourier Transform converts time series data into frequency data. Each frequency has a amplitude and a phase. However, I am having a hard time understanding the output given by this function. Graphing the output gives me this:
I have tried graphing the real component, the imaginary component, and the magnitude of both components. Each attempt gives a very similar-looking graph.
Am I wrong to assume there should be a spike at ~2000?
Am I miss-interpreting the output of this function?

Am I wrong to assume there should be a spike at ~2000?
Yes, because 2000 is the period you're interested in, not the frequency. It looks like you're running a 32,768-point FFT, so you should expect to find a peak in bin #16 (16 cycles per 32k = periods of approximately 2048 samples), not bin #2000.
If you want something that reports directly in terms of number of samples, instead of frequency, you may want an autocorrelation, instead of an FFT. Your signal would have autocorrelation peaks at lags of 2000, 4000, etc.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight