I want to do moving average or something similar to that, because I am getting noisy values from ADC, this is my first try, just to compute moving average, but values goes to 0 everytime, can you help me?
This is part of code, which makes this magic:
unsigned char buffer[5];
int samples = 0;
USART_Init0(MYUBRR);
uint16_t adc_result0, adc_result1;
float ADCaverage = 0;
while(1)
{
adc_result0 = adc_read(0); // read adc value at PA0
samples++;
//adc_result1 = adc_read(1); // read adc value at PA1
ADCaverage = (ADCaverage + adc_result0)/samples;
sprintf(buffer, "%d\n", (int)ADCaverage);
char * p = buffer;
while (*p) { USART_Transmit0(*p++); }
_delay_ms(1000);
}
return(0);
}
This result I am sending via usart to display value.
Your equation is not correct.
Let s_n = (sum_{i=0}^{n} x[i])/n then:
s_(n-1) = sum_{i=0}^{n-1} x[i])/(n-1)
sum_{i=0}^{n-1} x[i] = (n-1)*s_(n-1)
sum_{i=0}^{n} x[i] = n*s_n
sum_{i=0}^{n} x[i] = sum_{i=0}^{n-1} x[i] + x[n]
n*s_n = (n-1)*s_(n-1) + x[n] = n*s_(n-1) + (x[n]-s_(n-1))
s_n = s_(n-1) + (x[n]-s_(n-1))/n
You must use
ADCaverage += (adc_result0-ADCaverage)/samples;
You can use an exponential moving average which only needs 1 memory unit.
y[0] = (x[0] + y[-1] * (a-1) )/a
Where a is the filter factor.
If a is multiples of 2 you can use shifts and optimize for speed significantly:
y[0] = ( x[0] + ( ( y[-1] << a ) - y[-1] ) ) >> a
This works especially well with left aligned ADC's. Just keep an eye on the word size of the shift result.
Related
i have an array of n length fullfilled by 16 bit (int16) pcm raw data,the data is in 44100 sample_rate
and stereo,so i have in my array first 2 bytes left channel then right channel etc...i tried to implement a simple low pass converting my array into floating points -1 1,the low pass works but there are round errors that cause little pops in the sound
now i do simply this :
INT32 left_id = 0;
INT32 right_id = 1;
DOUBLE filtered_l_db = 0.0;
DOUBLE filtered_r_db = 0.0;
DOUBLE last_filtered_left = 0;
DOUBLE last_filtered_right = 0;
DOUBLE l_db = 0.0;
DOUBLE r_db = 0.0;
DOUBLE low_filter = filter_freq(core->audio->low_pass_cut);
for(UINT32 a = 0; a < (buffer_size/2);++a)
{
l_db = ((DOUBLE)input_buffer[left_id]) / (DOUBLE)32768;
r_db = ((DOUBLE)input_buffer[right_id]) / (DOUBLE)32768;
///////////////LOW PASS
filtered_l_db = last_filtered_left +
(low_filter * (l_db -last_filtered_left ));
filtered_r_db = last_filtered_right +
(low_filter * (r_db - last_filtered_right));
last_filtered_left = filtered_l_db;
last_filtered_right = filtered_r_db;
INT16 l = (INT16)(filtered_l_db * (DOUBLE)32768);
INT16 r = (INT16)(filtered_r_db * (DOUBLE)32768);
output_buffer[left_id] = (output_buffer[left_id] + l);
output_buffer[right_id] = (output_buffer[right_id] + r);
left_id +=2;
right_id +=2;
}
PS: the input buffer is an int16 array with the pcm data from -32767 to 32767;
i found this function here
Low Pass filter in C
and was the only one that i could understand xd
DOUBLE filter_freq(DOUBLE cut_freq)
{
DOUBLE a = 1.0/(cut_freq * 2 * PI);
DOUBLE b = 1.0/SAMPLE_RATE;
return b/(a+b);
}
my aim is instead to have absolute precision on the wave,and to directly low pass using only integers
with the cost to lose resolution on the filter(and i'm ok with it)..i saw a lot of examples but i really didnt understand anything...someone of you would be so gentle to explain how this is done like you would explain to a little baby?(in code or pseudo code rapresentation) thank you
Assuming the result of function filter_freq can be written as a fraction m/n your filter calculation basically is
y_new = y_old + (m/n) * (x - y_old);
which can be transformed to
y_new = ((n * y_old) + m * (x - y_old)) / n;
The integer division / n truncates the result towards 0. If you want rounding instead of truncation you can implement it as
y_tmp = ((n * y_old) + m * (x - y_old));
if(y_tmp < 0) y_tmp -= (n / 2);
else y_tmp += (n / 2);
y_new = y_tmp / n
In order to avoid losing precision from dividing the result by n in one step and multiplying it by n in the next step you can save the value y_tmp before the division and use it in the next cycle.
y_tmp = (y_tmp + m * (x - y_old));
if(y_tmp < 0) y_new = y_tmp - (n / 2);
else y_new = y_tmp + (n / 2);
y_new /= n;
If your input data is int16_t I suggest to implement the calculation using int32_t to avoid overflows.
I tried to convert the filter in your code without checking other parts for possible problems.
INT32 left_id = 0;
INT32 right_id = 1;
int32_t filtered_l_out = 0; // output value after division
int32_t filtered_r_out = 0;
int32_t filtered_l_tmp = 0; // used to keep the output value before division
int32_t filtered_r_tmp = 0;
int32_t l_in = 0; // input value
int32_t r_in = 0;
DOUBLE low_filter = filter_freq(core->audio->low_pass_cut);
// define denominator and calculate numerator
// use power of 2 to allow bit-shift instead of division
const uint32_t filter_shift = 16U;
const int32_t filter_n = 1U << filter_shift;
int32_t filter_m = (int32_t)(low_filter * filter_n)
for(UINT32 a = 0; a < (buffer_size/2);++a)
{
l_in = input_buffer[left_id]);
r_in = input_buffer[right_id];
///////////////LOW PASS
filtered_l_tmp = filtered_l_tmp + filter_m * (l_in - filtered_l_out);
if(last_filtered_left < 0) {
filtered_l_out = last_filtered_left - filter_n/2;
} else {
filtered_l_out = last_filtered_left + filter_n/2;
}
//filtered_l_out /= filter_n;
filtered_l_out >>= filter_shift;
/* same calculation for right */
INT16 l = (INT16)(filtered_l_out);
INT16 r = (INT16)(filtered_r_out);
output_buffer[left_id] = (output_buffer[left_id] + l);
output_buffer[right_id] = (output_buffer[right_id] + r);
left_id +=2;
right_id +=2;
}
As your filter is initialized with 0 it may need several samples to follow a possible step to the first input value. Depending on your data it might be better to initialize the filter based on the first input value.
I am trying to create a modulated waveform out of 2 sine waves.
To do this I need the modulo(fmodf) to know what amplitude a sine with a specific frequency(lo_frequency) has at that time(t). But I get a hardfault when the following line is executed:
j = fmodf(2 * PI * lo_frequency * t, 2 * PI);
Do you have an idea why this gives me a hardfault ?
Edit 1:
I exchanged fmodf with my_fmodf:
float my_fmodf(float x, float y){
if(y == 0){
return 0;
}
float n = x / y;
return x - n * y;
}
But still the hardfault occurs, and when I debug it it doesn't even jump into this function(my_fmodf).
Heres the whole function in which this error occurs:
int* create_wave(int* message){
/* Mixes the message signal at 10kHz and the carrier at 40kHz.
* When a bit of the message is 0 the amplitude is lowered to 10%.
* When a bit of the message is 1 the amplitude is 100%.
* The output of the STM32 can't be negative, thats why the wave swings between
* 0 and 256 (8bit precision for faster DAC)
*/
static int rf_frequency = 10000;
static int lo_frequency = 40000;
static int sample_rate = 100000;
int output[sample_rate];
int index, mix;
float j, t;
for(int i = 0; i <= sample_rate; i++){
t = i * 0.00000001f; // i * 10^-8
j = my_fmodf(2 * PI * lo_frequency * t, 2 * PI);
if (j < 0){
j += (float) 2 * PI;
}
index = floor((16.0f / (lo_frequency/rf_frequency * 0.0001f)) * t);
if (index < 16) {
if (!message[index]) {
mix = 115 + sin1(j) * 0.1f;
} else {
mix = sin1(j);
}
} else {
break;
}
output[i] = mix;
}
return output;
}
Edit 2:
I fixed the warning: function returns address of local variable [-Wreturn-local-addr] the way "chux - Reinstate Monica" suggested.
int* create_wave(int* message){
static uint16_t rf_frequency = 10000;
static uint32_t lo_frequency = 40000;
static uint32_t sample_rate = 100000;
int *output = malloc(sizeof *output * sample_rate);
uint8_t index, mix;
float j, n, t;
for(int i = 0; i < sample_rate; i++){
t = i * 0.00000001f; // i * 10^-8
j = fmodf(2 * PI * lo_frequency * t, 2 * PI);
if (j < 0){
j += 2 * PI;
}
index = floor((16.0f / (lo_frequency/rf_frequency * 0.0001f)) * t);
if (index < 16) {
if (!message[index]) {
mix = (uint8_t) floor(115 + sin1(j) * 0.1f);
} else {
mix = sin1(j);
}
} else {
break;
}
output[i] = mix;
}
return output;
}
But now I get the hardfault on this line:
output[i] = mix;
EDIT 3:
Because the previous code contained a very large buffer array that did not fit into the 16KB SRAM of the STM32F303K8 I needed to change it.
Now I use a "ping-pong" buffer where I use the callback of the DMA for "first-half-transmitted" and "completly-transmitted":
void HAL_DAC_ConvHalfCpltCallbackCh1(DAC_HandleTypeDef * hdac){
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_SET);
for(uint16_t i = 0; i < 128; i++){
new_value = sin_table[(i * 8) % 256];
if (message[message_index] == 0x0){
dac_buf[i] = new_value * 0.1f + 115;
} else {
dac_buf[i] = new_value;
}
}
}
void HAL_DAC_ConvCpltCallbackCh1 (DAC_HandleTypeDef * hdac){
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_RESET);
for(uint16_t i = 128; i < 256; i++){
new_value = sin_table[(i * 8) % 256];
if (message[message_index] == 0x0){
dac_buf[i] = new_value * 0.1f + 115;
} else {
dac_buf[i] = new_value;
}
}
message_index++;
if (message_index >= 16) {
message_index = 0;
// HAL_DAC_Stop_DMA (&hdac1, DAC_CHANNEL_1);
}
}
And it works the way I wanted:
But the frequency of the created sine is too low.
I cap at around 20kHz but I'd need 40kHz.
I allready increased the clock by a factor of 8 so that one is maxed out:
.
I can still decrease the counter period (it is 50 at the moment), but when I do so the interrupt callback seems to take longer than the period to the next one.
At least it seems so as the output becomes very distorted when I do that.
I also tried to decrease the precision by taking only every 8th sine value but
I cant do this any more because then the output does not look like a sine wave anymore.
Any ideas how I could optimize the callback so that it takes less time ?
Any other ideas ?
Does fmodf() cause a hardfault in stm32?
It is other code problems causing the hard fault here.
Failing to compile with ample warnings
Best code tip: enable all warnings. #KamilCuk
Faster feedback than Stackoverflow.
I'd expect something like below on a well enabled compiler.
return output;
warning: function returns address of local variable [-Wreturn-local-addr]
Returning a local Object
Cannot return a local array. Allocate instead.
// int output[sample_rate];
int *output = malloc(sizeof *output * sample_rate);
return output;
Calling code will need to free() the pointer.
Out of range array access
static int sample_rate = 100000;
int output[sample_rate];
// for(int i = 0; i <= sample_rate; i++){
for(int i = 0; i < sample_rate; i++){
...
output[i] = mix;
}
Stack overflow?
static int sample_rate = 100000; int output[sample_rate]; is a large local variable. Maybe allocate or try something smaller?
Advanced: loss of precision
A good fmodf() does not lose precision. For a more precise answer consider double math for the intermediate results. An even better approach is more involved.
float my_fmodf(float x, float y){
if(y == 0){
return 0;
}
double n = 1.0 * x / y;
return (float) (x - n * y);
}
Can I not use any function within another ?
Yes. Code has other issues.
1 value every 10uS makes only 100kSPS whis is not too much for this macro. In my designs I generate > 5MSPS signals without any problems. Usually I have one buffer and DMA in circular mode. First I fill the buffer and start generation. When the half transmition DMA interrupt is trigerred I fill the first half of the buffer with fresh data. The the transmition complete interrupt is trigerred I fill the second half and this process repeats all over again.
I am trying to perform the following calculation using an ATmega328P MCU.
πππ ππ‘πππ = 1000 Β· πππ0 + 2000 Β· πππ1 + β― + 8000 Β· πππ7 / πππ0+πππ1+β―+πππ7
In the main routine (as shown here):
int main(void)
{
//variables
uint16_t raw_values[8];
uint16_t position = 0;
uint16_t positions[8];
char raw[] = " raw";
char space[] = ", ";
char channelString[] = "Channel#: ";
char positionString[] = "Position: ";
//initialize ADC (Analog)
initADC();
//initialize UART
initUART(BAUD, DOUBLE_SPEED);
//give time for ADC to perform & finish 1st conversion
//8us x 25 = 200us
delay_us(200);
while(1)
{
//get the raw values from the ADC for each channel
for(uint8_t channel = 0; channel < 8; channel++)
{
raw_values[channel] = analog(channel);
//invert the raw value
raw_values[channel] = DIVISOR - raw_values[channel];
}
for(uint8_t channel = 0; channel < 8; channel++)
{
//print the channel#
transmitString(channelString);
printDec16bit(channel);
transmitString(space);
//print the raw value from the ADC conversion
printDec16bit(raw_values[channel]);
transmitString(raw);
transmitString(space);
//calculate the position value at each sensor
transmitString(positionString);
positions[channel] = (uint16_t)((POSITION_REF/DIVISOR) * raw_values[channel]);
printDec16bit(positions[channel]);
printCR();
}
printCR();
//calculate and display 'position'
position = calculatePosition(positions);
printDec16bit(position);
printCR();
printCR();
//add a delay
delay_ms(2000);
}
}
I am calling the following function, but the return value I am getting is way off.
uint16_t calculatePosition(uint16_t* channel_positions)
{
uint32_t intermediates[8];
uint32_t temp_sum = 0;
uint16_t divisor = 0;
uint16_t value = 0;
for(uint8_t i = 0; i < 8; i++)
{
intermediates[i] = channel_positions[i] * ((i + 1) * 1000);
}
for(uint8_t j = 0; j < 8; j++)
{
temp_sum = temp_sum + intermediates[j];
}
for(uint8_t k = 0; k < 8; k++)
{
divisor = divisor + channel_positions[k];
}
value = temp_sum/divisor;
return value;
}
Alternatively, I have even tried this code, and get a result that is not what I expect.
uint16_t calculatePosition(uint16_t* channel_positions)
{
uint16_t position;
position = ((1000 * channel_positions[0]) +
(2000 * channel_positions[1]) +
(3000 * channel_positions[2]) +
(4000 * channel_positions[3]) +
(5000 * channel_positions[4]) +
(6000 * channel_positions[5]) +
(7000 * channel_positions[6]) +
(8000 * channel_positions[7])) /
(channel_positions[0] +
channel_positions[1] +
channel_positions[2] +
channel_positions[3] +
channel_positions[4] +
channel_positions[5] +
channel_positions[6] +
channel_positions[7]);
return position;
}
What could I be doing wrong? For an array of values such as {15, 12, 5, 16, 11, 35, 964, 76} I expect a result of 6504, but instead I get a value in the 200's (or some other weird value).
Look at your input array: {15, 12, 5, 16, 11, 35, 964, 76}
Specifically, look at the element that is 964. That element times 7000 is 6748000 which is greater than a uint16_t can handle.
There are a number of solutions. One of them is changing to uint32_t. If this is not an option, you could extract a factor of 1000, like this:
position = 1000 *(
((1 * channel_positions[0]) +
(2 * channel_positions[1]) +
(3 * channel_positions[2]) +
(4 * channel_positions[3]) +
(5 * channel_positions[4]) +
(6 * channel_positions[5]) +
(7 * channel_positions[6]) +
(8 * channel_positions[7])) /
(channel_positions[0] +
channel_positions[1] +
channel_positions[2] +
channel_positions[3] +
channel_positions[4] +
channel_positions[5] +
channel_positions[6] +
channel_positions[7]));
Note that this will not eliminate the problem, but it could possibly reduce it so that the problem never occurs for reasonable input.
Taking the same idea to the loop version, we get:
uint16_t calculatePosition(uint16_t* channel_positions)
{
uint16_t temp_sum = 0;
uint16_t divisor = 0;
for(uint8_t i = 0; i < 8; i++) {
temp_sum += (channel_positions[i] * (i+1));
divisor += channel_positions[i];
}
return 1000*(temp_sum/divisor);
}
Note that you will lose some accuracy in the process due to rounding with integer division. Since you have been very careful with specifying the width, I assume you're not willing to change the type of the input array. This code should give you maximum accuracy with minimal extra memory usage. But if you're running this function often on a 16-bit machine it can impact performance quite a bit.
uint16_t calculatePosition(uint16_t* channel_positions)
{
// Use 32 bit for these
uint32_t temp_sum = 0;
uint32_t divisor = 0;
for(uint8_t i = 0; i < 8; i++) {
// Copy the value to a 32 bit number
uint32_t temp_pos = channel_positions[i];
temp_sum += temp_pos * (i+1);
divisor += temp_pos;
}
// Moved parenthesis for better accuracy
return (1000*temp_sum) / divisor;
}
Provided that the result can fit in a uint16_t there is absolutely zero chance that this version will fail, because the biggest possible value for 1000*temp_sum is 2,359,260,000 and the biggest value it can hold is 4,294,967,295.
Sidenote about MRE (minimal, reproducible example)
MRE:s are described here: https://stackoverflow.com/help/minimal-reproducible-example
In this example, a good main function to post in the question would be:
#include <stdio.h>
int main()
{
uint16_t positions[] = {15, 12, 5, 16, 11, 35, 964, 76};
uint16_t pos = calculatePosition(positions);
printf("%d\n", pos);
}
It's enough to demonstrate the problem you had and no more.
As it was said, the problem is in integer overflow.
Be careful when moving the multiplier outside, when using integer math! (A * 1000) / B does not equal to (A / B) * 1000.
The simplest solution, to convert first of operands in each operation into a wider type. Others will be converted implicitly. E.q.
...
position = ((1000UL * channel_positions[0]) +
(2000UL * channel_positions[1]) +
(3000UL * channel_positions[2]) +
(4000UL * channel_positions[3]) +
(5000UL * channel_positions[4]) +
(6000UL * channel_positions[5]) +
(7000UL * channel_positions[6]) +
(8000UL * channel_positions[7])) /
((uint32_t)channel_positions[0] +
channel_positions[1] + // no need to convert, it will be converted implicitly
channel_positions[2] + // since previous operand is wider
channel_positions[3] +
channel_positions[4] +
channel_positions[5] +
channel_positions[6] +
channel_positions[7]);
I come across a problem about converting double to ascii, after searching, I got Florian's paper "Printing Floating-Point Numbers Quickly and Accurately with Integers", Grisu2 algorithm is really awesome and much faster. I have understood Grisu2's idea but I don't know how to implement it, so I got Florian's C implement, it's a little complicated for me and I still don't really understand 2 functions: cached_power and digit_gen, could anyone who knows Grisu2 help me?
Comments show my question.
// cached_power function:
static const uint64_t powers_ten[] = {0xbf29dcaba82fdeae , 0xeef453d6923bd65a,...};
//how do these numbers precomputed
static const int powers_ten_e[] = {-1203 , -1200 , -1196 , -1193 , -1190 , ...};//and what do they mean?
static diy_fp_t cached_power(int k)
{//does this function mean give k and return the normalized 10^k diy_fp_t?
diy_fp_t res;
int index = 343 + k;//why add 343?
res.f = powers_ten[index];
res.e = powers_ten_e[index];
return res;
}
this one is more complicated
void digit_gen(diy_fp_t Mp, diy_fp_t delta,//Is Mp normalized?
char* buffer, int* len, int* K)
{
uint32_t div; int d, kappa; diy_fp_t one;
one.f = ((uint64_t)1) << -Mp.e; one.e = Mp.e;//what if Mp.e is positive? what's the purpose of one?
uint32_t p1 = Mp.f >> -one.e; /// Mp_cut// what does p1 mean?
uint64_t p2 = Mp.f & (one.f - 1);//what does p2 mean?
*len = 0; kappa = 3; div = TEN2;//why kappa=3 and div=100? is kappa related to div?
while (kappa > 0)
{ /// Mp_inv1 //what does this loop mean?
d = p1 / div;
if (d || *len) buffer[(*len)++] = '0' + d;
p1 %= div; kappa--; div /= 10;
if ((((uint64_t)p1) << -one.e) + p2 <= delta.f)
{ /// Mp_delta
*K += kappa; return;
}
}
do
{ //what does this loop mean?
p2 *= 10;
d = p2 >> -one.e;
if (d || *len) buffer[(*len)++] = '0' + d; /// Mp_inv2
p2 &= one.f - 1; kappa--; delta.f *= 10;// p2&=one.f-1 means what?
} while (p2 > delta.f);
*K += kappa;
}
The first part:
diy_fp_t is a floating point struct with mantisse and exponent as separate members (not very interesting, but itΒ΄s here: https://github.com/miloyip/dtoa-benchmark/blob/master/src/grisu/diy_fp.h).
The purpose of cached_power(k) is to compute the value of 10^k and save the result to a diy_fp_t. Because that is neither trivial nor fast for the computer, the author has arrays (one for mantisse, one for exponent) of precalculated values (as good as possible) of the necessary powers (Grisu wonΒ΄t use other powers than that. An explanation is in the paper, chapter 4 and 5).
The array in the example code begins with the value for 10^(-343), this is 0xbf29dcaba82fdeae * 2^(-1203), = 13774783565108600494 * 2^(-1203). 10^(-342) belongs to the next array position, and so on. And because -343 has the array index [0], 343 is added first.
Can anyone spot any way to improve the speed in the next Bilinear resizing Algorithm?
I need to improve Speed as this is critical, keeping good image quality. Is expected to be used in mobile devices with low speed CPUs.
The algorithm is used mainly for up-scale resizing. Any other faster Bilinear algorithm also would be appreciated. Thanks
void resize(int* input, int* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
int a, b, c, d, x, y, index;
float x_ratio = ((float)(sourceWidth - 1)) / targetWidth;
float y_ratio = ((float)(sourceHeight - 1)) / targetHeight;
float x_diff, y_diff, blue, red, green ;
int offset = 0 ;
for (int i = 0; i < targetHeight; i++)
{
for (int j = 0; j < targetWidth; j++)
{
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
a = input[index] ;
b = input[index + 1] ;
c = input[index + sourceWidth] ;
d = input[index + sourceWidth + 1] ;
// blue element
blue = (a&0xff)*(1-x_diff)*(1-y_diff) + (b&0xff)*(x_diff)*(1-y_diff) +
(c&0xff)*(y_diff)*(1-x_diff) + (d&0xff)*(x_diff*y_diff);
// green element
green = ((a>>8)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>8)&0xff)*(x_diff)*(1-y_diff) +
((c>>8)&0xff)*(y_diff)*(1-x_diff) + ((d>>8)&0xff)*(x_diff*y_diff);
// red element
red = ((a>>16)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>16)&0xff)*(x_diff)*(1-y_diff) +
((c>>16)&0xff)*(y_diff)*(1-x_diff) + ((d>>16)&0xff)*(x_diff*y_diff);
output [offset++] =
0x000000ff | // alpha
((((int)red) << 24)&0xff0000) |
((((int)green) << 16)&0xff00) |
((((int)blue) << 8)&0xff00);
}
}
}
Off the the top of my head:
Stop using floating-point, unless you're certain your target CPU has it in hardware with good performance.
Make sure memory accesses are cache-optimized, i.e. clumped together.
Use the fastest data types possible. Sometimes this means smallest, sometimes it means "most native, requiring least overhead".
Investigate if signed/unsigned for integer operations have performance costs on your platform.
Investigate if look-up tables rather than computations gain you anything (but these can blow the caches, so be careful).
And, of course, do lots of profiling and measurements.
In-Line Cache and Lookup Tables
Cache your computations in your algorithm.
Avoid duplicate computations (like (1-y_diff) or (x_ratio * j))
Go through all the lines of your algorithm, and try to identify patterns of repetitions. Extract these to local variables. And possibly extract to functions, if they are short enough to be inlined, to make things more readable.
Use a lookup-table
It's quite likely that, if you can spare some memory, you can implement a "store" for your RGB values and simply "fetch" them based on the inputs that produced them. Maybe you don't need to store all of them, but you could experiment and see if some come back often. Alternatively, you could "fudge" your colors and thus end up with less values to store for more lookup inputs.
If you know the boundaries for you inputs, you can calculate the complete domain space and figure out what makes sense to cache. For instance, if you can't cache the whole R, G, B values, maybe you can at least pre-compute the shiftings ((b>>16) and so forth...) that are most likely deterministic in your case).
Use the Right Data Types for Performance
If you can avoid double and float variables, use int. On most architectures, int would be test faster type for computations because of the memory model. You can still achieve decent precision by simply shifting your units (ie use 1026 as int instead of 1.026 as double or float). It's quite likely that this trick would be enough for you.
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
Could surely use some optimization: you were using x_ration * j-1 just a few cycles earlier, so all you really need here is x+=x_ratio
My random guess (use a profiler instead of letting people guess!):
The compiler has to generate that works when input and output overlap which means it has to do generate loads of redundant stores and loads. Add restrict to the input and output parameters to remove that safety feature.
You could also try using a=b; and c=d; instead of loading them again.
here is my version, steal some ideas. My C-fu is quite weak, so some lines are pseudocodes, but you can fix them.
void resize(int* input, int* output,
int sourceWidth, int sourceHeight,
int targetWidth, int targetHeight
) {
// Let's create some lookup tables!
// you can move them into 2-dimensional arrays to
// group together values used at the same time to help processor cache
int sx[0..targetWidth ]; // target->source X lookup
int sy[0..targetHeight]; // target->source Y lookup
int mx[0..targetWidth ]; // left pixel's multiplier
int my[0..targetHeight]; // bottom pixel's multiplier
// we don't have to calc indexes every time, find out when
bool reloadPixels[0..targetWidth ];
bool shiftPixels[0..targetWidth ];
int shiftReloadPixels[0..targetWidth ]; // can be combined if necessary
int v; // temporary value
for (int j = 0; j < targetWidth; j++){
// (8bit + targetBits + sourceBits) should be < max int
v = 256 * j * (sourceWidth-1) / (targetWidth-1);
sx[j] = v / 256;
mx[j] = v % 256;
reloadPixels[j] = j ? ( sx[j-1] != sx[j] ? 1 : 0)
: 1; // always load first pixel
// if no reload -> then no shift too
shiftPixels[j] = j ? ( sx[j-1]+1 = sx[j] ? 2 : 0)
: 0; // nothing to shift at first pixel
shiftReloadPixels[j] = reloadPixels[i] | shiftPixels[j];
}
for (int i = 0; i < targetHeight; i++){
v = 256 * i * (sourceHeight-1) / (targetHeight-1);
sy[i] = v / 256;
my[i] = v % 256;
}
int shiftReload;
int srcIndex;
int srcRowIndex;
int offset = 0;
int lm, rm, tm, bm; // left / right / top / bottom multipliers
int a, b, c, d;
for (int i = 0; i < targetHeight; i++){
srcRowIndex = sy[ i ] * sourceWidth;
tm = my[i];
bm = 255 - tm;
for (int j = 0; j < targetWidth; j++){
// too much ifs can be too slow, measure.
// always true for first pixel in a row
if( shiftReload = shiftReloadPixels[ j ] ){
srcIndex = srcRowIndex + sx[j];
if( shiftReload & 2 ){
a = b;
c = d;
}else{
a = input[ srcIndex ];
c = input[ srcIndex + sourceWidth ];
}
b = input[ srcIndex + 1 ];
d = input[ srcIndex + 1 + sourceWidth ];
}
lm = mx[j];
rm = 255 - lm;
// WTF?
// Input AA RR GG BB
// Output RR GG BB AA
if( j ){
leftOutput = rightOutput ^ 0xFFFFFF00;
}else{
leftOutput =
// blue element
((( ( (a&0xFF)*tm
+ (c&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((a>>8)&0xFF)*tm
+ ((c>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((a>>16)&0xFF)*tm
+ ((c>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
}
rightOutput =
// blue element
((( ( (b&0xFF)*tm
+ (d&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((b>>8)&0xFF)*tm
+ ((d>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((b>>16)&0xFF)*tm
+ ((d>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
output[offset++] =
// alpha
0x000000ff
| leftOutput
| rightOutput
;
}
}
}