I have a question about reducing number of memory calls in a loop. Consider the following code (This is not my code as I cannot represent it here because it is too long):
for(k=0;k<n;k++)
{
y[k] = x[0]*2 + z[1];
}
As you can see, in each iteration, the same blocks in the memory (x[0], z[1]) are being called . I was wondering if there is any way around to reduce memory access when the same block of memory is called several times.
Thanks in advance.
Simply, get the values before the loop:
i = x[0];
j = z[1];
for(k=0;k<n;k++)
{
y[k] = i*2 + j;
}
Ofcourse the compiler will optimize this(if it can) even if you don't change anything but it helps to write more readable and intuitive code. You don't need to get the values on every iteration and the code you write should be indicative of that.
Forget micro optimizations write more intuitive and readable code!
As rightly pointed out in comments the right hand expression is completely independent of the loop, so:
i = x[0]*2 + z[1];
for(k=0;k<n;k++)
{
y[k] = i;
}
Here is what you can do.
float value = x[0]*2 + z[1];
for(k=0;k<n;k++)
{
y[k] = value;
}
hope this helps.
v = x[0]*2 + z[1];
for(k=0;k<n;++k) y[k] = v
Assuming that x[0] and z[1] are NOT mapped to y[0..n-1]
If z has a type that shorter than int (e.g. char) you can try the following trick:
char value = x[0]*2 + z[1];
unsigned int value32 = value | (value << 8) | (value << 16) | (value << 24);
unsigned int k;
// Going by blocks of 4
for(k = 0; k < n - n%4; k+=4) {
(unsigned int)z[k] = value32;
}
// Finishing loop
for(; k < n; k++) {
z[k] = value;
}
Compiler will optimize this,
But in case you use a broken compiler without optimizations: you can put them both in register integers and then work with them. like this:
x0 = x[0]*2;
z1 = z[1];
y0 = x0 + z1;
register int k;
for(k=0;k<n;k++)
{
y[k] = y0;
}
This does not guarantee that x[0] and z[1] will be on a register, but atleast hints the compiler that they should be on a register.
Related
I am trying to create a modulated waveform out of 2 sine waves.
To do this I need the modulo(fmodf) to know what amplitude a sine with a specific frequency(lo_frequency) has at that time(t). But I get a hardfault when the following line is executed:
j = fmodf(2 * PI * lo_frequency * t, 2 * PI);
Do you have an idea why this gives me a hardfault ?
Edit 1:
I exchanged fmodf with my_fmodf:
float my_fmodf(float x, float y){
if(y == 0){
return 0;
}
float n = x / y;
return x - n * y;
}
But still the hardfault occurs, and when I debug it it doesn't even jump into this function(my_fmodf).
Heres the whole function in which this error occurs:
int* create_wave(int* message){
/* Mixes the message signal at 10kHz and the carrier at 40kHz.
* When a bit of the message is 0 the amplitude is lowered to 10%.
* When a bit of the message is 1 the amplitude is 100%.
* The output of the STM32 can't be negative, thats why the wave swings between
* 0 and 256 (8bit precision for faster DAC)
*/
static int rf_frequency = 10000;
static int lo_frequency = 40000;
static int sample_rate = 100000;
int output[sample_rate];
int index, mix;
float j, t;
for(int i = 0; i <= sample_rate; i++){
t = i * 0.00000001f; // i * 10^-8
j = my_fmodf(2 * PI * lo_frequency * t, 2 * PI);
if (j < 0){
j += (float) 2 * PI;
}
index = floor((16.0f / (lo_frequency/rf_frequency * 0.0001f)) * t);
if (index < 16) {
if (!message[index]) {
mix = 115 + sin1(j) * 0.1f;
} else {
mix = sin1(j);
}
} else {
break;
}
output[i] = mix;
}
return output;
}
Edit 2:
I fixed the warning: function returns address of local variable [-Wreturn-local-addr] the way "chux - Reinstate Monica" suggested.
int* create_wave(int* message){
static uint16_t rf_frequency = 10000;
static uint32_t lo_frequency = 40000;
static uint32_t sample_rate = 100000;
int *output = malloc(sizeof *output * sample_rate);
uint8_t index, mix;
float j, n, t;
for(int i = 0; i < sample_rate; i++){
t = i * 0.00000001f; // i * 10^-8
j = fmodf(2 * PI * lo_frequency * t, 2 * PI);
if (j < 0){
j += 2 * PI;
}
index = floor((16.0f / (lo_frequency/rf_frequency * 0.0001f)) * t);
if (index < 16) {
if (!message[index]) {
mix = (uint8_t) floor(115 + sin1(j) * 0.1f);
} else {
mix = sin1(j);
}
} else {
break;
}
output[i] = mix;
}
return output;
}
But now I get the hardfault on this line:
output[i] = mix;
EDIT 3:
Because the previous code contained a very large buffer array that did not fit into the 16KB SRAM of the STM32F303K8 I needed to change it.
Now I use a "ping-pong" buffer where I use the callback of the DMA for "first-half-transmitted" and "completly-transmitted":
void HAL_DAC_ConvHalfCpltCallbackCh1(DAC_HandleTypeDef * hdac){
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_SET);
for(uint16_t i = 0; i < 128; i++){
new_value = sin_table[(i * 8) % 256];
if (message[message_index] == 0x0){
dac_buf[i] = new_value * 0.1f + 115;
} else {
dac_buf[i] = new_value;
}
}
}
void HAL_DAC_ConvCpltCallbackCh1 (DAC_HandleTypeDef * hdac){
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_3, GPIO_PIN_RESET);
for(uint16_t i = 128; i < 256; i++){
new_value = sin_table[(i * 8) % 256];
if (message[message_index] == 0x0){
dac_buf[i] = new_value * 0.1f + 115;
} else {
dac_buf[i] = new_value;
}
}
message_index++;
if (message_index >= 16) {
message_index = 0;
// HAL_DAC_Stop_DMA (&hdac1, DAC_CHANNEL_1);
}
}
And it works the way I wanted:
But the frequency of the created sine is too low.
I cap at around 20kHz but I'd need 40kHz.
I allready increased the clock by a factor of 8 so that one is maxed out:
.
I can still decrease the counter period (it is 50 at the moment), but when I do so the interrupt callback seems to take longer than the period to the next one.
At least it seems so as the output becomes very distorted when I do that.
I also tried to decrease the precision by taking only every 8th sine value but
I cant do this any more because then the output does not look like a sine wave anymore.
Any ideas how I could optimize the callback so that it takes less time ?
Any other ideas ?
Does fmodf() cause a hardfault in stm32?
It is other code problems causing the hard fault here.
Failing to compile with ample warnings
Best code tip: enable all warnings. #KamilCuk
Faster feedback than Stackoverflow.
I'd expect something like below on a well enabled compiler.
return output;
warning: function returns address of local variable [-Wreturn-local-addr]
Returning a local Object
Cannot return a local array. Allocate instead.
// int output[sample_rate];
int *output = malloc(sizeof *output * sample_rate);
return output;
Calling code will need to free() the pointer.
Out of range array access
static int sample_rate = 100000;
int output[sample_rate];
// for(int i = 0; i <= sample_rate; i++){
for(int i = 0; i < sample_rate; i++){
...
output[i] = mix;
}
Stack overflow?
static int sample_rate = 100000; int output[sample_rate]; is a large local variable. Maybe allocate or try something smaller?
Advanced: loss of precision
A good fmodf() does not lose precision. For a more precise answer consider double math for the intermediate results. An even better approach is more involved.
float my_fmodf(float x, float y){
if(y == 0){
return 0;
}
double n = 1.0 * x / y;
return (float) (x - n * y);
}
Can I not use any function within another ?
Yes. Code has other issues.
1 value every 10uS makes only 100kSPS whis is not too much for this macro. In my designs I generate > 5MSPS signals without any problems. Usually I have one buffer and DMA in circular mode. First I fill the buffer and start generation. When the half transmition DMA interrupt is trigerred I fill the first half of the buffer with fresh data. The the transmition complete interrupt is trigerred I fill the second half and this process repeats all over again.
I decided to play around a bit with complex.h, and ran into what I consider a very curious problem.
int mandelbrot(long double complex c, int lim)
{
long double complex z = c;
for(int i = 0; i < lim; ++i, z = cpowl(z,2)+c)
{
if(creall(z)*creall(z)+cimagl(z)*cimagl(z) > 4.0)
return 0;
}
return 1;
}
int mandelbrot2(long double cr, long double ci, int lim)
{
long double zr = cr;
long double zi = ci;
for(int i = 0; i < lim; ++i, zr = zr*zr-zi*zi+cr, zi = 2*zr*zi+ci)
{
if(zr*zr+zi*zi > 4.0)
return 0;
}
return 1;
}
These functions do not behave the same. If we input -2.0+0.0i and a limit higher than 17, the latter will return 1, which is correct for any limit, while the former will return 0, at least on my system. GCC 9.1.0, Ryzen 2700x.
I cannot for the life of me figure out how this can happen. I mean while I may not entirely understand how complex.h works behind the scenes, for this particular example it makes no sense that the results should deviate like this.
While writing I notices the cpowl(z,2)+c, and tried to change it to z*z+c, which helped, however after a quick test, I found that the behavior still differ. Ex. -1.3+0.1*I, lim=18.
I'm curious to know if this is specific to my system and what the cause might be, though I'm perfectly aware that the most like scenario is me having made a mistake, but alas, I can't find it.
--- edit---
Finally, the complete code, including alterations and fixes. The two functions now seem to yield the same result.
#include <stdio.h>
#include <complex.h>
int mandelbrot(long double complex c, int lim)
{
long double complex z = c;
for(int i = 0; i < lim; ++i, z = z*z+c)
{
if(creall(z)*creall(z)+cimagl(z)*cimagl(z) > 4.0)
return 0;
}
return 1;
}
int mandelbrot2(long double cr, long double ci, int lim)
{
long double zr = cr;
long double zi = ci;
long double tmp;
for(int i = 0; i < lim; ++i)
{
if(zr*zr+zi*zi > 4.0) return 0;
tmp = zi;
zi = 2*zr*zi+ci;
zr = zr*zr-tmp*tmp+cr;
}
return 1;
}
int main()
{
long double complex c = -2.0+0.0*I;
printf("%i\n",mandelbrot(c,100));
printf("%i\n",mandelbrot2(-2.0,0.0,100));
return 0;
}
cpowl() still messes things up, but I suppose if I wanted to, I could just create my own implementation.
The second function is the one that's incorrect, not the first.
In the expression in the third clause of the for:
zr = zr*zr-zi*zi+cr, zi = 2*zr*zi+ci
The calculation of zi is using the new value of zr, not the current one. You'll need to save the results of these two calculations in temp variables, then assign these back to zr and zi:
int mandelbrot2(long double cr, long double ci, int lim)
{
long double zr = cr;
long double zi = ci;
for(int i = 0; i < lim; ++i)
{
printf("i=%d, z=%Lf%+Lfi\n", i, zr, zi);
if(zr*zr+zi*zi > 4.0)
return 0;
long double new_zr = zr*zr-zi*zi+cr;
long double new_zi = 2*zr*zi+ci;
zr = new_zr;
zi = new_zi;
}
return 1;
}
Also, using cpowl for simple squaring will result in inaccuracies that can be avoided by simplying using z*z in this case.
Difference for Input −2 + 0 i
cpowl is inaccurate. Exponentiation is a complicated function to implement, and a variety of errors likely arise in its computation. On macOS 10.14.6, z in the mandelbrot routine takes on these values in successive iterations:
z = -2 + 0 i.
z = 2 + 4.33681e-19 i.
z = 2 + 1.73472e-18 i.
z = 2 + 6.93889e-18 i.
z = 2 + 2.77556e-17 i.
z = 2 + 1.11022e-16 i.
z = 2 + 4.44089e-16 i.
z = 2 + 1.77636e-15 i.
z = 2 + 7.10543e-15 i.
z = 2 + 2.84217e-14 i.
z = 2 + 1.13687e-13 i.
z = 2 + 4.54747e-13 i.
z = 2 + 1.81899e-12 i.
z = 2 + 7.27596e-12 i.
z = 2 + 2.91038e-11 i.
z = 2 + 1.16415e-10 i.
z = 2 + 4.65661e-10 i.
Thus, once the initial error is made, producing 2 + 4.33681•10−19 i, z continues to grow (correctly, as a result of mathematics, not just floating-point errors) until it is large enough to pass the test comparing the square of its absolute value to 4. (The test does not immediately capture the excess because the square of the imaginary part is so small it is lost in rounding when added to the square of the real part.)
In contrast, if we replace z = cpowl(z,2)+c with z = z*z + c, z remains 2 (that is, 2 + 0i). In general, the operations in z*z experience some rounding errors too, but not as badly as with cpowl.
Difference for Input −1.3 + 0.1 i
For this input, the difference is caused by the incorrect calculation in the update step of the for loop:
++i, zr = zr*zr-zi*zi+cr, zi = 2*zr*zi+ci
That uses the new value of zr when calculating zi. It can be fixed by inserting long double t; and changing the update step to
++i, t = zr*zr - zi*zi + cr, zi = 2*zr*zi + ci, zr = t
I want to do moving average or something similar to that, because I am getting noisy values from ADC, this is my first try, just to compute moving average, but values goes to 0 everytime, can you help me?
This is part of code, which makes this magic:
unsigned char buffer[5];
int samples = 0;
USART_Init0(MYUBRR);
uint16_t adc_result0, adc_result1;
float ADCaverage = 0;
while(1)
{
adc_result0 = adc_read(0); // read adc value at PA0
samples++;
//adc_result1 = adc_read(1); // read adc value at PA1
ADCaverage = (ADCaverage + adc_result0)/samples;
sprintf(buffer, "%d\n", (int)ADCaverage);
char * p = buffer;
while (*p) { USART_Transmit0(*p++); }
_delay_ms(1000);
}
return(0);
}
This result I am sending via usart to display value.
Your equation is not correct.
Let s_n = (sum_{i=0}^{n} x[i])/n then:
s_(n-1) = sum_{i=0}^{n-1} x[i])/(n-1)
sum_{i=0}^{n-1} x[i] = (n-1)*s_(n-1)
sum_{i=0}^{n} x[i] = n*s_n
sum_{i=0}^{n} x[i] = sum_{i=0}^{n-1} x[i] + x[n]
n*s_n = (n-1)*s_(n-1) + x[n] = n*s_(n-1) + (x[n]-s_(n-1))
s_n = s_(n-1) + (x[n]-s_(n-1))/n
You must use
ADCaverage += (adc_result0-ADCaverage)/samples;
You can use an exponential moving average which only needs 1 memory unit.
y[0] = (x[0] + y[-1] * (a-1) )/a
Where a is the filter factor.
If a is multiples of 2 you can use shifts and optimize for speed significantly:
y[0] = ( x[0] + ( ( y[-1] << a ) - y[-1] ) ) >> a
This works especially well with left aligned ADC's. Just keep an eye on the word size of the shift result.
I am hoping someone can help me with this. I am a complete and utter C newbie.
This is for a school assignment in a class on C (just plain old C, not C# or C++), and the professor is insistent that the only compiler we're allowed to use is Borland 5.5.
The general assignment is to run an algorithm that can check the validity of a credit card number. I've successfully gotten the program to pick up the user-input CC number, then portion that number out into an array. It prints out mostly what I want.
However, when I entered the last function (the one I commented as such) and then compiled, the program just started to hang. I have no idea what could be causing that.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
//global variables declared.
//in an earlier version, I was going to use multiple functions, but I couldn't make them work
float array[16];
double num, ten;
int i, a, b, x, y, check;
int main()
{
ten = 10;
//pick up user-input number
printf("Enter your credit card number\n>");
scanf("%lf", &num);
//generate the array
for (i = 15; i >= 0; i--)
{
array[i] = fmod(num, ten);
num /= 10;
printf("Array is %1.1lf\n", array[i]);
}
//double every other number. If the number is greater than ten, test for that, then parse and re-add.
//this is where the program starts to hang (I think).
{for (i = 2; i <= 16; i + 2)
{
array[i] = array[i] * 2;
if (array[i] >= 10)
{
a = (int)array[i] % 10;
b = (int)array[i] / 10;
array[i] = a + b;
}
}
printf("%f", array[i]);
}
//add the numbers together
x = array[2] + array[4] + array[6] + array[8] + array[10] + array[12] + array[14] + array[16];
y = array[1] + array[3] + array[5] + array[7] + array[9] + array[11] + array[13] + array[15];
check = x + y;
//print out a test number to make sure the program is doing everything correctly.
//Right now, this isn't happening
printf("%d", check);
return 0;
}
for (i = 2; i <= 16; i + 2)
should be
for (i = 2; i <= 16; i = i + 2)
or
for (i = 2; i <= 16; i += 2)
As you have it, the value of i is never modified, so the loop never terminates.
You declare your array
array[16] so array[0] .. array[15]
In the second for loop you have
when i = 16 array[16]!
valter
Can anyone spot any way to improve the speed in the next Bilinear resizing Algorithm?
I need to improve Speed as this is critical, keeping good image quality. Is expected to be used in mobile devices with low speed CPUs.
The algorithm is used mainly for up-scale resizing. Any other faster Bilinear algorithm also would be appreciated. Thanks
void resize(int* input, int* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
int a, b, c, d, x, y, index;
float x_ratio = ((float)(sourceWidth - 1)) / targetWidth;
float y_ratio = ((float)(sourceHeight - 1)) / targetHeight;
float x_diff, y_diff, blue, red, green ;
int offset = 0 ;
for (int i = 0; i < targetHeight; i++)
{
for (int j = 0; j < targetWidth; j++)
{
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
a = input[index] ;
b = input[index + 1] ;
c = input[index + sourceWidth] ;
d = input[index + sourceWidth + 1] ;
// blue element
blue = (a&0xff)*(1-x_diff)*(1-y_diff) + (b&0xff)*(x_diff)*(1-y_diff) +
(c&0xff)*(y_diff)*(1-x_diff) + (d&0xff)*(x_diff*y_diff);
// green element
green = ((a>>8)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>8)&0xff)*(x_diff)*(1-y_diff) +
((c>>8)&0xff)*(y_diff)*(1-x_diff) + ((d>>8)&0xff)*(x_diff*y_diff);
// red element
red = ((a>>16)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>16)&0xff)*(x_diff)*(1-y_diff) +
((c>>16)&0xff)*(y_diff)*(1-x_diff) + ((d>>16)&0xff)*(x_diff*y_diff);
output [offset++] =
0x000000ff | // alpha
((((int)red) << 24)&0xff0000) |
((((int)green) << 16)&0xff00) |
((((int)blue) << 8)&0xff00);
}
}
}
Off the the top of my head:
Stop using floating-point, unless you're certain your target CPU has it in hardware with good performance.
Make sure memory accesses are cache-optimized, i.e. clumped together.
Use the fastest data types possible. Sometimes this means smallest, sometimes it means "most native, requiring least overhead".
Investigate if signed/unsigned for integer operations have performance costs on your platform.
Investigate if look-up tables rather than computations gain you anything (but these can blow the caches, so be careful).
And, of course, do lots of profiling and measurements.
In-Line Cache and Lookup Tables
Cache your computations in your algorithm.
Avoid duplicate computations (like (1-y_diff) or (x_ratio * j))
Go through all the lines of your algorithm, and try to identify patterns of repetitions. Extract these to local variables. And possibly extract to functions, if they are short enough to be inlined, to make things more readable.
Use a lookup-table
It's quite likely that, if you can spare some memory, you can implement a "store" for your RGB values and simply "fetch" them based on the inputs that produced them. Maybe you don't need to store all of them, but you could experiment and see if some come back often. Alternatively, you could "fudge" your colors and thus end up with less values to store for more lookup inputs.
If you know the boundaries for you inputs, you can calculate the complete domain space and figure out what makes sense to cache. For instance, if you can't cache the whole R, G, B values, maybe you can at least pre-compute the shiftings ((b>>16) and so forth...) that are most likely deterministic in your case).
Use the Right Data Types for Performance
If you can avoid double and float variables, use int. On most architectures, int would be test faster type for computations because of the memory model. You can still achieve decent precision by simply shifting your units (ie use 1026 as int instead of 1.026 as double or float). It's quite likely that this trick would be enough for you.
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
Could surely use some optimization: you were using x_ration * j-1 just a few cycles earlier, so all you really need here is x+=x_ratio
My random guess (use a profiler instead of letting people guess!):
The compiler has to generate that works when input and output overlap which means it has to do generate loads of redundant stores and loads. Add restrict to the input and output parameters to remove that safety feature.
You could also try using a=b; and c=d; instead of loading them again.
here is my version, steal some ideas. My C-fu is quite weak, so some lines are pseudocodes, but you can fix them.
void resize(int* input, int* output,
int sourceWidth, int sourceHeight,
int targetWidth, int targetHeight
) {
// Let's create some lookup tables!
// you can move them into 2-dimensional arrays to
// group together values used at the same time to help processor cache
int sx[0..targetWidth ]; // target->source X lookup
int sy[0..targetHeight]; // target->source Y lookup
int mx[0..targetWidth ]; // left pixel's multiplier
int my[0..targetHeight]; // bottom pixel's multiplier
// we don't have to calc indexes every time, find out when
bool reloadPixels[0..targetWidth ];
bool shiftPixels[0..targetWidth ];
int shiftReloadPixels[0..targetWidth ]; // can be combined if necessary
int v; // temporary value
for (int j = 0; j < targetWidth; j++){
// (8bit + targetBits + sourceBits) should be < max int
v = 256 * j * (sourceWidth-1) / (targetWidth-1);
sx[j] = v / 256;
mx[j] = v % 256;
reloadPixels[j] = j ? ( sx[j-1] != sx[j] ? 1 : 0)
: 1; // always load first pixel
// if no reload -> then no shift too
shiftPixels[j] = j ? ( sx[j-1]+1 = sx[j] ? 2 : 0)
: 0; // nothing to shift at first pixel
shiftReloadPixels[j] = reloadPixels[i] | shiftPixels[j];
}
for (int i = 0; i < targetHeight; i++){
v = 256 * i * (sourceHeight-1) / (targetHeight-1);
sy[i] = v / 256;
my[i] = v % 256;
}
int shiftReload;
int srcIndex;
int srcRowIndex;
int offset = 0;
int lm, rm, tm, bm; // left / right / top / bottom multipliers
int a, b, c, d;
for (int i = 0; i < targetHeight; i++){
srcRowIndex = sy[ i ] * sourceWidth;
tm = my[i];
bm = 255 - tm;
for (int j = 0; j < targetWidth; j++){
// too much ifs can be too slow, measure.
// always true for first pixel in a row
if( shiftReload = shiftReloadPixels[ j ] ){
srcIndex = srcRowIndex + sx[j];
if( shiftReload & 2 ){
a = b;
c = d;
}else{
a = input[ srcIndex ];
c = input[ srcIndex + sourceWidth ];
}
b = input[ srcIndex + 1 ];
d = input[ srcIndex + 1 + sourceWidth ];
}
lm = mx[j];
rm = 255 - lm;
// WTF?
// Input AA RR GG BB
// Output RR GG BB AA
if( j ){
leftOutput = rightOutput ^ 0xFFFFFF00;
}else{
leftOutput =
// blue element
((( ( (a&0xFF)*tm
+ (c&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((a>>8)&0xFF)*tm
+ ((c>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((a>>16)&0xFF)*tm
+ ((c>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
}
rightOutput =
// blue element
((( ( (b&0xFF)*tm
+ (d&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((b>>8)&0xFF)*tm
+ ((d>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((b>>16)&0xFF)*tm
+ ((d>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
output[offset++] =
// alpha
0x000000ff
| leftOutput
| rightOutput
;
}
}
}