Optimize Bilinear Resize Algorithm in C

Optimize Bilinear Resize Algorithm in C - c

Can anyone spot any way to improve the speed in the next Bilinear resizing Algorithm?
I need to improve Speed as this is critical, keeping good image quality. Is expected to be used in mobile devices with low speed CPUs.
The algorithm is used mainly for up-scale resizing. Any other faster Bilinear algorithm also would be appreciated. Thanks
void resize(int* input, int* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
int a, b, c, d, x, y, index;
float x_ratio = ((float)(sourceWidth - 1)) / targetWidth;
float y_ratio = ((float)(sourceHeight - 1)) / targetHeight;
float x_diff, y_diff, blue, red, green ;
int offset = 0 ;
for (int i = 0; i < targetHeight; i++)
{
for (int j = 0; j < targetWidth; j++)
{
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
a = input[index] ;
b = input[index + 1] ;
c = input[index + sourceWidth] ;
d = input[index + sourceWidth + 1] ;
// blue element
blue = (a&0xff)*(1-x_diff)*(1-y_diff) + (b&0xff)*(x_diff)*(1-y_diff) +
(c&0xff)*(y_diff)*(1-x_diff) + (d&0xff)*(x_diff*y_diff);
// green element
green = ((a>>8)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>8)&0xff)*(x_diff)*(1-y_diff) +
((c>>8)&0xff)*(y_diff)*(1-x_diff) + ((d>>8)&0xff)*(x_diff*y_diff);
// red element
red = ((a>>16)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>16)&0xff)*(x_diff)*(1-y_diff) +
((c>>16)&0xff)*(y_diff)*(1-x_diff) + ((d>>16)&0xff)*(x_diff*y_diff);
output [offset++] =
0x000000ff | // alpha
((((int)red) << 24)&0xff0000) |
((((int)green) << 16)&0xff00) |
((((int)blue) << 8)&0xff00);
}
}
}

Off the the top of my head:
Stop using floating-point, unless you're certain your target CPU has it in hardware with good performance.
Make sure memory accesses are cache-optimized, i.e. clumped together.
Use the fastest data types possible. Sometimes this means smallest, sometimes it means "most native, requiring least overhead".
Investigate if signed/unsigned for integer operations have performance costs on your platform.
Investigate if look-up tables rather than computations gain you anything (but these can blow the caches, so be careful).
And, of course, do lots of profiling and measurements.

In-Line Cache and Lookup Tables
Cache your computations in your algorithm.
Avoid duplicate computations (like (1-y_diff) or (x_ratio * j))
Go through all the lines of your algorithm, and try to identify patterns of repetitions. Extract these to local variables. And possibly extract to functions, if they are short enough to be inlined, to make things more readable.
Use a lookup-table
It's quite likely that, if you can spare some memory, you can implement a "store" for your RGB values and simply "fetch" them based on the inputs that produced them. Maybe you don't need to store all of them, but you could experiment and see if some come back often. Alternatively, you could "fudge" your colors and thus end up with less values to store for more lookup inputs.
If you know the boundaries for you inputs, you can calculate the complete domain space and figure out what makes sense to cache. For instance, if you can't cache the whole R, G, B values, maybe you can at least pre-compute the shiftings ((b>>16) and so forth...) that are most likely deterministic in your case).
Use the Right Data Types for Performance
If you can avoid double and float variables, use int. On most architectures, int would be test faster type for computations because of the memory model. You can still achieve decent precision by simply shifting your units (ie use 1026 as int instead of 1.026 as double or float). It's quite likely that this trick would be enough for you.

x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = (y * sourceWidth + x) ;
Could surely use some optimization: you were using x_ration * j-1 just a few cycles earlier, so all you really need here is x+=x_ratio

My random guess (use a profiler instead of letting people guess!):
The compiler has to generate that works when input and output overlap which means it has to do generate loads of redundant stores and loads. Add restrict to the input and output parameters to remove that safety feature.
You could also try using a=b; and c=d; instead of loading them again.

here is my version, steal some ideas. My C-fu is quite weak, so some lines are pseudocodes, but you can fix them.
void resize(int* input, int* output,
int sourceWidth, int sourceHeight,
int targetWidth, int targetHeight
) {
// Let's create some lookup tables!
// you can move them into 2-dimensional arrays to
// group together values used at the same time to help processor cache
int sx[0..targetWidth ]; // target->source X lookup
int sy[0..targetHeight]; // target->source Y lookup
int mx[0..targetWidth ]; // left pixel's multiplier
int my[0..targetHeight]; // bottom pixel's multiplier
// we don't have to calc indexes every time, find out when
bool reloadPixels[0..targetWidth ];
bool shiftPixels[0..targetWidth ];
int shiftReloadPixels[0..targetWidth ]; // can be combined if necessary
int v; // temporary value
for (int j = 0; j < targetWidth; j++){
// (8bit + targetBits + sourceBits) should be < max int
v = 256 * j * (sourceWidth-1) / (targetWidth-1);
sx[j] = v / 256;
mx[j] = v % 256;
reloadPixels[j] = j ? ( sx[j-1] != sx[j] ? 1 : 0)
: 1; // always load first pixel
// if no reload -> then no shift too
shiftPixels[j] = j ? ( sx[j-1]+1 = sx[j] ? 2 : 0)
: 0; // nothing to shift at first pixel
shiftReloadPixels[j] = reloadPixels[i] | shiftPixels[j];
}
for (int i = 0; i < targetHeight; i++){
v = 256 * i * (sourceHeight-1) / (targetHeight-1);
sy[i] = v / 256;
my[i] = v % 256;
}
int shiftReload;
int srcIndex;
int srcRowIndex;
int offset = 0;
int lm, rm, tm, bm; // left / right / top / bottom multipliers
int a, b, c, d;
for (int i = 0; i < targetHeight; i++){
srcRowIndex = sy[ i ] * sourceWidth;
tm = my[i];
bm = 255 - tm;
for (int j = 0; j < targetWidth; j++){
// too much ifs can be too slow, measure.
// always true for first pixel in a row
if( shiftReload = shiftReloadPixels[ j ] ){
srcIndex = srcRowIndex + sx[j];
if( shiftReload & 2 ){
a = b;
c = d;
}else{
a = input[ srcIndex ];
c = input[ srcIndex + sourceWidth ];
}
b = input[ srcIndex + 1 ];
d = input[ srcIndex + 1 + sourceWidth ];
}
lm = mx[j];
rm = 255 - lm;
// WTF?
// Input AA RR GG BB
// Output RR GG BB AA
if( j ){
leftOutput = rightOutput ^ 0xFFFFFF00;
}else{
leftOutput =
// blue element
((( ( (a&0xFF)*tm
+ (c&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((a>>8)&0xFF)*tm
+ ((c>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((a>>16)&0xFF)*tm
+ ((c>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
}
rightOutput =
// blue element
((( ( (b&0xFF)*tm
+ (d&0xFF)*bm )*lm
) & 0xFF0000 ) >> 8)
// green element
| ((( ( ((b>>8)&0xFF)*tm
+ ((d>>8)&0xFF)*bm )*lm
) & 0xFF0000 )) // no need to shift
// red element
| ((( ( ((b>>16)&0xFF)*tm
+ ((d>>16)&0xFF)*bm )*lm
) & 0xFF0000 ) << 8 )
;
output[offset++] =
// alpha
0x000000ff
| leftOutput
| rightOutput
;
}
}
}

Related

What is the matrix/vector operation that corresponds to this code?

Here is the code:
long long mul(long long x)
{
uint64_t M[64] = INIT;
uint64_t result = 0;
for ( int i = 0; i < 64; i++ )
{
uint64_t a = x & M[i];
uint64_t b = 0;
while ( a ){
b ^= a & 1;;
a >>= 1;
}
result |= b << (63 - i);
}
return result;
}
This code implements multiplication of the matrix and vector on GF(2). The code that returns result as the product of 64x64 matrix M and 1x64 vector x.
I want to know what linear algebraic operation( on GF(2) ) this code is:
long long unknown(long long x)
{
uint64_t A[] = INIT;
uint64_t a = 0, b = 0;
for( i = 1; i <= 64; i++ ){
for( j = i; j <= 64; j++ ){
if( ((x >> (64-i)) & 1) && ((x >> (64-j)) & 1) )
a ^= A[b];
b++;
}
}
return a;
}

I want to know what linear algebraic operation( on GF(2) ) this code is:
Of course you mean GF(2)64, the field of 64-dimensional vectors over GF(2).
Consider first the loop structure:
for( i = 1; i <= 64; i++ ){
for( j = i; j <= 64; j++ ){
That's looking at every distinct pair of indices (the indices themselves not necessarily distinct from each other). That should provide a first clue. We then see
if( ((x >> (64-i)) & 1) && ((x >> (64-j)) & 1) )
, which is testing whether vector x has both bit i and bit j set. If it does, then we add a row of matrix A into accumulation variable a, by vector sum (== element-wise exclusive or). By incrementing b on every inner-loop iteration, we ensure that each iteration services a different row of A. And that also tells us that A must have 64 * 65 / 2 = 160 rows (that matter).
In general, this is not a linear operation at all. The criterion for an operation o on a vector field over GF(2) to be linear boils down to this expression holding for all pairs of vectors x and y:
o(x + y) = o(x) + o(y)
Now, for notational convenience, let's consider the field GF(2)2 instead of GF(2)64; the result can be extended from the former to the latter simply by adding zeroes. Let x be the bit vector (1, 0) (represented, for example, by the integer 2). Let y be the bit vector (0, 1) (represented by the integer 1). And let A be this matrix:
1 0
0 1
1 0
Your operation has the following among its results:
operand result as integer comment
x (1, 0) 2 Only the first row is accumulated
y (1, 0) 2 Only the third row is accumulated
x + y (0, 1) 1 All rows are accumulated
Clearly, it is not the case that o(x) + o(y) = o(x + y) for this x, y, and characteristic A, so the operation is not linear for this A.
There are matrices A for which the corresponding operation is linear, but what linear operation they represent will depend on A. For example, it is possible to represent a wide variety of matrix-vector multiplications this way. It's not clear to me whether linear operations other than matrix-vector multiplications can be represented in this form, but I'm inclined to think not.

image proccessing further optimization

I'm new to optimization and was given a task to optimize a function that processes an image as much as possible. it takes an image, blurs it and then saves the blurred image, and then continues and sharpens the image, and saves also the sharpened image.
Here is my code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
I wanted to ask about the if statements, is there anything better that can replace those? And also more generally speaking can anyone spot an optimization mistakes here, or can offer his inputs?
Thanks a lot!
updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
index += 2;
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
------------------------------------------------------------------------------updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = n + 1;
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operate like that : instead of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
index += 2;
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}

Some general optimization guidelines:
If you're running on x86, compile as a 64-bit binary. x86 is really a register-starved CPU. In 32-bit mode you pretty much have only 5 or 6 32-bit general-purpose registers available, and you only get "all" 6 if you compile with optimizations like -fomit-frame-pointer on GCC. In 64-bit mode you'll have 13 or 14 64-bit general-purpose registers.
Get a good compiler and use the highest possible general optimization level.
Profile! Profile! Profile! Actually profile your code so actually know where the performance bottlenecks are. Any guesses about the location of any performance bottlenecks are likely wrong.
Once you find your bottlenecks, examine the actual instructions the compiler produces and look at the bottleneck areas, just to see what's happening. Perhaps the bottleneck is where the compiler had to do a lot of register spilling and filling because of register pressure. This can be really helpful if you can profile down to the instruction level.
Use the insights from the profiling and examination of the generated instructions to improve your code and compile arguments. For example, if you're seeing a lot of register spilling and filling, you need to reduce register pressure, perhaps by manually coalescing loops or disabling prefetching with a compiler option.
Experiment with different page size options. If a single row of pixels is a significant fraction of a page size, reaching into other rows is more likely to reach into another page and result in a TLB miss. Using larger memory pages may significantly reduce this.
Some specific ideas for your code:
Use only one outer loop. You'll have to experiment to find the fastest way to handle your "extra" edge pixels. The fastest way might be to not do anything special, roll right over them like "normal" pixels, and just ignore the values in them later.
Manually unroll the two inner loops - you're only doing 9 pixels.
Don't use calculateIndex() - use the address of the current pixel and find the other pixels simply by subtracting or adding the proper value from the current pixel address. For example, the address of the upper-left pixel in your inner loops would be something like currentPixelAddress - n - 1.
Those would convert your four-deep nested loops into a single loop with very little index calculations needed.

A few ideas - untested.
You have if(ii==i && jj=j) to test for the central pixel in your sharpening loop which you do 9x for every pixel. I think it would be faster to remove that if and do exactly the same for every pixel but then make a correction, outside the loop by adding 10x the central pixel.
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
Where you do dst[calculateIndex(i, j, n)] = current_pixel;, you can probably calculate the index once before the loop at the start and then just increment the pointer with each write inside the loop - assuming your arrays are contiguous and unpadded.
index=calculateIndex(1,1,n)
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
...
dst[index++] = current_pixel;
}
index+=2; // skip over last pixel of this line and first pixel of next line
}
As you move your 3x3 window of 9 pixels across the image, you could "remember" the left-most column of 3 pixels from the previous position, then instead of 9 additions for each pixel, you would do a single subtraction for the left-most column leaving the window and 3 additions for the new column entering the window on the right side, i.e. 4 calculations instead of 9.

Floating average with reading of ADC values

I want to do moving average or something similar to that, because I am getting noisy values from ADC, this is my first try, just to compute moving average, but values goes to 0 everytime, can you help me?
This is part of code, which makes this magic:
unsigned char buffer[5];
int samples = 0;
USART_Init0(MYUBRR);
uint16_t adc_result0, adc_result1;
float ADCaverage = 0;
while(1)
{
adc_result0 = adc_read(0); // read adc value at PA0
samples++;
//adc_result1 = adc_read(1); // read adc value at PA1
ADCaverage = (ADCaverage + adc_result0)/samples;
sprintf(buffer, "%d\n", (int)ADCaverage);
char * p = buffer;
while (*p) { USART_Transmit0(*p++); }
_delay_ms(1000);
}
return(0);
}
This result I am sending via usart to display value.

Your equation is not correct.
Let s_n = (sum_{i=0}^{n} x[i])/n then:
s_(n-1) = sum_{i=0}^{n-1} x[i])/(n-1)
sum_{i=0}^{n-1} x[i] = (n-1)*s_(n-1)
sum_{i=0}^{n} x[i] = n*s_n
sum_{i=0}^{n} x[i] = sum_{i=0}^{n-1} x[i] + x[n]
n*s_n = (n-1)*s_(n-1) + x[n] = n*s_(n-1) + (x[n]-s_(n-1))
s_n = s_(n-1) + (x[n]-s_(n-1))/n
You must use
ADCaverage += (adc_result0-ADCaverage)/samples;

You can use an exponential moving average which only needs 1 memory unit.
y[0] = (x[0] + y[-1] * (a-1) )/a
Where a is the filter factor.
If a is multiples of 2 you can use shifts and optimize for speed significantly:
y[0] = ( x[0] + ( ( y[-1] << a ) - y[-1] ) ) >> a
This works especially well with left aligned ADC's. Just keep an eye on the word size of the shift result.

I am writing a max filter. This replaces each pixel RGB channel with the maximum channel intensity of the surrounding 9 pixels

Here is the code I am using. When I run it, it doesn't seem to change anything in the image except the last 1/4 of it. That part turns to a solid color.
void maxFilter(pixel * data, int w, int h)
{
GLubyte tempRed;
GLubyte tempGreen;
GLubyte tempBlue;
int i;
int j;
int k;
int pnum = 0;
int pnumWrite = 0;
for(i = 0 ; i < (h - 2); i+=3) {
for(j = 0 ; j < (w - 2); j+=3) {
tempRed = 0;
tempGreen = 0;
tempBlue = 0;
for (k = 0 ; k < 3 ; k++){
if ((data[pnum].r) > tempRed){tempRed = (data[pnum + k].r);}
if ((data[pnum].g) > tempGreen){tempGreen = (data[pnum + k].g);}
if ((data[pnum].b) > tempBlue){tempBlue = (data[pnum + k].b);}
if ((data[(pnum + w)].r) > tempRed){tempRed = (data[(pnum + w)].r);}
if ((data[(pnum + w)].g) > tempGreen){tempGreen = (data[(pnum + w)].g);}
if ((data[(pnum + w)].b) > tempBlue){tempBlue = (data[(pnum + w)].b);}
if ((data[(pnum + 2 * w)].r) > tempRed){tempRed = (data[(pnum + 2 * w)].r);}
if ((data[(pnum + 2 * w)].g) > tempGreen){tempGreen = (data[(pnum + 2 * w)].g);}
if ((data[(pnum + 2 * w)].b) > tempBlue){tempBlue = (data[(pnum + 2 * w)].b);}
pnum++;
}
pnumWrite = pnum - 3;
for (k = 0 ; k < 3 ; k++){
((data[pnumWrite].r) = tempRed);
((data[pnumWrite].g) = tempGreen);
((data[pnumWrite].b) = tempBlue);
((data[(pnumWrite + w)].r) = tempRed);
((data[(pnumWrite + w)].g) = tempGreen);
((data[(pnumWrite + w)].b) = tempBlue);
((data[(pnumWrite + 2 * w)].r) = tempRed);
((data[(pnumWrite + 2 * w)].g) = tempGreen);
((data[(pnumWrite + 2 * w)].b) = tempBlue);
pnumWrite++;
}
}
}
}

I can see several problems with that code - being difficult to follow not being the least!
I think your main problem is that the loop is (as you probably intended) run through h/3 * w/3 times, once for each 3x3 block in the image. But the pnum index runs only increases by 3 for each block, and reaches a maximum of about h*w/3, rather than the intended h*w. That means that only the first third of your image will be affected by your filter. (And I suspect your painting is done 'bottom-up', so that's why you see the lowest part change. I remember .bmp files being structured that way, but perhaps there are others as well.)
The 'cheap' fix would be to add 2*w at the right point, but nobody will ever understand that code again. I suggest you rewrite your indexing instead, and explicitly compute pnum from i and j in each turn through the loop. That can be improved on for readability, but is reasonably clear.
There's another minor thing: you have code like
if ((data[pnum].r) > tempRed){tempRed = (data[pnum + k].r);}
where the indexing on the right and on the left differ: this is probably also giving you results different from what you intended.
As Jongware points out, writing to the input array is always dangerous - your code is intended, I believe, to avoid that problem by only looking once into each 3x3 block, but his suggestion of a separate output array is very sensible - you probably don't want the blockiness your code gives anyway (you make each 3x3 block all one colour, don't you?), and his suggestion would let you avoid that.

Fastest and most efficient way to find the maximum no. that can be obtained by performing bitwise and on 2 DISTINCT elements of array

Given an array of non-negative integers, what is the fastest and most efficient way to find the maximum no. that can be obtained by performing bitwise and (i.e, & operator) on 2 DISTINCT elements of the array?
This is my code until now :
max = 0
for(i=0; i<n; i++)
{
for(j=i+1; j<n; j++)
{
temp = a[i] & a[j];
if(temp > max)
max = temp
}
}
This, of course, is the naive method. I am looking for a more efficient solution.
Maybe something like using a trie(actually a binary tree) to find max XOR of elements of array. The description for the max XOR solution can be found at http://threads-iiith.quora.com/Tutorial-on-Trie-and-example-problems?share=1

I hope I have got the question right. Here's my solution to it:
You have an array of integers, say that they are unsigned integers since we are dealing with bitwise operations. Let's think of them as a string of zeroes and ones in their binary representation and then put them on top of each other.
We now have their corresponding bits aligned vertically. Let's draw vertical lines, starting from the leftmost column. If we ever encounter more than or equal to two 1s in a column, then rule out every row that does not have the 1s. We are to disregard the ruled out ones while drawing our further vertical lines.
You see where this is going at?
This shall go on until we have only and exactly 2 lines left that hasn't been ruled out. If we ever end up with anything else than 2, then it means something went wrong:
Less than 2 means we had less than 2 lines initially
More than 2 means that...
If there are less than what we had initially, then the ones left should all be identical
If there are exactly as many as we had initially, then it can be that all are the same, or every possible pair is bitwise distinct, meaning that every single pair produces 0
Here's the code I've written that follows the logic I've described above:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <memory.h>
#define bit(_x_) (1U << (_x_))
void randomfillarray( unsigned int * arr, size_t size ) {
srand( time( NULL ) );
for ( int i = 0; i < size; i++ )
arr[i] = rand( );
}
int main( ) {
unsigned int arr[10];
size_t size = sizeof arr / sizeof * arr;
randomfillarray( arr, size );
unsigned int * resultantcouple = malloc( sizeof arr );
memcpy( resultantcouple, arr, sizeof arr );
for ( int i = 0; i < size; i++ )
printf( i ? " %u" : "%u", arr[i] );
putchar( '\n' );
int success = 0;
for ( unsigned int thebit = bit( sizeof( int ) * 8 - 1 ); thebit; thebit >>= 1 ) {
int count = 0;
int * indices = NULL;
for ( int i = 0; i < size; i++ ) {
if ( resultantcouple[i] & thebit ) {
indices = realloc( indices, ++count * sizeof * indices );
indices[count - 1] = i;
}
}
if ( count >= 2 ) {
size = count;
for ( int i = 0; i < size; i++ )
resultantcouple[i] = resultantcouple[indices[i]];
resultantcouple = realloc( resultantcouple, size * sizeof * resultantcouple );
}
if ( size == 2 ) {
success = 1;
break;
}
free( indices );
}
if ( success )
printf( "Success! %u and %u are the ones.", resultantcouple[0], resultantcouple[1] );
else
printf( "Failure! Either all pairs are bitwise distinct, or there are less than 2 elements, or something else..." );
putchar( '\n' );
return 0;
}
Here's the same during action: http://ideone.com/hRA8tn
I'm not sure if this is the best, but it should be better than testing all out.

First look at and understand the heapsort algorithm.
Turn the array into a heap which lets you access the two largest elements. This is done in linear time, O (n).
Take the two largest elements, x = largest, y = second largest. If y = 0, the solution is 0. If the highest bit in x and the highest bit in y are the same, the solution is x & y. Otherwise, clear the highest bit in x, fix the heap, and try again. The last step takes O (log n) steps, and if you are using k bit integers, like 32 or 64, it is repeated at most k times.
No extra space needed, and linear time.
Pseudo-code:
If n ≤ 1 there is no solution.
Turn a [0] to a [n-1] into a heap with a [0] as the largest element.
Repeat
Let x = a [0].
Let y = a [1].
If n ≥ 3 and a [2] > a [1] then let y = a [2].
If y = 0 then the solution is 0.
Determine b = the highest bit of x.
If (y & b) != 0 then the solution is x & y.
Replace a [0] with x & (~ b)
Turn a [0] to a [n-1] into a heap again by moving a [0] down.
This assumes that a [i] and a [j] are considered "distinct array elements" if i ≠ j. If you require instead that a [i] ≠ a [j] then things are slightly different. You'd have to remove duplicate entries in your array, but in case the largest elements are for example 31 and 15, you don't want to clear the highest bit in 31 and then remove it as a duplicate! So the code is more difficult.
Let mask = ~0. In the following, when creating a heap compare a [i] & mask, not a [i].
Turn a [0] to a [n-1] into a heap with a [0] as the largest element.
Repeat
If n ≤ 1 then there is no solution.
Let x = a [0].
Let y = a [1].
If n ≥ 3 and a [2] & mask > y & mask then let y = a [2].
If x = y then let n = n - 1, let a [0] = a [n], restore the heap, and continue.
If (y & mask) = 0 then the solution is 0.
Determine b = the highest bit of x & mask.
If (y & b) != 0 then the solution is x & y.
Replace mask with mask & ~b.
Restore the heap and continue.
Worst case is O (n log n), for example if all elements are 1 except one that is 0.

The following worked for me for our_n uints in uint our_a[our_n], without changing the array or copying it or anything else. The essence is that in one pass down the array it identifies the next bit that can be added to the result so far. Each pass only considers values which contain all the bits of the result so far:
uint result ;
uint passes ;
uint msb ;
uint pn ;
at->start_clock = times(&at->start_tms) ;
result = 0 ;
passes = 0 ;
msb = (UINT_MAX >> 1) + 1 ;
pn = our_n ;
do
{
uint seen_once ;
uint seen_again ;
passes += 1 ;
seen_once = 0 ;
seen_again = 0 ;
for (uint i = 0 ; i < pn ; ++i)
{
uint a ;
a = our_a[i] ;
if ((a & result) == result)
{
seen_again |= (a & seen_once) ;
seen_once |= a ;
} ;
} ;
assert((seen_again & result) == result) ;
seen_again ^= result ;
while (msb > seen_again)
msb >>= 1 ;
result |= msb ;
}
while (msb > 1) ;
So, this is O(p * n), where p is the number of passes: 1..32.
If it is OK to destroy the contents of the array, then the inner loop can be changed to:
k = 0 ;
for (uint i = 0 ; i < pn ; ++i)
{
uint a ;
a = our_a[i] ;
if ((a & result) == result)
{
our_a[k++] = a ;
seen_again |= (a & seen_once) ;
seen_once |= a ;
} ;
} ;
pn = k ;
Of course, the first pass is now doing rather more work than it need to, so doing that separately saves a bit more.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight