Nested Loops in OpenACC

Nested Loops in OpenACC - c

I'm trying to parallelize a nested for loop with OpenACC. I don't understand why my code isn't working correctly, The following is the relevant part of my code:
int edgedetect_laplace(int height, // I: image height
int width, // I: image width
gray_t image[height][width], // I: input image
gray_t new_image[height][width]) // O: output image
{
// just for reproducable checksums...
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
new_image[i][j] = 0;
}
}
#pragma acc data copyin(image[:height][:width]) copyout(new_image[:height][:width])
{
#pragma acc parallel
{
#pragma acc loop
for (int i = 1; i < height - 1; i++)
{
#pragma acc loop
for (int j = 1; j < width - 1; ++j)
{
// apply laplace operator
unsigned int val = 4 * image[i][j] - image[i - 1][j] - image[i + 1][j] - image[i][j - 1] - image[i][j + 1];
/* store calculated value (map to correct range) */
new_image[i][j] = min(val, GRAY_MAX);
}
}
}
}
printf("time Laplace edge detection: %.6f s\n", t1 - t0);
unsigned long cs = checksum(height, width, new_image);
if (cs != REFERENCE_CHECKSUM_LAPLACE)
printf("\t error checksum Laplace: expected %lu, seen %lu\n", REFERENCE_CHECKSUM_LAPLACE, cs);
else
printf("checksum Laplace OK : %lu\n", cs);
return 0;
}
I have run the program sequentially and calculated the checksum to test if my parallelized version runs correctly. However, it does not (I'm getting a different checksum) and I don't see why.

It might be because you're assigning values to the interior of "new_image" but copying out the whole array. Since it's not initialized on the device, this means the halos will contain garbage values when copied back. Try using "copy" instead of "copyout" so "new_image" is initialized, or just copyout the interior elements.
If that's not the issue, please provide a minimally reproducing example.

Related

Blur filter in C results in only a slightly changed image

i am trying to make a blur filter in c that takes the neighboring pixels of the main pixel, takes the avarage of the rgb values and stores it in the temp array, them changes the image using the temp array values, it seems correct but it is not working as intended, giving an output of a very slightly blured image. I realy dont see my mistake and would be very thankful if someone helped, sorry if i made something horrible, started learning c last week.
i checked this post
Blurring an Image in c pixel by pixel - special cases
but i did not see were i went wrong.
im working with this data struct
BYTE rgbtBlue;
BYTE rgbtGreen;
BYTE rgbtRed;
void blur(int height, int width, RGBTRIPLE image[height][width])
{
// ints to use later
int j;
int p;
RGBTRIPLE temp[height][width];
for(int n = 0; n < height; n++) // loop to check every pixel
{
for(int k = 0; k < width; k++)
{
int widx = 3;
int hghtx = 3;
// conditionals for border cases
int y = 0;
if(n == 0)
{
p = 0;
hghtx = 2;
}
if(n == height - 1)
{
p = -1;
hghtx = 2;
}
if(k == 0)
{
j = 0;
widx = 2;
}
if(k == width - 1)
{
j = -1;
widx = 2;
}
for(int u = 0; u < hghtx; u++) // matrix of pixels around the main pixel using the conditionals gathered before
for(int i = 0; i < widx; i++)
if(y == 1) // takes the average of color and stores it in the RGB temp
{
temp[n][k].rgbtGreen = temp[n][k].rgbtGreen + image[n + p + u][k + j + i].rgbtGreen / (hghtx * widx);
temp[n][k].rgbtRed = temp[n][k].rgbtRed + image[n + p + u][k + j + i].rgbtRed / (hghtx * widx);
temp[n][k].rgbtBlue = temp[n][k].rgbtBlue + image[n + p + u][k + j + i].rgbtBlue / (hghtx * widx);
}
else // get first value of temp
{
temp[n][k].rgbtGreen = (image[n + p + u][k + j + i].rgbtGreen) / (hghtx * widx);
temp[n][k].rgbtRed = (image[n + p + u][k + j + i].rgbtRed) / (hghtx * widx);
temp[n][k].rgbtBlue = (image[n + p + u][k + j + i].rgbtBlue) / (hghtx * widx);
y++;
}
}
}
// changes the original image to the blured one
for(int n = 0; n < height; n++)
for(int k = 0; k < width; k++)
image[n][k] = temp[n][k];
}

I think it's a combination of things.
If the code worked the way you expect, you would be still doing a blur of just 3x3 pixels and that can be hardly noticeable, especially on large images (I'm pretty sure it will be unnoticeable on an image 4000x3000 pixels)
There are some problems with the code.
As #Fe2O3 says, at the end of the first line, widx will change to 2 and stay 2 for the rest of the image.
you are reading from temp[][] without initializing it. I think that if you compile that in release mode (not debug), temp[][] will contain random data and not all zeros as you probably expect. (as #WeatherWane pointed out)
The way you calculate the average of the pixels is weird. If you use a matrix 3x3 pixels, each pixel value shoud be divided by 9 in the final sum. But you divide the first pixel nine times by 2 (in effect doing /256), the second one eight times by 2 (so its pixel/128) etc. until the last one is divided by 2. So basically, it's mostly the value of the bottom right pixel.
also, since your RGB values are just bytes, you may want to divide them first and only then add them, because otherwise, you'll get overflows with wild results.
Try using a debugger to see the values you are actually calculating. It can be quite an eye opener :)

Getting a segmentation fault at certain moment

I tried to apply box blur to an image (without a matrix but just iterating over 9 neighbooring pixels) but I am always getting a segmentation fault after I get to 408th pixel of an image (on the 1st row). I don't know what could cause it because debugging with printf() didn't show any meaningful results
void blur(int height, int width, RGBTRIPLE image[height][width])
{
BYTE totalRed, totalGreen, totalBlue;
totalRed = totalGreen = totalBlue = 0;
for (int i = 1; i < height - 1; i++)
{
for (int j = 1; j < width - 1; j++)
{
for (int h = -1; h <= 1; h++)
{
for (int w = -1; w <= 1; w++)
{
totalRed += image[i + h][j + w].rgbtRed;
totalGreen += image[i + h][j + w].rgbtGreen;
totalBlue += image[i + h][j + w].rgbtBlue;
}
}
image[j][i].rgbtRed = round((totalRed / 9));
image[j][i].rgbtGreen = round((totalGreen / 9));
image[j][i].rgbtBlue = round((totalBlue / 9));
}
}
return;
}
EDIT
I fixed the issue, thanks to everyone who answered me.

The problem is you transposed the index values for storing the updated value: image[j][i].rgbtRed = round((totalRed / 9)) should be
image[i][j].rgbtRed = round((totalRed / 9));
image[i][j].rgbtGreen = round((totalGreen / 9));
image[i][j].rgbtBlue = round((totalBlue / 9));
Note however that you overwrite the pixels in row i that will be used for blurring the next row, which is incorrect. Also note that you should make special cases for the boundary rows and columns. More work is needed on the algorithm.

I would suggest you to post a minimal "working" example that we could compile and reproduce results on something like Compiler Explorer.
As #Fe2O3 commented on the original post, you have i and j flipped in these assignments:
image[j][i].rgbtRed = round((totalRed / 9));
image[j][i].rgbtGreen = round((totalGreen / 9));
image[j][i].rgbtBlue = round((totalBlue / 9));
Which could cause problems whenever the images are not squares.
Additionally, you're using a byte-sized variable to store the sum of 9 bytes worth of bytes, meaning your max value will be 9*255=2295. I'd highly recommend you upgrading the type of totalRed/Green/Blue to at least 16 bits.
Finally, as #[Some Programmer Dude] suggested, there's nothing to round as in C division of integers will not convert the resulting value to float/double. The value will be truncated, so your result will look like
if (x > 0) {
floor(x)
} else if (x < 0) {
ceil(x)
} else {
crash_and_burn()
}

How can I properly implement zero cross triggering for digital oscilloscope in C?

So I'm doing a simple oscilloscope in C. It reads audio data from the output buffer (and drops buffer write counter when called so the buffer is refreshed). I tried making simple zero-cross triggering since most of the time users will see simple (sine, pulse, saw, triangle) waves but the best result I got with the code below is a wave that jumps back and forth for half of its cycle. What is wrong?
Signal that is fed in goes from -32768 to 32767 so zero is where it should be.
If you didn't understand what I meant you can see the video: click
Upd: Removed the code unrelated to triggering so all function may be understood easier.
extern Mused mused;
void update_oscillscope_view(GfxDomain *dest, const SDL_Rect* area)
{
if (mused.output_buffer_counter >= OSC_SIZE * 12) {
mused.output_buffer_counter = 0;
}
for (int x = 0; x < area->h * 0.5; x++) {
//drawing a black rect so bevel is hidden when it is under oscilloscope
gfx_line(domain,
area->x, area->y + 2 * x,
area->x + area->w - 1, area->y + 2 * x,
colors[COLOR_WAVETABLE_BACKGROUND]);
}
Sint32 sample, last_sample, scaled_sample;
for (int i = 0; i < 2048; i++) {
if (mused.output_buffer[i] < 0 && mused.output_buffer[i - 1] > 0) {
//here comes the part with triggering
if (i < OSC_SIZE * 2) {
for (int x = i; x < area->w + i; ++x) {
last_sample = scaled_sample;
sample = (mused.output_buffer[2 * x] + mused.output_buffer[2 * x + 1]) / 2;
if (sample > OSC_MAX_CLAMP) { sample = OSC_MAX_CLAMP; }
if (sample < -OSC_MAX_CLAMP) { sample = -OSC_MAX_CLAMP; }
if (last_sample > OSC_MAX_CLAMP) { last_sample = OSC_MAX_CLAMP; }
if (last_sample < -OSC_MAX_CLAMP) { last_sample = -OSC_MAX_CLAMP; }
scaled_sample = (sample * OSC_SIZE) / 32768;
if(x != i) {
gfx_line(domain,
area->x + x - i - 1, area->h / 2 + area->y + last_sample,
area->x + x - i, area->h / 2 + area->y + scaled_sample,
colors[COLOR_WAVETABLE_SAMPLE]);
}
}
}
return;
}
}
}

During debugging, I simplified the code until it started working. Thanks Clifford.
I found a trigger index i (let's say it is array index 300). Modified it so that the oscilloscope was drawing lines from [(2 * i) + offset] to [(2 * i + 1) + offset], thus an incorrect picture was formed.
I used (2 * i), because I wanted long waves to fit into oscilloscope. I replaced it with drawing from [i + offset] to [i + 1 + offset] and that solved a problem.
Afterwards, I implemented "horizontal scale 0.5x properly.
The output waveform still jumps a little, but overall it holds it in place.

image proccessing further optimization

I'm new to optimization and was given a task to optimize a function that processes an image as much as possible. it takes an image, blurs it and then saves the blurred image, and then continues and sharpens the image, and saves also the sharpened image.
Here is my code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
I wanted to ask about the if statements, is there anything better that can replace those? And also more generally speaking can anyone spot an optimization mistakes here, or can offer his inputs?
Thanks a lot!
updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
index += 2;
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
------------------------------------------------------------------------------updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = n + 1;
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operate like that : instead of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
index += 2;
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}

Some general optimization guidelines:
If you're running on x86, compile as a 64-bit binary. x86 is really a register-starved CPU. In 32-bit mode you pretty much have only 5 or 6 32-bit general-purpose registers available, and you only get "all" 6 if you compile with optimizations like -fomit-frame-pointer on GCC. In 64-bit mode you'll have 13 or 14 64-bit general-purpose registers.
Get a good compiler and use the highest possible general optimization level.
Profile! Profile! Profile! Actually profile your code so actually know where the performance bottlenecks are. Any guesses about the location of any performance bottlenecks are likely wrong.
Once you find your bottlenecks, examine the actual instructions the compiler produces and look at the bottleneck areas, just to see what's happening. Perhaps the bottleneck is where the compiler had to do a lot of register spilling and filling because of register pressure. This can be really helpful if you can profile down to the instruction level.
Use the insights from the profiling and examination of the generated instructions to improve your code and compile arguments. For example, if you're seeing a lot of register spilling and filling, you need to reduce register pressure, perhaps by manually coalescing loops or disabling prefetching with a compiler option.
Experiment with different page size options. If a single row of pixels is a significant fraction of a page size, reaching into other rows is more likely to reach into another page and result in a TLB miss. Using larger memory pages may significantly reduce this.
Some specific ideas for your code:
Use only one outer loop. You'll have to experiment to find the fastest way to handle your "extra" edge pixels. The fastest way might be to not do anything special, roll right over them like "normal" pixels, and just ignore the values in them later.
Manually unroll the two inner loops - you're only doing 9 pixels.
Don't use calculateIndex() - use the address of the current pixel and find the other pixels simply by subtracting or adding the proper value from the current pixel address. For example, the address of the upper-left pixel in your inner loops would be something like currentPixelAddress - n - 1.
Those would convert your four-deep nested loops into a single loop with very little index calculations needed.

A few ideas - untested.
You have if(ii==i && jj=j) to test for the central pixel in your sharpening loop which you do 9x for every pixel. I think it would be faster to remove that if and do exactly the same for every pixel but then make a correction, outside the loop by adding 10x the central pixel.
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
Where you do dst[calculateIndex(i, j, n)] = current_pixel;, you can probably calculate the index once before the loop at the start and then just increment the pointer with each write inside the loop - assuming your arrays are contiguous and unpadded.
index=calculateIndex(1,1,n)
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
...
dst[index++] = current_pixel;
}
index+=2; // skip over last pixel of this line and first pixel of next line
}
As you move your 3x3 window of 9 pixels across the image, you could "remember" the left-most column of 3 pixels from the previous position, then instead of 9 additions for each pixel, you would do a single subtraction for the left-most column leaving the window and 3 additions for the new column entering the window on the right side, i.e. 4 calculations instead of 9.

2D convolution with a with a kernel which is not center originated

I want to do 2D convolution of an image with a Gaussian kernel which is not centre originated given by equation:
h(x-x', y-y') = exp(-((x-x')^2+(y-y'))/2*sigma)
Lets say the centre of kernel is (1,1) instead of (0,0). How should I change my following code for generation of kernel and for the convolution?
int krowhalf=krow/2, kcolhalf=kcol/2;
int sigma=1
// sum is for normalization
float sum = 0.0;
// generate kernel
for (int x = -krowhalf; x <= krowhalf; x++)
{
for(int y = -kcolhalf; y <= kcolhalf; y++)
{
r = sqrtl((x-1)*(x-1) + (y-1)*(y-1));
gKernel[x + krowhalf][y + kcolhalf] = exp(-(r*r)/(2*sigma));
sum += gKernel[x + krowhalf][y + kcolhalf];
}
}
//normalize the Kernel
for(int i = 0; i < krow; ++i)
for(int j = 0; j < kcol; ++j)
gKernel[i][j] /= sum;
float **convolve2D(float** in, float** out, int h, int v, float **kernel, int kCols, int kRows)
{
int kCenterX = kCols / 2;
int kCenterY = kRows / 2;
int i,j,m,mm,n,nn,ii,jj;
for(i=0; i < h; ++i) // rows
{
for(j=0; j < v; ++j) // columns
{
for(m=0; m < kRows; ++m) // kernel rows
{
mm = kRows - 1 - m; // row index of flipped kernel
for(n=0; n < kCols; ++n) // kernel columns
{
nn = kCols - 1 - n; // column index of flipped kernel
//index of input signal, used for checking boundary
ii = i + (m - kCenterY);
jj = j + (n - kCenterX);
// ignore input samples which are out of bound
if( ii >= 0 && ii < h && jj >= 0 && jj < v )
//out[i][j] += in[ii][jj] * (kernel[mm+nn*29]);
out[i][j] += in[ii][jj] * (kernel[mm][nn]);
}
}
}
}
}

Since you're using the convolution operator you have 2 choices:
Using it Spatial Invariant property.
To so so, just calculate the image using regular convolution filter (Better done using either conv2 or imfilter) and then shift the result.
You should mind the boundary condition you'd to employ (See imfilter properties).
Calculate the shifted result specifically.
You can do this by loops as you suggested or more easily create non symmetric kernel and still use imfilter or conv2.
Sample Code (MATLAB)
clear();
mInputImage = imread('3.png');
mInputImage = double(mInputImage) / 255;
mConvolutionKernel = zeros(3, 3);
mConvolutionKernel(2, 2) = 1;
mOutputImage01 = conv2(mConvolutionKernel, mInputImage);
mConvolutionKernelShifted = [mConvolutionKernel, zeros(3, 150)];
mOutputImage02 = conv2(mConvolutionKernelShifted, mInputImage);
figure();
imshow(mOutputImage01);
figure();
imshow(mOutputImage02);
The tricky part is to know to "Crop" the second image in the same axis as the first.
Then you'll have a shifted image.
You can use any Kernel and any function which applies convolution.
Enjoy.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight