How to optimize Matrix initialization and transposition to run faster using C

How to optimize Matrix initialization and transposition to run faster using C - c

The dimension of this matrix is 40000*40000. I was supposed to consider spatial and temporal locality for program but I have no idea to optimize this code. It cost about 50+ seconds in my computer which is not acceptable for our group.The size of block is 500 now. Could someone help me to improve this code?
void InitializeMatrixRowwise(){
int i,j,ii,jj;
double x;
x = 0.0;
for (i = 0; i < DIMENSION; i += BLOCKSIZE)
{
for (j = 0; j < DIMENSION; j += BLOCKSIZE)
{
for (ii = i; ii < i+BLOCKSIZE && ii < DIMENSION; ii++)
{
for (jj = j; jj < j+BLOCKSIZE && jj < DIMENSION; jj++)
{
if (ii >= jj)
{
Matrix[ii][jj] = x++;
}
else
Matrix[ii][jj] = 1.0;
}
}
}
}
}
void TransposeMatrixRowwise(){
int column,row,i,j;
double temp;
for (row = 0; row < DIMENSION; row += BLOCKSIZE)
{
for (column = 0; column < DIMENSION; column += BLOCKSIZE)
{
for (i = row; i < row + BLOCKSIZE && i < DIMENSION; i++)
{
for (j = column; j < column + BLOCKSIZE && j < DIMENSION; j++)
{
if (i > j)
{
temp = Matrix[i][j];
Matrix[i][j] = Matrix[j][i];
Matrix[j][i] = temp;
}
}
}
}
}
}

Your transpose function seems like it might be more complex than necessary and therefore perhaps slower than necessary. However, I created two versions of the code with timing inserted on the 'full size' (40k x 40k array, with 500 x 500 blocks), one using your transpose function and one using this much simpler algorithm:
static void TransposeMatrixRowwise(void)
{
for (int row = 0; row < DIMENSION; row++)
{
for (int col = row + 1; col < DIMENSION; col++)
{
double temp = Matrix[row][col];
Matrix[row][col] = Matrix[col][row];
Matrix[col][row] = temp;
}
}
}
This looks much simpler; it has only two nested loops instead of four, but the timing turns out to be dramatically worse — 31.5s vs 14.7s.
# Simple transpose
# Count = 7
# Sum(x1) = 220.87
# Sum(x2) = 6979.00
# Mean = 31.55
# Std Dev = 1.27 (sample)
# Variance = 1.61 (sample)
# Min = 30.41
# Max = 33.54
# Complex transpose
# Count = 7
# Sum(x1) = 102.81
# Sum(x2) = 1514.00
# Mean = 14.69
# Std Dev = 0.82 (sample)
# Variance = 0.68 (sample)
# Min = 13.59
# Max = 16.21
The reason for the performance difference is almost certainly due to locality of reference. The more complex algorithm is working with two separate blocks of memory at a time, whereas the simpler algorithm is ranging over far more memory, leading to many more page misses, and the slower performance.
Thus, while you might be able to tune the transpose algorithm using different block sizes (it needn't be the same block size as was used to generate the matrices), there is little doubt based on these measurements
that the more complex algorithm is more efficient.
I also did a check at 1/10th scale — 4k x 4k matrix, 50 x 50 block size — to ensure that the output from the transposition was the same (about 152 MiB of text). I didn't save the data at full scale with more than 100 times as much data. The times at 1/10th scale were dramatically better — less than 1/100th time — for both versions at the 1/10th scale:
< Initialization: 0.068667
< Transposition: 0.063927
---
> Initialization: 0.081022
> Transposition: 0.039169
4005c4005
< Print transposition: 3.901960
---
> Print transposition: 4.040136
JFTR: Testing on a 2016 MacBook Pro running macOS High Sierra 10.13.1 with 2.7 GHz Intel Core i7 CPU and 16 GB 2133 MHz LPDDR3 RAM. The compiler was GCC 7.2.0 (home-built). There was a browser running (but mostly inactive) and music playing in the background, so the machine wasn't idle, but I don't think those will dramatically affect the numbers.

Related

Break or merge loops over arrays in C?

Let's say I have 3 arrays image, blur and out, all of dimensions M×N×3.
I want to compute the bilateral gradient of each pixel in the array image (current_pixel - (previous_previous + next_pixel) / 2) over x and y dimensions, divide it by some floats, then add the value of the corresponding pixel from the array blur and finally put the result into the array out.
My question is, in C, what is the most efficient way to do it (regarding the memory access speed and computing efficiency) :
One loop indexing the 3 arrays at once :
for (i = 0, j = 0, k = 0 ; i < M-1, j < N-1, k < 3 ; i++, j++, k++):
out[i][j][k] = (2 * image[i][j][k] - image[i+1][j][k] - image[i][j+1][k]) / 2. + lambda * blur[i][j][k]
Two loops indexing only two arrays :
for (i = 0, j = 0, k = 0 ; i < M-1, j < N-1, k < 3 ; i++, j++, k++):
out[i][j][k] = (2 * image[i][j][k] - image[i+1][j][k] - image[i][j+1][k]) / 2.
for (i = 0, j = 0, k = 0 ; i < M-1, j < N-1, k < 3 ; i++, j++, k++):
out[i][j][k] += lambda * blur[i][j][k]
(for readability, I only wrote a simple forward gradient, but the complete formula is given above).
Or is there another faster way ? I'm programming for x86_64 CPUs.

One loop indexing the 3 arrays at once will be slightly easier for compiler to optimize. But you can quite likely check it and tested it.

image proccessing further optimization

I'm new to optimization and was given a task to optimize a function that processes an image as much as possible. it takes an image, blurs it and then saves the blurred image, and then continues and sharpens the image, and saves also the sharpened image.
Here is my code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
I wanted to ask about the if statements, is there anything better that can replace those? And also more generally speaking can anyone spot an optimization mistakes here, or can offer his inputs?
Thanks a lot!
updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = calculateIndex(1, 1, n);
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operato like that : insted of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
index += 2;
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}
------------------------------------------------------------------------------updated code:
typedef struct {
unsigned char red;
unsigned char green;
unsigned char blue;
} pixel;
// I delete the other struct because we can do the same operations with use of only addresses
//use macro instead of function is more efficient
#define calculateIndex(i, j, n) ((i)*(n)+(j))
// I combine all the functions in one because it is time consuming
void myfunction(Image *image, char* srcImgpName, char* blurRsltImgName, char* sharpRsltImgName) {
// use variable from type 'register int' is much more efficient from 'int'
register int i,j, ii, jj, sum_red, sum_green, sum_blue;
//using local variable is much more efficient than using pointer to pixels from the original image,and updat its value in each iteration
pixel current_pixel , p;
//dst will point on the first pixel in the image
pixel* dst = (pixel*)image->data;
int squareN = n*n;
//instead of multiply by 3 - I used shift
register int sizeToAllocate = ((squareN)<<1)+(squareN); // use variable from type 'register int' is much more efficient from 'int'
pixel* src = malloc(sizeToAllocate);
register int index;
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// first step : smooth //////////////////////////////////////////////////////////////////////
/**the smooth blur is step that apply the blur-kernel (matrix of ints) over each pixel in the bouns - and make the image more smooth.
*this function was originally used this matrix :
* [1, 1, 1]
* [1, 1, 1]
* [1, 1, 1]
*because the matrix is full of 1 , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable.
*/
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
index = n + 1;
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
for(ii = i-1; ii <= i+1; ++ii) {
for(jj =j-1; jj <= j+1; ++jj) {
//take care of the [ii,jj] pixel in the matrix
//calculate the adrees of the current pixel
pixel p = src[calculateIndex(ii, jj, n)];
//sum the colors' values of the neighbors of the current pixel
sum_red += p.red;
sum_green += p.green;
sum_blue += p.blue;
}
}
//calculate the avarage of the colors' values around the current pixel - as written in the instructions
sum_red = (((sum_red) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_green = (((sum_green) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
sum_blue = (((sum_blue) * 0xE38F) >> 19);//instead of dividing by 9 - I used shift because it is more efficient
current_pixel.red = (unsigned char)sum_red;
current_pixel.green = (unsigned char)sum_green;
current_pixel.blue = (unsigned char)sum_blue;
dst[index++] = current_pixel;
}
index += 2;
}
// write result image to file
writeBMP(image, srcImgpName, blurRsltImgName);
//memcpy replace the old functions that converts chars to pixels or pixels to chars. it is very efficient and build-in in c libraries
memcpy(src, dst, sizeToAllocate);
///////////////////////////////////////// second step : sharp //////////////////////////////////////////////////////////////////////
/** I want to sharp the smooth image . In this step I apply the sharpen kernel (matrix of ints) over each pixel in the bouns - and make the image more sharp.
*this function was originally used this matrix :
* [-1, -1, -1]
* [-1, 9, -1]
* [-1, -1, -1]
*because the matrix is full of (-1) , we don't really need it - the access to the matrix is very expensive . instead of the matrix I used
*primitive variable. I operate like that : instead of multiply in (-1) in the end of the step , I define counter initializes with zero , and
*substruct all te colors' values from it. the result is actually the same as multiply by (-1), in more efficient way.
*/
index = calculateIndex(1,1,n);
//the loops are starting with 1 and not with 0 because we need to check only the pixels with 8 neighbors around them
for (i = 1 ; i < n-1; ++i) {
for (j = 1 ; j < n-1 ; ++j) {
// I used this variables as counters to the colors' values around a specific pixel
sum_red = 0;
sum_green = 0;
sum_blue = 0;
// Do central pixel first
p=src[index];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
//each pixel's colors' values must match the range [0,255] - I used the idea from the original code
//the red value must be in the range [0,255]
if (sum_red < 0) {
sum_red = 0;
} else if (sum_red > 255 ) {
sum_red = 255;
}
current_pixel.red = (unsigned char)sum_red;
//the green value must be in the range [0,255]
if (sum_green < 0) {
sum_green = 0;
} else if (sum_green > 255 ) {
sum_green = 255;
}
current_pixel.green = (unsigned char)sum_green;
//the blue value must be in the range [0,255]
if (sum_blue < 0) {
sum_blue = 0;
} else if (sum_blue > 255 ) {
sum_blue = 255;
}
current_pixel.blue = (unsigned char)sum_blue;
// put the updated pixel in [i,j] in the image
dst[calculateIndex(i, j, n)] = current_pixel;
}
index += 2;
}
//free the allocated space to prevent memory leaks
free(src);
// write result image to file
writeBMP(image, srcImgpName, sharpRsltImgName);
}

Some general optimization guidelines:
If you're running on x86, compile as a 64-bit binary. x86 is really a register-starved CPU. In 32-bit mode you pretty much have only 5 or 6 32-bit general-purpose registers available, and you only get "all" 6 if you compile with optimizations like -fomit-frame-pointer on GCC. In 64-bit mode you'll have 13 or 14 64-bit general-purpose registers.
Get a good compiler and use the highest possible general optimization level.
Profile! Profile! Profile! Actually profile your code so actually know where the performance bottlenecks are. Any guesses about the location of any performance bottlenecks are likely wrong.
Once you find your bottlenecks, examine the actual instructions the compiler produces and look at the bottleneck areas, just to see what's happening. Perhaps the bottleneck is where the compiler had to do a lot of register spilling and filling because of register pressure. This can be really helpful if you can profile down to the instruction level.
Use the insights from the profiling and examination of the generated instructions to improve your code and compile arguments. For example, if you're seeing a lot of register spilling and filling, you need to reduce register pressure, perhaps by manually coalescing loops or disabling prefetching with a compiler option.
Experiment with different page size options. If a single row of pixels is a significant fraction of a page size, reaching into other rows is more likely to reach into another page and result in a TLB miss. Using larger memory pages may significantly reduce this.
Some specific ideas for your code:
Use only one outer loop. You'll have to experiment to find the fastest way to handle your "extra" edge pixels. The fastest way might be to not do anything special, roll right over them like "normal" pixels, and just ignore the values in them later.
Manually unroll the two inner loops - you're only doing 9 pixels.
Don't use calculateIndex() - use the address of the current pixel and find the other pixels simply by subtracting or adding the proper value from the current pixel address. For example, the address of the upper-left pixel in your inner loops would be something like currentPixelAddress - n - 1.
Those would convert your four-deep nested loops into a single loop with very little index calculations needed.

A few ideas - untested.
You have if(ii==i && jj=j) to test for the central pixel in your sharpening loop which you do 9x for every pixel. I think it would be faster to remove that if and do exactly the same for every pixel but then make a correction, outside the loop by adding 10x the central pixel.
// Do central pixel first
p=src[calculateIndex(i,j,n)];
sum_red = 10*p.red;
sum_green = 10*p.green;
sum_blue = 10*p.blue;
for(ii =i-1; ii <= i + 1; ++ii) {
for(jj = j-1; jj <= j + 1; ++jj) {
p = src[calculateIndex(ii, jj, n)];
//operate according to the instructions
sum_red -= p.red;
sum_green -= p.green;
sum_blue -= p.blue;
}
}
Where you do dst[calculateIndex(i, j, n)] = current_pixel;, you can probably calculate the index once before the loop at the start and then just increment the pointer with each write inside the loop - assuming your arrays are contiguous and unpadded.
index=calculateIndex(1,1,n)
for (i = 1 ; i < n - 1; ++i) {
for (j = 1 ; j < n - 1 ; ++j) {
...
dst[index++] = current_pixel;
}
index+=2; // skip over last pixel of this line and first pixel of next line
}
As you move your 3x3 window of 9 pixels across the image, you could "remember" the left-most column of 3 pixels from the previous position, then instead of 9 additions for each pixel, you would do a single subtraction for the left-most column leaving the window and 3 additions for the new column entering the window on the right side, i.e. 4 calculations instead of 9.

Poor maths performance in C vs Python/numpy

Near-duplicate / related:
How does BLAS get such extreme performance? (If you want fast matmul in C, seriously just use a good BLAS library unless you want to hand-tune your own asm version.) But that doesn't mean it's not interesting to see what happens when you compile less-optimized matrix code.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
Matrix Multiplication with blocks
Out of interest, I decided to compare the performance of (inexpertly) handwritten C vs. Python/numpy performing a simple matrix multiplication of two, large, square matrices filled with random numbers from 0 to 1.
I found that python/numpy outperformed my C code by over 10,000x This is clearly not right, so what is wrong with my C code that is causing it to perform so poorly? (even compiled with -O3 or -Ofast)
The python:
import time
import numpy as np
t0 = time.time()
m1 = np.random.rand(2000, 2000)
m2 = np.random.rand(2000, 2000)
t1 = time.time()
m3 = m1 # m2
t2 = time.time()
print('creation time: ', t1 - t0, ' \n multiplication time: ', t2 - t1)
The C:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void) {
clock_t t0=clock(), t1, t2;
// create matrices and allocate memory
int m_size = 2000;
int i, j, k;
double running_sum;
double *m1[m_size], *m2[m_size], *m3[m_size];
double f_rand_max = (double)RAND_MAX;
for(i = 0; i < m_size; i++) {
m1[i] = (double *)malloc(sizeof(double)*m_size);
m2[i] = (double *)malloc(sizeof(double)*m_size);
m3[i] = (double *)malloc(sizeof(double)*m_size);
}
// populate with random numbers 0 - 1
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++) {
m1[i][j] = (double)rand() / f_rand_max;
m2[i][j] = (double)rand() / f_rand_max;
}
t1 = clock();
// multiply together
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++) {
running_sum = 0;
for (k = 0; k < m_size; k++)
running_sum += m1[i][k] * m2[k][j];
m3[i][j] = running_sum;
}
t2 = clock();
float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
printf("creation time: %f", t01 );
printf("\nmultiplication time: %f", t12 );
return 0;
}
EDIT: Have corrected the python to do a proper dot product which closes the gap a little and the C to time with a resolution of microseconds and use the comparable double data type, rather than float, as originally posted.
Outputs:
$ gcc -O3 -march=native bench.c
$ ./a.out
creation time: 0.092651
multiplication time: 139.945068
$ python3 bench.py
creation time: 0.1473407745361328
multiplication time: 0.329038143157959
It has been pointed out that the naive algorithm implemented here in C could be improved in ways that lend themselves to make better use of compiler optimisations and the cache.
EDIT: Having modified the C code to transpose the second matrix in order to achieve a more efficient access pattern, the gap closes more
The modified multiplication code:
// transpose m2 in order to capitalise on cache efficiencies
// store transposed matrix in m3 for now
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++)
m3[j][i] = m2[i][j];
// swap the pointers
void *mtemp = *m3;
*m3 = *m2;
*m2 = mtemp;
// multiply together
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++) {
running_sum = 0;
for (k = 0; k < m_size; k++)
running_sum += m1[i][k] * m2[j][k];
m3[i][j] = running_sum;
}
The results:
$ gcc -O3 -march=native bench2.c
$ ./a.out
creation time: 0.107767
multiplication time: 10.843431
$ python3 bench.py
creation time: 0.1488208770751953
multiplication time: 0.3335080146789551
EDIT: compiling with -0fast, which I am reassured is a fair comparison, brings down the difference to just over an order of magnitude (in numpy's favour).
$ gcc -Ofast -march=native bench2.c
$ ./a.out
creation time: 0.098201
multiplication time: 4.766985
$ python3 bench.py
creation time: 0.13812589645385742
multiplication time: 0.3441300392150879
EDIT: It was suggested to change indexing from arr[i][j] to arr[i*m_size + j] this yielded a small performance increase:
for m_size = 10000
$ gcc -Ofast -march=native bench3.c # indexed by arr[ i * m_size + j ]
$ ./a.out
creation time: 1.280863
multiplication time: 626.327820
$ gcc -Ofast -march=native bench2.c # indexed by art[I][j]
$ ./a.out
creation time: 2.410230
multiplication time: 708.979980
$ python3 bench.py
creation time: 3.8284950256347656
multiplication time: 39.06089973449707
The up to date code bench3.c:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void) {
clock_t t0, t1, t2;
t0 = clock();
// create matrices and allocate memory
int m_size = 10000;
int i, j, k, x, y;
double running_sum;
double *m1 = (double *)malloc(sizeof(double)*m_size*m_size),
*m2 = (double *)malloc(sizeof(double)*m_size*m_size),
*m3 = (double *)malloc(sizeof(double)*m_size*m_size);
double f_rand_max = (double)RAND_MAX;
// populate with random numbers 0 - 1
for (i=0; i < m_size; i++) {
x = i * m_size;
for (j=0; j < m_size; j++)
m1[x + j] = ((double)rand()) / f_rand_max;
m2[x + j] = ((double)rand()) / f_rand_max;
m3[x + j] = ((double)rand()) / f_rand_max;
}
t1 = clock();
// transpose m2 in order to capitalise on cache efficiencies
// store transposed matrix in m3 for now
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++)
m3[j*m_size + i] = m2[i * m_size + j];
// swap the pointers
double *mtemp = m3;
m3 = m2;
m2 = mtemp;
// multiply together
for (i=0; i < m_size; i++) {
x = i * m_size;
for (j=0; j < m_size; j++) {
running_sum = 0;
y = j * m_size;
for (k = 0; k < m_size; k++)
running_sum += m1[x + k] * m2[y + k];
m3[x + j] = running_sum;
}
}
t2 = clock();
float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
printf("creation time: %f", t01 );
printf("\nmultiplication time: %f", t12 );
return 0;
}

CONCLUSION: So the original absurd factor of x10,000 difference was largely due to mistakenly comparing element-wise multiplication in Python/numpy to C code and not compiled with all of the available optimisations and written with a highly inefficient memory access pattern that likely didn't utilise the cache.
A 'fair' comparison (ie. correct, but highly inefficient single-threaded algorithm, compiled with -Ofast) yields a performance factor difference of x350
A number of simple edits to improve the memory access pattern brought the comparison down to a factor of x16 (in numpy's favour) for large matrix (10000 x 10000) multiplication. Furthermore, numpy automatically utilises all four virtual cores on my machine whereas this C does not, so the performance difference could be a factor of x4 - x8 (depending on how well this program ran on hyperthreading). I consider a factor of x4 - x8 to be fairly sensible, given that I don't really know what I'm doing and just knocked a bit of code together whereas numpy is based on BLAS which I understand has been extensively optimised over the years by experts from all over the place so I consider the question answered/solved.

How to use Armadillo Columns/Rows to perform optimised calculations on accesses within the same column

What is the best way to manipulate indexing in Armadillo? I was under the impression that it heavily used template expressions to avoid temporaries, but I'm not seeing these speedups.
Is direct array indexing still the best way to approach calculations that rely on consecutive elements within the same array?
Keep in mind, that I hope to parallelise these calculations in the future with TBB::parallel_for (In this case, from a maintainability perspective, it may be simpler to use direct accessing?) These calculations happen in a tight loop, and I hope to make them as optimal as possible.
ElapsedTimer timer;
int n = 768000;
int numberOfLoops = 5000;
arma::Col<double> directAccess1(n);
arma::Col<double> directAccess2(n);
arma::Col<double> directAccessResult1(n);
arma::Col<double> directAccessResult2(n);
arma::Col<double> armaAccess1(n);
arma::Col<double> armaAccess2(n);
arma::Col<double> armaAccessResult1(n);
arma::Col<double> armaAccessResult2(n);
std::valarray<double> valArrayAccess1(n);
std::valarray<double> valArrayAccess2(n);
std::valarray<double> valArrayAccessResult1(n);
std::valarray<double> valArrayAccessResult2(n);
// Prefil
for (int i = 0; i < n; i++) {
directAccess1[i] = i;
directAccess2[i] = n - i;
armaAccess1[i] = i;
armaAccess2[i] = n - i;
valArrayAccess1[i] = i;
valArrayAccess2[i] = n - i;
}
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
directAccessResult1[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i - 1]) * directAccess2[i - 1];
directAccessResult2[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i]) * directAccess2[i];
}
}
timer.StopAndPrint("Direct Array Indexing Took");
std::cout << std::endl;
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
armaAccessResult1.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(0, n - 2)) % armaAccess2.rows(0, n - 2);
armaAccessResult2.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(1, n - 1)) % armaAccess2.rows(1, n - 1);
}
timer.StopAndPrint("Arma Array Indexing Took");
std::cout << std::endl;
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
valArrayAccessResult1[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i - 1]) * valArrayAccess2[i - 1];
valArrayAccessResult2[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i]) * valArrayAccess2[i];
}
}
timer.StopAndPrint("Valarray Array Indexing Took:");
std::cout << std::endl;
In vs release mode (/02 - to avoid armadillo array indexing checks), they produce the following timings:
Started Performance Analysis!
Direct Array Indexing Took: 37.294 seconds elapsed
Arma Array Indexing Took: 39.4292 seconds elapsed
Valarray Array Indexing Took:: 37.2354 seconds elapsed

Your direct code is already quite optimal, so expression templates are not going to help here.
However, you may want to make sure the optimization level in your compiler actually enables auto-vectorization (-O3 in gcc). Secondly, you can get a bit of extra speed by #define ARMA_NO_DEBUG before including the Armadillo header. This will turn off all run-time checks (such as bound checks for element access), but this is not recommended until you have completely debugged your program.

2D convolution with a with a kernel which is not center originated

I want to do 2D convolution of an image with a Gaussian kernel which is not centre originated given by equation:
h(x-x', y-y') = exp(-((x-x')^2+(y-y'))/2*sigma)
Lets say the centre of kernel is (1,1) instead of (0,0). How should I change my following code for generation of kernel and for the convolution?
int krowhalf=krow/2, kcolhalf=kcol/2;
int sigma=1
// sum is for normalization
float sum = 0.0;
// generate kernel
for (int x = -krowhalf; x <= krowhalf; x++)
{
for(int y = -kcolhalf; y <= kcolhalf; y++)
{
r = sqrtl((x-1)*(x-1) + (y-1)*(y-1));
gKernel[x + krowhalf][y + kcolhalf] = exp(-(r*r)/(2*sigma));
sum += gKernel[x + krowhalf][y + kcolhalf];
}
}
//normalize the Kernel
for(int i = 0; i < krow; ++i)
for(int j = 0; j < kcol; ++j)
gKernel[i][j] /= sum;
float **convolve2D(float** in, float** out, int h, int v, float **kernel, int kCols, int kRows)
{
int kCenterX = kCols / 2;
int kCenterY = kRows / 2;
int i,j,m,mm,n,nn,ii,jj;
for(i=0; i < h; ++i) // rows
{
for(j=0; j < v; ++j) // columns
{
for(m=0; m < kRows; ++m) // kernel rows
{
mm = kRows - 1 - m; // row index of flipped kernel
for(n=0; n < kCols; ++n) // kernel columns
{
nn = kCols - 1 - n; // column index of flipped kernel
//index of input signal, used for checking boundary
ii = i + (m - kCenterY);
jj = j + (n - kCenterX);
// ignore input samples which are out of bound
if( ii >= 0 && ii < h && jj >= 0 && jj < v )
//out[i][j] += in[ii][jj] * (kernel[mm+nn*29]);
out[i][j] += in[ii][jj] * (kernel[mm][nn]);
}
}
}
}
}

Since you're using the convolution operator you have 2 choices:
Using it Spatial Invariant property.
To so so, just calculate the image using regular convolution filter (Better done using either conv2 or imfilter) and then shift the result.
You should mind the boundary condition you'd to employ (See imfilter properties).
Calculate the shifted result specifically.
You can do this by loops as you suggested or more easily create non symmetric kernel and still use imfilter or conv2.
Sample Code (MATLAB)
clear();
mInputImage = imread('3.png');
mInputImage = double(mInputImage) / 255;
mConvolutionKernel = zeros(3, 3);
mConvolutionKernel(2, 2) = 1;
mOutputImage01 = conv2(mConvolutionKernel, mInputImage);
mConvolutionKernelShifted = [mConvolutionKernel, zeros(3, 150)];
mOutputImage02 = conv2(mConvolutionKernelShifted, mInputImage);
figure();
imshow(mOutputImage01);
figure();
imshow(mOutputImage02);
The tricky part is to know to "Crop" the second image in the same axis as the first.
Then you'll have a shifted image.
You can use any Kernel and any function which applies convolution.
Enjoy.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight