I've got an assignment to optimize a piece of C code (a language in which I'm rather n00bish) designed to simulate rotating pixels in an image:
void naive_rotate(int dim, pixel *src, pixel *dst) {
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
Defs for pixel and RIDX are as follows:
typedef struct {
unsigned short red;
unsigned short green;
unsigned short blue;
} pixel;
#define RIDX(i,j,n) ((i)*(n)+(j))
The instructions for the assignment contain the note, "Your task is to rewrite this code to make it run as fast as possible using techniques like code motion, loop unrolling
and blocking."
I thought I had some ideas on how to approach this. However, my attempts at loop unrolling:
void rotate_unroll(int dim, pixel *src, pixel *dst) {
int i, j;
for (i = 0; i < dim; i++) {
for (j = 0; j < dim; j+=4) {
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
dst[RIDX(dim-1-(j+1), i, dim)] = src[RIDX(i, j+1, dim)];
dst[RIDX(dim-1-(j+2), i, dim)] = src[RIDX(i, j+2, dim)];
dst[RIDX(dim-1-(j+3), i, dim)] = src[RIDX(i, j+3, dim)];
}
}
}
and code motion (or at least reorganizing RIDX and moving around some of the calculations out of the inner loop):
void rotate_motion(int dim, pixel *src, pixel *dst) {
int i, j;
int dimsquared = dim * dim;
for (i = 0; i < dim; i++) {
int dst_temp = dimsquared - dim + i;
int src_temp = i * dim;
for (j = 0; j < dim; j++) {
dst[dst_temp - (dim * j)] = src[src_temp + j];
}
}
}
// dst[RIDX(dim-1-j, i, dim)]
// = dst[(dim-1-j)dim + i]
// = dst[(dim * dim) - dim - (dim)j + i]
// src[RIDX(i, j, dim)]
// = src[(dim)i + j]
do not seem to be working; the timer packaged with the assignment claims that my solutions are not having any impact on the CPE of the program. I suspect I am probably approaching both methods incorrectly and would greatly appreciate any guidance in the right direction. (It's a homework assignment so I'm not asking for a complete solution, just some pointers.)
My other thought was to try to add an accumulator -- something along the lines of the following:
void rotate_acc(int dim, pixel *src, pixel *dst) {
int i, j;
pixel temp = dst;
for (i = 0; i < dim; i++) {
for (j = 0; j < dim; j++) {
temp[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
}
dst = temp;
}
But my syntax is very wrong there and I'm not sure how one would go about constructing a temporary version of the struct in question.
Any help is much appreciated. Thanks!
You need to have a thorough understanding on pointers in C. Put it simply: pointers represent an address of where your data is stored in memory (pixel struct in your case).
In your code, the function rotate_acc takes a pixel pointer as argument: pixel *dst. At first you can be tempted to say pixel *tmp = dst, but keep in mind this is what is called shallow copy -- only the address are copied, not the data it's pointing at -- hence if you modify tmp by saying tmp->red = 0, it will modify the original data too
If you need a deep copy, you need to say pixel tmp = *dst
Try this:
void naive_rotate(int dim, pixel *src, pixel *dst) {
int dimSq = dim * dim;
int dstdIxStart = dimSq - dim;
int endIdx = dimSq - 2 * dim - 2;
int dstIdx = dimSq - dim;
for (int i = 0; int < endIdx; ++i)
{
dst[dstIdx--] = src[i];
if (0 == dstIdx)
{
dstdIxStart -= dim;
dstIdx = dstdIxStart;
}
}
}
You will have to double check the maths, but I hope you get the idea.
It removes all the multiplications. Also as src is being accessed sequentially it is good for the cache.
Related
Related to dynamic allocation inside a function, most questions & answers are based on double pointers.
But I was recommended to avoid using double pointer unless I have to, so I want to allocate a 'array pointer' (not 'array of pointer') and hide it inside a function.
int (*arr1d) = calloc(dim1, sizeof(*arr1d));
int (*arr2d)[dim2] = calloc(dim1, sizeof(*arr2d));
Since the above lines are the typical dynamic-allocation of pointer of array, I tried the following.
#include <stdio.h>
#include <stdlib.h>
int allocateArray1D(int n, int **arr) {
*arr = calloc(n, sizeof(*arr));
for (int i = 0; i < n; i++) {
(*arr)[i] = i;
}
return 0;
}
int allocateArray2D(int nx, int ny, int *(*arr)[ny]) {
*arr[ny] = calloc(nx, sizeof(*arr));
for (int i = 0; i < nx; i++) {
for (int j = 0; j < ny; j++) {
(*arr)[i][j] = 10 * i + j;
}
}
return 0;
}
int main() {
int nx = 3;
int ny = 2;
int *arr1d = NULL; // (1)
allocateArray1D(nx, &arr1d);
int(*arr2d)[ny] = NULL; // (2)
allocateArray2D(nx, ny, &arr2d);
for (int i = 0; i < nx; i++) {
printf("arr1d[%d] = %d \n", i, arr1d[i]);
}
printf("\n");
printf("arr2d \n");
for (int i = 0; i < nx; i++) {
for (int j = 0; j < ny; j++) {
printf(" %d ", arr2d[i][j]);
}
printf("\n");
}
return 0;
}
And the error message already comes during the compilation.
03.c(32): warning #167: argument of type "int (**)[ny]" is incompatible with parameter of type "int *(*)[*]"
allocateArray2D(nx, ny, &arr2d);
^
It is evident from the error message that it has been messed up with the argument types (that I wrote as int *(*arr)[ny]) but what should I have to put there? I tried some variants like int *((*arr)[ny]), but didn't work).
And if I remove the 2D parts, then the code well compiles, and run as expected. But I wonder if this is the right practice, at least for 1D case since there are many examples where the code behaves as expected, but in fact there were wrong or un-standard lines.
Also, the above code is not satisfactory in the first place. I want to even remove the lines in main() that I marked as (1) and (2).
So in the end I want a code something like this, but all with the 'array pointers'.
int **arr2d;
allocateArray2D(nx, ny, arr2d);
How could this be done?
You need to pass the array pointer by reference (not pass an array pointer to an array of int*):
int *(*arr)[ny] -> int (**arr)[ny]
The function becomes:
int allocateArray2D(int nx, int ny, int (**arr)[ny]) {
*arr = calloc(nx, sizeof(int[ny])); // or sizeof(**arr)
for (int i = 0; i < nx; i++) {
for (int j = 0; j < ny; j++) {
(*arr)[i][j] = 10 * i + j;
}
}
return 0;
}
For details, check out Correctly allocating multi-dimensional arrays
Best practices with malloc family is to always check if allocation succeeded and always free() at the end of the program.
As a micro-optimization, I'd rather recommend to use *arr = malloc( sizeof(int[nx][ny]) );, since calloc just creates pointless overhead bloat in the form of zero initialization. There's no use of it here since every item is assigned explicitly anyway.
Wrong parameter type
Strange allocation
Wrong size type
I would return the array as void * too (at least to check if allocation did not fail).
void *allocateArray2D(size_t nx, size_t ny, int (**arr)[ny]) {
//*arr = calloc(nx, sizeof(**arr)); calloc is not needed here as you assign values to the array
*arr = malloc(nx * sizeof(**arr));
for (size_t i = 0; i < nx; i++) {
for (size_t j = 0; j < ny; j++) {
(*arr)[i][j] = 10 * i + j;
}
}
return *arr;
}
I'm working on speeding up Conway's Game of Life. Right now, the code looks at a cell and then adds up the 3x3 area immediately surrounding the point, then subtracts the value at the point we're looking at. Here's the function that is doing that:
static int neighbors2 (board b, int i, int j)
{
int n = 0;
int i_left = max(0,i-1);
int i_right = min(HEIGHT, i+2);
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int ii, jj;
for (jj = j_left; jj < j_right; ++jj) {
for (ii = i_left; ii < i_right; ii++) {
n += b[ii][jj];
}
}
return n - b[i][j];
}
And here is the code I've been trying to use to iterate through pieces at a time:
//Iterates through the first row of the 3x3 area
static int first_row(board b, int i, int j) {
int f = 0;
int i_left = max(0,i-1);
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int jj;
for (jj = j_left; jj < j_right; ++jj) {
f += b[i_left][jj];
}
return f;
}
//Iterates and adds up the second row of the 3x3 area
static int second_row(board b, int i, int j) {
int g = 0;
int i_right = min(HEIGHT, i+2);
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int jj;
if (i_right != i) {
for (jj = j_left; jj < j_right; ++jj) {
g += b[i][jj];
}
}
return g;
}
//iterates and adds up the third row of the 3x3 area.
static int third_row(board b, int i, int j) {
int h = 0;
int i_right = min(HEIGHT, i+2);
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int jj;
for (jj = j_left; jj < j_right; ++jj) {
h += b[i_right][jj];
}
return h;
}
//adds up the surrounding spots
//subtracts the spot we're looking at.
static int addUp(board b, int i, int j) {
int n = first_row(b, i, j) + second_row(b, i, j) + third_row(b, i, j);
return n - b[i][j];
}
But, for some reason it isn't working. I have no idea why.
Things to note:
sometimes i == i_right, so we do not want to add up a row twice.
The three functions are supposed to do the exact same thing as neighbors2 in separate pieces.
min and max are functions that were premade for me.
sometimes sometimes j == j_right, so we do not want to add up something twice. I'm pretty confident the loop takes care of this however.
Tips and things to consider are appreciated.
Thanks all. I've been working on this for a couple hours now and have no idea what is going wrong. It seems like it should work but I keep getting incorrect solutions at random spots among the board.
In neighbors2, you set i_left and i_right so that the're limited to the rows of the grid. If the current cell is in the top or bottom row, you only loop through two rows instead of 3.
In first_row() and last_row() you also limit it to the rows of the grid. But the result is that these functions will add the cells on the same row as the current cell, which is what second_row does. So you end up adding those rows twice.
You shouldn't call first_row() when i = 0, and you shouldn't call third_row() when i == HEIGHT.
static int addUp(board b, int i, int j) {
int n = (i == 0 ? 0 : first_row(b, i, j)) +
second_row(b, i, j) +
(i == HEIGHT ? 0 : third_row(b, i, j));
return n - b[i][j];
}
Another option would be to do the check in the functions themselves:
function first_row((board b, int i, int j) {
if (i == 0) {
return 0;
}
int f = 0;
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int jj;
for (jj = j_left; jj < j_right; ++jj) {
f += b[i][jj];
}
return f;
}
and similarly for third_row(). But doing it in the caller saves the overhead of the function calls.
BTW, your variable names are very confusing. All the i variables are for rows, which go from top to bottom, not left to right.
#include <stdio.h>
#include <stdlib.h>
#define ROWSDISP 50
#define COLSDISP 100
int rows=ROWSDISP+2, cols=COLSDISP+2;
This is to avoid illegal indexes when stepping over the neighbours.
struct onecell {char alive;
char neibs;} **cells;
This is the foundation of a (dynamic) 2D-array, of a small struct.
To create space for each row plus the space to hold an array of row pointers:
void init_cells()
{
int i;
cells = calloc(rows, sizeof(*cells));
for(i=0; i<=rows-1; i++)
cells[i] = calloc(cols, sizeof(**cells));
}
I skip the rand_fill() and glider() funcs. A cell can be set by
cells[y][x].alive=1.
int main(void) {
struct onecell *c, *n1, *rlow;
int i, j, loops=0;
char nbs;
init_cells();
rand_fill();
glider();
while (loops++ < 1000) {
printf("\n%d\n", loops);
for (i = 1; i <= rows-2; i++) {
for (j = 1; j <= cols-2; j++) {
c = &cells[ i ][ j ];
n1 = &cells[ i ][j+1];
rlow = cells[i+1];
nbs = c->neibs + n1->alive + rlow[ j ].alive
+ rlow[j+1].alive
+ rlow[j-1].alive;
if(c->alive) {
printf("#");
n1->neibs++;
rlow[ j ].neibs++;
rlow[j+1].neibs++;
rlow[j-1].neibs++;
if(nbs < 2 || nbs > 3)
c->alive = 0;
} else {
printf(" ");
if(nbs == 3)
c->alive = 1;
}
c->neibs = 0; // reset for next cycle
}
printf("\n");
}
}
return(0);
}
There is no iterating a 3x3 square here. Of the 8 neighbours,
only the 4 downstream ones are checked; but at the same time
their counters are raised.
A benchmark with 100x100 grid:
# time ./a.out >/dev/null
real 0m0.084s
user 0m0.084s
sys 0m0.000s
# bc <<<100*100*1000/.084
119047619
And each of these 100M cells needs to check 8 neighbours, so this is close to the CPU frequency (1 neighbour check per cycle).
It seems twice as fast as the rosetta code solution.
There also is no need to switch the boards. Thanks to the investment in the second field of a cell.
Given the rise of VLA since c99 it has become easier to pass a multidimensional array of unknown size to a function. But there is a decent amount of controversy around using VLAs. Some readily endorse it "It is often nicer
to use dynamic memory, alloca() or VLAs."1 others scorne them. The question I was asking myself is what the standard way was in the c90 days to passing a multidimensional array to a function. Here is a little code:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
int arr[2][4];
int i;
int j;
for(i = 0; i < 2; i++) {
for(j = 0; j < 4; j++) {
arr[i][j] = j;
}
}
exit(EXIT_SUCCESS);
}
I could think of one way: passing a pointer to a pointer:
void foo_a(int m, int n, int **ptr_arr)
{
int i, j;
for (i = 0; i < m; i++) {
for (j = 0; j < n; j++) {
ptr_arr[i][j] += 1;
}
printf("\n");
}
}
But that would involve flattening the array first by inserting something like into main (which is not pretty).
int *unidim_arr[ROW];
for (i = 0; i < ROW; i++) {
unidim_arr[i] = &(arr[i][0]);
}
Another one would probably be using a single pointer and calculating the offset by hand which is error prone:
void foo_b(int m, int n, int *ptr_arr)
{
int i, j;
for (i = 0; i < m; i++) {
for (j = 0; j < n; j++) {
*((ptr_arr + i * n) + j) += 1;
}
}
}
The solution that strikes me as nicest is using something like
void foo_c(int m, int n, int (*ptr_arr)[])
{
int i, j;
for (i = 0; i < m; i++) {
for (j = 0; j < n; j++) {
ptr_arr[i][j] += 1;
}
}
}
but to my understanding this would only work with VLAs where I can simply specify (*ptr_arr)[n] in the functions parameter list? Is there another way to do it in c90 with special attention to foo_c()?
1. Please, no systemd-bashing.
One method is to pass a pointer to the first element of the array along with the array dimensions, then treat that pointer as a 1-d array in your function.
Example:
void foo( int *arr, size_t r, size_t c ) // process a 2D array defined as int arr[r][c]
{
for ( size_t i = 0; i < r; i++ )
for ( size_t j = 0; j < c; j++ )
arr[i * r + j] = some_value(); // calculate index manually
}
int main( void )
{
int arr[4][5];
foo( &arr[0][0], 4, 5 );
}
This scales up pretty easily to higher dimensioned arrays. Naturally this only works for true multi-dimensional arrays where the rows are all adjacent in memory. This won't work for arrays dynamically allocated a row at a time, such as
int **arr = malloc( sizeof *arr * rows );
for ( size_t i = 0; i < rows; i++ )
arr[i] = malloc( sizeof *arr[i] * cols );
since the rows aren't guaranteed to be adjacent, but in that case you'd just use the arr pointer as-is:
void bar( int **arr, size_t r, size_t c ) // process a 2D array defined as int **arr
{
for ( size_t i = 0; i < r; i++ )
for ( size_t j = 0; j < c; j++ )
arr[i][j] = some_value();
}
#include <stdio.h>
#include <stdlib.h>
#define MAX_ROWS 5
#define MAX_COLS 5
int globalvariable = 100;
void CreateMatrix(int ***Matrix)
{
int **ptr;
char *cp;
int i = 0;
*Matrix = (int**)malloc((sizeof(int*) * MAX_ROWS) + ((MAX_ROWS * MAX_COLS)*sizeof(int)));
ptr = *Matrix;
cp = (char*)((char*)*Matrix + (sizeof(int*) * MAX_ROWS));
for(i =0; i < MAX_ROWS; i++)
{
cp = (char*)(cp + ((sizeof(int) * MAX_COLS) * i));
*ptr = (int*)cp;
ptr++;
}
}
void FillMatrix(int **Matrix)
{
int i = 0, j = 0;
for(i = 0; i < MAX_ROWS; i++)
{
for(j = 0; j < MAX_COLS; j++)
{
globalvariable++;
Matrix[i][j] = globalvariable;
}
}
}
void DisplayMatrix(int **Matrix)
{
int i = 0, j = 0;
for(i = 0; i < MAX_ROWS; i++)
{
printf("\n");
for(j = 0; j < MAX_COLS; j++)
{
printf("%d\t", Matrix[i][j]);
}
}
}
void FreeMatrix(int **Matrix)
{
free(Matrix);
}
int main()
{
int **Matrix1, **Matrix2;
CreateMatrix(&Matrix1);
FillMatrix(Matrix1);
DisplayMatrix(Matrix1);
FreeMatrix(Matrix1);
getchar();
return 0;
}
If the code is executed, I get the following error messages in a dialogbox.
Windows has triggered a breakpoint in sam.exe.
This may be due to a corruption of the heap, which indicates a bug in sam.exe or any of the DLLs it has loaded.
This may also be due to the user pressing F12 while sam.exe has focus.
The output window may have more diagnostic information.
I tried to debug in Visual Studio, when printf("\n"); statement of DisplayMatrix() is executed, same error message is reproduced.
If I press continue, it prints 101 to 125 as expected. In Release Mode, there is no issue !!!.
please share your ideas.
In C it is often simpler and more efficient to allocate a numerical matrix with calloc and use explicit index calculation ... so
int width = somewidth /* put some useful width computation */;
int height = someheight /* put some useful height computation */
int *mat = calloc(width*height, sizeof(int));
if (!mat) { perror ("calloc"); exit (EXIT_FAILURE); };
Then initialize and fill the matrix by computing the offset appropriately, e.g. something like
for (int i=0; i<width; i++)
for (int j=0; j<height; j++)
mat[i*height+j] = i+j;
if the matrix has (as you show) dimensions known at compile time, you could either stack allocate it with
{ int matrix [NUM_COLS][NUM_ROWS];
/* do something with matrix */
}
or heap allocate it. I find more readable to make it a struct like
struct matrix_st { int matfield [NUM_COLS][NUM_ROWS]; };
struct matrix_st *p = malloc(sizeof(struct matrix_st));
if (!p) { perror("malloc"); exit(EXIT_FAILURE); };
then fill it appropriately:
for (int i=0; i<NUM_COLS; i++)
for (int j=0; j<NUM_ROWS, j++)
p->matfield[i][j] = i+j;
Remember that malloc returns an uninitialized memory zone so you need to initialize all of it.
A two-dimensional array is not the same as a pointer-to-pointer. Maybe you meant
int (*mat)[MAX_COLS] = malloc(MAX_ROWS * sizeof(*mat));
instead?
Read this tutorial.
A very good & complete tutorial for pointers, you can go directly to Chapter 9, if you have in depth basic knowledge.
What I'm trying to do is take this code:
char naive_smooth_descr[] = "naive_smooth: Naive baseline implementation";
void naive_smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}
and replace the function call avg(dim, i, j, src); with the actual code at the very bottom of the page. Then, take that code and replace all the function calls in that code with the the actual code, etc.
If you're asking why do all this, the reason is simple: when you get rid of function calls the program runs faster, and I'm trying to attain the fastest cycles per element when the above code runs by getting rid of all the function calls and replacing it with the actual code.
Now I'm really just having a lot of trouble doing this. Do I take the code with the brackets and then just copy and paste? Do I leave out the brackets? Do I include the beginning of the code, for example, static pixel avg(int dim, int i, int j, pixel *src) and then the brackets and then the code to replace the function call?
I am going to paste all the code here:
/* A struct used to compute averaged pixel value */
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
/* Compute min and max of two integers, respectively */
static int min(int a, int b) { return (a < b ? a : b); }
static int max(int a, int b) { return (a > b ? a : b); }
/*
* initialize_ pixel_ sum - Initializes all fields of sum to 0
*/
static void initialize_ pixel_ sum (pixel_sum *sum)
{
sum->red = sum->green = sum->blue = 0;
sum->num = 0;
return;
}
/*
* accumulate_sum - Accumulates field values of p in corresponding
* fields of sum
*/
static void accumulate_ sum (pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
/*
* assign_ sum_ to_ pixel - Computes averaged pixel value in current_pixel
*/
static void assign_ sum_ to_ pixel (pixel *current_ pixel, pixel_ sum sum)
{
current_pixel->red = (unsigned short) (sum.red/sum.num);
current_pixel->green = (unsigned short) (sum.green/sum.num);
current_pixel->blue = (unsigned short) (sum.blue/sum.num);
return;
}
/*
* avg - Returns averaged pixel value at (i,j)
*/
This is the code that I want to replace the function call avg(dim, i, j, src); with:
static pixel avg (int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum);
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
return current_pixel;
}
/*
* mysmooth - my smooth
*/
char mysmooth_ descr[] = "my smooth: My smooth";
void mysmooth (int dim, pixel *src, pixel *dst)
{
int i, j;
int ii, jj;
pixel_sum sum;
pixel current_pixel;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
{
initialize_pixel_sum(&sum);
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
dst[RIDX(i, j, dim)] = current_pixel;
}
So Is this what my code should look like after I finish taking the code from avg() and replacing it with the function?
If your code base is small, includes like 10-12 functions, you might want to try having the keyword inline in front of each of the functions.
Second option, use a compiler option that inlines all the function calls, don't do it manually (that is why compilers exist). What compiler are you using? You can look online for its option that inlines all function calls (if it has any).
Third, if you are using GCC for compiling your code, you can specify the always_inline attribute for the function. Here is how to use it:
static pixel avg (int dim, int i, int j, pixel *src) __attribute__((always_inline));
If you are using a C99 compiler or a C++ compiler, you can use inline keyword. However, it won't guarantee that the call will be replaced with actual code, only if the compiler deems it to be more efficient.
Otherwise, if you are using pure C89, then avg() has to be a macro. Then you are guaranteed to have the function "call" replaced with the actual code.
I have to say I agree with the approach of making sure you're using compiler optimizations and inline... but if you still want an answer to your specific question, I think what you're getting at is something like:
for (j = 0; j < dim; j++)
{
/* ...avg() code body except for the return... */
dst[RIDX(i, j, dim)] = current_pixel;
}
Use inline and macros: http://gcc.gnu.org/onlinedocs/cpp/Macros.html
I unrolled the beginning and the end of the cycles to eliminate min() and max() from the code:
void smooth_B(int dim, struct pixel src[dim][dim], struct pixel dst[dim][dim]){
dst[0][0].red =(src[0][0].red +src[1][0].red +src[0][1].red +src[1][1].red )/4;
dst[0][0].green=(src[0][0].green+src[1][0].green+src[0][1].green+src[1][1].green)/4;
dst[0][0].blue =(src[0][0].blue +src[1][0].blue +src[0][1].blue +src[1][1].blue )/4;
for( int j=1; j<dim-1; j++){
dst[0][j].red =(src[0][j-1].red +src[1][j-1].red +src[0][j].red +src[1][j].red +src[0][j+1].red +src[1][j+1].red )/6;
dst[0][j].green=(src[0][j-1].green+src[1][j-1].green+src[0][j].green+src[1][j].green+src[0][j+1].green+src[1][j+1].green)/6;
dst[0][j].blue =(src[0][j-1].blue +src[1][j-1].blue +src[0][j].blue +src[1][j].blue +src[0][j+1].blue +src[1][j+1].blue )/6;
}
dst[0][dim-1].red =(src[0][dim-2].red +src[1][dim-2].red +src[0][dim-1].red +src[1][dim-1].red )/4;
dst[0][dim-1].green=(src[0][dim-2].green+src[1][dim-2].green+src[0][dim-1].green+src[1][dim-1].green)/4;
dst[0][dim-1].blue =(src[0][dim-2].blue +src[1][dim-2].blue +src[0][dim-1].blue +src[1][dim-1].blue )/4;
for( int i=1; i<dim-1; i++){
dst[i][0].red =(src[i-1][0].red +src[i-1][1].red +src[i][0].red +src[i][1].red +src[i+1][0].red +src[i+1][1].red )/6;
dst[i][0].green=(src[i-1][0].green+src[i-1][1].green+src[i][0].green+src[i][1].green+src[i+1][0].green+src[i+1][1].green)/6;
dst[i][0].blue =(src[i-1][0].blue +src[i-1][1].blue +src[i][0].blue +src[i][1].blue +src[i+1][0].blue +src[i+1][1].blue )/6;
for( int j=1; j<dim; j++){
dst[i][j].red =(src[i-1][j-1].red +src[i][j-1].red +src[i+1][j-1].red +src[i-1][j].red +src[i][j].red +src[i+1][j].red +src[i-1][j+1].red +src[i][j+1].red +src[i+1][j+1].red )/9;
dst[i][j].green=(src[i-1][j-1].green+src[i][j-1].green+src[i+1][j-1].green+src[i-1][j].green+src[i][j].green+src[i+1][j].green+src[i-1][j+1].green+src[i][j+1].green+src[i+1][j+1].green)/9;
dst[i][j].blue =(src[i-1][j-1].blue +src[i][j-1].blue +src[i+1][j-1].blue +src[i-1][j].blue +src[i][j].blue +src[i+1][j].blue +src[i-1][j+1].blue +src[i][j+1].blue +src[i+1][j+1].blue )/9;
}
dst[i][dim-1].red =(src[i-1][dim-2].red +src[i][dim-2].red +src[i+1][dim-2].red +src[i-1][dim-1].red +src[i][dim-1].red +src[i+1][dim-1].red )/6;
dst[i][dim-1].green=(src[i-1][dim-2].green+src[i][dim-2].green+src[i+1][dim-2].green+src[i-1][dim-1].green+src[i][dim-1].green+src[i+1][dim-1].green)/6;
dst[i][dim-1].blue =(src[i-1][dim-2].blue +src[i][dim-2].blue +src[i+1][dim-2].blue +src[i-1][dim-1].blue +src[i][dim-1].blue +src[i+1][dim-1].blue )/6;
}
dst[dim-1][0].red =(src[dim-2][0].red +src[dim-2][1].red +src[dim-1][0].red +src[dim-1][1].red )/4;
dst[dim-1][0].green=(src[dim-2][0].green+src[dim-2][1].green+src[dim-1][0].green+src[dim-1][1].green)/4;
dst[dim-1][0].blue =(src[dim-2][0].blue +src[dim-2][1].blue +src[dim-1][0].blue +src[dim-1][1].blue )/4;
for( int j=1; j<dim; j++){
dst[dim-1][j].red =(src[dim-2][j-1].red +src[dim-1][j-1].red +src[dim-2][j].red +src[dim-1][j].red +src[dim-2][j+1].red +src[dim-1][j+1].red )/6;
dst[dim-1][j].green=(src[dim-2][j-1].green+src[dim-1][j-1].green+src[dim-2][j].green+src[dim-1][j].green+src[dim-2][j+1].green+src[dim-1][j+1].green)/6;
dst[dim-1][j].blue =(src[dim-2][j-1].blue +src[dim-1][j-1].blue +src[dim-2][j].blue +src[dim-1][j].blue +src[dim-2][j+1].blue +src[dim-1][j+1].blue )/6;
}
dst[dim-1][dim-1].red =(src[dim-2][dim-2].red +src[dim-1][dim-2].red +src[dim-2][dim-1].red +src[dim-1][dim-1].red )/4;
dst[dim-1][dim-1].green=(src[dim-2][dim-2].green+src[dim-1][dim-2].green+src[dim-2][dim-1].green+src[dim-1][dim-1].green)/4;
dst[dim-1][dim-1].blue =(src[dim-2][dim-2].blue +src[dim-1][dim-2].blue +src[dim-2][dim-1].blue +src[dim-1][dim-1].blue )/4;
}
As i measured it is faster by ~50% than the original code. The next step is the elimination of repeated calculations.