What I'm trying to do is take this code:
char naive_smooth_descr[] = "naive_smooth: Naive baseline implementation";
void naive_smooth(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}
and replace the function call avg(dim, i, j, src); with the actual code at the very bottom of the page. Then, take that code and replace all the function calls in that code with the the actual code, etc.
If you're asking why do all this, the reason is simple: when you get rid of function calls the program runs faster, and I'm trying to attain the fastest cycles per element when the above code runs by getting rid of all the function calls and replacing it with the actual code.
Now I'm really just having a lot of trouble doing this. Do I take the code with the brackets and then just copy and paste? Do I leave out the brackets? Do I include the beginning of the code, for example, static pixel avg(int dim, int i, int j, pixel *src) and then the brackets and then the code to replace the function call?
I am going to paste all the code here:
/* A struct used to compute averaged pixel value */
typedef struct {
int red;
int green;
int blue;
int num;
} pixel_sum;
/* Compute min and max of two integers, respectively */
static int min(int a, int b) { return (a < b ? a : b); }
static int max(int a, int b) { return (a > b ? a : b); }
/*
* initialize_ pixel_ sum - Initializes all fields of sum to 0
*/
static void initialize_ pixel_ sum (pixel_sum *sum)
{
sum->red = sum->green = sum->blue = 0;
sum->num = 0;
return;
}
/*
* accumulate_sum - Accumulates field values of p in corresponding
* fields of sum
*/
static void accumulate_ sum (pixel_sum *sum, pixel p)
{
sum->red += (int) p.red;
sum->green += (int) p.green;
sum->blue += (int) p.blue;
sum->num++;
return;
}
/*
* assign_ sum_ to_ pixel - Computes averaged pixel value in current_pixel
*/
static void assign_ sum_ to_ pixel (pixel *current_ pixel, pixel_ sum sum)
{
current_pixel->red = (unsigned short) (sum.red/sum.num);
current_pixel->green = (unsigned short) (sum.green/sum.num);
current_pixel->blue = (unsigned short) (sum.blue/sum.num);
return;
}
/*
* avg - Returns averaged pixel value at (i,j)
*/
This is the code that I want to replace the function call avg(dim, i, j, src); with:
static pixel avg (int dim, int i, int j, pixel *src)
{
int ii, jj;
pixel_sum sum;
pixel current_pixel;
initialize_pixel_sum(&sum);
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
return current_pixel;
}
/*
* mysmooth - my smooth
*/
char mysmooth_ descr[] = "my smooth: My smooth";
void mysmooth (int dim, pixel *src, pixel *dst)
{
int i, j;
int ii, jj;
pixel_sum sum;
pixel current_pixel;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
{
initialize_pixel_sum(&sum);
for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++)
for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++)
accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);
assign_sum_to_pixel(¤t_pixel, sum);
dst[RIDX(i, j, dim)] = current_pixel;
}
So Is this what my code should look like after I finish taking the code from avg() and replacing it with the function?
If your code base is small, includes like 10-12 functions, you might want to try having the keyword inline in front of each of the functions.
Second option, use a compiler option that inlines all the function calls, don't do it manually (that is why compilers exist). What compiler are you using? You can look online for its option that inlines all function calls (if it has any).
Third, if you are using GCC for compiling your code, you can specify the always_inline attribute for the function. Here is how to use it:
static pixel avg (int dim, int i, int j, pixel *src) __attribute__((always_inline));
If you are using a C99 compiler or a C++ compiler, you can use inline keyword. However, it won't guarantee that the call will be replaced with actual code, only if the compiler deems it to be more efficient.
Otherwise, if you are using pure C89, then avg() has to be a macro. Then you are guaranteed to have the function "call" replaced with the actual code.
I have to say I agree with the approach of making sure you're using compiler optimizations and inline... but if you still want an answer to your specific question, I think what you're getting at is something like:
for (j = 0; j < dim; j++)
{
/* ...avg() code body except for the return... */
dst[RIDX(i, j, dim)] = current_pixel;
}
Use inline and macros: http://gcc.gnu.org/onlinedocs/cpp/Macros.html
I unrolled the beginning and the end of the cycles to eliminate min() and max() from the code:
void smooth_B(int dim, struct pixel src[dim][dim], struct pixel dst[dim][dim]){
dst[0][0].red =(src[0][0].red +src[1][0].red +src[0][1].red +src[1][1].red )/4;
dst[0][0].green=(src[0][0].green+src[1][0].green+src[0][1].green+src[1][1].green)/4;
dst[0][0].blue =(src[0][0].blue +src[1][0].blue +src[0][1].blue +src[1][1].blue )/4;
for( int j=1; j<dim-1; j++){
dst[0][j].red =(src[0][j-1].red +src[1][j-1].red +src[0][j].red +src[1][j].red +src[0][j+1].red +src[1][j+1].red )/6;
dst[0][j].green=(src[0][j-1].green+src[1][j-1].green+src[0][j].green+src[1][j].green+src[0][j+1].green+src[1][j+1].green)/6;
dst[0][j].blue =(src[0][j-1].blue +src[1][j-1].blue +src[0][j].blue +src[1][j].blue +src[0][j+1].blue +src[1][j+1].blue )/6;
}
dst[0][dim-1].red =(src[0][dim-2].red +src[1][dim-2].red +src[0][dim-1].red +src[1][dim-1].red )/4;
dst[0][dim-1].green=(src[0][dim-2].green+src[1][dim-2].green+src[0][dim-1].green+src[1][dim-1].green)/4;
dst[0][dim-1].blue =(src[0][dim-2].blue +src[1][dim-2].blue +src[0][dim-1].blue +src[1][dim-1].blue )/4;
for( int i=1; i<dim-1; i++){
dst[i][0].red =(src[i-1][0].red +src[i-1][1].red +src[i][0].red +src[i][1].red +src[i+1][0].red +src[i+1][1].red )/6;
dst[i][0].green=(src[i-1][0].green+src[i-1][1].green+src[i][0].green+src[i][1].green+src[i+1][0].green+src[i+1][1].green)/6;
dst[i][0].blue =(src[i-1][0].blue +src[i-1][1].blue +src[i][0].blue +src[i][1].blue +src[i+1][0].blue +src[i+1][1].blue )/6;
for( int j=1; j<dim; j++){
dst[i][j].red =(src[i-1][j-1].red +src[i][j-1].red +src[i+1][j-1].red +src[i-1][j].red +src[i][j].red +src[i+1][j].red +src[i-1][j+1].red +src[i][j+1].red +src[i+1][j+1].red )/9;
dst[i][j].green=(src[i-1][j-1].green+src[i][j-1].green+src[i+1][j-1].green+src[i-1][j].green+src[i][j].green+src[i+1][j].green+src[i-1][j+1].green+src[i][j+1].green+src[i+1][j+1].green)/9;
dst[i][j].blue =(src[i-1][j-1].blue +src[i][j-1].blue +src[i+1][j-1].blue +src[i-1][j].blue +src[i][j].blue +src[i+1][j].blue +src[i-1][j+1].blue +src[i][j+1].blue +src[i+1][j+1].blue )/9;
}
dst[i][dim-1].red =(src[i-1][dim-2].red +src[i][dim-2].red +src[i+1][dim-2].red +src[i-1][dim-1].red +src[i][dim-1].red +src[i+1][dim-1].red )/6;
dst[i][dim-1].green=(src[i-1][dim-2].green+src[i][dim-2].green+src[i+1][dim-2].green+src[i-1][dim-1].green+src[i][dim-1].green+src[i+1][dim-1].green)/6;
dst[i][dim-1].blue =(src[i-1][dim-2].blue +src[i][dim-2].blue +src[i+1][dim-2].blue +src[i-1][dim-1].blue +src[i][dim-1].blue +src[i+1][dim-1].blue )/6;
}
dst[dim-1][0].red =(src[dim-2][0].red +src[dim-2][1].red +src[dim-1][0].red +src[dim-1][1].red )/4;
dst[dim-1][0].green=(src[dim-2][0].green+src[dim-2][1].green+src[dim-1][0].green+src[dim-1][1].green)/4;
dst[dim-1][0].blue =(src[dim-2][0].blue +src[dim-2][1].blue +src[dim-1][0].blue +src[dim-1][1].blue )/4;
for( int j=1; j<dim; j++){
dst[dim-1][j].red =(src[dim-2][j-1].red +src[dim-1][j-1].red +src[dim-2][j].red +src[dim-1][j].red +src[dim-2][j+1].red +src[dim-1][j+1].red )/6;
dst[dim-1][j].green=(src[dim-2][j-1].green+src[dim-1][j-1].green+src[dim-2][j].green+src[dim-1][j].green+src[dim-2][j+1].green+src[dim-1][j+1].green)/6;
dst[dim-1][j].blue =(src[dim-2][j-1].blue +src[dim-1][j-1].blue +src[dim-2][j].blue +src[dim-1][j].blue +src[dim-2][j+1].blue +src[dim-1][j+1].blue )/6;
}
dst[dim-1][dim-1].red =(src[dim-2][dim-2].red +src[dim-1][dim-2].red +src[dim-2][dim-1].red +src[dim-1][dim-1].red )/4;
dst[dim-1][dim-1].green=(src[dim-2][dim-2].green+src[dim-1][dim-2].green+src[dim-2][dim-1].green+src[dim-1][dim-1].green)/4;
dst[dim-1][dim-1].blue =(src[dim-2][dim-2].blue +src[dim-1][dim-2].blue +src[dim-2][dim-1].blue +src[dim-1][dim-1].blue )/4;
}
As i measured it is faster by ~50% than the original code. The next step is the elimination of repeated calculations.
Related
I am new to programming and am trying to use two functions for matrix (2D array) operations, where the output of one function is the input for the next.
However, I do not find a way to correctly deliver the values from one function to another. When I print the outputs of the first function in main (), they are correct, but when I input them into the 2nd function and print them, the values make no sense. I have tried it a lot of ways, but it probably fails due to my lack of understanding double pointers.
I am thankful for any hint or advise!
#include <stdio.h>
#include <stdlib.h>
int** td (int r_in, int c_in, int r_p, int c_p, int input[][c_in],int params[][c_p]){
int i, j, k;
int**y_td;
// memory allocation
y_td = (int*)malloc(sizeof(int*)*r_in);
for (i=0; i < r_in; i++){
y_td[i] = (int*)malloc(sizeof(int)*c_p);
}
//
for (i=0; i < r_in; i++){
for (j=0; j < c_p; j++){
y_td[i][j]=0; // Initialization
for (k=0; k < c_in; k++){
y_td[i][j]+= input[i][k]*params[k][j];
}
}
}
return y_td;
}
int** cnv (int r_in, int c_in, int filter, int f_size, int input[][c_in], int params[][f_size][c_in]){
int x,i,j,k,l,m,n;
int min_len = ((r_in < f_size)? r_in:f_size);
int max_len = ((r_in > f_size)? r_in:f_size);
int r_out = max_len - min_len + 1;//rows_out
int kernel;
int** y_cnv;
// Print input to check if it was correctly transmitted to the function
printf("Input CV (should be equal to TD result):\n");
for (i=0;i<r_in;i++){
for (j=0;j<c_in;j++){
printf("%d ", input[i][j]);
}
printf("\n");
}
printf("\n\n");
//memory allocation
y_cnv = (int*)malloc(sizeof(int*)*r_out);
for (i=0; i < r_out; i++){
y_cnv[i] = (int*)malloc(sizeof(int)*filter);
}
//
for (i=0; i < filter; i++){
for (k=0; k < r_out; k++){
y_cnv [k][i]=0; //initialize
}
for (j = 0; j < c_in; j++){
for (n = min_len-1; n < max_len; n++){
x = n-min_len+1;
for (m= 0; m < r_in; m++){
kernel = (((n-m) < min_len && (n-m) >= 0)? params[i][n-m][j]:0);
y_cnv[x][i] += input[m][j]*kernel;
}
}
}
}
return y_cnv;
}
int main() {
// create test arrays
int A [4][2]= {{1,1},{2,2},{3,3},{4,4}};
int B [2][3]= {{1,2,3},{2,3,4}};
int C [2][2][3]= {{{1,1,1},{2,2,2}},{{3,3,3},{4,4,4}}};
int** matrix;
int i, j;
matrix = td(4,2,2,3,A,B);
// print the result of first function, which is input in 2nd function
printf("The TD result is:\n");
for (i=0;i<4;i++){
for (j=0;j<3;j++){
printf("%d ",matrix[i][j]);
}
printf("\n");
}
printf("\n\n");
matrix = cnv(4,3,2,2,matrix,C);
return 0;
}
I expect the matrix printed in main () after the first function td () to be the same as when I read it in the second function cnv () and print it there, but it is not.
take a look at this question. You were hit by the same underlying problem.
Turning
int** cnv (int r_in, int c_in, int filter, int f_size, int input[][c_in], int params[][f_size][c_in])
into
int** cnv (int r_in, int c_in, int filter, int f_size, int** input, int params[][f_size][c_in])
fixes the problem you asked for.
The reason is that you allocate an array of pointers called y_td in your first function. Each of this pointers is a number naming a memory segment where you stored some real numbers. By using int input[][c_in] you tell the computer to interpret these pointers as integer numbers and when you print them you get the addresses in memory instead of the expected values, because then input[x][y] is translated to *((int *)input+x*c_in+y).
Please allow me one more comment: You should follow the comments below the question and care for all compiler warnings: If there is a warning you should treat it as an compiler error unless you exactly know what you are doing, especially in C. Your code contains some possible problem sources like the one above.
recently I am working on a c OpenMP code which carrying out the affinity scheduling. Basically, after a thread has finished its assigned iterations, it will start looking for other threads which has the most work load and steal some jobs from them.
Everything works fine, I can compile the file using icc. However, when I try to run it, it gives me the segmentation fault(core dumped). But the funny thing is, the error is not always happen, that is, even I get an error when I first run the code, when I try to run again, sometimes it works. This is so weird to me. I wonder what I did wrong in my code and how to fix the problem. Thank you. I did only modified the method runloop and affinity, others are given at the beginning which works fine.
#include <stdio.h>
#include <math.h>
#define N 729
#define reps 1000
#include <omp.h>
double a[N][N], b[N][N], c[N];
int jmax[N];
void init1(void);
void init2(void);
void runloop(int);
void loop1chunk(int, int);
void loop2chunk(int, int);
void valid1(void);
void valid2(void);
int affinity(int*, int*, int, int, float, int*, int*);
int main(int argc, char *argv[]) {
double start1,start2,end1,end2;
int r;
init1();
start1 = omp_get_wtime();
for (r=0; r<reps; r++){
runloop(1);
}
end1 = omp_get_wtime();
valid1();
printf("Total time for %d reps of loop 1 = %f\n",reps, (float)(end1-start1));
init2();
start2 = omp_get_wtime();
for (r=0; r<reps; r++){
runloop(2);
}
end2 = omp_get_wtime();
valid2();
printf("Total time for %d reps of loop 2 = %f\n",reps, (float)(end2-start2));
}
void init1(void){
int i,j;
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.0;
b[i][j] = 3.142*(i+j);
}
}
}
void init2(void){
int i,j, expr;
for (i=0; i<N; i++){
expr = i%( 3*(i/30) + 1);
if ( expr == 0) {
jmax[i] = N;
}
else {
jmax[i] = 1;
}
c[i] = 0.0;
}
for (i=0; i<N; i++){
for (j=0; j<N; j++){
b[i][j] = (double) (i*j+1) / (double) (N*N);
}
}
}
void runloop(int loopid)
{
int nthreads = omp_get_max_threads(); // we set it before the parallel region, using opm_get_num_threads() will always return 1 otherwise
int ipt = (int) ceil((double)N/(double)nthreads);
float chunks_fraction = 1.0 / nthreads;
int threads_lo_bound[nthreads];
int threads_hi_bound[nthreads];
#pragma omp parallel default(none) shared(threads_lo_bound, threads_hi_bound, nthreads, loopid, ipt, chunks_fraction)
{
int myid = omp_get_thread_num();
int lo = myid * ipt;
int hi = (myid+1)*ipt;
if (hi > N) hi = N;
threads_lo_bound[myid] = lo;
threads_hi_bound[myid] = hi;
int current_lower_bound = 0;
int current_higher_bound = 0;
int affinity_steal = 0;
while(affinity_steal != -1)
{
switch(loopid)
{
case 1: loop1chunk(current_lower_bound, current_higher_bound); break;
case 2: loop2chunk(current_lower_bound, current_higher_bound); break;
}
#pragma omp critical
{
affinity_steal = affinity(threads_lo_bound, threads_hi_bound, nthreads, myid, chunks_fraction, ¤t_lower_bound, ¤t_higher_bound);
}
}
}
}
int affinity(int* threads_lo_bound, int* threads_hi_bound, int num_of_thread, int thread_num, float chunks_fraction, int *current_lower_bound, int *current_higher_bound)
{
int current_pos;
if (threads_hi_bound[thread_num] - threads_lo_bound[thread_num] > 0)
{
current_pos = thread_num;
}
else
{
int new_pos = -1;
int jobs_remain = 0;
int i;
for (i = 0; i < num_of_thread; i++)
{
int diff = threads_hi_bound[i] - threads_lo_bound[i];
if (diff > jobs_remain)
{
new_pos = i;
jobs_remain = diff;
}
}
current_pos = new_pos;
}
if (current_pos == -1) return -1;
int remaining_iterations = threads_hi_bound[current_pos] - threads_lo_bound[current_pos];
int iter_size_fractions = (int)ceil(chunks_fraction * remaining_iterations);
*current_lower_bound = threads_lo_bound[current_pos];
*current_higher_bound = threads_lo_bound[current_pos] + iter_size_fractions;
threads_lo_bound[current_pos] = threads_lo_bound[current_pos] + iter_size_fractions;
return current_pos;
}
void loop1chunk(int lo, int hi) {
int i,j;
for (i=lo; i<hi; i++){
for (j=N-1; j>i; j--){
a[i][j] += cos(b[i][j]);
}
}
}
void loop2chunk(int lo, int hi) {
int i,j,k;
double rN2;
rN2 = 1.0 / (double) (N*N);
for (i=lo; i<hi; i++){
for (j=0; j < jmax[i]; j++){
for (k=0; k<j; k++){
c[i] += (k+1) * log (b[i][j]) * rN2;
}
}
}
}
void valid1(void) {
int i,j;
double suma;
suma= 0.0;
for (i=0; i<N; i++){
for (j=0; j<N; j++){
suma += a[i][j];
}
}
printf("Loop 1 check: Sum of a is %lf\n", suma);
}
void valid2(void) {
int i;
double sumc;
sumc= 0.0;
for (i=0; i<N; i++){
sumc += c[i];
}
printf("Loop 2 check: Sum of c is %f\n", sumc);
}
You don't initialise the arrays threads_lo_bound and threads_hi_bound, so they initially contain some completely random values (this is source of randomness number 1).
You then enter the parallel region, where it is imperative to realise not all threads will be moving through the code in sync, the actual speed of each threads is quite random as it shares the CPU with many other programs, even if they only use 1%, that will still show (this is source of randomness number 2, I'd argue this one is more relevant to why you see it working every now and then).
So what happens when the code crashes?
One of the threads (most likely the master) reaches the critical region before at least one of the other threads has reached the line where you set threads_lo_bound[myid] and threads_hi_bound[myid].
After that, depending on what those random values stored in there were (you can generally assume they were out of bounds, your array is fairly small, the odds of those values being valid indices are pretty slim), the thread will try to steal some of the jobs (that don't exist) by setting current_lower_bound and/or current_upper_bound to some value that is out of range of your initial arrays a, b, c.
It will then enter the second iteration of your while(affinity_steal != -1) loop and access memory that is out of bounds inevitably leading to a segmentation fault (eventually, in principle it's undefined behaviour and the crash can occur at any point after an invalid memory access, or in some cases never, leading you to believe everything is in order, when it is most definitely not).
The fix of course is simple, add
#pragma omp barrier
just before the while(affinity_steal != -1) loop to ensure all threads have reached that point (i.e. synchronise the threads at that point) and the bounds are properly set before you proceed into the loop. The overhead of this is minimal, but if for some reason you wish to avoid using barriers, you can simply set the values of the array before entering the parallel region.
That said, bugs like this can usually be located using a good debugger, I strongly suggest learning how to use one, they make life much easier.
I'm working on speeding up Conway's Game of Life. Right now, the code looks at a cell and then adds up the 3x3 area immediately surrounding the point, then subtracts the value at the point we're looking at. Here's the function that is doing that:
static int neighbors2 (board b, int i, int j)
{
int n = 0;
int i_left = max(0,i-1);
int i_right = min(HEIGHT, i+2);
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int ii, jj;
for (jj = j_left; jj < j_right; ++jj) {
for (ii = i_left; ii < i_right; ii++) {
n += b[ii][jj];
}
}
return n - b[i][j];
}
And here is the code I've been trying to use to iterate through pieces at a time:
//Iterates through the first row of the 3x3 area
static int first_row(board b, int i, int j) {
int f = 0;
int i_left = max(0,i-1);
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int jj;
for (jj = j_left; jj < j_right; ++jj) {
f += b[i_left][jj];
}
return f;
}
//Iterates and adds up the second row of the 3x3 area
static int second_row(board b, int i, int j) {
int g = 0;
int i_right = min(HEIGHT, i+2);
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int jj;
if (i_right != i) {
for (jj = j_left; jj < j_right; ++jj) {
g += b[i][jj];
}
}
return g;
}
//iterates and adds up the third row of the 3x3 area.
static int third_row(board b, int i, int j) {
int h = 0;
int i_right = min(HEIGHT, i+2);
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int jj;
for (jj = j_left; jj < j_right; ++jj) {
h += b[i_right][jj];
}
return h;
}
//adds up the surrounding spots
//subtracts the spot we're looking at.
static int addUp(board b, int i, int j) {
int n = first_row(b, i, j) + second_row(b, i, j) + third_row(b, i, j);
return n - b[i][j];
}
But, for some reason it isn't working. I have no idea why.
Things to note:
sometimes i == i_right, so we do not want to add up a row twice.
The three functions are supposed to do the exact same thing as neighbors2 in separate pieces.
min and max are functions that were premade for me.
sometimes sometimes j == j_right, so we do not want to add up something twice. I'm pretty confident the loop takes care of this however.
Tips and things to consider are appreciated.
Thanks all. I've been working on this for a couple hours now and have no idea what is going wrong. It seems like it should work but I keep getting incorrect solutions at random spots among the board.
In neighbors2, you set i_left and i_right so that the're limited to the rows of the grid. If the current cell is in the top or bottom row, you only loop through two rows instead of 3.
In first_row() and last_row() you also limit it to the rows of the grid. But the result is that these functions will add the cells on the same row as the current cell, which is what second_row does. So you end up adding those rows twice.
You shouldn't call first_row() when i = 0, and you shouldn't call third_row() when i == HEIGHT.
static int addUp(board b, int i, int j) {
int n = (i == 0 ? 0 : first_row(b, i, j)) +
second_row(b, i, j) +
(i == HEIGHT ? 0 : third_row(b, i, j));
return n - b[i][j];
}
Another option would be to do the check in the functions themselves:
function first_row((board b, int i, int j) {
if (i == 0) {
return 0;
}
int f = 0;
int j_left = max(0,j-1);
int j_right = min(WIDTH, j+2);
int jj;
for (jj = j_left; jj < j_right; ++jj) {
f += b[i][jj];
}
return f;
}
and similarly for third_row(). But doing it in the caller saves the overhead of the function calls.
BTW, your variable names are very confusing. All the i variables are for rows, which go from top to bottom, not left to right.
#include <stdio.h>
#include <stdlib.h>
#define ROWSDISP 50
#define COLSDISP 100
int rows=ROWSDISP+2, cols=COLSDISP+2;
This is to avoid illegal indexes when stepping over the neighbours.
struct onecell {char alive;
char neibs;} **cells;
This is the foundation of a (dynamic) 2D-array, of a small struct.
To create space for each row plus the space to hold an array of row pointers:
void init_cells()
{
int i;
cells = calloc(rows, sizeof(*cells));
for(i=0; i<=rows-1; i++)
cells[i] = calloc(cols, sizeof(**cells));
}
I skip the rand_fill() and glider() funcs. A cell can be set by
cells[y][x].alive=1.
int main(void) {
struct onecell *c, *n1, *rlow;
int i, j, loops=0;
char nbs;
init_cells();
rand_fill();
glider();
while (loops++ < 1000) {
printf("\n%d\n", loops);
for (i = 1; i <= rows-2; i++) {
for (j = 1; j <= cols-2; j++) {
c = &cells[ i ][ j ];
n1 = &cells[ i ][j+1];
rlow = cells[i+1];
nbs = c->neibs + n1->alive + rlow[ j ].alive
+ rlow[j+1].alive
+ rlow[j-1].alive;
if(c->alive) {
printf("#");
n1->neibs++;
rlow[ j ].neibs++;
rlow[j+1].neibs++;
rlow[j-1].neibs++;
if(nbs < 2 || nbs > 3)
c->alive = 0;
} else {
printf(" ");
if(nbs == 3)
c->alive = 1;
}
c->neibs = 0; // reset for next cycle
}
printf("\n");
}
}
return(0);
}
There is no iterating a 3x3 square here. Of the 8 neighbours,
only the 4 downstream ones are checked; but at the same time
their counters are raised.
A benchmark with 100x100 grid:
# time ./a.out >/dev/null
real 0m0.084s
user 0m0.084s
sys 0m0.000s
# bc <<<100*100*1000/.084
119047619
And each of these 100M cells needs to check 8 neighbours, so this is close to the CPU frequency (1 neighbour check per cycle).
It seems twice as fast as the rosetta code solution.
There also is no need to switch the boards. Thanks to the investment in the second field of a cell.
I've got an assignment to optimize a piece of C code (a language in which I'm rather n00bish) designed to simulate rotating pixels in an image:
void naive_rotate(int dim, pixel *src, pixel *dst) {
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
Defs for pixel and RIDX are as follows:
typedef struct {
unsigned short red;
unsigned short green;
unsigned short blue;
} pixel;
#define RIDX(i,j,n) ((i)*(n)+(j))
The instructions for the assignment contain the note, "Your task is to rewrite this code to make it run as fast as possible using techniques like code motion, loop unrolling
and blocking."
I thought I had some ideas on how to approach this. However, my attempts at loop unrolling:
void rotate_unroll(int dim, pixel *src, pixel *dst) {
int i, j;
for (i = 0; i < dim; i++) {
for (j = 0; j < dim; j+=4) {
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
dst[RIDX(dim-1-(j+1), i, dim)] = src[RIDX(i, j+1, dim)];
dst[RIDX(dim-1-(j+2), i, dim)] = src[RIDX(i, j+2, dim)];
dst[RIDX(dim-1-(j+3), i, dim)] = src[RIDX(i, j+3, dim)];
}
}
}
and code motion (or at least reorganizing RIDX and moving around some of the calculations out of the inner loop):
void rotate_motion(int dim, pixel *src, pixel *dst) {
int i, j;
int dimsquared = dim * dim;
for (i = 0; i < dim; i++) {
int dst_temp = dimsquared - dim + i;
int src_temp = i * dim;
for (j = 0; j < dim; j++) {
dst[dst_temp - (dim * j)] = src[src_temp + j];
}
}
}
// dst[RIDX(dim-1-j, i, dim)]
// = dst[(dim-1-j)dim + i]
// = dst[(dim * dim) - dim - (dim)j + i]
// src[RIDX(i, j, dim)]
// = src[(dim)i + j]
do not seem to be working; the timer packaged with the assignment claims that my solutions are not having any impact on the CPE of the program. I suspect I am probably approaching both methods incorrectly and would greatly appreciate any guidance in the right direction. (It's a homework assignment so I'm not asking for a complete solution, just some pointers.)
My other thought was to try to add an accumulator -- something along the lines of the following:
void rotate_acc(int dim, pixel *src, pixel *dst) {
int i, j;
pixel temp = dst;
for (i = 0; i < dim; i++) {
for (j = 0; j < dim; j++) {
temp[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
}
dst = temp;
}
But my syntax is very wrong there and I'm not sure how one would go about constructing a temporary version of the struct in question.
Any help is much appreciated. Thanks!
You need to have a thorough understanding on pointers in C. Put it simply: pointers represent an address of where your data is stored in memory (pixel struct in your case).
In your code, the function rotate_acc takes a pixel pointer as argument: pixel *dst. At first you can be tempted to say pixel *tmp = dst, but keep in mind this is what is called shallow copy -- only the address are copied, not the data it's pointing at -- hence if you modify tmp by saying tmp->red = 0, it will modify the original data too
If you need a deep copy, you need to say pixel tmp = *dst
Try this:
void naive_rotate(int dim, pixel *src, pixel *dst) {
int dimSq = dim * dim;
int dstdIxStart = dimSq - dim;
int endIdx = dimSq - 2 * dim - 2;
int dstIdx = dimSq - dim;
for (int i = 0; int < endIdx; ++i)
{
dst[dstIdx--] = src[i];
if (0 == dstIdx)
{
dstdIxStart -= dim;
dstIdx = dstdIxStart;
}
}
}
You will have to double check the maths, but I hope you get the idea.
It removes all the multiplications. Also as src is being accessed sequentially it is good for the cache.
I am trying to pass a matrix to a function by reference. The function will replace every element A[i][j] of the matrix by -A[i][j]. I first create the matrix:
float a[3][4] =
{
{1.0f, 0.0f, 0.0f, 0.0f},
{0.0f, 1.0f, 0.0f, 0.0f},
{1.0f, 1.0f, 0.0f, 0.0f},
};
Then, I obtain the pointer to this matrix:
float*** pa = &a;
Then, I introduce the following function:
void process(float ***matrix, int nRows, int nCols){
short i;
short j;
for (i=0 ; i<nRows; i++){
for (j=0 ; j<nCols ; j++){
(*matrix)[i][j] *= -1;
}
}
}
which I call as follows:
process(pa,3,4);
My program fails to execute and returns:
Segmentation fault: 11
Any ideas?
Summary of the answers: Some notes based on the questions this question received:
I. The aforementioned function can be used, provided that a is initialized a bit differently so as to be a float**. In particular:
int numberOfRows = 3;
int numberOfColumns = 4;
float **a = (float **) malloc(sizeof (float *) * numberOfRows);
for (i = 0; i < numberOfRows; ++i) {
a[i] = (float *) malloc(sizeof (float) * numberOfColumns);
}
and then, it is passed to the function process as process(&a, 3,4);.
II. Alternatively, one may use the function:
void multi_by_minus(float *matrix, int nRows, int nCols) {
short i,j;
for (i = 0; i < nRows; i++) {
for (j = 0; j < nCols; j++) {
matrix[i * nCols + j] *= -1;
}
}
}
which treats the matrix as an one-dimensional array. In that case we simply invoke it as multi_by_minus(&a, 3, 4);
III. Finally, we may use the method:
void process2(int nRows, int nCols, float (*matrix)[nCols]) {
short i, j;
for (i = 0; i < nRows; i++) {
for (j = 0; j < nCols; j++) {
matrix[i][j] *= -1;
}
}
}
to which we provide a pointer to a, i.e., we invoke it like process2(3,4,&a);. In this way, we acquire access to the elements of the matrix in 2D.
There is no need for the triple pointer since you are already supplying the memory. You would use that if you were to allocate the memory inside de function.
You can't index a 2 dimension matrix without supplying at least the size of 1 dimension. The reason is that the compiler needs to generate code to access the correct offset taking into account both dimensions. In this particular case, I suggest passing a simple pointer and indexing as a 1D array, like this:
void process(float *matrix, int nRows, int nCols){
short i;
short j;
for (i=0 ; i<nRows; i++){
for (j=0 ; j<nCols ; j++){
matrix[i * nCols + j] *= -1;
}
}
}
You can then call it like this:
process((float*)a,3,4);
This way you manually index your buffer.
You have to change the signature of the function to this:
void process(float (*matrix)[3][4], int nRows, int nCols){
And when calling the function, use this:
process(&a, 3, 4);
If you put the nCols parameter before the matrix parameter, you can pass the two-dimensional matrix and use it in the natural way, without extra * operators or index arithmetic:
void process(int nRows, int nCols, float (*matrix)[nCols])
{
for (short i = 0 ; i < nRows; i++)
{
for (short j = 0; j < nCols; j++)
{
matrix[i][j] *= -1;
}
}
}
Then you call process like this:
process(3, 4, matrix);
Incidentally:
Unless there is special reason for making i and j short, you should declare them int. int is defined to be the “natural” size for integers on the target platform.
So easy, if you have a matrix:
int m[2][2]={{1,0},{0,1}};
and you want to define a pointer to m, so you must declare:
int (*mptr)[2][2];
mprt=m; // or mptr=&m; is the same.
and you can use it to point to elements of the matrix m.
(*mptr)[i][j]....