optimizing for loop addition [duplicate]

optimizing for loop addition [duplicate] - c

This question already has answers here:
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
Closed 5 years ago.
I have an assignment to optimize a for loop so the compiler compiles code that runs faster. The objective is to get the code to run in 5 or less seconds, with the original run time being around 23 seconds. The original code looks like this:
#include <stdio.h>
#include <stdlib.h>
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
return 0;
}
My first thought was to do loop unrolling on the inner for loop which got it down to 5.7 seconds and that loop looked like this:
for (j = 0; j < ARRAY_SIZE - 11; j+= 12) {
sum = sum + (array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7] + array[j+8] + array[j+9] + array[j+10] + array[j+11]);
}
After taking it out to 12 spots in the array per loop the performance wasn't increasing anymore so my next thought was to try and introduce some parallelism so I did this:
sum = sum + (array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5]);
sum1 = sum1 + (array[j+6] + array[j+7] + array[j+8] + array[j+9] + array[j+10] + array[j+11]);
That actually ended up slowing down the code and each additional variable again slowed the code down more so. I'm not sure if parallelism doesn't work here or if I'm implementing it wrong or what but that didn't work so now I'm not really sure how I can optimize it anymore to get it below 5 seconds.
EDIT: I forgot to mention I can't make any changes to the outer loop, only the inner loop
EDIT2: This is the part of the code I'm trying to optimize for my assignment:
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Im using gcc compiler with the flags
gcc -m32 -std=gnu11 -Wall -g a04.c -o a04
All compiler optimizations are turned off

Since j and i don't depend on one another, I think you can just do:
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES

You can move the declaration of variable 'j' out of the loop like so:
int j;
for (i = 0; i < N_TIMES; i++) {
//int j; <-- Move this line out of the loop
for (j = 0; j < ARRAY_SIZE - 11; j+= 12) {
sum = sum + (array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7] + array[j+8] + array[j+9] + array[j+10] + array[j+11]);
}
}
You don't need to declare a new variable 'j' each time the loop runs.

Related

warning: format '%s' expects type 'char ' but argument 2 has type 'char ()[2000]'

so I was writing this code for the game of life using C on linux but i got this warning! what does this warning mean and how can i fix it ?. The code i wrote is :
#include <stdio.h>
#include <string.h>
#include <omp.h>
#include <stdlib.h>
#include <assert.h>
#define MAX_N 2000
int plate[2][(MAX_N + 2) * (MAX_N + 2)];
int which = 0;
int n;
int live(int index){
return (plate[which][index - n - 3]
+ plate[which][index - n - 2]
+ plate[which][index - n - 1]
+ plate[which][index - 1]
+ plate[which][index + 1]
+ plate[which][index + n + 1]
+ plate[which][index + n + 2]
+ plate[which][index + n + 3]);
}
void iteration(){
#pragma omp parallel for schedule(static)
for(int i = 1; i <= n; i++){
for(int j = 1; j <= n; j++){
int index = i * (n + 2) + j;
int num = live(index);
if(plate[which][index]){
plate[!which][index] = (num == 2 || num == 3) ?
1 : 0;
}else{
plate[!which][index] = (num == 3);
}
}
}
which = !which;
}
void print_plate(){
for(int i = 1; i <= n; i++){
for(int j = 1; j <= n; j++){
printf("%d", plate[which][i * (n + 2) + j]);
}
printf("\n");
}
printf("\0");
}
int main(){
int M;
char line[MAX_N];
memset(plate[0], 0, sizeof(int) * (n + 2) * (n + 2));
memset(plate[1], 0, sizeof(int) * (n + 2) * (n + 2));
if(scanf("%d %d", &n, &M) == 2){
for(int i = 1; i <= n; i++){
scanf("%s", &line);
for(int j = 0; j < n; j++){
plate[0][i * (n + 2) + j + 1] = line[j] - '0';
}
}
for(int i = 0; i < M; i++){
iteration();
}
print_plate();
}
return 0;
}
it would be so much appreciated if you could help me fix cause i think this should have work fine.

You have this:
scanf("%s", &line);
line is of type char[2000] (MAX_N). By taking the address-of operator of it, you are getting a type of char(*)[2000]. Get rid of the & and instead you will have type char[2000] which will decay to the char* you need.

There are a few errors in the code:
You're trying to scan a variable line by addressing it in the scanf() function. This could be solved if you remove the ampersand & sign from there.
Explained answer is though already provided in the first answer of this question.
Using the statement:
printf("\0"); // format contains a (null)
is totally meaningless. You're trying to print something which doesn't exists - a null.
The pragma:
#pragma omp parallel for schedule(static)
will be ignored as per of -Wunknown-pragmas flag.

Timing of portion of serial code increases when increasing the number of threads used to parallelize other portion of code

I am trying to compare the timings of two algorithms ; one is run in serial and the other is run in parallel and I am using omp_get_wtime() to get the execution time of functions.
When the number of threads is increased to 2 then the execution time of serial part is getting doubled , even though the timing of parallel part is getting correctly reduced.
As the number of threads in parallel portion is increased the timing of serial portion is getting increased proportionally.
Sample code of portion that is being parallelized:
void constraints(int pop[], double gain[], double gammma[], double snr, int numberOfLinks, int chromosomes, int links,int (*cv)[])
{
int i, j, k;
double interference[chromosomes*links];
#pragma omp parallel for private(i,j,k) num_threads(4)
for(i = 0; i < chromosomes; i++)
{
(*cv)[i] = 0;
for(j = 0; j < links; j++)
{
interference[i * links + j] = 0;
for(k = 0; k < links; k++)
{
if(k != j)
{
interference[i * links + j] = interference[i * links + j] + pop[i * links + k] * gain[k * numberOfLinks + j];
}
}
if (((gain[j * numberOfLinks + j]/(interference[i * links + j] + 1/snr)) < gammma[j]) && (pop[i * links + j] == 1))
{
(*cv)[i] = (*cv)[i] + 1;
}
}
}
}
Could any one please let me know why this is happening.

Double sum optimization

Recently I got this question in one of my interviews, which I unfortunately skipped, but I'm very curious to get the answer. Can you help me?
int sum = 0;
int num = 100000000;
for (int i = 0; i < num; i++){
for (int j = 0; j < num; j++ ){
sum += m_DataX[i] * m_DataX[j];
}
}
EDITED: Also I would like to see if it is possible to optimize if we have the following expression for sum:
sum += m_DataX[i] * m_DataY[j];

Simply, square of sum of the numbers.
Why?
Let, an array is, |1|2|3|
Then, the code produces
1*1 + 1*2 + 1*3
2*1 + 2*2 + 2*3
3*1 + 3*2 + 3*3
That is,
(1*1 + 1*2 + 1*3) + (2*1 + 2*2 + 2*3) + (3*1 + 3*2 + 3*3)
=>1(1+2+3) + 2(1+2+3) + 3(1+2+3)
=>(1+2+3)*(1+2+3)
Therefore, the code will be
int tempSum = 0;
for (int i = 0; i < num ; i ++){
tempSum+=m_DataX [i];
}
sum=tempSum*tempSum;
Update:
What if, sum += m_DataX[i]*m_DataY[j]
Let, two arrays are, |1|2|3| and |4|5|6|
Therefore,
1*4 + 1*5 + 1*5
2*4 + 2*5 + 2*6
3*4 + 3*5 + 3*6
=> 1*4 + 2*4 + 3*4 + 1*5 + 2*5 + 3*5 + 1*6 + 2*6 + 3*6
=> (1+2+3)*(4+5+6)

First, instantiate i and j outside the for loop. Then sum of all the elements and compute the square of it that will be your result.

int tempSumX = 0;
int tempSumY = 0;
for (int i = 0; i < num; i++) {
tempSumX += m_deltaX[i];
tempSumY += m_deltaY[i];
}
sum = tempSumX * tempSumY;
For the 2nd case

Loop unrolling doesn't work with remaining elements

I have a typical algorithm for matrix multiplication. I am trying to apply and understand loop unrolling, but I am having a problem implementing the algorithm when I am trying to unroll k times when k isn't a multiple of the matrices size. (I get very large numbers as a result instead). That means I am not getting how to handle the remaining elements after unrolling. Here is what I have:
void Mult_Matx(unsigned long* a, unsigned long* b, unsigned long*c, long n)
{
long i = 0, j = 0, k = 0;
unsigned long sum, sum1, sum2, sum3, sum4, sum5, sum6, sum7;
for (i = 0; i < n; i++)
{
long in = i * n;
for (j = 0; j < n; j++)
{
sum = sum1 = sum2 = sum3 = sum4 = sum5 = sum6 = sum7 = 0;
for (k = 0; k < n; k += 8)
{
sum = sum + a[in + k] * b[k * n + j];
sum1 = sum1 + a[in + (k + 1)] * b[(k + 1) * n + j];
sum2 = sum2 + a[in + (k + 2)] * b[(k + 2) * n + j];
sum3 = sum3 + a[in + (k + 3)] * b[(k + 3) * n + j];
sum4 = sum4 + a[in + (k + 4)] * b[(k + 4) * n + j];
sum5 = sum5 + a[in + (k + 5)] * b[(k + 5) * n + j];
sum6 = sum6 + a[in + (k + 6)] * b[(k + 6) * n + j];
sum7 = sum7 + a[in + (k + 7)] * b[(k + 7) * n + j];
}
if (n % 8 != 0)
{
for (k = 8 * (n / 8); k < n; k++)
{
sum = sum + a[in + k] * b[k * n + j];
}
}
c[in + j] = sum + sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7;
}
}
}
Let's say size aka n is 12. When I unroll it 4 times, this code works, meaning when it never enters the remainder loop. But I am losing track of what's going on when it does! If anyone can direct me where I am going wrong, I'd really appreciate it. I am new to this, and having a hard time figuring out.

A generic way of unrolling a loop on this shape:
for(int i=0; i<N; i++)
...
is
int i;
for(i=0; i<N-L; i+=L)
...
for(; i<N; i++)
...
or if you want to keep the index variable in the scope of the loops:
for(int i=0; i<N-L; i+=L)
...
for(int i=L*(N/L); i<N; i++)
...
Here, I'm using the fact that integer division is rounded down. L is the number of steps you do in the first loop.
Example:
const int N=22;
const int L=6;
int i;
for(i=0; i<N-L; i+=L)
{
printf("%d\n", i);
printf("%d\n", i+1);
printf("%d\n", i+2);
printf("%d\n", i+3);
printf("%d\n", i+4);
printf("%d\n", i+5);
}
for(; i<N; i++)
printf("%d\n", i);
But I recommend taking a look at Duff's device. However, I do suspect that it's not always a good thing to use. The reason is that modulo is a pretty expensive operation.
The condition if (n % 8 != 0) should not be needed. The for header should take care of that if written properly.

Multiplying a large square matrix by it's transpose being slower than large square matrix just multiplying... How to fix?

Apparently, transposing a matrix then multiplying it is faster than just multiplying the two matrices. However, my code right now does not do that and I have no clue why... (The normal multiplying is just the triple-nested-for loop and it gives me roughly 1.12secs to multiply a 1000x1000 matrix whilst this code gives me 8 times the time(so slower instead of faster)... I am lost now any help would be appreciated! :D
A = malloc (size*size * sizeof (double));
B = malloc (size*size * sizeof (double));
C = malloc (size*size * sizeof (double));
/* initialise array elements */
for (row = 0; row < size; row++){
for (col = 0; col < size; col++){
A[size * row + col] = rand();
B[size * row + col] = rand();
}
}
t1 = getTime();
/* code to be measured goes here */
T = malloc (size*size * sizeof(double));
for(i = 0; i < size; ++i) {
for(j = 0; j <= i ; ++j) {
T[size * i + j] = B[size * j + i];
}
}
for (j = 0; j < size; ++j) {
for (k = 0; k < size; ++k) {
for (m = 0; m < size; ++m) {
C[size * j + k] = A[size * j + k] * T[size * m + k];
}
}
}
t2 = getTime();

I see couple of problems.
You are just setting the value of C[size * j + k] instead of incrementing it. Even though this is an error in the computation, it shouldn't impact performance. Also, you need to initialize C[size * j + k] to 0.0 before the innermost loop starts. Otherwise, you will be incrementing an uninitialized value. That is a serious problem that could result in overflow.
The multiplication term is wrong.
Remember that your multiplication term needs to represent:
C[j, k] += A[j, m] * B[m, k], which is
C[j, k] += A[j, m] * T[k, m]
Instead of
C[size * j + k] = A[size * j + k] * T[size * m + k];
you need
C[size * j + k] += A[size * j + m] * T[size * k + m];
// ^ ^ ^^^^^^^^^^^^^^^^
// | | Need to get T[k, m], not T[m, k]
// | ^^^^^^^^^^^^^^^^
// | Need to get A[j, m], not A[j, k]
// ^^^^ Increment, not set.
I think the main culprit that hurts performance, in addition to it being wrong, is your use of T[size * m + k]. When you do that, there is a lot of jumping of memory (m is the fastest changing variable in the loop) to get to the data. When you use the correct term, T[size * k + m], there will be less of that and you should see a performance improvement.
In summary, use:
for (j = 0; j < size; ++j) {
for (k = 0; k < size; ++k) {
C[size * j + k] = 0.0;
for (m = 0; m < size; ++m) {
C[size * j + k] += A[size * j + m] * T[size * k + m];
}
}
}
You might be able to get a little bit more performance by using:
double* a = NULL;
double* c = NULL;
double* t = NULL;
for (j = 0; j < size; ++j) {
a = A + (size*j);
c = C + (size*j);
for (k = 0; k < size; ++k) {
t = T + size*k;
c[k] = 0.0;
for (m = 0; m < size; ++m) {
c[k] += a[m] * t[m];
}
}
}
PS I haven't tested the code. Just giving you some ideas.

It is likely that your transpose runs slower than the multiplication in this test because the transpose is where the data is loaded from memory into cache, while the matrix multiplication runs out of cache, at least for 1000x1000 with many modern processors (24 MB fits into cache on many Intel Xeon processors).
In any case, both your transpose and multiplication are horribly inefficient. Your transpose is going to thrash the TLB, so you should use a blocking factor of 32 or so (see https://github.com/ParRes/Kernels/blob/master/SERIAL/Transpose/transpose.c for example code).
Furthermore, on x86, it is better to write contiguously (due to how cache-line locking and blocking stores work - if you use nontemporal stores carefully, this might change), whereas on some variants of PowerPC, in particular ,the Blue Gene variants, you want to read contiguously (because of in-order execution, nonblocking stores and write-through cache). See https://github.com/jeffhammond/HPCInfo/blob/master/tuning/transpose/transpose.c for example code.
Finally, I don't care what you say ("I specifically have to do it this way though"), you need to use BLAS for the matrix multiplication. End of story. If your supervisor or some other coworker is telling you otherwise, they are incompetent and should not be allowed to talk about code until they have been thoroughly reeducated. Please refer them to this post if you don't feel like telling them yourself.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

optimizing for loop addition [duplicate] - c

Since j and i don't depend on one another, I think you can just do: for (j = 0; j < ARRAY_SIZE; j++) { sum += array[j]; } sum *= N_TIMES

Related

warning: format '%s' expects type 'char ' but argument 2 has type 'char ()[2000]'

Timing of portion of serial code increases when increasing the number of threads used to parallelize other portion of code

Double sum optimization

Loop unrolling doesn't work with remaining elements

Multiplying a large square matrix by it's transpose being slower than large square matrix just multiplying... How to fix?

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

optimizing for loop addition [duplicate] - c

Since j and i don't depend on one another, I think you can just do: for (j = 0; j < ARRAY_SIZE; j++) { sum += array[j]; } sum *= N_TIMES

Related

warning: format '%s' expects type 'char *' but argument 2 has type 'char (*)[2000]'

Timing of portion of serial code increases when increasing the number of threads used to parallelize other portion of code

Double sum optimization

Loop unrolling doesn't work with remaining elements

Multiplying a large square matrix by it's transpose being slower than large square matrix just multiplying... How to fix?

Categories

Resources

warning: format '%s' expects type 'char ' but argument 2 has type 'char ()[2000]'