How to optimise a for loop for faster execution time

How to optimise a for loop for faster execution time - c

I am looking to optimize a loop using loop unswitching and SIMD so that I can speedup execution time.
for (b_idx = 0; b_idx < e_idx; b_idx++) {
if (fxp < 0) {
fxp += LUT[b_idx];
x += ytmp;
y -= xtmp;
} else {
fxp -= LUT[b_idx];
x -= ytmp;
y += xtmp;
}
xtmp = x >> (b_idx + 1);
ytmp = y >> (b_idx + 1);
}

The inner loop contains a if condition that needs branch prediction. Branch misprediction can introduce big slow downs. It is most certainly the bottleneck of the algorithm.
In this case, you can easily remove the condition by noting that branches only differ by one sign.
for (b_idx = 0; b_idx < e_idx; b_idx++) {
int sgn = (fxp < 0) * 2 - 1; // +1 or -1
fxp += sgn * LUT[b_idx];
x += sgn * ytmp;
y -= sgn * xtmp;
xtmp = x >> (b_idx + 1);
ytmp = y >> (b_idx + 1);
}
But as the algorithm is incremental, loop b_idx+1 depends on loop b_idx, that loop cannot be parallelized.

You can consider moving if condition line out of the loop to avoid many times determination.

Related

How to use Armadillo Columns/Rows to perform optimised calculations on accesses within the same column

What is the best way to manipulate indexing in Armadillo? I was under the impression that it heavily used template expressions to avoid temporaries, but I'm not seeing these speedups.
Is direct array indexing still the best way to approach calculations that rely on consecutive elements within the same array?
Keep in mind, that I hope to parallelise these calculations in the future with TBB::parallel_for (In this case, from a maintainability perspective, it may be simpler to use direct accessing?) These calculations happen in a tight loop, and I hope to make them as optimal as possible.
ElapsedTimer timer;
int n = 768000;
int numberOfLoops = 5000;
arma::Col<double> directAccess1(n);
arma::Col<double> directAccess2(n);
arma::Col<double> directAccessResult1(n);
arma::Col<double> directAccessResult2(n);
arma::Col<double> armaAccess1(n);
arma::Col<double> armaAccess2(n);
arma::Col<double> armaAccessResult1(n);
arma::Col<double> armaAccessResult2(n);
std::valarray<double> valArrayAccess1(n);
std::valarray<double> valArrayAccess2(n);
std::valarray<double> valArrayAccessResult1(n);
std::valarray<double> valArrayAccessResult2(n);
// Prefil
for (int i = 0; i < n; i++) {
directAccess1[i] = i;
directAccess2[i] = n - i;
armaAccess1[i] = i;
armaAccess2[i] = n - i;
valArrayAccess1[i] = i;
valArrayAccess2[i] = n - i;
}
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
directAccessResult1[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i - 1]) * directAccess2[i - 1];
directAccessResult2[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i]) * directAccess2[i];
}
}
timer.StopAndPrint("Direct Array Indexing Took");
std::cout << std::endl;
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
armaAccessResult1.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(0, n - 2)) % armaAccess2.rows(0, n - 2);
armaAccessResult2.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(1, n - 1)) % armaAccess2.rows(1, n - 1);
}
timer.StopAndPrint("Arma Array Indexing Took");
std::cout << std::endl;
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
valArrayAccessResult1[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i - 1]) * valArrayAccess2[i - 1];
valArrayAccessResult2[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i]) * valArrayAccess2[i];
}
}
timer.StopAndPrint("Valarray Array Indexing Took:");
std::cout << std::endl;
In vs release mode (/02 - to avoid armadillo array indexing checks), they produce the following timings:
Started Performance Analysis!
Direct Array Indexing Took: 37.294 seconds elapsed
Arma Array Indexing Took: 39.4292 seconds elapsed
Valarray Array Indexing Took:: 37.2354 seconds elapsed

Your direct code is already quite optimal, so expression templates are not going to help here.
However, you may want to make sure the optimization level in your compiler actually enables auto-vectorization (-O3 in gcc). Secondly, you can get a bit of extra speed by #define ARMA_NO_DEBUG before including the Armadillo header. This will turn off all run-time checks (such as bound checks for element access), but this is not recommended until you have completely debugged your program.

Need a formula to calculate (i->0 to n-1 )π(n!/i!)

I need a formula to calculate the expression multiplication of (n!/i!) where i varies from 0 to n-1. So that I can implement it to a Code. trivial way exceed time limit.Any quick suggestions are welcome

x = (n!/i!) = (i+1) * (i+2) * ... * (n)
So, in pseudo-code (doing it in reverse order for increased efficiency):
x = 1;
result = 1;
for (j = n; j > 0; j--) {
x = x * j;
result = result * x; // Or + if you want the sum instead of the product
}
return result;
Note that you may need a bigint representation for x and result as they will overflow quickly.

complexity for a nested loop with varying internal loop

Very similar complexity examples. I am trying to understand as to how these questions vary. Exam coming up tomorrow :( Any shortcuts for find the complexities here.
CASE 1:
void doit(int N) {
while (N) {
for (int j = 0; j < N; j += 1) {}
N = N / 2;
}
}
CASE 2:
void doit(int N) {
while (N) {
for (int j = 0; j < N; j *= 4) {}
N = N / 2;
}
}
CASE 3:
void doit(int N) {
while (N) {
for (int j = 0; j < N; j *= 2) {}
N = N / 2;
}
}
Thank you so much!

void doit(int N) {
while (N) {
for (int j = 0; j < N; j += 1) {}
N = N / 2;
}
}
To find the O() of this, notice that we are dividing N by 2 each iteration. So, (not to insult your intelligence, but for completeness) the final non-zero iteration through the loop we will have N=1. The time before that we will have N=a(2), then before that N=a(4)... where 0< a < N (note those are non-inclusive bounds). So, this loop will execute a total of log(N) times, meaning the first iteration we see that N=a2^(floor(log(N))).
Why do we care about that? Well, it's a geometric series which has a nice closed form:
Sum = \sum_{k=0}^{\log(N)} a2^k = a*\frac{1-2^{\log N +1}}{1-2} = 2aN-a = O(N).
If someone can figure out how to get that latexy notation to display correctly for me I would really appreciate it.

You already have the answer to number 1 - O(n), as given by #NickO, here is an alternative explanation.
Denote the number of outer repeats of inner loop by T(N), and let the number of outer loops be h. Note that h = log_2(N)
T(N) = N + N/2 + ... + N / (2^i) + ... + 2 + 1
< 2N (sum of geometric series)
in O(N)
Number 3: is O((logN)^2)
Denote the number of outer repeats of inner loop by T(N), and let the number of outer loops be h. Note that h = log_2(N)
T(N) = log(N) + log(N/2) + log(N/4) + ... + log(1) (because log(a*b) = log(a) + log(b)
= log(N * (N/2) * (N/4) * ... * 1)
= log(N^h * (1 * 1/2 * 1/4 * .... * 1/N))
= log(N^h) + log(1 * 1/2 * 1/4 * .... * 1/N) (because log(a*b) = log(a) + log(b))
< log(N^h) + log(1)
= log(N^h) (log(1) = 0)
= h * log(N) (log(a^b) = b*log(a))
= (log(N))^2 (because h=log_2(N))
Number 2 is almost identical to number 3.
(In 2,3: assuming j starts from 1, not from 0, if this is not the case #WhozCraig giving the reason why it never breaks)

How to make the following code faster

int u1, u2;
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long
res1, res2 initialized to zero.
l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);
simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res1[i + k], simdb);
simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);
simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res2[i + k], simdb);
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
The above mentioned code is called many times in my program (profiler shows 98%).
EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct
Any suggestion to make it faster is welcome.

Unfortunately the most obvious optimisations are probably already being done by the compiler:
You can pull &_mulpre[u1] and &mulpre[u2] our of the inner loop.
You can pull &res1[i] our of the inner loop.
Using different variables for the two inner operations, and reordering them, might allow for better pipelining.
Possibly swapping the outer loops would improve cache locality on elm1.

Well, you could always call it fewer times :-)
The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.

There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).

l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
_mm_stream_si128 ((__m128i *)&res1[i + k],
_mm_xor_si128 (
_mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
_mm_load_si128 ((__m128i *) &res1[i + k]
));
mm_stream_si128 ((__m128i *)&res2[i + k],
_mm_xor_si128 (
_mm_load_si128 ((__m128i *)&_mulpre[u2][k]),
_mm_load_si128 ((__m128i *)&res2[i + k])
));
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
Do remember your are using intrinsic, using less _128mi/_mm128 value will speed up your program.
try _mm_stream_si128(), it might speed up the storing process.
try prefetch

simple C problem

I have had to start to learning C as part of a project that I am doing. I have started doing the 'euler' problems in it and am having trouble with the first one. I have to find the sum of all multiples of 3 or 5 below 1000. Could someone please help me. Thanks.
#include<stdio.h>
int start;
int sum;
int main() {
while (start < 1001) {
if (start % 3 == 0) {
sum = sum + start;
start += 1;
} else {
start += 1;
}
if (start % 5 == 0) {
sum = sum + start;
start += 1;
} else {
start += 1;
}
printf("%d\n", sum);
}
return(0);
}

You've gotten some great answers so far, mainly suggesting something like:
#include <stdio.h>
int main(int argc, char * argv[])
{
int i;
int soln = 0;
for (i = 1; i < 1000; i++)
{
if ((i % 3 == 0) || (i % 5 == 0))
{
soln += i;
}
}
printf("%d\n", soln);
return 0;
}
So I'm going to take a different tack. I know you're doing this to learn C, so this may be a bit of a tangent.
Really, you're making the computer work too hard for this :). If we figured some things out ahead of time, it could make the task easier.
Well, how many multiples of 3 are less than 1000? There's one for each time that 3 goes into 1000 - 1.
mult3 = ⌊ (1000 - 1) / 3 ⌋ = 333
(the ⌊ and ⌋ mean that this is floor division, or, in programming terms, integer division, where the remainder is dropped).
And how many multiples of 5 are less than 1000?
mult5 = ⌊ (1000 - 1) / 5 ⌋ = 199
Now what is the sum of all the multiples of 3 less than 1000?
sum3 = 3 + 6 + 9 + ... + 996 + 999 = 3×(1 + 2 + 3 + ... + 332 + 333) = 3×∑i=1 to mult3 i
And the sum of all the multiples of 5 less than 1000?
sum5 = 5 + 10 + 15 + ... + 990 + 995 = 5×(1 + 2 + 3 + ... + 198 + 199) = 5×∑i = 1 to mult5 i
Some multiples of 3 are also multiples of 5. Those are the multiples of 15.
Since those count towards mult3 and mult5 (and therefore sum3 and sum5) we need to know mult15 and sum15 to avoid counting them twice.
mult15 = ⌊ (1000 - 1) /15 ⌋ = 66
sum15 = 15 + 30 + 45 + ... + 975 + 990 = 15×(1 + 2 + 3 + ... + 65 + 66) = 15×∑i = 1 to mult15 i
So the solution to the problem "find the sum of all the multiples of 3 or 5 below 1000" is then
soln = sum3 + sum5 - sum15
So, if we wanted to, we could implement this directly:
#include <stdio.h>
int main(int argc, char * argv[])
{
int i;
int const mult3 = (1000 - 1) / 3;
int const mult5 = (1000 - 1) / 5;
int const mult15 = (1000 - 1) / 15;
int sum3 = 0;
int sum5 = 0;
int sum15 = 0;
int soln;
for (i = 1; i <= mult3; i++) { sum3 += 3*i; }
for (i = 1; i <= mult5; i++) { sum5 += 5*i; }
for (i = 1; i <= mult15; i++) { sum15 += 15*i; }
soln = sum3 + sum5 - sum15;
printf("%d\n", soln);
return 0;
}
But we can do better. For calculating individual sums, we have Gauss's identity which says the sum from 1 to n (aka ∑i = 1 to n i) is n×(n+1)/2, so:
sum3 = 3×mult3×(mult3+1) / 2
sum5 = 5×mult5×(mult5+1) / 2
sum15 = 15×mult15×(mult15+1) / 2
(Note that we can use normal division or integer division here - it doesn't matter since one of n or n+1 must be divisible by 2)
Now this is kind of neat, since it means we can find the solution without using a loop:
#include <stdio.h>
int main(int argc, char *argv[])
{
int const mult3 = (1000 - 1) / 3;
int const mult5 = (1000 - 1) / 5;
int const mult15 = (1000 - 1) / 15;
int const sum3 = (3 * mult3 * (mult3 + 1)) / 2;
int const sum5 = (5 * mult5 * (mult5 + 1)) / 2;
int const sum15 = (15 * mult15 * (mult15 + 1)) / 2;
int const soln = sum3 + sum5 - sum15;
printf("%d\n", soln);
return 0;
}
Of course, since we've gone this far we could crank out the entire thing by hand:
sum3 = 3×333×(333+1) / 2 = 999×334 / 2 = 999×117 = 117000 - 117 = 116883
sum5 = 5×199×(199+1) / 2 = 995×200 / 2 = 995×100 = 99500
sum15 = 15×66×(66+1) / 2 = 990×67 / 2 = 495 × 67 = 33165
soln = 116883 + 99500 - 33165 = 233168
And write a much simpler program:
#include <stdio.h>
int main(int argc, char *argv[])
{
printf("233168\n");
return 0;
}

You could change your ifs:
if ((start % 3 == 0) || (start % 5 == 0))
sum += start;
start ++;
and don´t forget to initialize your sum with zero and start with one.
Also, change the while condition to < 1000.

You would be much better served by a for loop, and combining your conditionals.
Not tested:
int main()
{
int x;
int sum = 0;
for (x = 1; x <= 1000; x++)
if (x % 3 == 0 || x % 5 == 0)
sum += x;
printf("%d\n", sum);
return 0;
}

The answers are all good, but won't help you learn C.
What you really need to understand is how to find your own errors. A debugger could help you, and the most powerful debugger in C is called "printf". You want to know what your program is doing, and your program is not a "black box".
Your program already prints the sum, it's probably wrong, and you want to know why. For example:
printf("sum:%d start:%d\n", sum, start);
instead of
printf("%d\n", sum);
and save it into a text file, then try to understand what's going wrong.
does the count start with 1 and end with 999?
does it really go from 1 to 999 without skipping numbers?
does it work on a smaller range?

Eh right, well i can see roughly where you are going, I'm thinking the only thing wrong with it has been previously mentioned. I did this problem before on there, obviously you need to step through every multiple of 3 and 5 and sum them. I did it this way and it does work:
int accumulator = 0;
int i;
for (i = 0; i < 1000; i += 3)
accumulator += i;
for (i = 0; i < 1000; i +=5) {
if (!(i%3==0)) {
accumulator += i;
}
}
printf("%d", accumulator);
EDIT: Also note its not 0 to 1000 inclusive, < 1000 stops at 999 since it is the last number below 1000, you have countered that by < 1001 which means you go all the way to 1000 which is a multiple of 5 meaning your answer will be 1000 higher than it should be.

You haven't said what the program is supposed to do, or what your problem is. That makes it hard to offer help.
At a guess, you really ought to initialize start and sum to zero, and perhaps the printf should be outside the loop.

Really you need a debugger, and to single-step through the code so that you can see what it's actually doing. Your basic problem is that the flow of control isn't going where you think it is, and rather than provide correct code as others have done, I'll try to explain what your code does. Here's what happens, step-by-step (I've numbered the lines):
1: while (start < 1001) {
2: if (start % 3 == 0) {
3: sum = sum + start;
4: start += 1;
5: }
6: else {
7: start += 1;
8: }
9:
10: if (start % 5 == 0) {
11: sum = sum + start;
12: start += 1;
13: }
14: else {
15: start += 1;
16: }
17: printf("%d\n", sum);
18: }
line 1. sum is 0, start is 0. Loop condition true.
line 2. sum is 0, start is 0. If condition true.
line 3. sum is 0, start is 0. sum <- 0.
line 4. sum is 0, start is 0. start <- 1.
line 5. sum is 0, start is 1. jump over "else" clause
line 10. sum is 0, start is 1. If condition false, jump into "else" clause.
line 15. sum is 0, start is 1. start <- 2.
line 16 (skipped)
line 17. sum is 0, start is 2. Print "0\n".
line 18. sum is 0, start is 2. Jump to the top of the loop.
line 1. sum is 0, start is 2. Loop condition true.
line 2. sum is 0, start is 2. If condtion false, jump into "else" clause.
line 7. sum is 0, start is 2. start <- 3.
line 10. sum is 0, start is 3. If condition false, jump into "else" clause.
line 15. sum is 0, start is 3. start <- 4.
line 17. sum is 0, start is 4. Print "0\n".
You see how this is going? You seem to think that at line 4, after doing sum += 1, control goes back to the top of the loop. It doesn't, it goes to the next thing after the "if/else" construct.

You have forgotten to initialize your variables,

The problem with your code is that your incrementing the 'start' variable twice. This is due to having two if..else statements. What you need is an if..else if..else statement as so:
if (start % 3 == 0) {
sum = sum + start;
start += 1;
}
else if (start % 5 == 0) {
sum = sum + start;
start += 1;
}
else {
start += 1;
}
Or you could be more concise and write it as follows:
if(start % 3 == 0)
sum += start;
else if(start % 5 == 0)
sum += start;
start++;
Either of those two ways should work for you.
Good luck!

Here's a general solution which works with an arbitrary number of factors:
#include <stdio.h>
#define sum_multiples(BOUND, ...) \
_sum_multiples(BOUND, (unsigned []){ __VA_ARGS__, 0 })
static inline unsigned sum_single(unsigned bound, unsigned base)
{
unsigned n = bound / base;
return base * (n * (n + 1)) / 2;
}
unsigned _sum_multiples(unsigned bound, unsigned bases[])
{
unsigned sum = 0;
for(unsigned i = 0; bases[i]; ++i)
{
sum += sum_single(bound, bases[i]);
for(unsigned j = i + 1; bases[j]; ++j)
sum -= sum_single(bound, bases[i] * bases[j]);
}
return sum;
}
int main(void)
{
printf("%u\n", sum_multiples(999, 3, 5));
return 0;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to optimise a for loop for faster execution time - c

You can consider moving if condition line out of the loop to avoid many times determination.

Related

How to use Armadillo Columns/Rows to perform optimised calculations on accesses within the same column

Need a formula to calculate (i->0 to n-1 )π(n!/i!)

complexity for a nested loop with varying internal loop

How to make the following code faster

simple C problem

Categories

Resources