Correct way of unrolling loop using gcc - c

#include <stdio.h>
int main() {
int i;
for(i=0;i<10000;i++){
printf("%d",i);
}
}
I want to do loop unrolling on this code using gcc
but even using the flag.
gcc -O2 -funroll-all-loops --save-temps unroll.c
the assembled code i am getting contain a loop of 10000 iteration
_main:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
pushq %r14
pushq %rbx
Ltmp2:
xorl %ebx, %ebx
leaq L_.str(%rip), %r14
.align 4, 0x90
LBB1_1:
xorb %al, %al
movq %r14, %rdi
movl %ebx, %esi
callq _printf
incl %ebx
cmpl $10000, %ebx
jne LBB1_1
popq %rbx
popq %r14
popq %rbp
ret
Leh_func_end1:
Can somone plz tell me how to implement loop unrolling correctly in gcc

Loop unrolling won't give you any benefit for this code, because the overhead of the function call to printf() itself dominates the work done at each iteration. The compiler may be aware of this, and since it is being asked to optimize the code, it may decide that unrolling increases the code size for no appreciable run-time performance gain, and decides the risk of incurring an instruction cache miss is too high to perform the unrolling.
The type of unrolling required to speed up this loop would require reducing the number of calls to printf() itself. I am unaware of any optimizing compiler that is capable of doing that.
As an example of unrolling the loop to reduce the number of printf() calls, consider this code:
void print_loop_unrolled (int n) {
int i = -8;
if (n % 8) {
printf("%.*s", n % 8, "01234567");
i += n % 8;
}
while ((i += 8) < n) {
printf("%d%d%d%d%d%d%d%d",i,i+1,i+2,i+3,i+4,i+5,i+6,i+7);
}
}

gcc has maximum loops unroll parameters.
You have to use -O3 -funroll-loops and play with parameters max-unroll-times, max-unrolled-insns and max-average-unrolled-insns.
Example:
-O3 -funroll-loops --param max-unroll-times=200

Replace
printf("%d",i);
with
volatile int j = i;
and see if the loop gets unrolled.

Related

GCC option to avoid function calls to simple get/set functions? [duplicate]

This question already has answers here:
Can gcc or clang inline functions that are not in the same compilation unit?
(1 answer)
How do I force gcc to inline a function?
(8 answers)
C, inline function and GCC [duplicate]
(4 answers)
In C, should inline functions in headers be externed in the .c file?
(2 answers)
Closed 7 days ago.
There maybe a very simple solution to this problem but it has been bothering me for a while, so I have to ask.
In our embedded projects, it seems common to have simple get/set functions to many variables in separate C-files. Then, those variables are being called from many other C-files. When I look the assembly listing, those function calls are never replaced with move instructions. Faster way would be to just declare monitored variables as global variables to avoid unnecessary function calls.
Let's say you have a file.c which has variables that need to be monitored in another C-file main.c. For example, debugging variables, hardware registers, adc-values, etc. Is there a compiler optimization that replaces simple get/set functions with assembly move instructions thus avoiding unnecessary overhead caused by function calls?
file.h
#ifndef FILE_H
#define FILE_H
#include <stdint.h>
int32_t get_signal(void);
void set_signal(int32_t x);
#endif
file.c
#include "file.h"
#include <stdint.h>
static volatile int32_t *signal = SOME_HARDWARE_ADDRESS;
int32_t get_signal(void)
{
return *signal;
}
void set_signal(int32_t x)
{
*signal = x;
}
main.c
#include "file.h"
#include <stdio.h>
int main(int argc, char *args[])
{
// Do something with the variable
for (int i = 0; i < 10; i++)
{
printf("signal = %d\n", get_signal());
}
return 0;
}
If I compile the above code with gcc -Wall -save-temps main.c file.c -o main.exe, it gives the following assembly listing for main.c. You can always see the call get_signal even if you compile with -O3 flag which seems silly as we are only reading memory address. Why bother calling such simple function?
Same explanation applies for the simple set function. It is always called even though we would be only writing to one memory location in the function and doing nothing else.
main.s
main:
pushq %rbp
.seh_pushreg %rbp
movq %rsp, %rbp
.seh_setframe %rbp, 0
subq $48, %rsp
.seh_stackalloc 48
.seh_endprologue
movl %ecx, 16(%rbp)
movq %rdx, 24(%rbp)
call __main
movl $0, -4(%rbp)
jmp .L4
.L5:
call get_signal
movl %eax, %edx
leaq .LC0(%rip), %rcx
call printf
addl $1, -4(%rbp)
.L4:
cmpl $9, -4(%rbp)
jle .L5
movl $0, %eax
addq $48, %rsp
popq %rbp
ret
UPDATED 2023-02-13
Question was closed with several links to inline and Link-time Optimization-related answers. I don't think the same question has been answered before or at least the solution is not obvious for my get_function. What is there to inline if a function just returns a value and does nothing else?
Anyways, it seems, as suggested, that one solution to fix this problem is to add compiler flags -O2 -flto which correctly replaces assembly instruction call get_signal with move instruction with the following partial output:
main:
subq $40, %rsp
.seh_stackalloc 40
.seh_endprologue
call __main
movl tmp.0(%rip), %edx
movl $10, %eax
.p2align 4,,10
.p2align 3
.L4:
movl signal(%rip), %ecx
addl %ecx, %edx
subl $1, %eax
jne .L4
leaq .LC0(%rip), %rcx
movl %edx, tmp.0(%rip)
call printf.constprop.0
xorl %eax, %eax
addq $40, %rsp
ret
.seh_endproc
Thank you.

Why -O1 is faster than -O2 for 10000 times?

Below is a C function to evaluate a polynomial:
/* Calculate a0 + a1*x + a2*x^2 + ... + an*x^n */
/* from CSAPP Ex.5.5, modified to integer version */
int poly(int a[], int x, int degree) {
long int i;
int result = a[0];
int xpwr = x;
for (i = 1; i <= degree; ++i) {
result += a[i]*xpwr;
xpwr *= x;
}
return result;
}
And a main function:
#define TIMES 100000ll
int main(void) {
long long int i;
unsigned long long int result = 0;
for (i = 0; i < TIMES; ++i) {
/* g_a is an int[10000] global variable with all elements equals to 1 */
/* x = 2, i.e. evaluate 1 + 2 + 2^2 + ... + 2^9999 */
result += poly(g_a, 2, 9999);
}
printf("%lld\n", result);
return 0;
}
When I compile the program with GCC and options -O1 and -O2 separately, I found that -O1 is FASTER than -O2 a lot.
Platform details:
i5-4600
Arch Linux x86_64 with kernel 3.18
GCC 4.9.2
gcc -O1 -o /tmp/a.out test.c
gcc -O2 -o /tmp/a.out test.c
Result:
When TIMES = 100000ll, -O1 prints the result instantly, while -O2 needs 0.36s
When TIMES = 1000000000ll, -O1 prints the result in 0.28s, -O2 takes so long that I didn't finish the test
It seems that -O1 is approximately 10000 times faster than -O2.
When I test it on Mac (clang-600.0.56), the result is even more weird: -O1 takes no more than 0.02s even when TIMES = 1000000000000000000ll
I have tested the following changes:
makes g_a random (elements are from 1 to 10)
x = 19234 (or some other number)
use int instead of long long int
And the results are the same.
I tried to look at the assembly code, it seems that -O1 is calling the poly function while -O2 does inline optimization. But inline should make the performance better, isn't it?
What makes these huge differences? Why -O1 on clang can make the program so fast? Is -O1 doing something wrong? (I cannot check the result as it is too slow without optimization)
Here is the assembly code of main for -O1: (you may get it by adding -S option to gcc)
main:
.LFB12:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $9999, %edx
movl $2, %esi
movl $g_a, %edi
call poly
movslq %eax, %rdx
movl $100000, %eax
.L6:
subq $1, %rax
jne .L6
imulq $100000, %rdx, %rsi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
And for -O2:
main:
.LFB12:
.cfi_startproc
movl g_a(%rip), %r9d
movl $100000, %r8d
xorl %esi, %esi
.p2align 4,,10
.p2align 3
.L8:
movl $g_a+4, %eax
movl %r9d, %ecx
movl $2, %edx
.p2align 4,,10
.p2align 3
.L7:
movl (%rax), %edi
addq $4, %rax
imull %edx, %edi
addl %edx, %edx
addl %edi, %ecx
cmpq $g_a+40000, %rax
jne .L7
movslq %ecx, %rcx
addq %rcx, %rsi
subq $1, %r8
jne .L8
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC1, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
Although I don't know much about assembly, it is obvious that -O1 is just calling poly once, and multiply the result by 100000 (imulq $100000, %rdx, %rsi). This is the reason that it is so fast.
It seems that gcc can detect that poly is a pure function with no side effect. (It will be interesting if we have another thread modifying g_a while poly is running...)
On the other hand, -O2 has inlined the poly function, so it has no chance to check poly as a pure function.
I have further done some research:
I cannot find the actual flag used by -O1 which do the pure function checking.
I have tried all the flags listed by gcc -Q -O1 --help=optimizers individually, but none of them have the effect.
Maybe it needs a combination of the flags together to get the effect, but it is very hard to try all the combinations.
But I have found the flag used by -O2 which makes the effect disappear, which is the -finline-small-functions flag. The name of the flag explains itself.
One thing that jumps out at me is that you're overflowing signed integers. The behaviour of this is undefined in C. Specifically, int result won't be able to hold pow(2,9999). I don't see what the point is of benchmarking code with undefined behaviour?

Speed with and without 'static' [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I know that 'static' is about scope, but I've got a question: what function/variable will be faster to access: a 'static' one or not?
Which code will be faster:
#include <stdio.h>
int main(){
int count;
for (count=0;count<1000;++count)
printf("%d\n",count);
return 0;
}
or
#include <stdio.h>
int main(){
static int count;
for (count=0;count<1000;++count)
printf("%d\n",count);
return 0;
}
In my code I'm working with VERY big numbers (with unsigned long long) and I'm accessing and increasing them about 4.000.000 times a second. This code is not the one I'm working on, it's just an example.
As a sign of good will, I have made up a program that we can actually reason about.
#include <stdint.h>
#include <stdio.h>
int
main()
{
static const uint64_t a = 1664525UL;
static const uint64_t c = 1013904223UL;
static const uint64_t m = (1UL << 31);
static uint32_t x = 1;
register unsigned i;
for (i = 0; i < 1000000000U; ++i)
x = (a * x + c) % m;
printf("%d\n", x);
return 0;
}
It will simply compute the one billionth element of a pseudo random sequence returned by a simple linear congruential generator. We have to do something more difficult than simply increment a counter or the compiler will optimize the entire loop out of existence.
Here is how I have compiled (GCC 4.9.1 on x86_64 GNU/Linux):
$ gcc -o non-static -Dstatic= -Wall -O3 main.c
$ gcc -o static -Wall -O3 main.c
To get the version without static, we simply #define it away on the compiler command line.
Running both programs took 2.36 seconds meaning there is no measurable performance difference.
To find out why, I like to look at the assembly code.
$ gcc -S -o non-static.s -Dstatic= -Wall -O3 main.c
$ gcc -S -o static.s -Wall -O3 main.c
We find that GCC generated identical machine code for the inner loop and moved the special treatment for the static variables out of the loop, which is what we should have expected from a good compiler.
Relevant code with static:
main:
.LFB11:
.cfi_startproc
movl x.2266(%rip), %esi
movl $1000000000, %eax
.p2align 4,,10
.p2align 3
.L2: # BEGIN LOOP
imull $1664525, %esi, %esi
addl $1013904223, %esi
andl $2147483647, %esi
subl $1, %eax
jne .L2 # END LOOP
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi
xorl %eax, %eax
movl %esi, x.2266(%rip)
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
and without:
main:
.LFB11:
.cfi_startproc
movl $1000000000, %eax
movl $1, %esi
.p2align 4,,10
.p2align 3
.L2: # BEGIN LOOP
imull $1664525, %esi, %esi
addl $1013904223, %esi
andl $2147483647, %esi
subl $1, %eax
jne .L2 # END LOOP
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
This just re-emphasizes what many have tried to express in their comments: We need actual code to reason about performance and we should really benchmark it while doing so.
Also, you shouldn't worry too much about such things and trust your compiler most of the time. Focus on writing readable and maintainable code and only fiddle with the dirty details if you have evidence that it is necessary to achieve the required performance. In your particular example, I cannot see any valid reason to declare the local variables static. It disturbs me as a reader and should not be done.

Loop unrolling optimization, how does this work

Consider this C-code:
int sum=0;
for(int i=0;i<5;i++)
sum+=i;
This could be translated in (pseudo-) assembly this way (without loop unrolling):
% pseudo-code assembly
ADDI $R10, #0 % sum
ADDI $R11, #0 % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
BNE $R11, #5 LOOP
So my first question is how is this code translated using loop unrolling, between these two ways:
1)
ADDI $R10, #0
ADDI $R10, #0
ADDI $R10, #1
ADDI $R10, #2
ADDI $R10, #3
ADDI $R10, #4
2)
ADD $R10, #10
Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?
Also, is there a possibility to block the pipeline with a branch instruction? Do I have to write it this way:
% pseudo-code assembly
ADDI $R10, #0 % sum
ADDI $R11, #0 % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
NOP % is this necessary to avoid the pipeline blocking?
NOP
NOP
NOP
BNE $R11, #5 LOOP
To avoid that the fetch-decode-exe-mem-write back cycle is interrupted by the branch?
This is more for demonstration of what a compiler is capable of, rather than what every compiler would do. The source:
#include <stdio.h>
int main(void)
{
int i, sum = 0;
for(i=0; i<5; i++) {
sum+=i;
}
printf("%d\n", sum);
return 0;
}
Note the printf I have added. If the variable is not used, the compiler will optimize out the entire loop.
Compiling with -O0 (No optimization)
gcc -Wall -O0 -S -c lala.c:
.L3:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $4, -8(%rbp)
jle .L3
The loop happens in a 'dumb' way, with -8(%rbp) being the variable i.
Compiling with -O1 (Optimization level 1)
gcc -Wall -O1 -S -c lala.c:
movl $10, %edx
The loop has been completely removed and replaced with the equivalent value.
In unrolling, the compiler looks to see how many iterations would happen and tries to unroll by performing less iterations. For example, the loop body might be duplicated twice which would result in the number of branches to be halved. Such a case in C:
int i = 0, sum = 0;
sum += i;
i++;
for(; i<5;i++) {
sum+=i;
i++;
sum+=i;
}
Notice that one iteration had to be extracted out of the loop. This is because 5 is an odd number and so the work can not simply be halved by duplicating the contents. In this case the loop will only be entered twice. The assembly code produced by -O0:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
jmp .L2
.L3:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $4, -8(%rbp)
Completely unrolling in C:
for(i=0; i<5;i++) {
sum+=i;
i++;
sum+=i;
i++;
sum+=i;
i++;
sum+=i;
i++;
sum+=i;
}
This time the loop is actually entered only once. The assembly produced with -O0:
.L3:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $4, -8(%rbp)
jle .L3
So my first question is how is this code translated using loop unrolling, between these two ways
This kind of optimization is usually implemented on AST level instead of output code (e.g. assembly) level. Loop unrolling can be done when the number of iteration is fixed and known at compile time. So for instance I have this AST:
Program
|
+--For
|
+--Var
| |
| +--Variable i
|
+--Start
| |
| +--Constant 1
|
+--End
| |
| +--Constant 3
|
+--Statements
|
+ Print i
The compiler would have known that For's Start and End are constants, and therefore could easily copy the Statements, replacing all occurences of Var with its value for each call. For above AST, it would be translated to:
Program
|
+--Print 1
|
+--Print 2
|
+--Print 3
Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?
Yes, if it's implemented to have such a feature. It's actually an improvement over the above case. In your example case, after doing the unrolling, the compiler could see that all l-value remains the same while r-value are constants. Therefore it could perform peephole optimization combined with constant folding to yield single addition. If the peephole optimization also considers the declaration, then it could be even optimized more into a single move instruction.
At the basic level, the concept of loop unrolling is just simply copying the body of the loop multiple times as appropriate. The compiler may do other optimizations (such as inserting fixed values from a calculation) as well but wouldn't be considered as unrolling the loop but potentially replacing it all together. But that would ultimately depend on the compiler and flags used.
The C code (unrolled only) would look more like this:
int sum = 0;
int i = 0;
for ( ; i < (5 & ~(4-1)); i += 4) /* unrolling 4 iterations */
{
sum+=(i+0);
sum+=(i+1);
sum+=(i+2);
sum+=(i+3);
}
for ( ; i < 5; i++)
{
sum+=i;
}
Though there's plenty of opportunities for the compiler to make even more optimizations here, this is just one step.
There is no general answer possible for this, different compilers, different versions of them, different compiler flags will vary. Use the appropriate option of your compiler to look at the assembler outcome. With gcc and relatives this is the -S option.

Compiler optimization causing program to run slower

I have the following piece of code that I wrote in C. Its fairly simple as it just right bit-shifts x for every loop of for.
int main() {
int x = 1;
for (int i = 0; i > -2; i++) {
x >> 2;
}
}
Now the strange thing that is happening is that when I just compile it without any optimizations or with first level optimization (-O), it runs just fine (I am timing the executable and its about 1.4s with -O and 5.4s without any optimizations.
Now when I add -O2 or -O3 switch for compilation and time the resulting executable, it doesn't stop (I have tested for up to 60s).
Any ideas on what might be causing this?
The optimized loop is producing an infinite loop which is a result of you depending on signed integer overflow. Signed integer overflow is undefined behavior in C and should not be depended on. Not only can it confuse developers it may also be optimized out by the compiler.
Assembly (no optimizations): gcc -std=c99 -S -O0 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $1, -4(%rbp)
movl $0, -8(%rbp)
jmp L2
L3:
incl -8(%rbp)
L2:
cmpl $-2, -8(%rbp)
jg L3
movl $0, %eax
leave
ret
Assembly (optimized level 3): gcc -std=c99 -S -O3 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
L2:
jmp L2 #<- infinite loop
You will get the definitive answer by looking at the binary that's produced (using objdump or something).
But as others have noted, this is probably because you're relying on undefined behaviour. One possible explanation is that the compiler is free to assume that i will never be less than -2, and so will eliminate the conditional entirely, and convert this into an infinite loop.
Also, your code has no observable side effects, so the compiler is also free to optimise the entire program away to nothing, if it likes.
Additional information about why integer overflows are undefined can be found here:
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
Search for the paragraph "Signed integer overflow".

Resources