Loop unrolling optimization, how does this work - c

Consider this C-code:
int sum=0;
for(int i=0;i<5;i++)
sum+=i;
This could be translated in (pseudo-) assembly this way (without loop unrolling):
% pseudo-code assembly
ADDI $R10, #0 % sum
ADDI $R11, #0 % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
BNE $R11, #5 LOOP
So my first question is how is this code translated using loop unrolling, between these two ways:
1)
ADDI $R10, #0
ADDI $R10, #0
ADDI $R10, #1
ADDI $R10, #2
ADDI $R10, #3
ADDI $R10, #4
2)
ADD $R10, #10
Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?
Also, is there a possibility to block the pipeline with a branch instruction? Do I have to write it this way:
% pseudo-code assembly
ADDI $R10, #0 % sum
ADDI $R11, #0 % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
NOP % is this necessary to avoid the pipeline blocking?
NOP
NOP
NOP
BNE $R11, #5 LOOP
To avoid that the fetch-decode-exe-mem-write back cycle is interrupted by the branch?

This is more for demonstration of what a compiler is capable of, rather than what every compiler would do. The source:
#include <stdio.h>
int main(void)
{
int i, sum = 0;
for(i=0; i<5; i++) {
sum+=i;
}
printf("%d\n", sum);
return 0;
}
Note the printf I have added. If the variable is not used, the compiler will optimize out the entire loop.
Compiling with -O0 (No optimization)
gcc -Wall -O0 -S -c lala.c:
.L3:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $4, -8(%rbp)
jle .L3
The loop happens in a 'dumb' way, with -8(%rbp) being the variable i.
Compiling with -O1 (Optimization level 1)
gcc -Wall -O1 -S -c lala.c:
movl $10, %edx
The loop has been completely removed and replaced with the equivalent value.
In unrolling, the compiler looks to see how many iterations would happen and tries to unroll by performing less iterations. For example, the loop body might be duplicated twice which would result in the number of branches to be halved. Such a case in C:
int i = 0, sum = 0;
sum += i;
i++;
for(; i<5;i++) {
sum+=i;
i++;
sum+=i;
}
Notice that one iteration had to be extracted out of the loop. This is because 5 is an odd number and so the work can not simply be halved by duplicating the contents. In this case the loop will only be entered twice. The assembly code produced by -O0:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
jmp .L2
.L3:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $4, -8(%rbp)
Completely unrolling in C:
for(i=0; i<5;i++) {
sum+=i;
i++;
sum+=i;
i++;
sum+=i;
i++;
sum+=i;
i++;
sum+=i;
}
This time the loop is actually entered only once. The assembly produced with -O0:
.L3:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $4, -8(%rbp)
jle .L3

So my first question is how is this code translated using loop unrolling, between these two ways
This kind of optimization is usually implemented on AST level instead of output code (e.g. assembly) level. Loop unrolling can be done when the number of iteration is fixed and known at compile time. So for instance I have this AST:
Program
|
+--For
|
+--Var
| |
| +--Variable i
|
+--Start
| |
| +--Constant 1
|
+--End
| |
| +--Constant 3
|
+--Statements
|
+ Print i
The compiler would have known that For's Start and End are constants, and therefore could easily copy the Statements, replacing all occurences of Var with its value for each call. For above AST, it would be translated to:
Program
|
+--Print 1
|
+--Print 2
|
+--Print 3
Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?
Yes, if it's implemented to have such a feature. It's actually an improvement over the above case. In your example case, after doing the unrolling, the compiler could see that all l-value remains the same while r-value are constants. Therefore it could perform peephole optimization combined with constant folding to yield single addition. If the peephole optimization also considers the declaration, then it could be even optimized more into a single move instruction.

At the basic level, the concept of loop unrolling is just simply copying the body of the loop multiple times as appropriate. The compiler may do other optimizations (such as inserting fixed values from a calculation) as well but wouldn't be considered as unrolling the loop but potentially replacing it all together. But that would ultimately depend on the compiler and flags used.
The C code (unrolled only) would look more like this:
int sum = 0;
int i = 0;
for ( ; i < (5 & ~(4-1)); i += 4) /* unrolling 4 iterations */
{
sum+=(i+0);
sum+=(i+1);
sum+=(i+2);
sum+=(i+3);
}
for ( ; i < 5; i++)
{
sum+=i;
}
Though there's plenty of opportunities for the compiler to make even more optimizations here, this is just one step.

There is no general answer possible for this, different compilers, different versions of them, different compiler flags will vary. Use the appropriate option of your compiler to look at the assembler outcome. With gcc and relatives this is the -S option.

Related

Is a memory barrier AND volatile ENOUGH to avoid a data race?

I want to see if I am forced to use atomic integers.
I have a loop that looks similar to this:
struct loop {
volatile int loop_variable;
volatile int limit;
}
for (int i = loop.loop_variable ; i < loop.limit ; loop.loop_variable++) {
}
Then another thead does this:
loops.loop_variable = loops.limit;
And issues a memory barrier.
Is this multithreaded safe?
The assembly where there is a data race is between these lines:
// loop.loop_variable = loop.limit;
movl 4+loop.0(%rip), %eax
movl %eax, loop.0(%rip)
And
// for (int i = loop.loop_variable ; i < loop.limit ; loop.loop_variable++)
movl loop.0(%rip), %eax
movl %eax, -4(%rbp)
jmp .L2
.L3:
movl loop.0(%rip), %eax
addl $1, %eax
movl %eax, loop.0(%rip)
.L2:
movl loop.0(%rip), %eax
cmpl $99999, %eax
jle .L3
movl $0, %eax
There might be a data race between
movl loop.0(%rip), %eax
addl $1, %eax
movl %eax, loop.0(%rip)
Since it's three instructions to increment the loop_variable. But only one to overwrite the loop variable to the limit.
Is this multithreaded safe?
No.
Given
loops[0].loop_variable = loops[0].limit;
< memory barrier >
in one thread, that memory barrier won't prevent int i = loop.loop_variable from reading an indeterminate value or loop.loop_variable++ producing nonsense results in another thread. Other threads can still potentially "see" the change to loops[0].loop_variable. Or parts of the change.
A memory barrier just imposes consistency afterwards - it doesn't do a thing beforehand.

Why asm generated by gcc mov twice?

Suppose I have the following C code:
#include
int main()
{
int x = 11;
int y = x + 3;
printf("%d\n", x);
return 0;
}
Then I compile it into asm using gcc, I get this(with some flag removed):
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $11, -4(%rbp)
movl -4(%rbp), %eax
addl $3, %eax
movl %eax, -8(%rbp)
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
My problem is why it is movl -4(%rbp), %eax followed by movl %eax, %esi, rather than a simple movl -4(%rbp), %esi(which works well according to my experiment)?
You probably did not enable optimizations.
Without optimization the compiler will produce code like this. For one it does not allocate data to registers, but on the stack. This means that when you operate on variables they will first be transferred to a register and then operated on.
So given that x lives is allocated in -4(%rbp) and this is what the code appears as if you translate it directly without optimization. First you move 11 to the storage of x. This means:
movl $11, -4(%rbp)
done with the first statement. The next statement is to evaluate x+3 and place in the storage of y (which is -8(%rbp), this is done without regard of the previous generated code:
movl -4(%rbp), %eax
addl $3, %eax
movl %eax, -8(%rbp)
done with the second statement. By the way that is divided into two parts: evaluation of x+3 and the storage of the result. Then the compiler continues to generate code for the printf statement, again without taking earlier statements into account.
If you on the other hand would enable optimization the compiler does a number of smart and to humans obvious things. One thing is that it allows variables to be allocated to registers, or at least keep track on where one can find the value of the variable. In this case the compiler would for example know in the second statement that x is not only stored at -4(%ebp) it will also know that it is stored in $11 (yes it nows it's actual value). It can then use this to add 3 to it which means it knows the result to be 14 (but it's smarter that that - it has also seen that you didn't use that variable so it skips that statement entirely). Next statement is the printf statement and here it can use the fact that it knows x to be 11 and pass that directly to printf. By the way it also realizes that it doesn't get to use the storage of x at -4(%ebp). Finally it may know what printf does (since you included stdio.h) so can analyze the format string and do the conversion at compile time to replace the printf statement to a call that directly writes 14 to standard out.

For loop execution time different conditions

Why is it that the code:
for( i = 0, j = 0; i < 4 , j < 3; i++, j++)
is slower than
for( i = 0, j = 0; i < 4 && j < 3; i++, j++)
Elaborating on that some users proposed that two if statemnts take more time than a single if statement with an && operator: I tested it without for loops and it is not true. Two if statements are faster than a single one with a && operator.
The first code is not slower; at least in gcc without optimization. In fact, it should be faster.
When you compile both codes and disassemble them, you will find this for the first code:
cmpl $0x2,-0x8(%rbp)
jle 26 <main+0x26>
And this for the second one:
cmpl $0x3,-0x4(%rbp)
jg 44 <main+0x44>
cmpl $0x2,-0x8(%rbp)
jle 26 <main+0x26>
In the first example, gcc evaluates just the second part, because the first one has no effect and is not used in the comparison. In the second one, it has to check for the first one, and then, if true, check the second one.
So, in the general case, the first example should be faster than the first one. If you find the first slower, maybe your way to measure it was not 100% correct.
Their may be no change in execution time but may very the number of iterations since :
If we put comma separated condition in for loop,it evaluates the value of the last one. So basically whichever condition you write first, it will be disregarded, and the second one will be checked. So j = 0; i < 4 will always check for i<4 where as i < 4 && j < 3 will examine and return true if and only if both the conditions are true.
Reference
If we do check the assembly of the code you have you may see the difference
program
int main()
{
int x,y;
for(x=0,y=0;x<4,y<5;x++,y++);
printf("New one");
for(x=0,y=0;x<4 && y<5;x++,y++);
}
command to get assembly : gcc -S <program name>
Assembly
.file "for1.c"
.section .rodata
.LC0:
.string "New one"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
subl $32, %esp
movl $0, 24(%esp)
movl $0, 28(%esp)
jmp .L2
.L3:
addl $1, 24(%esp)
addl $1, 28(%esp)
.L2:
cmpl $4, 28(%esp) //Here only one condition
jle .L3
movl $.LC0, (%esp)
call printf
movl $0, 24(%esp)
movl $0, 28(%esp)
jmp .L4
.L6:
addl $1, 24(%esp)
addl $1, 28(%esp)
.L4:
cmpl $3, 24(%esp) //First Condition
jg .L7
cmpl $4, 28(%esp) //Second Condition
jle .L6
.L7:
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
So, it is clear if we have 2 condition then it will be more time taking.
the first option one is two ifs,
the second option is a mathematical equation and one if, which is usually faster,
here you save one if by doing a calculation, that costs less process time.
first option -> if() && if(),
second option-> if(() && ())

xorl %eax - Instruction set architecture in IA-32

I am experiencing some difficulties interpreting this exercise;
What does exactly xorl does in this assembly snippet?
C Code:
int i = 0;
if (i>=55)
i++;
else
i--;
Assembly
xorl ____ , %ebx
cmpl ____ , %ebx
Jel .L2
____ %ebx
.L2:
____ %ebx
.L3:
What's happening on the assembly part?
It's probably:
xorl %ebx, %ebx
This is a common idiom for zeroing a register on x86. This would correspond with i = 0 in the C code.
If you are curious "but why ?" the short answer is that the xor instruction is fewer bytes than mov $0, %ebx. The long answer includes other subtle reasons.
I am leaving out the rest of the exercise since there's nothing idiosyncratic left.
This is the completed and commented assembly equivalent to your C code:
xorl %ebx , %ebx ; i = 0
cmpl $54, %ebx
jle .L2 ; if (i <= 54) jump to .L2, otherwise continue with the next instruction (so if i>54... which equals >=55 like in your C code)
addl $2, %ebx ; >54 (or: >=55)
.L2:
decl %ebx ; <=54 (or <55, the else-branch of your if) Note: This code also gets executed if i >= 55, hence why we need +2 above so we only get +1 total
.L3:
So, these are the (arithmetic) instructions that get executed for all numbers >=55:
addl $2, %ebx
decl %ebx
So for numbers >=55, this is equal to incrementing. The following (arithmetic) instructions get executed for numbers <55:
decl %ebx
We jump over the addl $2, %ebx instruction, so for numbers <55 this is equal to decrementing.
In case you're not allowed to type addl $2, (since it's not just the instruction but also an argument) into a single blank there's probably an error in the asm code you've been given (missing a jump between line 4 and 5 to .L3).
Also note that jel is clearly a typo for jle in the question.
XORL is used to initialize a register to Zero, mostly used for the counter. The code from ccKep is correct, only that he incremented by a wrong value ie. 2 instead of 1. The correct version is therefore:
xorl %ebx , %ebx # i = 0
cmpl $54, %ebx # compare the two
jle .L2 #if (i <= 54) jump to .L2, otherwise continue with the next instruction (so if i>54... which equals >=55 like in your C code)
incl %ebx #i++
jmp .DONE # jump to exit position
.L2:
decl %ebx # <=54 (or <55
.DONE:

GCC optimization missed opportunity

I'm compiling this C code:
int mode; // use aa if true, else bb
int aa[2];
int bb[2];
inline int auto0() { return mode ? aa[0] : bb[0]; }
inline int auto1() { return mode ? aa[1] : bb[1]; }
int slow() { return auto1() - auto0(); }
int fast() { return mode ? aa[1] - aa[0] : bb[1] - bb[0]; }
Both slow() and fast() functions are meant to do the same thing, though fast() does it with one branch statement instead of two. I wanted to check if GCC would collapse the two branches into one. I've tried this with GCC 4.4 and 4.7, with various levels of optimization such as -O2, -O3, -Os, and -Ofast. It always gives the same strange results:
slow():
movl mode(%rip), %ecx
testl %ecx, %ecx
je .L10
movl aa+4(%rip), %eax
movl aa(%rip), %edx
subl %edx, %eax
ret
.L10:
movl bb+4(%rip), %eax
movl bb(%rip), %edx
subl %edx, %eax
ret
fast():
movl mode(%rip), %esi
testl %esi, %esi
jne .L18
movl bb+4(%rip), %eax
subl bb(%rip), %eax
ret
.L18:
movl aa+4(%rip), %eax
subl aa(%rip), %eax
ret
Indeed, only one branch is generated in each function. However, slow() seems to be inferior in a surprising way: it uses one extra load in each branch, for aa[0] and bb[0]. The fast() code uses them straight from memory in the subls without loading them into a register first. So slow() uses one extra register and one extra instruction per call.
A simple micro-benchmark shows that calling fast() one billion times takes 0.7 seconds, vs. 1.1 seconds for slow(). I'm using a Xeon E5-2690 at 2.9 GHz.
Why should this be? Can you tweak my source code somehow so that GCC does a better job?
Edit: here are the results with clang 4.2 on Mac OS:
slow():
movq _aa#GOTPCREL(%rip), %rax ; rax = aa (both ints at once)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
movq _mode#GOTPCREL(%rip), %rdx ; rdx = mode
cmpl $0, (%rdx) ; mode == 0 ?
leaq 4(%rcx), %rdx ; rdx = bb[1]
cmovneq %rax, %rcx ; if (mode != 0) rcx = aa
leaq 4(%rax), %rax ; rax = aa[1]
cmoveq %rdx, %rax ; if (mode == 0) rax = bb
movl (%rax), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
fast():
movq _mode#GOTPCREL(%rip), %rax ; rax = mode
cmpl $0, (%rax) ; mode == 0 ?
je LBB1_2 ; if (mode != 0) {
movq _aa#GOTPCREL(%rip), %rcx ; rcx = aa
jmp LBB1_3 ; } else {
LBB1_2: ; // (mode == 0)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
LBB1_3: ; }
movl 4(%rcx), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
Interesting: clang generates branchless conditionals for slow() but one branch for fast()! On the other hand, slow() does three loads (two of which are speculative, one will be unnecessary) vs. two for fast(). The fast() implementation is more "obvious," and as with GCC it's shorter and uses one less register.
GCC 4.7 on Mac OS generally suffers the same issue as on Linux. Yet it uses the same "load 8 bytes then twice extract 4 bytes" pattern as Clang on Mac OS. That's sort of interesting, but not very relevant, as the original issue of emitting subl with two registers rather than one memory and one register is the same on either platform for GCC.
The reason is that in the initial intermediate code, emitted for slow(), the memory load and the subtraction are in different basic blocks:
slow ()
{
int D.1405;
int mode.3;
int D.1402;
int D.1379;
# BLOCK 2 freq:10000
mode.3_5 = mode;
if (mode.3_5 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:5000
D.1402_6 = aa[1];
D.1405_10 = aa[0];
goto <bb 5>;
# BLOCK 4 freq:5000
D.1402_7 = bb[1];
D.1405_11 = bb[0];
# BLOCK 5 freq:10000
D.1379_3 = D.1402_17 - D.1405_12;
return D.1379_3;
}
whereas in fast() they are in the same basic block:
fast ()
{
int D.1377;
int D.1376;
int D.1374;
int D.1373;
int mode.1;
int D.1368;
# BLOCK 2 freq:10000
mode.1_2 = mode;
if (mode.1_2 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:3900
D.1373_3 = aa[1];
D.1374_4 = aa[0];
D.1368_5 = D.1373_3 - D.1374_4;
goto <bb 5>;
# BLOCK 4 freq:6100
D.1376_6 = bb[1];
D.1377_7 = bb[0];
D.1368_8 = D.1376_6 - D.1377_7;
# BLOCK 5 freq:10000
return D.1368_1;
}
GCC relies on instruction combining pass to handle cases like this (i.e. apparently not on the peephole optimization pass) and combining works on the scope of a basic block. That's why the subtraction and load are combined in a single insn in fast() and they aren't even considered for combining in slow().
Later, in the basic block reordering pass, the subtraction in slow() is duplicated and moved into the basic blocks, which contain the loads. Now there's opportunity for the combiner to, well, combine the load and the subtraction, but unfortunately, the combiner pass is not run again (and perhaps it cannot be run that late in the compilation process with hard registers already allocated and stuff).
I don't have an answer as to why GCC is unable to optimize the code the way you want it to, but I have a way to re-organize your code to achieve similar performance. Instead of organizing your code the way you have done so in slow() or fast(), I would recommend that you define an inline function that returns either aa or bb based on mode without needing a branch:
inline int * xx () { static int *xx[] = { bb, aa }; return xx[!!mode]; }
inline int kwiky(int *xx) { return xx[1] - xx[0]; }
int kwik() { return kwiky(xx()); }
When compiled by GCC 4.7 with -O3:
movl mode, %edx
xorl %eax, %eax
testl %edx, %edx
setne %al
movl xx.1369(,%eax,4), %edx
movl 4(%edx), %eax
subl (%edx), %eax
ret
With the definition of xx(), you can redefine auto0() and auto1() like so:
inline int auto0() { return xx()[0]; }
inline int auto1() { return xx()[1]; }
And, from this, you should see that slow() now compiles into code similar or identical to kwik().
Have you tried to modify internals compilers parameters (--param name=value in man page). Those are not changed with any optimizations level (with three minor excepts).
Some of them control code reduction/deduplication.
For some optimizations in this section you can read things like « larger values can exponentially increase compilation time » .

Resources