I've been digging through some parts of the Linux kernel, and found calls like this:
if (unlikely(fd < 0))
{
/* Do something */
}
or
if (likely(!err))
{
/* Do something */
}
I've found the definition of them:
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
I know that they are for optimization, but how do they work? And how much performance/size decrease can be expected from using them? And is it worth the hassle (and losing the portability probably) at least in bottleneck code (in userspace, of course).
They are hint to the compiler to emit instructions that will cause branch prediction to favour the "likely" side of a jump instruction. This can be a big win, if the prediction is correct it means that the jump instruction is basically free and will take zero cycles. On the other hand if the prediction is wrong, then it means the processor pipeline needs to be flushed and it can cost several cycles. So long as the prediction is correct most of the time, this will tend to be good for performance.
Like all such performance optimisations you should only do it after extensive profiling to ensure the code really is in a bottleneck, and probably given the micro nature, that it is being run in a tight loop. Generally the Linux developers are pretty experienced so I would imagine they would have done that. They don't really care too much about portability as they only target gcc, and they have a very close idea of the assembly they want it to generate.
Let's decompile to see what GCC 4.8 does with it
Without __builtin_expect
#include "stdio.h"
#include "time.h"
int main() {
/* Use time to prevent it from being optimized away. */
int i = !time(NULL);
if (i)
printf("%d\n", i);
puts("a");
return 0;
}
Compile and decompile with GCC 4.8.2 x86_64 Linux:
gcc -c -O3 -std=gnu11 main.c
objdump -dr main.o
Output:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 75 14 jne 24 <main+0x24>
10: ba 01 00 00 00 mov $0x1,%edx
15: be 00 00 00 00 mov $0x0,%esi
16: R_X86_64_32 .rodata.str1.1
1a: bf 01 00 00 00 mov $0x1,%edi
1f: e8 00 00 00 00 callq 24 <main+0x24>
20: R_X86_64_PC32 __printf_chk-0x4
24: bf 00 00 00 00 mov $0x0,%edi
25: R_X86_64_32 .rodata.str1.1+0x4
29: e8 00 00 00 00 callq 2e <main+0x2e>
2a: R_X86_64_PC32 puts-0x4
2e: 31 c0 xor %eax,%eax
30: 48 83 c4 08 add $0x8,%rsp
34: c3 retq
The instruction order in memory was unchanged: first the printf and then puts and the retq return.
With __builtin_expect
Now replace if (i) with:
if (__builtin_expect(i, 0))
and we get:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 74 11 je 21 <main+0x21>
10: bf 00 00 00 00 mov $0x0,%edi
11: R_X86_64_32 .rodata.str1.1+0x4
15: e8 00 00 00 00 callq 1a <main+0x1a>
16: R_X86_64_PC32 puts-0x4
1a: 31 c0 xor %eax,%eax
1c: 48 83 c4 08 add $0x8,%rsp
20: c3 retq
21: ba 01 00 00 00 mov $0x1,%edx
26: be 00 00 00 00 mov $0x0,%esi
27: R_X86_64_32 .rodata.str1.1
2b: bf 01 00 00 00 mov $0x1,%edi
30: e8 00 00 00 00 callq 35 <main+0x35>
31: R_X86_64_PC32 __printf_chk-0x4
35: eb d9 jmp 10 <main+0x10>
The printf (compiled to __printf_chk) was moved to the very end of the function, after puts and the return to improve branch prediction as mentioned by other answers.
So it is basically the same as:
int main() {
int i = !time(NULL);
if (i)
goto printf;
puts:
puts("a");
return 0;
printf:
printf("%d\n", i);
goto puts;
}
This optimization was not done with -O0.
But good luck on writing an example that runs faster with __builtin_expect than without, CPUs are really smart these days. My naive attempts are here.
C++20 [[likely]] and [[unlikely]]
C++20 has standardized those C++ built-ins: How to use C++20's likely/unlikely attribute in if-else statement They will likely (a pun!) do the same thing.
These are macros that give hints to the compiler about which way a branch may go. The macros expand to GCC specific extensions, if they're available.
GCC uses these to to optimize for branch prediction. For example, if you have something like the following
if (unlikely(x)) {
dosomething();
}
return x;
Then it can restructure this code to be something more like:
if (!x) {
return x;
}
dosomething();
return x;
The benefit of this is that when the processor takes a branch the first time, there is significant overhead, because it may have been speculatively loading and executing code further ahead. When it determines it will take the branch, then it has to invalidate that, and start at the branch target.
Most modern processors now have some sort of branch prediction, but that only assists when you've been through the branch before, and the branch is still in the branch prediction cache.
There are a number of other strategies that the compiler and processor can use in these scenarios. You can find more details on how branch predictors work at Wikipedia: http://en.wikipedia.org/wiki/Branch_predictor
They cause the compiler to emit the appropriate branch hints where the hardware supports them. This usually just means twiddling a few bits in the instruction opcode, so code size will not change. The CPU will start fetching instructions from the predicted location, and flush the pipeline and start over if that turns out to be wrong when the branch is reached; in the case where the hint is correct, this will make the branch much faster - precisely how much faster will depend on the hardware; and how much this affects the performance of the code will depend on what proportion of the time hint is correct.
For instance, on a PowerPC CPU an unhinted branch might take 16 cycles, a correctly hinted one 8 and an incorrectly hinted one 24. In innermost loops good hinting can make an enormous difference.
Portability isn't really an issue - presumably the definition is in a per-platform header; you can simply define "likely" and "unlikely" to nothing for platforms that do not support static branch hints.
long __builtin_expect(long EXP, long C);
This construct tells the compiler that the expression EXP
most likely will have the value C. The return value is EXP.
__builtin_expect is meant to be used in an conditional
expression. In almost all cases will it be used in the
context of boolean expressions in which case it is much
more convenient to define two helper macros:
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
These macros can then be used as in
if (likely(a > 1))
Reference: https://www.akkadia.org/drepper/cpumemory.pdf
(general comment - other answers cover the details)
There's no reason that you should lose portability by using them.
You always have the option of creating a simple nil-effect "inline" or macro that will allow you to compile on other platforms with other compilers.
You just won't get the benefit of the optimization if you're on other platforms.
As per the comment by Cody, this has nothing to do with Linux, but is a hint to the compiler. What happens will depend on the architecture and compiler version.
This particular feature in Linux is somewhat mis-used in drivers. As osgx points out in semantics of hot attribute, any hot or cold function called with in a block can automatically hint that the condition is likely or not. For instance, dump_stack() is marked cold so this is redundant,
if(unlikely(err)) {
printk("Driver error found. %d\n", err);
dump_stack();
}
Future versions of gcc may selectively inline a function based on these hints. There have also been suggestions that it is not boolean, but a score as in most likely, etc. Generally, it should be preferred to use some alternate mechanism like cold. There is no reason to use it in any place but hot paths. What a compiler will do on one architecture can be completely different on another.
In many linux release, you can find compiler.h in /usr/linux/ , you can include it for use simply. And another opinion, unlikely() is more useful rather than likely(), because
if ( likely( ... ) ) {
doSomething();
}
it can be optimized as well in many compiler.
And by the way, if you want to observe the detail behavior of the code, you can do simply as follow:
gcc -c test.c
objdump -d test.o > obj.s
Then, open obj.s, you can find the answer.
They're hints to the compiler to generate the hint prefixes on branches. On x86/x64, they take up one byte, so you'll get at most a one-byte increase for each branch. As for performance, it entirely depends on the application -- in most cases, the branch predictor on the processor will ignore them, these days.
Edit: Forgot about one place they can actually really help with. It can allow the compiler to reorder the control-flow graph to reduce the number of branches taken for the 'likely' path. This can have a marked improvement in loops where you're checking multiple exit cases.
These are GCC functions for the programmer to give a hint to the compiler about what the most likely branch condition will be in a given expression. This allows the compiler to build the branch instructions so that the most common case takes the fewest number of instructions to execute.
How the branch instructions are built are dependent upon the processor architecture.
I am wondering why its not defined like this:
#define likely(x) __builtin_expect(((x) != 0),1)
#define unlikely(x) __builtin_expect((x),0)
I mean __builtin_expect docs say that the compiler will expect the first parameter having the value of the second one (and first param is returned), but the original way the macro is defined above makes it hard to use this for things that might return non-zero values as "true" values (instead of value 1).
This might be buggy - but from my code you get the idea what direction I mean..
Related
I have this code in C:
int main(void)
{
int a = 1 + 2;
return 0;
}
When I objdump -x86-asm-syntax=intel -d a.out which is compiled with -O0 flag with GCC 9.3.0_1, I get:
0000000100000f9e _main:
100000f9e: 55 push rbp
100000f9f: 48 89 e5 mov rbp, rsp
100000fa2: c7 45 fc 03 00 00 00 mov dword ptr [rbp - 4], 3
100000fa9: b8 00 00 00 00 mov eax, 0
100000fae: 5d pop rbp
100000faf: c3 ret
and with -O1 flag:
0000000100000fc2 _main:
100000fc2: b8 00 00 00 00 mov eax, 0
100000fc7: c3 ret
which removes the unused variable a and stack managing altogether.
However, when I use Apple clang version 11.0.3 with -O0 and -O1, I get
0000000100000fa0 _main:
100000fa0: 55 push rbp
100000fa1: 48 89 e5 mov rbp, rsp
100000fa4: 31 c0 xor eax, eax
100000fa6: c7 45 fc 00 00 00 00 mov dword ptr [rbp - 4], 0
100000fad: c7 45 f8 03 00 00 00 mov dword ptr [rbp - 8], 3
100000fb4: 5d pop rbp
100000fb5: c3 ret
and
0000000100000fb0 _main:
100000fb0: 55 push rbp
100000fb1: 48 89 e5 mov rbp, rsp
100000fb4: 31 c0 xor eax, eax
100000fb6: 5d pop rbp
100000fb7: c3 ret
respectively.
I never get the stack managing part stripped off as in GCC.
Why does (Apple) Clang keep unnecessary push and pop?
This may or may not be a separate question, but with the following code:
int main(void)
{
// return 0;
}
GCC creates a same ASM with or without the return 0;.
However, Clang -O0 leaves this extra
100000fa6: c7 45 fc 00 00 00 00 mov dword ptr [rbp - 4], 0
when there is return 0;.
Why does Clang keep these (probably) redundant ASM codes?
I suspect you were trying to see the addition happen.
int main(void)
{
int a = 1 + 2;
return 0;
}
but with optimization say -O2, your dead code went away
00000000 <main>:
0: 2000 movs r0, #0
2: 4770 bx lr
The variable a is local, it never leaves the function it does not rely on anything outside of the function (globals, input variables, return values from called functions, etc). So it has no functional purpose it is dead code it doesn't do anything so an optimizer is free to remove it and did.
So I assume you went to use no or less optimization and then saw it was too verbose.
00000000 <main>:
0: cf 93 push r28
2: df 93 push r29
4: 00 d0 rcall .+0 ; 0x6 <main+0x6>
6: cd b7 in r28, 0x3d ; 61
8: de b7 in r29, 0x3e ; 62
a: 83 e0 ldi r24, 0x03 ; 3
c: 90 e0 ldi r25, 0x00 ; 0
e: 9a 83 std Y+2, r25 ; 0x02
10: 89 83 std Y+1, r24 ; 0x01
12: 80 e0 ldi r24, 0x00 ; 0
14: 90 e0 ldi r25, 0x00 ; 0
16: 0f 90 pop r0
18: 0f 90 pop r0
1a: df 91 pop r29
1c: cf 91 pop r28
1e: 08 95 ret
If you want to see addition happen instead first off don't use main() it has baggage, and the baggage varies among toolchains. So try something else
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b);
}
now the addition relies on external items so the compiler cannot optimize any of this away.
00000000 <_fun>:
0: 1d80 0002 mov 2(sp), r0
4: 6d80 0004 add 4(sp), r0
8: 0087 rts pc
If we want to figure out which one is a and which one is b then.
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+(b<<1));
}
00000000 <_fun>:
0: 1d80 0004 mov 4(sp), r0
4: 0cc0 asl r0
6: 6d80 0002 add 2(sp), r0
a: 0087 rts pc
Want to see an immediate value
unsigned int fun ( unsigned int a )
{
return(a+0x321);
}
00000000 <fun>:
0: 8b 44 24 04 mov eax,DWORD PTR [esp+0x4]
4: 05 21 03 00 00 add eax,0x321
9: c3 ret
you can figure out what the compilers return address convention is, etc.
But you will hit some limits trying to get the compiler to do things for you to learn asm likewise you can easily take the code generated by these compilations
(using -save-temps or -S or disassemble and type it in (I prefer the latter)) but you can only get so far on your operating system in high level/C callable functions. Eventually you will want to do something bare-metal (on a simulator at first) to get maximum freedom and to try instructions you cant normally try or try them in a way that is hard or you don't quite understand yet how to use in the confines of an operating system in a function call. (please do not use inline assembly until down the road or never, use real assembly and ideally the assembler not the compiler to assemble it, down the road then try those things).
The one compiler was built for or defaults to using a stack frame so you need to tell the compiler to omit it. -fomit-frame-pointer. Note that one or both of these can be built to default not to have a frame pointer.
../gcc-$GCCVER/configure --target=$TARGET --prefix=$PREFIX --without-headers --with-newlib --with-gnu-as --with-gnu-ld --enable-languages='c' --enable-frame-pointer=no
(Don't assume gcc nor clang/llvm have a "standard" build as they are both customizable and the binary you downloaded has someone's opinion of the standard build)
You are using main(), this has the return 0 or not thing and it can/will carry other baggage. Depends on the compiler and settings. Using something not main gives you the freedom to pick your inputs and outputs without it warning that you didn't conform to the short list of choices for main().
For gcc -O0 is ideally no optimization although sometimes you see some. -O3 is max give me all you got. -O2 is historically where folks live if for no other reason than "I did it because everyone else is doing it". -O1 is no mans land for gnu it has some items not in -O0 but not a lot of good ones in -O2, so depends heavily on your code as to whether or not you landed in one/some of the optimizations associated with -O1. These numbered optimization things if your compiler even has a -O option is just a pre-defined list 0 means this list 1 means that list and so on.
There is no reason to expect any two compilers or the same compiler with different options to produce the same code from the same sources. If two competing compilers were able to do that most if not all of the time something very fishy is going on...Likewise no reason to expect the list of optimizations each compiler supports, what each optimization does, etc, to match much less the -O1 list to match between them and so on.
There is no reason to assume that any two compilers or versions conform to the same calling convention for the same target, it is much more common now and further for the processor vendor to create a recommended calling convention and then the competing compilers to often conform to that because why not, everyone else is doing it, or even better, whew I don't have to figure one out myself, if this one fails I can blame them.
There are a lot of implementation defined areas in C in particular, less so in C++ but still...So your expectations of what come out and comparing compilers to each other may differ for this reason as well. Just because one compiler implements some code in some way doesn't mean that is how that language works sometimes it is how that compiler author(s) interpreted the language spec or had wiggle room.
Even with full optimizations enabled, everything that compiler has to offer there is no reason to assume that a compiler can outperform a human. Its an algorithm with limits programmed by a human, it cannot outperform us. With experience it is not hard to examine the output of a compiler for sometimes simple functions but often for larger functions and find missed optimizations, or other things that could have been done "better" for some opinion of "better". And sometimes you find the compiler just left something in that you think it should have removed, and sometimes you are right.
There is education as shown above in using a compiler to start to learn assembly language, and even with decades of experience and dabbling with dozens of assembly languages/instruction sets, if there is a debugged compiler available I will very often start with disassembling simple functions to start learning that new instruction set, then look those up then start to get a feel from what I find there for how to use it.
Very often starting with this one first:
unsigned int fun ( unsigned int a )
{
return(a+5);
}
or
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b);
}
And going from there. Likewise when writing a disassembler or a simulator for fun to learn the instruction set I often rely on an existing assembler since it is often the documentation for a processor is lacking, the first assembler and compiler for that processor are very often done with direct access to the silicon folks and then those that follow can also use existing tools as well as documentation to figure things out.
So you are on a good path to start learning assembly language I have strong opinions on which ones to or not to start with to improve the experience and chances of success, but I have been in too many battles on Stack Overflow this week, I'll let that go. You can see that I chose an array of instruction sets in this answer. And even if you don't know them you can probably figure out what the code is doing. "standard" installs of llvm provide the ability to output assembly language for several instruction sets from the same source code. The gnu approach is you pick the target (family) when you compile the toolchain and that compiled toolchain is limited to that target/family but you can easily install several gnu toolchains on your computer at the same time be they variations on defaults/settings for the same target or different targets. A number of these are apt gettable without having to learn to build the tools, arm, avr, msp430, x86 and perhaps some others.
I cannot speak to the why does it not return zero from main when you didn't actually have any return code. See comments by others and read up on the specs for that language. (or ask that as a separate question, or see if it was already answered).
Now you said Apple clang not sure what that reference was to I know that Apple has put a lot of work into llvm in general. Or maybe you are on a mac or in an Apple supplied/suggested development environment, but check Wikipedia and others, clang had a lot of corporate help not just Apple, so not sure what the reference was there. If you are on an Apple computer then the apt gettable isn't going to make sense, but there are still lots of pre-built gnu (and llvm) based toolchains you can download and install rather than attempt to build the toolchain from sources (which isn't difficult BTW).
Is there a limit in nested calls of functions according to C99?
Example:
result = fn1( fn2( fn3( ... fnN(parN1, parN2) ... ), par2), par1);
NOTE: this code is definitely not a good practice because hard to manage; however, this code is generated automatically from a model, so manageability issues does not apply.
There is not directly a limitation, but a compiler is only required to allow some minimum limits for various categories:
From the C11 standard:
5.2.4.1 Translation limits
1 The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits: 18)
...
63 nesting levels of parenthesized expressions within a full expression
...
4095 characters in a logical source line
18) Implementations should avoid imposing fixed translation limits whenever possible
No. There is no limit.
As a example, this is a C snippet:
int func1(int a){return a;}
int func2(int a){return a;}
int func3(int a){return a;}
void main()
{
func1(func2(func3(16)));
}
The corresponding assembly code is:
0000000000000024 <main>:
24: 55 push %rbp
25: 48 89 e5 mov %rsp,%rbp
28: bf 10 00 00 00 mov $0x10,%edi
2d: e8 00 00 00 00 callq 32 <main+0xe>
32: 89 c7 mov %eax,%edi
34: e8 00 00 00 00 callq 39 <main+0x15>
39: 89 c7 mov %eax,%edi
3b: e8 00 00 00 00 callq 40 <main+0x1c>
40: 90 nop
41: 5d pop %rbp
42: c3 retq
The %edi register stores the result of each function and the %eax register stores the argument. As you can see, there are three callq instructions which correspond to three function calls. In other words, these nested functions are called one by one. There is no need to worry about the stack.
As mentioned in comments, compiler may crash when the code nests too deep.
I write a simple Python script to test this.
nest = 64000
funcs=""
call=""
for i in range(1, nest+1):
funcs += "int func%d(int a){return a;}\n" %i
call += "func%d(" %i
call += str(1) # parameter
call += ")" * nest + ";" # right parenthesis
content = '''
%s
void main()
{
%s
}
''' %(funcs, call)
with open("test.c", "w") as fd:
fd.write(content)
nest = 64000 is OK, but 640000 will cause gcc-5.real: internal compiler error: Segmentation fault (program cc1).
No. Since these functions are executed one by one, there is no issue.
int res;
res = fnN(parN1, parN2);
....
res = fn2(res, par2);
res = fn1(res, par1);
The execution is linear with previous result being used for next function call.
Edit: As explained in comments, there might be a problem with parser and/or compiler to deal with such ugly code.
If this is not a purely theoretical question, the answer is probably "Try to rewrite your code so you don't need to do that, because the limit is more than enough for most sane use cases". If this is purely theoretical, or you really do need to worry about this limit and can't just rewrite, read on.
Section 5.2.4 of the C11 standard (latest draft, which is freely available and almost identical) specifies various limits on what implementations are required to support. If I'm reading that right, you can go up to 63 levels of nesting.
However, implementations are allowed to support more, and in practice they probably do. I had trouble finding the appropriate documentation for GCC (the closest I found was for expressions in the preprocessor), but I expect it doesn't have a hard limit except for system resources when compiling.
I'm understanding the basics of assembly and c programming.
I compiled following simple program in C,
#include <stdio.h>
int main()
{
int a;
int b;
a = 10;
b = 88
return 0;
}
Compiled with following command,
gcc -ggdb -fno-stack-protector test.c -o test
The disassembled code for above program with gcc version 4.4.7 is:
5 push %ebp
89 e5 mov %esp,%ebp
83 ec 10 sub $0x10,%esp
c7 45 f8 0a 00 00 00 movl $0xa,-0x8(%ebp)
c7 45 fc 58 00 00 00 movl $0x58,-0x4(%ebp)
b8 00 00 00 00 mov $0x0,%eax
c9 leave
c3 ret
90 nop
However disassembled code for same program with gcc version 4.3.3 is:
8d 4c 23 04 lea 0x4(%esp), %ecx
83 e4 f0 and $0xfffffff0, %esp
55 push -0x4(%ecx)
89 e5 mov %esp,%ebp
51 push %ecx
83 ec 10 sub $0x10,%esp
c7 45 f4 0a 00 00 00 00 movl $0xa, -0xc(%ebp)
c7 45 f8 58 00 00 00 00 movl $0x58, -0x8(%ebp)
b8 00 00 00 00 mov $0x0, %eax
83 c4 10 add $0x10,%esp
59 pop %ecx
5d pop %ebp
8d 61 fc lea -0x4(%ecx),%esp
c3 ret
Why there is difference in the assembly code?
As you can see in second assembled code, Why pushing %ecx on stack?
What is significance of and $0xfffffff0, %esp?
note: OS is same
Compilers are not required to produce identical assembly code for the same source code. The C standard allows the compiler to optimize the code as they see fit as long as the observable behaviour is the same. So, different compilers may generate different assembly code.
For your code, GCC 6.2 with -O3 generates just:
xor eax, eax
ret
because your code essentially does nothing. So, it's reduced to a simple return statement.
To give you some idea, how many ways exists to create valid code for particular task, I thought this example may help.
From time to time there are size coding competitions, obviously targetting Assembly programmers, as you can't compete with compiler against hand written assembly at this level at all.
The competition tasks are fairly trivial to make the entry level and total effort reasonable, with precise input and output specifications (down to single byte or pixel perfection).
So you have almost trivial exact task, human produced code (at the moment still outperforming compilers for trivial task), with single simple rule "minimal size" as a goal.
With your logic it's absolutely clear every competitor should produce the same result.
The real world answer to this is for example:
Hugi Size Coding Competition Series - Compo29 - Random Maze Builder
12 entries, size of code (in bytes): 122, 122, 128, 135, 136, 137, 147, ... 278 (!).
And I bet the first two entries, both having 122B are probably different enough (too lazy to actually check them).
Now producing valid machine code from high level programming language and by machine (compiler) is lot more complex task. And compilers can't compete with humans in reasoning, most of the "how good code is produced by c++ compiler" stems from C++ language itself being defined quite close to machine code (easy to compile) and from brute CPU power allowing the compilers to work on thousands of variants for particular code path, searching for near-optimal solution mostly by brute force.
Still the numerical "reasoning" behind the optimizers are state of art in their own way, getting to the point where human are still unreachable, but more like in their own way, just as humans can't achieve the efficiency of compilers within reasonable effort for full-sized app compilation.
At this point reasoning about some debug code being different in few helper prologue/epilogue instructions... Even if you would find difference in optimized code, and the difference being "obvious" to human, it's still quite a feat the compiler can produce at least that, as compiler has to apply universal rules on specific code, without truly understanding the context of task.
I'm currently reading through Robert Love's Linux Kernel Development, Third Edition and I encountered an odd statement after a section explaining the set_bit(), clear_bit() atomic functions and their non-atomic siblings, __set_bit() and __clear_bit():
Unlike the atomic integer operations, code typically has no choice whether to use the bitwise operations—they are the only portable way to set a specific bit.
-p. 183 (emphasis my own)
I understand that these operations can be implemented in a single platform-specific assembly instruction, which is why these inline functions exist. But I'm curious as to why the author said that these are the only portable ways to do these things. For instance, I believe I could non-atomically set bit nr in unsigned long x by doing this in plain C:
x |= 1UL << nr;
Similarly I can non-atomically clear bit nr in unsigned long x by doing this:
x &= ~(1UL << nr);
Of course, depending on the sophistication of the compiler, these may compile to several instructions, and so they may not be as nice as the __set_bit() and __clear_bit() functions.
Am I missing something here? Was this phrase just a slightly lazy simplification, or is there something unportable about the ways I've presented above for setting and clearing bits?
Edit: It appears that, although GCC is pretty sophisticated, it still performs the bit shifts instead of using a single instruction like the __set_bit() function does, even on -O3 (version 6.2.1). As an example:
stephen at greed in ~/code
$ gcc -g -c set.c -O3
stephen at greed in ~/code
$ objdump -d -M intel -S set.o
set.o: file format elf64-x86-64
Disassembly of section .text.startup:
0000000000000000 <main>:
#include<stdio.h>
int main(int argc, char *argv)
{
unsigned long x = 0;
x |= (1UL << argc);
0: 89 f9 mov ecx,edi
2: be 01 00 00 00 mov esi,0x1
7: 48 83 ec 08 sub rsp,0x8
b: 48 d3 e6 shl rsi,cl
e: bf 00 00 00 00 mov edi,0x0
13: 31 c0 xor eax,eax
15: e8 00 00 00 00 call 1a <main+0x1a>
{
1a: 31 c0 xor eax,eax
x |= (1UL << argc);
1c: 48 83 c4 08 add rsp,0x8
printf("x=%x\n", x);
20: c3 ret
In the context of atomic integer operations, "bitwise operations" also means the atomic ones.
There is nothing special about the non-atomic bitwise operations (except that they support numbers larger than BITS_PER_LONG), so the generic implementation is always correct, and architecture-specific implementations would be needed only to optimize performance.
This question already exists:
Why are registers needed (why not only use memory)? [duplicate]
Closed 1 year ago.
Hi I have been reading this kind of stuff in various docs
register
Tells the compiler to store the variable being declared in a CPU register.
In standard C dialects, keyword register uses the following syntax:
register data-definition;
The register type modifier tells the compiler to store the variable being declared in a CPU register (if possible), to optimize access. For example,
register int i;
Note that TIGCC will automatically store often used variables in CPU registers when the optimization is turned on, but the keyword register will force storing in registers even if the optimization is turned off. However, the request for storing data in registers may be denied, if the compiler concludes that there is not enough free registers for use at this place.
http://tigcc.ticalc.org/doc/keywords.html#register
My point is not only about register. My point is why would a compiler stores the variables in memory. The compiler business is to just compile and to generate an object file. At run time the actual memory allocation happens. why would compiler does this business. I mean without running the object file just by compiling the file itself does the memory allocation happens in case of C?
The compiler is generating machine code, and the machine code is used to run your program. The compiler decides what machine code it generates, therefore making decisions about what sort of allocation will happen at runtime. It's not executing them when you type gcc foo.c but later, when you run the executable, it's the code GCC generated that's running.
This means that the compiler wants to generate the fastest code possible and makes as many decisions as it can at compile time, this includes how to allocate things.
The compiler doesn't run the code (unless it does a few rounds for profiling and better code execution), but it has to prepare it - this includes how to keep the variables your program defines, whether to use fast and efficient storage as registers, or using the slower (and more prone to side effects) memory.
Initially, your local variables would simply be assigned location on the stack frame (except of course for memory you explicitly use dynamic allocation for). If your function assigned an int, your compiler would likely tell the stack to grow by a few additional bytes and use that memory address for storing that variable and passing it as operand to any operation your code is doing on that variable.
However, since memory is slower (even when cached), and manipulating it causes more restrictions on the CPU, at a later stage the compiler may decide to try moving some variables into registers. This allocation is done through a complicated algorithm that tries to select the most reused and latency critical variables that can fit within the existing number of logical registers your architecture has (While confirming with various restrictions such as some instructions requiring the operand to be in this or that register).
There's another complication - some memory addresses may alias with external pointers in manners unknown at compilation time, in which case you can not move them into registers. Compilers are usually a very cautious bunch and most of them would avoid dangerous optimizations (otherwise they're need to put up some special checks to avoid nasty things).
After all that, the compiler is still polite enough to let you advise which variable it important and critical to you, in case he missed it, and by marking these with the register keyword you're basically asking him to make an attempt to optimize for this variable by using a register for it, given enough registers are available and that no aliasing is possible.
Here's a little example: Take the following code, doing the same thing twice but with slightly different circumstances:
#include "stdio.h"
int j;
int main() {
int i;
for (i = 0; i < 100; ++i) {
printf ("i'm here to prevent the loop from being optimized\n");
}
for (j = 0; j < 100; ++j) {
printf ("me too\n");
}
}
Note that i is local, j is global (and therefore the compiler doesn't know if anyone else might access him during the run).
Compiling in gcc with -O3 produces the following code for main:
0000000000400540 <main>:
400540: 53 push %rbx
400541: bf 88 06 40 00 mov $0x400688,%edi
400546: bb 01 00 00 00 mov $0x1,%ebx
40054b: e8 18 ff ff ff callq 400468 <puts#plt>
400550: bf 88 06 40 00 mov $0x400688,%edi
400555: 83 c3 01 add $0x1,%ebx # <-- i++
400558: e8 0b ff ff ff callq 400468 <puts#plt>
40055d: 83 fb 64 cmp $0x64,%ebx
400560: 75 ee jne 400550 <main+0x10>
400562: c7 05 80 04 10 00 00 movl $0x0,1049728(%rip) # 5009ec <j>
400569: 00 00 00
40056c: bf c0 06 40 00 mov $0x4006c0,%edi
400571: e8 f2 fe ff ff callq 400468 <puts#plt>
400576: 8b 05 70 04 10 00 mov 1049712(%rip),%eax # 5009ec <j> (loads j)
40057c: 83 c0 01 add $0x1,%eax # <-- j++
40057f: 83 f8 63 cmp $0x63,%eax
400582: 89 05 64 04 10 00 mov %eax,1049700(%rip) # 5009ec <j> (stores j back)
400588: 7e e2 jle 40056c <main+0x2c>
40058a: 5b pop %rbx
40058b: c3 retq
As you can see, the first loop counter sits in ebx, and is incremented on each iteration and compared against the limit.
The second loop however was the dangerous one, and gcc decided to pass the index counter through memory (loading it into rax every iteration). This example serves to show how better off you'd be when using registers, as well as how sometimes you can't.
The compiler needs to translate the code into machine instruction, and tell the computer how to run the code. That include how to make operations (like multiply two numbers) and how to store the data (stack, heap or register).