Why assembly code is different for simple C program with different gcc version?

Why assembly code is different for simple C program with different gcc version? - c

I'm understanding the basics of assembly and c programming.
I compiled following simple program in C,
#include <stdio.h>
int main()
{
int a;
int b;
a = 10;
b = 88
return 0;
}
Compiled with following command,
gcc -ggdb -fno-stack-protector test.c -o test
The disassembled code for above program with gcc version 4.4.7 is:
5 push %ebp
89 e5 mov %esp,%ebp
83 ec 10 sub $0x10,%esp
c7 45 f8 0a 00 00 00 movl $0xa,-0x8(%ebp)
c7 45 fc 58 00 00 00 movl $0x58,-0x4(%ebp)
b8 00 00 00 00 mov $0x0,%eax
c9 leave
c3 ret
90 nop
However disassembled code for same program with gcc version 4.3.3 is:
8d 4c 23 04 lea 0x4(%esp), %ecx
83 e4 f0 and $0xfffffff0, %esp
55 push -0x4(%ecx)
89 e5 mov %esp,%ebp
51 push %ecx
83 ec 10 sub $0x10,%esp
c7 45 f4 0a 00 00 00 00 movl $0xa, -0xc(%ebp)
c7 45 f8 58 00 00 00 00 movl $0x58, -0x8(%ebp)
b8 00 00 00 00 mov $0x0, %eax
83 c4 10 add $0x10,%esp
59 pop %ecx
5d pop %ebp
8d 61 fc lea -0x4(%ecx),%esp
c3 ret
Why there is difference in the assembly code?
As you can see in second assembled code, Why pushing %ecx on stack?
What is significance of and $0xfffffff0, %esp?
note: OS is same

Compilers are not required to produce identical assembly code for the same source code. The C standard allows the compiler to optimize the code as they see fit as long as the observable behaviour is the same. So, different compilers may generate different assembly code.
For your code, GCC 6.2 with -O3 generates just:
xor eax, eax
ret
because your code essentially does nothing. So, it's reduced to a simple return statement.

To give you some idea, how many ways exists to create valid code for particular task, I thought this example may help.
From time to time there are size coding competitions, obviously targetting Assembly programmers, as you can't compete with compiler against hand written assembly at this level at all.
The competition tasks are fairly trivial to make the entry level and total effort reasonable, with precise input and output specifications (down to single byte or pixel perfection).
So you have almost trivial exact task, human produced code (at the moment still outperforming compilers for trivial task), with single simple rule "minimal size" as a goal.
With your logic it's absolutely clear every competitor should produce the same result.
The real world answer to this is for example:
Hugi Size Coding Competition Series - Compo29 - Random Maze Builder
12 entries, size of code (in bytes): 122, 122, 128, 135, 136, 137, 147, ... 278 (!).
And I bet the first two entries, both having 122B are probably different enough (too lazy to actually check them).
Now producing valid machine code from high level programming language and by machine (compiler) is lot more complex task. And compilers can't compete with humans in reasoning, most of the "how good code is produced by c++ compiler" stems from C++ language itself being defined quite close to machine code (easy to compile) and from brute CPU power allowing the compilers to work on thousands of variants for particular code path, searching for near-optimal solution mostly by brute force.
Still the numerical "reasoning" behind the optimizers are state of art in their own way, getting to the point where human are still unreachable, but more like in their own way, just as humans can't achieve the efficiency of compilers within reasonable effort for full-sized app compilation.
At this point reasoning about some debug code being different in few helper prologue/epilogue instructions... Even if you would find difference in optimized code, and the difference being "obvious" to human, it's still quite a feat the compiler can produce at least that, as compiler has to apply universal rules on specific code, without truly understanding the context of task.

Related

Jump to a label from inline assembly to C

I have a written piece of code in assembly and at some points of it, I want to jump to a label in C. So I have the following code (shortened version but still, I am having the same problem):
#include <stdio.h>
#define JE asm volatile("jmp end");
int main(){
printf("hi\n");
JE
printf("Invisible\n");
end:
printf("Visible\n");
return 0;
}
This code compiles, but there is no end label in the disassembled version of the code.
If I change the label name from end to any other thing (let's say l1, both in asm code(jmp l1) and in the C code), the compiler says that
main.c:(.text+0x6b): undefined reference to `l1'
collect2: error: ld returned 1 exit status
Makefile:2: recipe for target 'main' failed
make: *** [main] Error 1
I have tried different things(different length, different cases, upper, lower, etc.) and I think it only compiles with end label. And with end label, I am receiving segmentation fault because, there is no end label in the disassembled version.
Compiled with: gcc -O0 main.c -o main
Disassembled code:
000000000000063a <main>:
63a: 55 push %rbp
63b: 48 89 e5 mov %rsp,%rbp
63e: 48 8d 3d af 00 00 00 lea 0xaf(%rip),%rdi # 6f4 <_IO_stdin_used+0x4>
645: e8 c6 fe ff ff callq 510 <puts#plt>
64a: e9 c9 09 20 00 jmpq 201018 <_end> # there is no _end label!
64f: 48 8d 3d a1 00 00 00 lea 0xa1(%rip),%rdi # 6f7 <_IO_stdin_used+0x7>
656: e8 b5 fe ff ff callq 510 <puts#plt>
65b: 48 8d 3d 9f 00 00 00 lea 0x9f(%rip),%rdi # 701 <_IO_stdin_used+0x11>
662: e8 a9 fe ff ff callq 510 <puts#plt>
667: b8 00 00 00 00 mov $0x0,%eax
66c: 5d pop %rbp
66d: c3 retq
66e: 66 90 xchg %ax,%ax
So, the questions are:
Am I doing something wrong? I have seen this kind of jumps (from
assembly to C) in codes. I can provide example links.
Why the compiler/linker cannot find l1 but can find end?

This is what asm goto is for. GCC Inline Assembly: Jump to label outside block
Note that defining a label inside another asm statement will sometimes work (e.g. with optimization disabled) but IS NOT SAFE.
asm("end:"); // BROKEN; NEVER USE
// except for toy experiments to look at compiler output
GNU C does not define the behaviour of jumping from one asm statement to another without asm goto. The compiler is allowed to assume that execution comes out the end of an asm statement and e.g. put a store after it.
The C end: label within a given function won't just have the asm symbol name of end or _end: - that wouldn't make sense because separate C functions are each allowed to have their own end: label. It could be something like main.end but it turns out GCC and clang just use their usual autonumbered labels like .L123.
Then how this code works: https://github.com/IAIK/transientfail/blob/master/pocs/spectre/PHT/sa_oop/main.c
It doesn't; the end label that asm volatile("je end"); references is in the .data section and happens to be defined by the compiler or linker to mark the end of that section.
asm volatile("je end") has no connection to the C label in that function.
I commented out some of the code in other functions to get it to compile without the "cacheutils.h" header but that didn't affect that part of the oop() function; see https://godbolt.org/z/jabYu3 for disassembly of the linked executable with JE_4k changed to JE_16 so it's not huge. It's disassembly of a linked executable so you can see the numeric address of je 6010f0 <_end> while the oop function itself starts at 4006e0 and ends at 400750. (So it doesn't contain the branch target).
If this happens to work for Spectre exploits, that's because apparently the branch is never actually taken.

Limit for nested function calls in C (C99)

Is there a limit in nested calls of functions according to C99?
Example:
result = fn1( fn2( fn3( ... fnN(parN1, parN2) ... ), par2), par1);
NOTE: this code is definitely not a good practice because hard to manage; however, this code is generated automatically from a model, so manageability issues does not apply.

There is not directly a limitation, but a compiler is only required to allow some minimum limits for various categories:
From the C11 standard:
5.2.4.1 Translation limits
1 The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits: 18)
...
63 nesting levels of parenthesized expressions within a full expression
...
4095 characters in a logical source line
18) Implementations should avoid imposing fixed translation limits whenever possible

No. There is no limit.
As a example, this is a C snippet:
int func1(int a){return a;}
int func2(int a){return a;}
int func3(int a){return a;}
void main()
{
func1(func2(func3(16)));
}
The corresponding assembly code is:
0000000000000024 <main>:
24: 55 push %rbp
25: 48 89 e5 mov %rsp,%rbp
28: bf 10 00 00 00 mov $0x10,%edi
2d: e8 00 00 00 00 callq 32 <main+0xe>
32: 89 c7 mov %eax,%edi
34: e8 00 00 00 00 callq 39 <main+0x15>
39: 89 c7 mov %eax,%edi
3b: e8 00 00 00 00 callq 40 <main+0x1c>
40: 90 nop
41: 5d pop %rbp
42: c3 retq
The %edi register stores the result of each function and the %eax register stores the argument. As you can see, there are three callq instructions which correspond to three function calls. In other words, these nested functions are called one by one. There is no need to worry about the stack.
As mentioned in comments, compiler may crash when the code nests too deep.
I write a simple Python script to test this.
nest = 64000
funcs=""
call=""
for i in range(1, nest+1):
funcs += "int func%d(int a){return a;}\n" %i
call += "func%d(" %i
call += str(1) # parameter
call += ")" * nest + ";" # right parenthesis
content = '''
%s
void main()
{
%s
}
''' %(funcs, call)
with open("test.c", "w") as fd:
fd.write(content)
nest = 64000 is OK, but 640000 will cause gcc-5.real: internal compiler error: Segmentation fault (program cc1).

No. Since these functions are executed one by one, there is no issue.
int res;
res = fnN(parN1, parN2);
....
res = fn2(res, par2);
res = fn1(res, par1);
The execution is linear with previous result being used for next function call.
Edit: As explained in comments, there might be a problem with parser and/or compiler to deal with such ugly code.

If this is not a purely theoretical question, the answer is probably "Try to rewrite your code so you don't need to do that, because the limit is more than enough for most sane use cases". If this is purely theoretical, or you really do need to worry about this limit and can't just rewrite, read on.
Section 5.2.4 of the C11 standard (latest draft, which is freely available and almost identical) specifies various limits on what implementations are required to support. If I'm reading that right, you can go up to 63 levels of nesting.
However, implementations are allowed to support more, and in practice they probably do. I had trouble finding the appropriate documentation for GCC (the closest I found was for expressions in the preprocessor), but I expect it doesn't have a hard limit except for system resources when compiling.

Reading x86 assembly code

I am working through a lab where I have to defuse a "bomb" by providing the correct input for each phase. I do not have access to the source code, so I have to step through the assembly code with GDB. Right now, I'm stuck on phase 2 and would really appreciate some help. Here is the x86 assembly code - I've added some comments that describe what I think is happening, but these could be horribly wrong because we only started learning assembly code a few days ago and I'm still quite shaky. As far as I can tell right now, this phase reads in 6 numbers from the user (that's what read_six_numbers does) and seems to go through some type of loop.
0000000000400f03 <phase_2>:
400f03: 41 55 push %r13 // save values
400f05: 41 54 push %r12
400f07: 55 push %rbp
400f08: 53 push %rbx
400f09: 48 83 ec 28 sub $0x28,%rsp // decrease stack pointer
400f0d: 48 89 e6 mov %rsp,%rsi // move rsp to rsi
400f10: e8 5a 07 00 00 callq 40166f <read_six_numbers> // read in six numbers from the user
400f15: 48 89 e3 mov %rsp,%rbx // move rsp to rbx
400f18: 4c 8d 64 24 0c lea 0xc(%rsp),%r12 // ?
400f1d: bd 00 00 00 00 mov $0x0,%ebp // set ebp to 0?
400f22: 49 89 dd mov %rbx,%r13 // move rbx to r13
400f25: 8b 43 0c mov 0xc(%rbx),%eax // ?
400f28: 39 03 cmp %eax,(%rbx) // compare eax and rbx
400f2a: 74 05 je 400f31 <phase_2+0x2e> // if equal, skip explode
400f2c: e8 1c 07 00 00 callq 40164d <explode_bomb> // bomb detonates (fail)
400f31: 41 03 6d 00 add 0x0(%r13),%ebp // add r13 and ebp (?)
400f35: 48 83 c3 04 add $0x4,%rbx // add 4 to rbx
400f39: 4c 39 e3 cmp %r12,%rbx // compare r12 and rbx
400f3c: 75 e4 jne 400f22 <phase_2+0x1f> // loop? if not equal, jump to 400f22
400f3e: 85 ed test %ebp,%ebp // compare ebp with itself?
400f40: 75 05 jne 400f47 <phase_2+0x44> // skip explosion if not equal
400f42: e8 06 07 00 00 callq 40164d <explode_bomb> // bomb detonates (fail)
400f47: 48 83 c4 28 add $0x28,%rsp
400f4b: 5b pop %rbx
400f4c: 5d pop %rbp
400f4d: 41 5c pop %r12
400f4f: 41 5d pop %r13
400f51: c3 retq
Any help is greatly appreciated - especially advice on how I would go about translating something like this into C code. Thanks in advance!

especially advice on how I would go about translating something like this into C code
Don't literally translate it into C.
Learn to think in terms of how algorithms are implemented in terms of changes to registers and memory. C and asm are just different ways of expressing what you actually want the machine to do.
Every instruction makes a well-defined modification to the architectural state of the machine, so just follow that chain of steps and see the result. Any good debugger (e.g. gdb in layout reg mode) can show you which register was modified as you single-step. The insn ref manual (links in the x86 tag wiki) has full documentation on exactly what every instruction does.
If you're ever surprised by something, look it up. There are many SO questions from people that didn't do that, and then posted silly questions about div results when they didn't set rdx first.
You need to find connections between insns that modify or overwrite a register or memory location, and later instructions that read from that register or memory location.
You can often get clues from how a register is being used, e.g. add $0x4,%rbx is probably a pointer increment to an int *. It's rare to increment a 64bit integer by 4 if it isn't a pointer, or involved in memory addressing somehow.
If you look at surrounding code and find mov 0xc(%rbx),%eax (loading 4B from an offset from %rbx), that confirms the theory that it's a pointer.
The cmp %r12,%rbx / jcc tells you that it's also part of the loop condition, and that %r12 is the end pointer. You check it's just a simple do{}while(p < end) loop by verifying that %r12 isn't modified in the loop, and that it's initialized to something sensible before the loop.
mov $0x0,%ebp tells you that this is compiler output from -O0 or -O1, because every x86 compiler knows the "peephole" optimization that xor %ebp,%ebp is the best way to zero registers. Fortunately this looks like -O1 compiler output, so it doesn't store everything to memory after every C statement and reload after. That makes code that's hard to follow, because a value doesn't stay live in the same register for long.
If you have any specific questions about that binary bomb code, you should ask them. I just answered the "how to read asm" part.

Generating simple shell binary code to be copied to the stack for stack overflow

I am trying to implement the buffer overflow attack, but I need to generate instruction code in binary so I can put it on the stack to be executed. My problem is that the instructions that I am getting now have jumps to different parts of a program which becomes hard to put on the stack. So I have this simple piece of code (not the code to be exploited) that I want to put on the stack to spawn a new shell.
#include <stdio.h>
int main( ) {
char *buf[2];
buf[0] = "/bin/bash";
buf[1] = NULL;
execve(buf[0], buf, NULL);
}
The code is being compiled with gcc with the following flags:
CFLAGS = -Wall -Wextra -g -fno-stack-protector -m32 -z execstack
LDFLAGS = -fno-stack-protector -m32 -z execstack
Finally using objdump -d -S, I get the following code (parts of it) in hex:
....
....
08048320 <execve#plt>:
8048320: ff 25 08 a0 04 08 jmp *0x804a008
8048326: 68 10 00 00 00 push $0x10
804832b: e9 c0 ff ff ff jmp 80482f0 <_init+0x3c>
....
....
int main( ) {
80483e4: 55 push %ebp
80483e5: 89 e5 mov %esp,%ebp
80483e7: 83 e4 f0 and $0xfffffff0,%esp
80483ea: 83 ec 20 sub $0x20,%esp
char *buf[2];
buf[0] = "/bin/bash";
80483ed: c7 44 24 18 f0 84 04 movl $0x80484f0,0x18(%esp)
80483f4: 08
buf[1] = NULL;
80483f5: c7 44 24 1c 00 00 00 movl $0x0,0x1c(%esp)
80483fc: 00
execve(buf[0], buf, NULL);
80483fd: 8b 44 24 18 mov 0x18(%esp),%eax
8048401: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
8048408: 00
8048409: 8d 54 24 18 lea 0x18(%esp),%edx
804840d: 89 54 24 04 mov %edx,0x4(%esp)
8048411: 89 04 24 mov %eax,(%esp)
8048414: e8 07 ff ff ff call 8048320 <execve#plt>
}
8048419: c9 leave
804841a: c3 ret
804841b: 90 nop
As you can see this code is hard to copy onto the stack. execve jumps to a different part of the assembly code to be executed. Is there a way to nicely get a program which can be put compactly on the stack without too much space and branches being used?

If you want a clean assembly code without coding it yourself, do the following:
Move your code to a separate function
Pass the -O0 compilation flag to gcc in order to prevent optimizations
Compile just an object file - use gcc [input file] -o [output file]
Follow these steps, and use the assembly code generated from objdump.
Please keep in mind, you have an external dependency for execve:
8048414: e8 07 ff ff ff call 8048320 <execve#plt>
so you must either include its implementation explicitly and remove the call, or know in advance that the process you want to attack has this function in its address space, and modify the call address to match the process execve address.

Welcome to Stack Overflow (pun intended).
This is an unusual request. Are you trying to get hired by the NSA?
Compilers don't structure assembly code in a very human friendly form, obviously. And their idea of optimization might be for performance rather than for compactness. Therefore, you might consider hand-coding it in assembler, using the compiler output as a guideline to achieve the effect you're going for.
What you have there may not be the best representation of what the compiler can do to inform your investigation in any case. Put the code into a function other than main, so you can just get the minimal stack setup and argument handling necessary, and try compiling it with all the different optimization levels to see what it does.
You're probably getting some extra setup overhead for putting it in main(), because it's the main entry point to your program and has to interface with libc and the OS (just guessing), and may be setting things up for the program's operational context in general (which you could presume would have been done for whichever executable you were inserting the code into so it would be redundant).

What's the point of "unlikely()"? [duplicate]

I've been digging through some parts of the Linux kernel, and found calls like this:
if (unlikely(fd < 0))
{
/* Do something */
}
or
if (likely(!err))
{
/* Do something */
}
I've found the definition of them:
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
I know that they are for optimization, but how do they work? And how much performance/size decrease can be expected from using them? And is it worth the hassle (and losing the portability probably) at least in bottleneck code (in userspace, of course).

They are hint to the compiler to emit instructions that will cause branch prediction to favour the "likely" side of a jump instruction. This can be a big win, if the prediction is correct it means that the jump instruction is basically free and will take zero cycles. On the other hand if the prediction is wrong, then it means the processor pipeline needs to be flushed and it can cost several cycles. So long as the prediction is correct most of the time, this will tend to be good for performance.
Like all such performance optimisations you should only do it after extensive profiling to ensure the code really is in a bottleneck, and probably given the micro nature, that it is being run in a tight loop. Generally the Linux developers are pretty experienced so I would imagine they would have done that. They don't really care too much about portability as they only target gcc, and they have a very close idea of the assembly they want it to generate.

Let's decompile to see what GCC 4.8 does with it
Without __builtin_expect
#include "stdio.h"
#include "time.h"
int main() {
/* Use time to prevent it from being optimized away. */
int i = !time(NULL);
if (i)
printf("%d\n", i);
puts("a");
return 0;
}
Compile and decompile with GCC 4.8.2 x86_64 Linux:
gcc -c -O3 -std=gnu11 main.c
objdump -dr main.o
Output:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 75 14 jne 24 <main+0x24>
10: ba 01 00 00 00 mov $0x1,%edx
15: be 00 00 00 00 mov $0x0,%esi
16: R_X86_64_32 .rodata.str1.1
1a: bf 01 00 00 00 mov $0x1,%edi
1f: e8 00 00 00 00 callq 24 <main+0x24>
20: R_X86_64_PC32 __printf_chk-0x4
24: bf 00 00 00 00 mov $0x0,%edi
25: R_X86_64_32 .rodata.str1.1+0x4
29: e8 00 00 00 00 callq 2e <main+0x2e>
2a: R_X86_64_PC32 puts-0x4
2e: 31 c0 xor %eax,%eax
30: 48 83 c4 08 add $0x8,%rsp
34: c3 retq
The instruction order in memory was unchanged: first the printf and then puts and the retq return.
With __builtin_expect
Now replace if (i) with:
if (__builtin_expect(i, 0))
and we get:
0000000000000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b <main+0xb>
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 74 11 je 21 <main+0x21>
10: bf 00 00 00 00 mov $0x0,%edi
11: R_X86_64_32 .rodata.str1.1+0x4
15: e8 00 00 00 00 callq 1a <main+0x1a>
16: R_X86_64_PC32 puts-0x4
1a: 31 c0 xor %eax,%eax
1c: 48 83 c4 08 add $0x8,%rsp
20: c3 retq
21: ba 01 00 00 00 mov $0x1,%edx
26: be 00 00 00 00 mov $0x0,%esi
27: R_X86_64_32 .rodata.str1.1
2b: bf 01 00 00 00 mov $0x1,%edi
30: e8 00 00 00 00 callq 35 <main+0x35>
31: R_X86_64_PC32 __printf_chk-0x4
35: eb d9 jmp 10 <main+0x10>
The printf (compiled to __printf_chk) was moved to the very end of the function, after puts and the return to improve branch prediction as mentioned by other answers.
So it is basically the same as:
int main() {
int i = !time(NULL);
if (i)
goto printf;
puts:
puts("a");
return 0;
printf:
printf("%d\n", i);
goto puts;
}
This optimization was not done with -O0.
But good luck on writing an example that runs faster with __builtin_expect than without, CPUs are really smart these days. My naive attempts are here.
C++20 [[likely]] and [[unlikely]]
C++20 has standardized those C++ built-ins: How to use C++20's likely/unlikely attribute in if-else statement They will likely (a pun!) do the same thing.

These are macros that give hints to the compiler about which way a branch may go. The macros expand to GCC specific extensions, if they're available.
GCC uses these to to optimize for branch prediction. For example, if you have something like the following
if (unlikely(x)) {
dosomething();
}
return x;
Then it can restructure this code to be something more like:
if (!x) {
return x;
}
dosomething();
return x;
The benefit of this is that when the processor takes a branch the first time, there is significant overhead, because it may have been speculatively loading and executing code further ahead. When it determines it will take the branch, then it has to invalidate that, and start at the branch target.
Most modern processors now have some sort of branch prediction, but that only assists when you've been through the branch before, and the branch is still in the branch prediction cache.
There are a number of other strategies that the compiler and processor can use in these scenarios. You can find more details on how branch predictors work at Wikipedia: http://en.wikipedia.org/wiki/Branch_predictor

They cause the compiler to emit the appropriate branch hints where the hardware supports them. This usually just means twiddling a few bits in the instruction opcode, so code size will not change. The CPU will start fetching instructions from the predicted location, and flush the pipeline and start over if that turns out to be wrong when the branch is reached; in the case where the hint is correct, this will make the branch much faster - precisely how much faster will depend on the hardware; and how much this affects the performance of the code will depend on what proportion of the time hint is correct.
For instance, on a PowerPC CPU an unhinted branch might take 16 cycles, a correctly hinted one 8 and an incorrectly hinted one 24. In innermost loops good hinting can make an enormous difference.
Portability isn't really an issue - presumably the definition is in a per-platform header; you can simply define "likely" and "unlikely" to nothing for platforms that do not support static branch hints.

long __builtin_expect(long EXP, long C);
This construct tells the compiler that the expression EXP
most likely will have the value C. The return value is EXP.
__builtin_expect is meant to be used in an conditional
expression. In almost all cases will it be used in the
context of boolean expressions in which case it is much
more convenient to define two helper macros:
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
These macros can then be used as in
if (likely(a > 1))
Reference: https://www.akkadia.org/drepper/cpumemory.pdf

(general comment - other answers cover the details)
There's no reason that you should lose portability by using them.
You always have the option of creating a simple nil-effect "inline" or macro that will allow you to compile on other platforms with other compilers.
You just won't get the benefit of the optimization if you're on other platforms.

As per the comment by Cody, this has nothing to do with Linux, but is a hint to the compiler. What happens will depend on the architecture and compiler version.
This particular feature in Linux is somewhat mis-used in drivers. As osgx points out in semantics of hot attribute, any hot or cold function called with in a block can automatically hint that the condition is likely or not. For instance, dump_stack() is marked cold so this is redundant,
if(unlikely(err)) {
printk("Driver error found. %d\n", err);
dump_stack();
}
Future versions of gcc may selectively inline a function based on these hints. There have also been suggestions that it is not boolean, but a score as in most likely, etc. Generally, it should be preferred to use some alternate mechanism like cold. There is no reason to use it in any place but hot paths. What a compiler will do on one architecture can be completely different on another.

In many linux release, you can find compiler.h in /usr/linux/ , you can include it for use simply. And another opinion, unlikely() is more useful rather than likely(), because
if ( likely( ... ) ) {
doSomething();
}
it can be optimized as well in many compiler.
And by the way, if you want to observe the detail behavior of the code, you can do simply as follow:
gcc -c test.c
objdump -d test.o > obj.s
Then, open obj.s, you can find the answer.

They're hints to the compiler to generate the hint prefixes on branches. On x86/x64, they take up one byte, so you'll get at most a one-byte increase for each branch. As for performance, it entirely depends on the application -- in most cases, the branch predictor on the processor will ignore them, these days.
Edit: Forgot about one place they can actually really help with. It can allow the compiler to reorder the control-flow graph to reduce the number of branches taken for the 'likely' path. This can have a marked improvement in loops where you're checking multiple exit cases.

These are GCC functions for the programmer to give a hint to the compiler about what the most likely branch condition will be in a given expression. This allows the compiler to build the branch instructions so that the most common case takes the fewest number of instructions to execute.
How the branch instructions are built are dependent upon the processor architecture.

I am wondering why its not defined like this:
#define likely(x) __builtin_expect(((x) != 0),1)
#define unlikely(x) __builtin_expect((x),0)
I mean __builtin_expect docs say that the compiler will expect the first parameter having the value of the second one (and first param is returned), but the original way the macro is defined above makes it hard to use this for things that might return non-zero values as "true" values (instead of value 1).
This might be buggy - but from my code you get the idea what direction I mean..

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight