Most code I have ever read uses a int for standard error handling (return values from functions and such). But I am wondering if there is any benefit to be had from using a uint_8 will a compiler -- read: most C compilers on most architectures -- produce instructions using the immediate address mode -- i.e., embed the 1-byte integer into the instruction ? The key instruction I'm thinking about is the compare after a function, using uint_8 as its return type, returns.
I could be thinking about things incorrectly, as introducing a 1 byte type just causes alignment issues -- there is probably a perfectly sane reason why compiles like to pack things in 4-bytes and this is possibly the reason everyone just uses ints -- and since this is stack related issue rather than the heap there is no real overhead.
Doing the right thing is what I'm thinking about. But lets say say for the sake of argument this is a popular cheap microprocessor for a intelligent watch and that it is configured with 1k of memory but does have different addressing modes in its instruction set :D
Another question to slightly specialize the discussion (x86) would be: is the literal in:
uint_32 x=func(); x==1;
and
uint_8 x=func(); x==1;
the same type ? or will the compiler generate a 8-byte literal in the second case. If so it may use it to generate a compare instruction which has the literal as an immediate value and the returned int as a register reference. See CMP instruction types..
Another Refference for the x86 Instruction Set.
Here's what one particular compiler will do for the following code:
extern int foo(void) ;
void bar(void)
{
if(foo() == 31) { //error code 31
do_something();
} else {
do_somehing_else();
}
}
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 08 sub $0x8,%esp
6: e8 fc ff ff ff call 7 <bar+0x7>
b: 83 f8 1f cmp $0x1f,%eax
e: 74 08 je 18 <bar+0x18>
10: c9 leave
11: e9 fc ff ff ff jmp 12 <bar+0x12>
16: 89 f6 mov %esi,%esi
18: c9 leave
19: e9 fc ff ff ff jmp 1a <bar+0x1a>
a 3 byte instruction for the cmp. if foo() returns a char , we get
b: 3c 1f cmp $0x1f,%al
If you're looking for efficiency though. Don't assume comparing stuff in %a1 is faster than comparing with %eax
There may be very small speed differences between the different integral types on a particular architecture. But you can't rely on it, it may change if you move to different hardware, and it may even run slower if you upgrade to newer hardware.
And if you talk about x86 in the example you are giving, you make a false assumption: An immediate needs to be of type uint8_t.
Actually 8-bit immediates embedded into the instruction are of type int8_t and can be used with bytes, words, dwords and qwords, in C notation: char, short, int and long long.
So on this architecture there would be no benefit at all, neither code size nor execution speed.
You should use int or unsigned int types for your calculations. Using smaller types only for compounds (structs/arrays). The reason for that is that int is normally defined to be the "most natural" integral type for the processor, all other derived type may necessitate processing to work correctly. We had in our project compiled with gcc on Solaris for SPARC the case that accesses to 8 and 16 bit variable added an instruction to the code. When loading a smaller type from memory it had to make sure the upper part of the register was properly set (sign extension for signed type or 0 for unsigned). This made the code longer and increased pressure on the registers, which deteriorated the other optimisations.
I've got a concrete example:
I declared two variable of a struct as uint8_t and got that code in Sparc Asm:
if(p->BQ > p->AQ)
was translated in
ldub [%l1+165], %o5 ! <variable>.BQ,
ldub [%l1+166], %g5 ! <variable>.AQ,
and %o5, 0xff, %g4 ! <variable>.BQ, <variable>.BQ
and %g5, 0xff, %l0 ! <variable>.AQ, <variable>.AQ
cmp %g4, %l0 ! <variable>.BQ, <variable>.AQ
bleu,a,pt %icc, .LL586 !
And here what I got when I declared the two variables as uint_t
lduw [%l1+168], %g1 ! <variable>.BQ,
lduw [%l1+172], %g4 ! <variable>.AQ,
cmp %g1, %g4 ! <variable>.BQ, <variable>.AQ
bleu,a,pt %icc, .LL587 !
Two arithmetic operations less and 2 registers more for other stuff
Processors typically likes to work with their natural register sizes, which in C is 'int'.
Although there are exceptions, you're thinking too much on a problem that does not exist.
Related
I have this code in C:
int main(void)
{
int a = 1 + 2;
return 0;
}
When I objdump -x86-asm-syntax=intel -d a.out which is compiled with -O0 flag with GCC 9.3.0_1, I get:
0000000100000f9e _main:
100000f9e: 55 push rbp
100000f9f: 48 89 e5 mov rbp, rsp
100000fa2: c7 45 fc 03 00 00 00 mov dword ptr [rbp - 4], 3
100000fa9: b8 00 00 00 00 mov eax, 0
100000fae: 5d pop rbp
100000faf: c3 ret
and with -O1 flag:
0000000100000fc2 _main:
100000fc2: b8 00 00 00 00 mov eax, 0
100000fc7: c3 ret
which removes the unused variable a and stack managing altogether.
However, when I use Apple clang version 11.0.3 with -O0 and -O1, I get
0000000100000fa0 _main:
100000fa0: 55 push rbp
100000fa1: 48 89 e5 mov rbp, rsp
100000fa4: 31 c0 xor eax, eax
100000fa6: c7 45 fc 00 00 00 00 mov dword ptr [rbp - 4], 0
100000fad: c7 45 f8 03 00 00 00 mov dword ptr [rbp - 8], 3
100000fb4: 5d pop rbp
100000fb5: c3 ret
and
0000000100000fb0 _main:
100000fb0: 55 push rbp
100000fb1: 48 89 e5 mov rbp, rsp
100000fb4: 31 c0 xor eax, eax
100000fb6: 5d pop rbp
100000fb7: c3 ret
respectively.
I never get the stack managing part stripped off as in GCC.
Why does (Apple) Clang keep unnecessary push and pop?
This may or may not be a separate question, but with the following code:
int main(void)
{
// return 0;
}
GCC creates a same ASM with or without the return 0;.
However, Clang -O0 leaves this extra
100000fa6: c7 45 fc 00 00 00 00 mov dword ptr [rbp - 4], 0
when there is return 0;.
Why does Clang keep these (probably) redundant ASM codes?
I suspect you were trying to see the addition happen.
int main(void)
{
int a = 1 + 2;
return 0;
}
but with optimization say -O2, your dead code went away
00000000 <main>:
0: 2000 movs r0, #0
2: 4770 bx lr
The variable a is local, it never leaves the function it does not rely on anything outside of the function (globals, input variables, return values from called functions, etc). So it has no functional purpose it is dead code it doesn't do anything so an optimizer is free to remove it and did.
So I assume you went to use no or less optimization and then saw it was too verbose.
00000000 <main>:
0: cf 93 push r28
2: df 93 push r29
4: 00 d0 rcall .+0 ; 0x6 <main+0x6>
6: cd b7 in r28, 0x3d ; 61
8: de b7 in r29, 0x3e ; 62
a: 83 e0 ldi r24, 0x03 ; 3
c: 90 e0 ldi r25, 0x00 ; 0
e: 9a 83 std Y+2, r25 ; 0x02
10: 89 83 std Y+1, r24 ; 0x01
12: 80 e0 ldi r24, 0x00 ; 0
14: 90 e0 ldi r25, 0x00 ; 0
16: 0f 90 pop r0
18: 0f 90 pop r0
1a: df 91 pop r29
1c: cf 91 pop r28
1e: 08 95 ret
If you want to see addition happen instead first off don't use main() it has baggage, and the baggage varies among toolchains. So try something else
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b);
}
now the addition relies on external items so the compiler cannot optimize any of this away.
00000000 <_fun>:
0: 1d80 0002 mov 2(sp), r0
4: 6d80 0004 add 4(sp), r0
8: 0087 rts pc
If we want to figure out which one is a and which one is b then.
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+(b<<1));
}
00000000 <_fun>:
0: 1d80 0004 mov 4(sp), r0
4: 0cc0 asl r0
6: 6d80 0002 add 2(sp), r0
a: 0087 rts pc
Want to see an immediate value
unsigned int fun ( unsigned int a )
{
return(a+0x321);
}
00000000 <fun>:
0: 8b 44 24 04 mov eax,DWORD PTR [esp+0x4]
4: 05 21 03 00 00 add eax,0x321
9: c3 ret
you can figure out what the compilers return address convention is, etc.
But you will hit some limits trying to get the compiler to do things for you to learn asm likewise you can easily take the code generated by these compilations
(using -save-temps or -S or disassemble and type it in (I prefer the latter)) but you can only get so far on your operating system in high level/C callable functions. Eventually you will want to do something bare-metal (on a simulator at first) to get maximum freedom and to try instructions you cant normally try or try them in a way that is hard or you don't quite understand yet how to use in the confines of an operating system in a function call. (please do not use inline assembly until down the road or never, use real assembly and ideally the assembler not the compiler to assemble it, down the road then try those things).
The one compiler was built for or defaults to using a stack frame so you need to tell the compiler to omit it. -fomit-frame-pointer. Note that one or both of these can be built to default not to have a frame pointer.
../gcc-$GCCVER/configure --target=$TARGET --prefix=$PREFIX --without-headers --with-newlib --with-gnu-as --with-gnu-ld --enable-languages='c' --enable-frame-pointer=no
(Don't assume gcc nor clang/llvm have a "standard" build as they are both customizable and the binary you downloaded has someone's opinion of the standard build)
You are using main(), this has the return 0 or not thing and it can/will carry other baggage. Depends on the compiler and settings. Using something not main gives you the freedom to pick your inputs and outputs without it warning that you didn't conform to the short list of choices for main().
For gcc -O0 is ideally no optimization although sometimes you see some. -O3 is max give me all you got. -O2 is historically where folks live if for no other reason than "I did it because everyone else is doing it". -O1 is no mans land for gnu it has some items not in -O0 but not a lot of good ones in -O2, so depends heavily on your code as to whether or not you landed in one/some of the optimizations associated with -O1. These numbered optimization things if your compiler even has a -O option is just a pre-defined list 0 means this list 1 means that list and so on.
There is no reason to expect any two compilers or the same compiler with different options to produce the same code from the same sources. If two competing compilers were able to do that most if not all of the time something very fishy is going on...Likewise no reason to expect the list of optimizations each compiler supports, what each optimization does, etc, to match much less the -O1 list to match between them and so on.
There is no reason to assume that any two compilers or versions conform to the same calling convention for the same target, it is much more common now and further for the processor vendor to create a recommended calling convention and then the competing compilers to often conform to that because why not, everyone else is doing it, or even better, whew I don't have to figure one out myself, if this one fails I can blame them.
There are a lot of implementation defined areas in C in particular, less so in C++ but still...So your expectations of what come out and comparing compilers to each other may differ for this reason as well. Just because one compiler implements some code in some way doesn't mean that is how that language works sometimes it is how that compiler author(s) interpreted the language spec or had wiggle room.
Even with full optimizations enabled, everything that compiler has to offer there is no reason to assume that a compiler can outperform a human. Its an algorithm with limits programmed by a human, it cannot outperform us. With experience it is not hard to examine the output of a compiler for sometimes simple functions but often for larger functions and find missed optimizations, or other things that could have been done "better" for some opinion of "better". And sometimes you find the compiler just left something in that you think it should have removed, and sometimes you are right.
There is education as shown above in using a compiler to start to learn assembly language, and even with decades of experience and dabbling with dozens of assembly languages/instruction sets, if there is a debugged compiler available I will very often start with disassembling simple functions to start learning that new instruction set, then look those up then start to get a feel from what I find there for how to use it.
Very often starting with this one first:
unsigned int fun ( unsigned int a )
{
return(a+5);
}
or
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b);
}
And going from there. Likewise when writing a disassembler or a simulator for fun to learn the instruction set I often rely on an existing assembler since it is often the documentation for a processor is lacking, the first assembler and compiler for that processor are very often done with direct access to the silicon folks and then those that follow can also use existing tools as well as documentation to figure things out.
So you are on a good path to start learning assembly language I have strong opinions on which ones to or not to start with to improve the experience and chances of success, but I have been in too many battles on Stack Overflow this week, I'll let that go. You can see that I chose an array of instruction sets in this answer. And even if you don't know them you can probably figure out what the code is doing. "standard" installs of llvm provide the ability to output assembly language for several instruction sets from the same source code. The gnu approach is you pick the target (family) when you compile the toolchain and that compiled toolchain is limited to that target/family but you can easily install several gnu toolchains on your computer at the same time be they variations on defaults/settings for the same target or different targets. A number of these are apt gettable without having to learn to build the tools, arm, avr, msp430, x86 and perhaps some others.
I cannot speak to the why does it not return zero from main when you didn't actually have any return code. See comments by others and read up on the specs for that language. (or ask that as a separate question, or see if it was already answered).
Now you said Apple clang not sure what that reference was to I know that Apple has put a lot of work into llvm in general. Or maybe you are on a mac or in an Apple supplied/suggested development environment, but check Wikipedia and others, clang had a lot of corporate help not just Apple, so not sure what the reference was there. If you are on an Apple computer then the apt gettable isn't going to make sense, but there are still lots of pre-built gnu (and llvm) based toolchains you can download and install rather than attempt to build the toolchain from sources (which isn't difficult BTW).
Can someone explain why this code:
#include <stdio.h>
int main()
{
return 0;
}
when compiled with tcc using tcc code.c produces this asm:
00401000 |. 55 PUSH EBP
00401001 |. 89E5 MOV EBP,ESP
00401003 |. 81EC 00000000 SUB ESP,0
00401009 |. 90 NOP
0040100A |. B8 00000000 MOV EAX,0
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
00401015 |. C3 RETN
I guess that
00401009 |. 90 NOP
is maybe there for some memory alignment, but what about
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
I mean why would compiler insert this near jump that jumps to the next instruction, LEAVE would execute anyway?
I'm on 64-bit Windows generating 32-bit executable using TCC 0.9.26.
Superfluous JMP before the Function Epilogue
The JMP at the bottom that goes to the next statement, this was fixed in a commit. Version 0.9.27 of TCC resolves this issue:
When 'return' is the last statement of the top-level block
(very common and often recommended case) jump is not needed.
As for the reason it existed in the first place? The idea is that each function has a possible common exit point. If there is a block of code with a return in it at the bottom, the JMP goes to a common exit point where stack cleanup is done and the ret is executed. Originally the code generator also emitted the JMP instruction erroneously at the end of the function too if it appeared just before the final } (closing brace). The fix checks to see if there is a return statement followed by a closing brace at the top level of the function. If there is, the JMP is omitted
An example of code that has a return at a lower scope before a closing brace:
int main(int argc, char *argv[])
{
if (argc == 3) {
argc++;
return argc;
}
argc += 3;
return argc;
}
The generated code looks like:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
401009: 90 nop
40100a: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40100d: 83 f8 03 cmp eax,0x3
401010: 0f 85 11 00 00 00 jne 0x401027
401016: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
401019: 89 c1 mov ecx,eax
40101b: 40 inc eax
40101c: 89 45 08 mov DWORD PTR [ebp+0x8],eax
40101f: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` inside the if statement
401022: e9 11 00 00 00 jmp 0x401038
401027: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40102a: 83 c0 03 add eax,0x3
40102d: 89 45 08 mov DWORD PTR [ebp+0x8],eax
401030: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` at end of the function
401033: e9 00 00 00 00 jmp 0x401038
; Common function exit point
401038: c9 leave
401039: c3 ret
In versions prior to 0.9.27 the return argc inside the if statement would jump to a common exit point (function epilogue). As well the return argc at the bottom of the function also jumps to the same common exit point of the function. The problem is that the common exit point for the function happens to be right after the top level return argcso the side effect is an extra JMP that happens to be to the next instruction.
NOP after Function Prologue
The NOP isn't for alignment. Because of the way Windows implements guard pages for the stack (Programs that are in Portable Executable format) TCC has two types of prologues. If the local stack space required < 4096 (smaller than a single page) then you see this kind of code generated:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
The sub esp,0 isn't optimized out. It is the amount of stack space needed for local variables (in this case 0). If you add some local variables you will see the 0x0 in the SUB instruction changes to coincide with the amount of stack space needed for local variables. This prologue requires 9 bytes. There is another prologue to handle the case where the stack space needed is >= 4096 bytes. If you add an array of 4096 bytes with something like:
char somearray[4096]
and look at the resulting instruction you will see the function prologue change to a 10 byte prologue:
401000: b8 00 10 00 00 mov eax,0x1000
401005: e8 d6 00 00 00 call 0x4010e0
TCC's code generator assumes that the function prologue is always 10 bytes when targeting WinPE. This is primarily because TCC is a single pass compiler. The compiler doesn't know how much stack space a function will use until after the function is processed. To get around not knowing this ahead of time, TCC pre-allocates 10 bytes for the prologue to fit the largest method. Anything shorter is padded to 10 bytes.
In the case where stack space needed < 4096 bytes the instructions used total 9 bytes. The NOP is used to pad the prologue to 10 bytes. For the case where >= 4096 bytes are needed, the number of bytes is passed in EAX and the function __chkstk is called to allocate the required stack space instead.
TCC is not an optimizing compiler, at least not really. Every single instruction it emitted for main is sub-optimal or not needed at all, except the ret. IDK why you thought the JMP was the only instruction that might not make sense for performance.
This is by design: TCC stands for Tiny C Compiler. The compiler itself is designed to be simple, so it intentionally doesn't include code to look for many kinds of optimizations. Notice the sub esp, 0: this useless instruction clearly come from filling in a function-prologue template, and TCC doesn't even look for the special case where the offset is 0 bytes. Other function need stack space for locals, or to align the stack before any child function calls, but this main() doesn't. TCC doesn't care, and blindly emits sub esp,0 to reserve 0 bytes.
(In fact, TCC is truly one pass, laying out machine code as it does through the C statement by statement. It uses the imm32 encoding for sub so it will have room to fill in the right number (upon reaching the end of the function) even if it turns out the function uses more than 255 bytes of stack space. So instead of constructing a list of instructions in memory to finish assembling later, it just remembers one spot to fill in a uint32_t. That's why it can't omit the sub when it turns out not to be needed.)
Most of the work in creating a good optimizing compiler that anyone will use in practice is the optimizer. Even parsing modern C++ is peanuts compared to reliably emitting efficient asm (which not even gcc / clang / icc can do all the time, even without considering autovectorization). Just generating working but inefficient asm is easy compared to optimizing; most of gcc's codebase is optimization, not parsing. See Basile's answer on Why are there so few C compilers?
The JMP (as you can see from #MichaelPetch's answer) has a similar explanation: TCC (until recently) didn't optimize the case where a function only has one return path, and doesn't need to JMP to a common epilogue.
There's even a NOP in the middle of the function. It's obviously a waste of code bytes and decode / issue front-end bandwidth and out-of-order window size. (Sometimes executing a NOP outside a loop or something is worth it to align the top of a loop which is branched to repeatedly, but a NOP in the middle of a basic block is basically never worth it, so that's not why TCC put it there. And if a NOP did help, you could probably do even better by reordering instructions or choosing larger instructions to do the same thing without a NOP. Even proper optimizing compilers like gcc/clang/icc don't try to predict this kind of subtle front-end effect.)
#MichaelPetch points out that TCC always wants its function prologue to be 10 bytes, because it's a single-pass compiler (and it doesn't know how much space it needs for locals until the end of the function, when it comes back and fills in the imm32). But Windows targets need stack probes when modifying ESP / RSP by more than a whole page (4096 bytes), and the alternate prologue for that case is 10 bytes, instead of 9 for the normal one without the NOP. So this is another tradeoff favouring compilation speed over good asm.
An optimizing compiler would xor-zero EAX (because that's smaller and at least as fast as mov eax,0), and leave out all the other instruction. Xor-zeroing is one of the most well-known / common / basic x86 peephole optimizations, and has several advantages other than code-size on some modern x86 microarchitectures.
main:
xor eax,eax
ret
Some optimizing compilers might still make a stack frame with EBP, but tearing it down with pop ebp would be strictly better than leave on all CPUs, for this special case where ESP = EBP so the mov esp,ebp part of leave isn't needed. pop ebp is still 1 byte, but it's also a single-uop instruction on modern CPUs, unlike leave which is 2 or 3 on modern CPUs. (http://agner.org/optimize/, and see also other performance optimization links in the x86 tag wiki.) This is what gcc does. It's a fairly common situation; if you push some other registers after making a stack frame, you have to point ESP at the right place before pop ebx or whatever. (Or use mov to restore them.)
The benchmarks TCC cares about are compilation speed, not quality (speed or size) of the resulting code. For example, the TCC web site has a benchmark in lines/sec and MB/sec (of C source) vs. gcc3.2 -O0, where it's ~9x faster on a P4.
However, TCC is not totally braindead: it will apparently do some inlining, and as Michael's answer points out, a recent patch does leave out the JMP (but still not the useless sub esp, 0).
This question already exists:
Why are registers needed (why not only use memory)? [duplicate]
Closed 1 year ago.
Hi I have been reading this kind of stuff in various docs
register
Tells the compiler to store the variable being declared in a CPU register.
In standard C dialects, keyword register uses the following syntax:
register data-definition;
The register type modifier tells the compiler to store the variable being declared in a CPU register (if possible), to optimize access. For example,
register int i;
Note that TIGCC will automatically store often used variables in CPU registers when the optimization is turned on, but the keyword register will force storing in registers even if the optimization is turned off. However, the request for storing data in registers may be denied, if the compiler concludes that there is not enough free registers for use at this place.
http://tigcc.ticalc.org/doc/keywords.html#register
My point is not only about register. My point is why would a compiler stores the variables in memory. The compiler business is to just compile and to generate an object file. At run time the actual memory allocation happens. why would compiler does this business. I mean without running the object file just by compiling the file itself does the memory allocation happens in case of C?
The compiler is generating machine code, and the machine code is used to run your program. The compiler decides what machine code it generates, therefore making decisions about what sort of allocation will happen at runtime. It's not executing them when you type gcc foo.c but later, when you run the executable, it's the code GCC generated that's running.
This means that the compiler wants to generate the fastest code possible and makes as many decisions as it can at compile time, this includes how to allocate things.
The compiler doesn't run the code (unless it does a few rounds for profiling and better code execution), but it has to prepare it - this includes how to keep the variables your program defines, whether to use fast and efficient storage as registers, or using the slower (and more prone to side effects) memory.
Initially, your local variables would simply be assigned location on the stack frame (except of course for memory you explicitly use dynamic allocation for). If your function assigned an int, your compiler would likely tell the stack to grow by a few additional bytes and use that memory address for storing that variable and passing it as operand to any operation your code is doing on that variable.
However, since memory is slower (even when cached), and manipulating it causes more restrictions on the CPU, at a later stage the compiler may decide to try moving some variables into registers. This allocation is done through a complicated algorithm that tries to select the most reused and latency critical variables that can fit within the existing number of logical registers your architecture has (While confirming with various restrictions such as some instructions requiring the operand to be in this or that register).
There's another complication - some memory addresses may alias with external pointers in manners unknown at compilation time, in which case you can not move them into registers. Compilers are usually a very cautious bunch and most of them would avoid dangerous optimizations (otherwise they're need to put up some special checks to avoid nasty things).
After all that, the compiler is still polite enough to let you advise which variable it important and critical to you, in case he missed it, and by marking these with the register keyword you're basically asking him to make an attempt to optimize for this variable by using a register for it, given enough registers are available and that no aliasing is possible.
Here's a little example: Take the following code, doing the same thing twice but with slightly different circumstances:
#include "stdio.h"
int j;
int main() {
int i;
for (i = 0; i < 100; ++i) {
printf ("i'm here to prevent the loop from being optimized\n");
}
for (j = 0; j < 100; ++j) {
printf ("me too\n");
}
}
Note that i is local, j is global (and therefore the compiler doesn't know if anyone else might access him during the run).
Compiling in gcc with -O3 produces the following code for main:
0000000000400540 <main>:
400540: 53 push %rbx
400541: bf 88 06 40 00 mov $0x400688,%edi
400546: bb 01 00 00 00 mov $0x1,%ebx
40054b: e8 18 ff ff ff callq 400468 <puts#plt>
400550: bf 88 06 40 00 mov $0x400688,%edi
400555: 83 c3 01 add $0x1,%ebx # <-- i++
400558: e8 0b ff ff ff callq 400468 <puts#plt>
40055d: 83 fb 64 cmp $0x64,%ebx
400560: 75 ee jne 400550 <main+0x10>
400562: c7 05 80 04 10 00 00 movl $0x0,1049728(%rip) # 5009ec <j>
400569: 00 00 00
40056c: bf c0 06 40 00 mov $0x4006c0,%edi
400571: e8 f2 fe ff ff callq 400468 <puts#plt>
400576: 8b 05 70 04 10 00 mov 1049712(%rip),%eax # 5009ec <j> (loads j)
40057c: 83 c0 01 add $0x1,%eax # <-- j++
40057f: 83 f8 63 cmp $0x63,%eax
400582: 89 05 64 04 10 00 mov %eax,1049700(%rip) # 5009ec <j> (stores j back)
400588: 7e e2 jle 40056c <main+0x2c>
40058a: 5b pop %rbx
40058b: c3 retq
As you can see, the first loop counter sits in ebx, and is incremented on each iteration and compared against the limit.
The second loop however was the dangerous one, and gcc decided to pass the index counter through memory (loading it into rax every iteration). This example serves to show how better off you'd be when using registers, as well as how sometimes you can't.
The compiler needs to translate the code into machine instruction, and tell the computer how to run the code. That include how to make operations (like multiply two numbers) and how to store the data (stack, heap or register).
How do you get the size of a C instruction? For example I want to know, if I am working on Windows XP with Intel Core2Duo processor, how much space does the instruction:
while(1==9)
{
printf("test");
}
takes?
C does not have a notion of "instruction", let alone a notion of "size of an instruction", so the question doesn't make sense in the context of "programming in C".
The fundamental element of a C program is a statement, which can be arbitrarily complex.
If you care about the details of your implementation, simply inspect the machine code that your compiler generates. That one is broken down in machine instructions, and you can usually see how many bytes are required to encode each instruction.
For example, I compile the program
#include <cstdio>
int main() { for (;;) std::printf("test"); }
with
g++ -c -o /tmp/out.o /tmp/in.cpp
and disassemble with
objdump -d /tmp/out.o -Mintel
and I get:
00000000 <main>:
0: 55 push ebp
1: 89 e5 mov ebp,esp
3: 83 e4 f0 and esp,0xfffffff0
6: 83 ec 10 sub esp,0x10
9: c7 04 24 00 00 00 00 mov DWORD PTR [esp],0x0
10: e8 fc ff ff ff call 11 <main+0x11>
15: eb f2 jmp 9 <main+0x9>
As you can see, the call instruction requires 5 bytes in my case, and the jump for the loop requires 2 bytes. (Also note the conspicuous absence of a ret statement.)
You can look up the instruction in your favourite x86 documentation and see that E8 is the single-byte call opcode; the subsequent four bytes are the operand.
gcc has an extension that returns the address of a label as a value of type void*.
Maybe you can use that for your intents.
/* UNTESTED --- USE AT YOUR OWN RISK */
labelbegin:
while (1 == 9) {
printf("test\n");
}
labelend:
printf("length: %d\n",
(int)((unsigned char *)&&labelend - (unsigned char *)&&labelbegin));
You can find size of a binary code block like following (x86 target, gcc compiler):
#define CODE_BEGIN(name) __asm__ __volatile__ ("jmp ."name"_op_end\n\t""."name"_op_begin:\n\t")
#define CODE_END(name) __asm__ __volatile__ ("."name"_op_end:\n\t""movl $."name"_op_begin, %0\n\t""movl $."name"_op_end, %1\n\t":"=g"(begin), "=g"(end));
void func() {
unsigned int begin, end;
CODE_BEGIN("func");
/* ... your code goes here ... */
CODE_END("func");
printf("Code length is %d\n", end - begin);
}
This technique is used by "lazy JIT" compilers that copy binary code written in high level language (C), instead of emitting assembly binary codes, like normal JIT compilers do.
It's usually a job for a tracer or a debugger that includes a tracer, there are also external tools that can inspect and profile the memory usage of your program like valgrind.
Refer to the documentation of your IDE of choice or your compiler of choice.
To add on previous answers, there are occasions on which you can determine the size of the code pieces - look into the linker-generated map file, you should be able to see there global symbols and their size, including global functions. See here for more info: What's the use of .map files the linker produces? (for VC++, but the concept is similar for evey compiler).
In an attempt to understand what occurs underneath I am making small C programs and then reversing it, and trying to understand its objdump output.
The C program is:
#include <stdio.h>
int function(int a, int b, int c) {
printf("%d, %d, %d\n", a,b,c);
}
int main() {
int a;
int *ptr;
asm("nop");
function(1,2,3);
}
The objdump output for function gives me the following.
080483a4 <function>:
80483a4: 55 push ebp
80483a5: 89 e5 mov ebp,esp
80483a7: 83 ec 08 sub esp,0x8
80483aa: ff 75 10 push DWORD PTR [ebp+16]
80483ad: ff 75 0c push DWORD PTR [ebp+12]
80483b0: ff 75 08 push DWORD PTR [ebp+8]
80483b3: 68 04 85 04 08 push 0x8048504
80483b8: e8 fb fe ff ff call 80482b8 <printf#plt>
80483bd: 83 c4 10 add esp,0x10
80483c0: c9 leave
Notice that before the call to printf, three DWORD's with offsets 8,16,12(they must be the arguments to function in the reverse order) are being pushed onto the stack. Later a hex address which must be the address of the format string is being pushed.
My doubt is
Rather than pushing 3 DWORDS and the format specifier onto the stack directly, I expected to see the esp being manually decremented and the values being pushed onto the stack after that. How can one explain this behaviour?
Well, some machines have a stack pointer that is kind of like any other register, so the way you push something is, yes, with a decrement followed by a store.
But some machines, like x8632/64 have a push instruction that does a macro-op: decrementing the pointer and doing the store.
Macro-ops, btw, have a funny history. At times, certain examples on certain machines have been slower than performing the elementary operations with simple instructions.
I doubt if that's frequently the case today. Modern x86 is amazingly sophisticated. The CPU will be disassembling your opcodes themselves into micro-ops which it then stores in a cache. The micro-ops have specific pipeline and time slot requirements and the end result is that there is a RISC cpu inside the x86 these days, and the whole thing goes really fast and has good architectural-layer code density.
The stack pointer is adjusted with the push instruction. So it's copied to ebp and the parameters are pushed onto the stack so they exist in 2 places each: function's stack and printf's stack. The pushes affect esp, thus ebp is copied.
There is no mov [esp+x], [ebp+y] instruction, too many operands. It would take two instructions and use a register. Push does it in one instruction.
This is a standard cdecl calling convention for x86 machine. There are several different types of calling conventions. You can read the following article in the Wikipedia about it:
http://en.wikipedia.org/wiki/X86_calling_conventions
It explains the basic principle.
You raise an interesting point which I think has not been directly addressed so far. I suppose that you have seen assembly code which looks something like this:
sub esp, X
...
mov [ebp+Y], eax
call Z
This sort of disassembly is generated by certain compilers. What it is doing is extending the stack, then assigning the value of the new space to be eax (which has hopefully been populated with something meaningful by that point). This is actually equivalent to what the push mnemonic does. I can't answer why certain compilers generate this code instead but my guess is that at some point doing it this way was judged to be more efficient.
In your effort to learn assembly language and disassemble binaries, you might find ODA useful. It's a web-based disassembler, which is handy for disassembling lots of different architectures without having to build binutil's objdump for each one of them.
http://onlinedisassembler.com/