learning disassembly - c

In an attempt to understand what occurs underneath I am making small C programs and then reversing it, and trying to understand its objdump output.
The C program is:
#include <stdio.h>
int function(int a, int b, int c) {
printf("%d, %d, %d\n", a,b,c);
}
int main() {
int a;
int *ptr;
asm("nop");
function(1,2,3);
}
The objdump output for function gives me the following.
080483a4 <function>:
80483a4: 55 push ebp
80483a5: 89 e5 mov ebp,esp
80483a7: 83 ec 08 sub esp,0x8
80483aa: ff 75 10 push DWORD PTR [ebp+16]
80483ad: ff 75 0c push DWORD PTR [ebp+12]
80483b0: ff 75 08 push DWORD PTR [ebp+8]
80483b3: 68 04 85 04 08 push 0x8048504
80483b8: e8 fb fe ff ff call 80482b8 <printf#plt>
80483bd: 83 c4 10 add esp,0x10
80483c0: c9 leave
Notice that before the call to printf, three DWORD's with offsets 8,16,12(they must be the arguments to function in the reverse order) are being pushed onto the stack. Later a hex address which must be the address of the format string is being pushed.
My doubt is
Rather than pushing 3 DWORDS and the format specifier onto the stack directly, I expected to see the esp being manually decremented and the values being pushed onto the stack after that. How can one explain this behaviour?

Well, some machines have a stack pointer that is kind of like any other register, so the way you push something is, yes, with a decrement followed by a store.
But some machines, like x8632/64 have a push instruction that does a macro-op: decrementing the pointer and doing the store.
Macro-ops, btw, have a funny history. At times, certain examples on certain machines have been slower than performing the elementary operations with simple instructions.
I doubt if that's frequently the case today. Modern x86 is amazingly sophisticated. The CPU will be disassembling your opcodes themselves into micro-ops which it then stores in a cache. The micro-ops have specific pipeline and time slot requirements and the end result is that there is a RISC cpu inside the x86 these days, and the whole thing goes really fast and has good architectural-layer code density.

The stack pointer is adjusted with the push instruction. So it's copied to ebp and the parameters are pushed onto the stack so they exist in 2 places each: function's stack and printf's stack. The pushes affect esp, thus ebp is copied.

There is no mov [esp+x], [ebp+y] instruction, too many operands. It would take two instructions and use a register. Push does it in one instruction.

This is a standard cdecl calling convention for x86 machine. There are several different types of calling conventions. You can read the following article in the Wikipedia about it:
http://en.wikipedia.org/wiki/X86_calling_conventions
It explains the basic principle.

You raise an interesting point which I think has not been directly addressed so far. I suppose that you have seen assembly code which looks something like this:
sub esp, X
...
mov [ebp+Y], eax
call Z
This sort of disassembly is generated by certain compilers. What it is doing is extending the stack, then assigning the value of the new space to be eax (which has hopefully been populated with something meaningful by that point). This is actually equivalent to what the push mnemonic does. I can't answer why certain compilers generate this code instead but my guess is that at some point doing it this way was judged to be more efficient.

In your effort to learn assembly language and disassemble binaries, you might find ODA useful. It's a web-based disassembler, which is handy for disassembling lots of different architectures without having to build binutil's objdump for each one of them.
http://onlinedisassembler.com/

Related

How do I run a function from its hex version in C?

Say I want to convert a certain function into hex
void func(char* string) {
puts(string);
}
1139: 55 push %rbp
113a: 48 89 e5 mov %rsp,%rbp
113d: 48 83 ec 10 sub $0x10,%rsp
1141: 48 89 7d f8 mov %rdi,-0x8(%rbp)
1145: 48 8b 45 f8 mov -0x8(%rbp),%rax
1149: 48 89 c7 mov %rax,%rdi
114c: e8 df fe ff ff callq 1030 <puts#plt>
1151: 90 nop
1152: c9 leaveq
1153: c3 retq
This is what I got on x86_64: \x55\x48\x89\xe5\x48\x83\xec\x10\x48\x89\x7d\xf8\x48\x8b\x45\xf8\x48\x89\xc7\xe8\xdf\xfe\xff\xff\x90\xc9\xc3
encrypt it and use it in this program. A decryptor at the start to decrypt these instructions at run time so it can't be analyzed statically.
Converting the above function into hex and creating a function pointer for it doesn't run and ends with SIGSEGV at push %rbp.
My aim is to make this code print Hi.
int main() {
char* decrypted = decrypt(hexcode);
void (*func)(char*) = (void)(*)(char)) decrypted;
func("HI");
}
My questions are:
How do I convert a function into hex properly.
How do I then run this hex code from main as shown above?
If you want to execute a binary blob; then you need to do something like this:
void *p = mmap(0, blob_size, PROT_WRITE, MAP_ANON, NOFD, 0);
read(blob_file, p, blob_size);
mprotect(p, blob_size, PROT_EXEC);
void (*UndefinedBehaviour)(char *x) = p;
UndefinedBehaviour("HI");
The allocates some memory, copies a blob into it, changes the memory to be PROT_EXEC, then invokes the blob at its beginning. You need to add some error checking, and depending upon what sort of system you are on, it may be running malware monitors to prevent you from doing this.
Answer for 1. : It is near impossible to do it automatically, because there is no simple way for determining the length of function code - it depends to machine CPU, compiler optimizations etc. Only way is "manual" analysis of disassembled binary.
You can't for those instructions because they're not fully position-independent and self-contained.
e8 df fe ff ff is a call rel32 (with a little-endian relative displacement as the call target). It only works if that displacement reaches the puts#plt stub, and that only happens in the executable you're disassembling, where this code appears at a fixed distance from the PLT. (So the executable itself is position-independent when relocated as a whole, but taking the machine code for one function and trying to run it from some other address will break.)
In theory you could fixup the call target using a function pointer to puts in some code that included this machine code in an array, but if you're trying to make shellcode you can't depend on the "target" process helping you that way.
Instead you should use system calls directly via the syscall instruction, for example Linux syscall with RAX=1=__NR_write is write. (Not via their libc wrapper functions like write(), that would have exactly the same problem as puts).
Then you can refer to How to get c code to execute hex bytecode? for how to put machine code in a C array, make sure that's in an executable page (e.g. gcc -z execstack or mprotect or mmap), and cast that to a function pointer + call it like you're doing here.
ends with SIGSEGV at push %rbp
Yup, code-fetch from a page without EXEC permission will do that. gcc -z execstack is an easy way to fix that, or mmap like other answers suggest, at which point execution will get as far as the call -289 and fault or run bad instructions.

Tiny C Compiler's generated code emits extra (unnecessary?) NOPs and JMPs

Can someone explain why this code:
#include <stdio.h>
int main()
{
return 0;
}
when compiled with tcc using tcc code.c produces this asm:
00401000 |. 55 PUSH EBP
00401001 |. 89E5 MOV EBP,ESP
00401003 |. 81EC 00000000 SUB ESP,0
00401009 |. 90 NOP
0040100A |. B8 00000000 MOV EAX,0
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
00401015 |. C3 RETN
I guess that
00401009 |. 90 NOP
is maybe there for some memory alignment, but what about
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
I mean why would compiler insert this near jump that jumps to the next instruction, LEAVE would execute anyway?
I'm on 64-bit Windows generating 32-bit executable using TCC 0.9.26.
Superfluous JMP before the Function Epilogue
The JMP at the bottom that goes to the next statement, this was fixed in a commit. Version 0.9.27 of TCC resolves this issue:
When 'return' is the last statement of the top-level block
(very common and often recommended case) jump is not needed.
As for the reason it existed in the first place? The idea is that each function has a possible common exit point. If there is a block of code with a return in it at the bottom, the JMP goes to a common exit point where stack cleanup is done and the ret is executed. Originally the code generator also emitted the JMP instruction erroneously at the end of the function too if it appeared just before the final } (closing brace). The fix checks to see if there is a return statement followed by a closing brace at the top level of the function. If there is, the JMP is omitted
An example of code that has a return at a lower scope before a closing brace:
int main(int argc, char *argv[])
{
if (argc == 3) {
argc++;
return argc;
}
argc += 3;
return argc;
}
The generated code looks like:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
401009: 90 nop
40100a: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40100d: 83 f8 03 cmp eax,0x3
401010: 0f 85 11 00 00 00 jne 0x401027
401016: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
401019: 89 c1 mov ecx,eax
40101b: 40 inc eax
40101c: 89 45 08 mov DWORD PTR [ebp+0x8],eax
40101f: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` inside the if statement
401022: e9 11 00 00 00 jmp 0x401038
401027: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40102a: 83 c0 03 add eax,0x3
40102d: 89 45 08 mov DWORD PTR [ebp+0x8],eax
401030: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` at end of the function
401033: e9 00 00 00 00 jmp 0x401038
; Common function exit point
401038: c9 leave
401039: c3 ret
In versions prior to 0.9.27 the return argc inside the if statement would jump to a common exit point (function epilogue). As well the return argc at the bottom of the function also jumps to the same common exit point of the function. The problem is that the common exit point for the function happens to be right after the top level return argcso the side effect is an extra JMP that happens to be to the next instruction.
NOP after Function Prologue
The NOP isn't for alignment. Because of the way Windows implements guard pages for the stack (Programs that are in Portable Executable format) TCC has two types of prologues. If the local stack space required < 4096 (smaller than a single page) then you see this kind of code generated:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
The sub esp,0 isn't optimized out. It is the amount of stack space needed for local variables (in this case 0). If you add some local variables you will see the 0x0 in the SUB instruction changes to coincide with the amount of stack space needed for local variables. This prologue requires 9 bytes. There is another prologue to handle the case where the stack space needed is >= 4096 bytes. If you add an array of 4096 bytes with something like:
char somearray[4096]
and look at the resulting instruction you will see the function prologue change to a 10 byte prologue:
401000: b8 00 10 00 00 mov eax,0x1000
401005: e8 d6 00 00 00 call 0x4010e0
TCC's code generator assumes that the function prologue is always 10 bytes when targeting WinPE. This is primarily because TCC is a single pass compiler. The compiler doesn't know how much stack space a function will use until after the function is processed. To get around not knowing this ahead of time, TCC pre-allocates 10 bytes for the prologue to fit the largest method. Anything shorter is padded to 10 bytes.
In the case where stack space needed < 4096 bytes the instructions used total 9 bytes. The NOP is used to pad the prologue to 10 bytes. For the case where >= 4096 bytes are needed, the number of bytes is passed in EAX and the function __chkstk is called to allocate the required stack space instead.
TCC is not an optimizing compiler, at least not really. Every single instruction it emitted for main is sub-optimal or not needed at all, except the ret. IDK why you thought the JMP was the only instruction that might not make sense for performance.
This is by design: TCC stands for Tiny C Compiler. The compiler itself is designed to be simple, so it intentionally doesn't include code to look for many kinds of optimizations. Notice the sub esp, 0: this useless instruction clearly come from filling in a function-prologue template, and TCC doesn't even look for the special case where the offset is 0 bytes. Other function need stack space for locals, or to align the stack before any child function calls, but this main() doesn't. TCC doesn't care, and blindly emits sub esp,0 to reserve 0 bytes.
(In fact, TCC is truly one pass, laying out machine code as it does through the C statement by statement. It uses the imm32 encoding for sub so it will have room to fill in the right number (upon reaching the end of the function) even if it turns out the function uses more than 255 bytes of stack space. So instead of constructing a list of instructions in memory to finish assembling later, it just remembers one spot to fill in a uint32_t. That's why it can't omit the sub when it turns out not to be needed.)
Most of the work in creating a good optimizing compiler that anyone will use in practice is the optimizer. Even parsing modern C++ is peanuts compared to reliably emitting efficient asm (which not even gcc / clang / icc can do all the time, even without considering autovectorization). Just generating working but inefficient asm is easy compared to optimizing; most of gcc's codebase is optimization, not parsing. See Basile's answer on Why are there so few C compilers?
The JMP (as you can see from #MichaelPetch's answer) has a similar explanation: TCC (until recently) didn't optimize the case where a function only has one return path, and doesn't need to JMP to a common epilogue.
There's even a NOP in the middle of the function. It's obviously a waste of code bytes and decode / issue front-end bandwidth and out-of-order window size. (Sometimes executing a NOP outside a loop or something is worth it to align the top of a loop which is branched to repeatedly, but a NOP in the middle of a basic block is basically never worth it, so that's not why TCC put it there. And if a NOP did help, you could probably do even better by reordering instructions or choosing larger instructions to do the same thing without a NOP. Even proper optimizing compilers like gcc/clang/icc don't try to predict this kind of subtle front-end effect.)
#MichaelPetch points out that TCC always wants its function prologue to be 10 bytes, because it's a single-pass compiler (and it doesn't know how much space it needs for locals until the end of the function, when it comes back and fills in the imm32). But Windows targets need stack probes when modifying ESP / RSP by more than a whole page (4096 bytes), and the alternate prologue for that case is 10 bytes, instead of 9 for the normal one without the NOP. So this is another tradeoff favouring compilation speed over good asm.
An optimizing compiler would xor-zero EAX (because that's smaller and at least as fast as mov eax,0), and leave out all the other instruction. Xor-zeroing is one of the most well-known / common / basic x86 peephole optimizations, and has several advantages other than code-size on some modern x86 microarchitectures.
main:
xor eax,eax
ret
Some optimizing compilers might still make a stack frame with EBP, but tearing it down with pop ebp would be strictly better than leave on all CPUs, for this special case where ESP = EBP so the mov esp,ebp part of leave isn't needed. pop ebp is still 1 byte, but it's also a single-uop instruction on modern CPUs, unlike leave which is 2 or 3 on modern CPUs. (http://agner.org/optimize/, and see also other performance optimization links in the x86 tag wiki.) This is what gcc does. It's a fairly common situation; if you push some other registers after making a stack frame, you have to point ESP at the right place before pop ebx or whatever. (Or use mov to restore them.)
The benchmarks TCC cares about are compilation speed, not quality (speed or size) of the resulting code. For example, the TCC web site has a benchmark in lines/sec and MB/sec (of C source) vs. gcc3.2 -O0, where it's ~9x faster on a P4.
However, TCC is not totally braindead: it will apparently do some inlining, and as Michael's answer points out, a recent patch does leave out the JMP (but still not the useless sub esp, 0).

Does main have a return address, dynamic link or return value in C?

According to our book, each function has an activation record in the run-time stack in C. Each of these activation records has a return address, dynamic link, and return value. Does main have these also?
All of these terms are purely implementation details - C has no notion of "return addresses" or "dynamic links." It doesn't even have a notion of a "stack" at all. Most implementations of C have these objects in them, and in those implementations it is possible that they exist for main. However, there is no requirement that this happen.
Hope this helps!
If you disassemble functions you will realize that most of the time the stack doesn't even contain the return value - often times the EAX register does (intel x86).
You can also look up "calling conventions" - it all pretty much depends on the compiler.
C is a language, how it's interpreted into machine code is not 'its' business.
While this depends on the implementation, it is worthy looking at a C program compiled with gcc. If you run objdump -d executable, you will see it disassembled and you can see how main() behaves. Here's an example:
08048680 <_start>:
...
8048689: 54 push %esp
804868a: 52 push %edx
804868b: 68 a0 8b 04 08 push $0x8048ba0
8048690: 68 30 8b 04 08 push $0x8048b30
8048695: 51 push %ecx
8048696: 56 push %esi
8048697: 68 f1 88 04 08 push $0x80488f1
804869c: e8 9f ff ff ff call 8048640 <__libc_start_main#plt>
80486a1: f4 hlt
...
080488f1 <main>:
80488f1: 55 push %ebp
80488f2: 89 e5 mov %esp,%ebp
80488f4: 57 push %edi
80488f5: 56 push %esi
80488f6: 53 push %ebx
...
8048b2b: 5b pop %ebx
8048b2c: 5e pop %esi
8048b2d: 5f pop %edi
8048b2e: 5d pop %ebp
8048b2f: c3 ret
You can see that main behaves similarly to a regular function in that it returns normally. In fact, if you look at the linux base documentation, you'll see that the call to __libc_start_main that we see from _start actually requires main to behave like a regular function.
In C/C++, main() is written just like a function, but isn't one. For example, it isn't allowed to call main(), it has several possible prototypes (can't do that in C!). Whatever is returned from it gets passed to the operating system (and the program ends).
Individual C implementations might handle main() like a function called from "outside" for uniformity, but nobody forces them to do so (or disallow switching to some other form of doing it without telling anybody). There are traditional ways of implementing C, but nobody is forced to do it that way. It is just the simplest way on our typical architectures.

Help deciphering simple Assembly Code

I am learning assembly using GDB & Eclipse
Here is a simple C code.
int absdiff(int x, int y)
{
if(x < y)
return y-x;
else
return x-y;
}
int main(void) {
int x = 10;
int y = 15;
absdiff(x,y);
return EXIT_SUCCESS;
}
Here is corresponding assembly instructions for main()
main:
080483bb: push %ebp #push old frame pointer onto the stack
080483bc: mov %esp,%ebp #move the frame pointer down, to the position of stack pointer
080483be: sub $0x18,%esp # ???
25 int x = 10;
080483c1: movl $0xa,-0x4(%ebp) #move the "x(10)" to 4 address below frame pointer (why not push?)
26 int y = 15;
080483c8: movl $0xf,-0x8(%ebp) #move the "y(15)" to 8 address below frame pointer (why not push?)
28 absdiff(x,y);
080483cf: mov -0x8(%ebp),%eax # -0x8(%ebp) == 15 = y, and move it into %eax
080483d2: mov %eax,0x4(%esp) # from this point on, I am confused
080483d6: mov -0x4(%ebp),%eax
080483d9: mov %eax,(%esp)
080483dc: call 0x8048394 <absdiff>
31 return EXIT_SUCCESS;
080483e1: mov $0x0,%eax
32 }
Basically, I am asking to help me to make sense of this assembly code, and why it is doing things in this particular order. Point where I am stuck, is shown in assembly comments. Thanks !
Lines 0x080483cf to 0x080483d9 are copying x and y from the current frame on the stack, and pushing them back onto the stack as arguments for absdiff() (this is typical; see e.g. http://en.wikipedia.org/wiki/X86_calling_conventions#cdecl). If you look at the disassembler for absdiff() (starting at 0x8048394), I bet you'll see it pick these values up from the stack and use them.
This might seem like a waste of cycles in this instance, but that's probably because you've compiled without optimisation, so the compiler does literally what you asked for. If you use e.g. -O2, you'll probably see most of this code disappear.
First it bears saying that this assembly is in the AT&T syntax version of x86_32, and that the order of arguments to operations is reversed from the Intel syntax (used with MASM, YASM, and many other assemblers and debuggers).
080483bb: push %ebp #push old frame pointer onto the stack
080483bc: mov %esp,%ebp #move the frame pointer down, to the position of stack pointer
080483be: sub $0x18,%esp # ???
This enters a stack frame. A frame is an area of memory between the stack pointer (esp) and the base pointer (ebp). This area is intended to be used for local variables that have to live on the stack. NOTE: Stack frames don't have to be implemented in this way, and GCC has the optimization switch -fomit-frame-pointer that does away with it except when alloca or variable sized arrays are used, because they are implemented by changing the stack pointer by arbitrary values. Not using ebp as the frame pointer allows it to be used as an extra general purpose register (more general purpose registers is usually good).
Using the base pointer makes several things simpler to calculate for compilers and debuggers, since where variables are located relative to the base does not change while in the function, but you can also index them relative to the stack pointer and get the same results, though the stack pointer does tend to change around so the same location may require a different index at different times.
In this code 0x18 (or 24) bytes are being reserved on the stack for local use.
This code so far is often called the function prologue (not to be confused with the programming language "prolog").
25 int x = 10;
080483c1: movl $0xa,-0x4(%ebp) #move the "x(10)" to 4 address below frame pointer (why not push?)
This line moves the constant 10 (0xA) to a location within the current stack frame relative to the base pointer. Because the base pointer below the top of the stack and since the stack grows downward in RAM the index is negative rather than positive. If this were indexed relative to the stack pointer a different index would be used, but it would be positive.
You are correct that this value could have been pushed rather than copied like this. I suspect that this is done this way because you have not compiled with optimizations turned on. By default gcc (which I assume you are using based on your use of gdb) does not optimize much, and so this code is probably the default "copy a constant to a location in the stack frame" code. This may not be the case, but it is one possible explanation.
26 int y = 15;
080483c8: movl $0xf,-0x8(%ebp) #move the "y(15)" to 8 address below frame pointer (why not push?)
Similar to the previous line of code. These two lines of code put the 10 and 15 into local variables. They are on the stack (rather than in registers) because this is unoptimized code.
28 absdiff(x,y);
gdb printing this meant that this is the source code line being executed, not that this function is being executed (yet).
080483cf: mov -0x8(%ebp),%eax # -0x8(%ebp) == 15 = y, and move it into %eax
In preparation for calling the function the values that are being passed as arguments need to be retrieved from their storage locations (even though they were just placed at those locations and their values are known because of the no optimization thing)
080483d2: mov %eax,0x4(%esp) # from this point on, I am confused
This is the second part of the move to the stack of one of the local variables' value so that it can be use as an argument to the function. You can't (usually) move from one memory address to another on x86, so you have to move it through a register (eax in this case).
080483d6: mov -0x4(%ebp),%eax
080483d9: mov %eax,(%esp)
These two lines do the same thing except for the other variable. Note that since this variable is being moved to the top of the stack that no offset is being used in the second instruction.
080483dc: call 0x8048394 <absdiff>
This pushed the return address to the top of the stack and jumps to the address of absdiff.
You didn't include code for absdiff, so you probably did not step through that.
31 return EXIT_SUCCESS;
080483e1: mov $0x0,%eax
C programs return 0 upon success, so EXIT_SUCCESS was defined as 0 by someone. Integer return values are put in eax, and some code that called the main function will use that value as the argument when calling the exit function.
32 }
This is the end. The reason that gdb stopped here is that there are things that actually happen to clean up. In C++ it is common to see destructor for local class instances being called here, but in C you will probably just see the function epilogue. This is the compliment to the function prologue, and consists of returning the stack pointer and base pointer to the values that they were originally at. Sometimes this is done with similar math on them, but sometimes it is done with the leave instruction. There is also an enter instruction which can be used for the prologue, but gcc doesn't do this (I don't know why). If you had continued to view the disassembly here you would have seen the epilogue code and a ret instruction.
Something you may be interested in is the ability to tell gcc to produce assembly files. If you do:
gcc -S source_file.c
a file named source_file.s will be produced with assembly code in it.
If you do:
gcc -S -O source_file.c
Then the same thing will happen, but some basic optimizations will be done. This will probably make reading the assembly code easier since the code will not likely have as many odd instructions that seem like they could have been done a better way (like moving constant values to the stack, then to a register, then to another location on the stack and never using the push instruction).
You regular optimization flags for gcc are:
-O0 default -- none
-O1 a few optimizations
-O the same as -O1
-O2 a lot of optimizations
-O3 a bunch more, some of which may take a long time and/or make the code a lot bigger
-Os optimize for size -- similar to -O2, but not quite
If you are actually trying to debug C programs then you will probably want the least optimizations possible since things will happen in the order that they are written in your code and variables won't disappear.
You should have a look at the gcc man page:
man gcc
Remember, if you're running in a debugger or debug mode, the compiler reserves the right to insert whatever debugging code it likes and make other nonsensical code changes.
For example, this is Visual Studio's debug main():
int main(void) {
001F13D0 push ebp
001F13D1 mov ebp,esp
001F13D3 sub esp,0D8h
001F13D9 push ebx
001F13DA push esi
001F13DB push edi
001F13DC lea edi,[ebp-0D8h]
001F13E2 mov ecx,36h
001F13E7 mov eax,0CCCCCCCCh
001F13EC rep stos dword ptr es:[edi]
int x = 10;
001F13EE mov dword ptr [x],0Ah
int y = 15;
001F13F5 mov dword ptr [y],0Fh
absdiff(x,y);
001F13FC mov eax,dword ptr [y]
001F13FF push eax
001F1400 mov ecx,dword ptr [x]
001F1403 push ecx
001F1404 call absdiff (1F10A0h)
001F1409 add esp,8
*(int*)nullptr = 5;
001F140C mov dword ptr ds:[0],5
return 0;
001F1416 xor eax,eax
}
001F1418 pop edi
001F1419 pop esi
001F141A pop ebx
001F141B add esp,0D8h
001F1421 cmp ebp,esp
001F1423 call #ILT+300(__RTC_CheckEsp) (1F1131h)
001F1428 mov esp,ebp
001F142A pop ebp
001F142B ret
It helpfully posts the C++ source next to the corresponding assembly. In this case, you can fairly clearly see that x and y are stored on the stack explicitly, and an explicit copy is pushed on, then absdiff is called. I explicitly de-referenced nullptr to cause the debugger to break in. You may wish to change compiler.
Compile with -fverbose-asm -g -save-temps for additional information with GCC.

C programming and error_code variable efficiency

Most code I have ever read uses a int for standard error handling (return values from functions and such). But I am wondering if there is any benefit to be had from using a uint_8 will a compiler -- read: most C compilers on most architectures -- produce instructions using the immediate address mode -- i.e., embed the 1-byte integer into the instruction ? The key instruction I'm thinking about is the compare after a function, using uint_8 as its return type, returns.
I could be thinking about things incorrectly, as introducing a 1 byte type just causes alignment issues -- there is probably a perfectly sane reason why compiles like to pack things in 4-bytes and this is possibly the reason everyone just uses ints -- and since this is stack related issue rather than the heap there is no real overhead.
Doing the right thing is what I'm thinking about. But lets say say for the sake of argument this is a popular cheap microprocessor for a intelligent watch and that it is configured with 1k of memory but does have different addressing modes in its instruction set :D
Another question to slightly specialize the discussion (x86) would be: is the literal in:
uint_32 x=func(); x==1;
and
uint_8 x=func(); x==1;
the same type ? or will the compiler generate a 8-byte literal in the second case. If so it may use it to generate a compare instruction which has the literal as an immediate value and the returned int as a register reference. See CMP instruction types..
Another Refference for the x86 Instruction Set.
Here's what one particular compiler will do for the following code:
extern int foo(void) ;
void bar(void)
{
if(foo() == 31) { //error code 31
do_something();
} else {
do_somehing_else();
}
}
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 08 sub $0x8,%esp
6: e8 fc ff ff ff call 7 <bar+0x7>
b: 83 f8 1f cmp $0x1f,%eax
e: 74 08 je 18 <bar+0x18>
10: c9 leave
11: e9 fc ff ff ff jmp 12 <bar+0x12>
16: 89 f6 mov %esi,%esi
18: c9 leave
19: e9 fc ff ff ff jmp 1a <bar+0x1a>
a 3 byte instruction for the cmp. if foo() returns a char , we get
b: 3c 1f cmp $0x1f,%al
If you're looking for efficiency though. Don't assume comparing stuff in %a1 is faster than comparing with %eax
There may be very small speed differences between the different integral types on a particular architecture. But you can't rely on it, it may change if you move to different hardware, and it may even run slower if you upgrade to newer hardware.
And if you talk about x86 in the example you are giving, you make a false assumption: An immediate needs to be of type uint8_t.
Actually 8-bit immediates embedded into the instruction are of type int8_t and can be used with bytes, words, dwords and qwords, in C notation: char, short, int and long long.
So on this architecture there would be no benefit at all, neither code size nor execution speed.
You should use int or unsigned int types for your calculations. Using smaller types only for compounds (structs/arrays). The reason for that is that int is normally defined to be the "most natural" integral type for the processor, all other derived type may necessitate processing to work correctly. We had in our project compiled with gcc on Solaris for SPARC the case that accesses to 8 and 16 bit variable added an instruction to the code. When loading a smaller type from memory it had to make sure the upper part of the register was properly set (sign extension for signed type or 0 for unsigned). This made the code longer and increased pressure on the registers, which deteriorated the other optimisations.
I've got a concrete example:
I declared two variable of a struct as uint8_t and got that code in Sparc Asm:
if(p->BQ > p->AQ)
was translated in
ldub [%l1+165], %o5 ! <variable>.BQ,
ldub [%l1+166], %g5 ! <variable>.AQ,
and %o5, 0xff, %g4 ! <variable>.BQ, <variable>.BQ
and %g5, 0xff, %l0 ! <variable>.AQ, <variable>.AQ
cmp %g4, %l0 ! <variable>.BQ, <variable>.AQ
bleu,a,pt %icc, .LL586 !
And here what I got when I declared the two variables as uint_t
lduw [%l1+168], %g1 ! <variable>.BQ,
lduw [%l1+172], %g4 ! <variable>.AQ,
cmp %g1, %g4 ! <variable>.BQ, <variable>.AQ
bleu,a,pt %icc, .LL587 !
Two arithmetic operations less and 2 registers more for other stuff
Processors typically likes to work with their natural register sizes, which in C is 'int'.
Although there are exceptions, you're thinking too much on a problem that does not exist.

Resources