gcc generates unnecessary (?) instructions

gcc generates unnecessary (?) instructions - c

I decided to compile a very basic C program and take a look at the generated code with objdump -d.
int main(int argc, char *argv[]) {
exit(0);
}
After compiling it with gcc test.c -s -o test.o and then disassembling with objdump -d my text segment looked like this:
Disassembly of section .text:
0000000000001050 <.text>:
1050: 31 ed xor %ebp,%ebp
1052: 49 89 d1 mov %rdx,%r9
1055: 5e pop %rsi
1056: 48 89 e2 mov %rsp,%rdx
1059: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
105d: 50 push %rax
105e: 54 push %rsp
105f: 4c 8d 05 4a 01 00 00 lea 0x14a(%rip),%r8 # 11b0 <__cxa_finalize#plt+0x170>
1066: 48 8d 0d e3 00 00 00 lea 0xe3(%rip),%rcx # 1150 <__cxa_finalize#plt+0x110>
106d: 48 8d 3d c1 00 00 00 lea 0xc1(%rip),%rdi # 1135 <__cxa_finalize#plt+0xf5>
1074: ff 15 66 2f 00 00 callq *0x2f66(%rip) # 3fe0 <__cxa_finalize#plt+0x2fa0>
107a: f4 hlt
107b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
1080: 48 8d 3d a9 2f 00 00 lea 0x2fa9(%rip),%rdi # 4030 <__cxa_finalize#plt+0x2ff0>
1087: 48 8d 05 a2 2f 00 00 lea 0x2fa2(%rip),%rax # 4030 <__cxa_finalize#plt+0x2ff0>
108e: 48 39 f8 cmp %rdi,%rax
1091: 74 15 je 10a8 <__cxa_finalize#plt+0x68>
1093: 48 8b 05 3e 2f 00 00 mov 0x2f3e(%rip),%rax # 3fd8 <__cxa_finalize#plt+0x2f98>
109a: 48 85 c0 test %rax,%rax
109d: 74 09 je 10a8 <__cxa_finalize#plt+0x68>
109f: ff e0 jmpq *%rax
10a1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10a8: c3 retq
10a9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10b0: 48 8d 3d 79 2f 00 00 lea 0x2f79(%rip),%rdi # 4030 <__cxa_finalize#plt+0x2ff0>
10b7: 48 8d 35 72 2f 00 00 lea 0x2f72(%rip),%rsi # 4030 <__cxa_finalize#plt+0x2ff0>
10be: 48 29 fe sub %rdi,%rsi
10c1: 48 c1 fe 03 sar $0x3,%rsi
10c5: 48 89 f0 mov %rsi,%rax
10c8: 48 c1 e8 3f shr $0x3f,%rax
10cc: 48 01 c6 add %rax,%rsi
10cf: 48 d1 fe sar %rsi
10d2: 74 14 je 10e8 <__cxa_finalize#plt+0xa8>
10d4: 48 8b 05 15 2f 00 00 mov 0x2f15(%rip),%rax # 3ff0 <__cxa_finalize#plt+0x2fb0>
10db: 48 85 c0 test %rax,%rax
10de: 74 08 je 10e8 <__cxa_finalize#plt+0xa8>
10e0: ff e0 jmpq *%rax
10e2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
10e8: c3 retq
10e9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10f0: 80 3d 39 2f 00 00 00 cmpb $0x0,0x2f39(%rip) # 4030 <__cxa_finalize#plt+0x2ff0>
10f7: 75 2f jne 1128 <__cxa_finalize#plt+0xe8>
10f9: 55 push %rbp
10fa: 48 83 3d f6 2e 00 00 cmpq $0x0,0x2ef6(%rip) # 3ff8 <__cxa_finalize#plt+0x2fb8>
1101: 00
1102: 48 89 e5 mov %rsp,%rbp
1105: 74 0c je 1113 <__cxa_finalize#plt+0xd3>
1107: 48 8b 3d 1a 2f 00 00 mov 0x2f1a(%rip),%rdi # 4028 <__cxa_finalize#plt+0x2fe8>
110e: e8 2d ff ff ff callq 1040 <__cxa_finalize#plt>
1113: e8 68 ff ff ff callq 1080 <__cxa_finalize#plt+0x40>
1118: c6 05 11 2f 00 00 01 movb $0x1,0x2f11(%rip) # 4030 <__cxa_finalize#plt+0x2ff0>
111f: 5d pop %rbp
1120: c3 retq
1121: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
1128: c3 retq
1129: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
1130: e9 7b ff ff ff jmpq 10b0 <__cxa_finalize#plt+0x70>
1135: 55 push %rbp
1136: 48 89 e5 mov %rsp,%rbp
1139: 48 83 ec 10 sub $0x10,%rsp
113d: 89 7d fc mov %edi,-0x4(%rbp)
1140: 48 89 75 f0 mov %rsi,-0x10(%rbp)
1144: bf 00 00 00 00 mov $0x0,%edi
1149: e8 e2 fe ff ff callq 1030 <exit#plt>
114e: 66 90 xchg %ax,%ax
1150: 41 57 push %r15
1152: 4c 8d 3d 8f 2c 00 00 lea 0x2c8f(%rip),%r15 # 3de8 <__cxa_finalize#plt+0x2da8>
1159: 41 56 push %r14
115b: 49 89 d6 mov %rdx,%r14
115e: 41 55 push %r13
1160: 49 89 f5 mov %rsi,%r13
1163: 41 54 push %r12
1165: 41 89 fc mov %edi,%r12d
1168: 55 push %rbp
1169: 48 8d 2d 80 2c 00 00 lea 0x2c80(%rip),%rbp # 3df0 <__cxa_finalize#plt+0x2db0>
1170: 53 push %rbx
1171: 4c 29 fd sub %r15,%rbp
1174: 48 83 ec 08 sub $0x8,%rsp
1178: e8 83 fe ff ff callq 1000 <exit#plt-0x30>
117d: 48 c1 fd 03 sar $0x3,%rbp
1181: 74 1b je 119e <__cxa_finalize#plt+0x15e>
1183: 31 db xor %ebx,%ebx
1185: 0f 1f 00 nopl (%rax)
1188: 4c 89 f2 mov %r14,%rdx
118b: 4c 89 ee mov %r13,%rsi
118e: 44 89 e7 mov %r12d,%edi
1191: 41 ff 14 df callq *(%r15,%rbx,8)
1195: 48 83 c3 01 add $0x1,%rbx
1199: 48 39 dd cmp %rbx,%rbp
119c: 75 ea jne 1188 <__cxa_finalize#plt+0x148>
119e: 48 83 c4 08 add $0x8,%rsp
11a2: 5b pop %rbx
11a3: 5d pop %rbp
11a4: 41 5c pop %r12
11a6: 41 5d pop %r13
11a8: 41 5e pop %r14
11aa: 41 5f pop %r15
11ac: c3 retq
11ad: 0f 1f 00 nopl (%rax)
11b0: c3 retq
As you can see, the part that was actually written by me occupies very little space.
The same program (if we ignore the fact that the main function is also treated as a function in C) in Assembly:
.global _start
.text
_start: mov $60, %rax
xor %rdi, %rdi
syscall
Assembled, linked and disassembled with gcc -c demo.s && ld demo.o -o demo && objdump -d demo:
Disassembly of section .text:
0000000000401000 <_start>:
401000: 48 c7 c0 3c 00 00 00 mov $0x3c,%rax
401007: 48 31 ff xor %rdi,%rdi
40100a: 0f 05 syscall
The question is: what purpose do all these instructions serve and is there a way to generate code without them?
While I was writing the question I noticed that the C program calls exit() from the linked library whereas in Assembly I do it directly with a syscall. I don't think it is important in this case though.

gcc generates unnecessary (?) instructions
Yes, because you invoked GCC without asking for any compiler optimizations.
My recommendation: compile with
gcc -fverbose-asm -O2 -S test.c
then look inside the generated test.s assembler code.
BTW, most of the code is from crt0, which is given by, not emitted by, gcc. Build your executable with gcc -O2 -v test.c -o testprog to understand what GCC really does. Read documentation of GCC internals.
Since GCC is free software, you are allowed to look inside its source code and improve it. But the crt0 stuff is tricky, and operating system specific.
Consider also reading about linkers and loaders, about ELF executables, and How to write shared libraries, and the Linux Assembler HowTo.

gcc -s strips symbol names out of the final executable so you can't tell where different parts of the machine code came from.
Most of it is not from your main. To just see that, look at gcc -S output (asm source), e.g. on https://godbolt.org/. How to remove "noise" from GCC/clang assembly output?
Most of that is the CRT (C RunTime) startup code that eventually calls your main after initializing the standard library. (e.g. allocating memory for stdio buffers and so on.) It gets linked in regardless of how efficient your main is. e.g. compiling an empty int main(void){} with gcc -Os (optimize for size) will barely make it any smaller.
You could in theory compile with gcc -nostdlib and write your own _start that uses inline asm to make an exit system call.
See also
A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux
How Get arguments value using inline assembly in C without Glibc? (getting command line args complicates the exercise of writing your own _start, but the answers there show how).

C program does a lots of stuff before calling the main function. It has to initialize .data and .bss segments, set the stack, go through the constructors and destructors (yes gcc in C has a special attributes for such a functions) and initializes the library.
gcc destructor and constructor functions:
void __attribute__ ((constructor)) funcname(void);
void __attribute__ ((destructor)) funcname(void);
you may have as many constructors and destructors as you wish.
constructors are called before call to the main function, destructors on exit from the program (after the main termination)
https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-Attributes.html#Function-Attributes

Related

How to prevent gcc from reordering x86 frame pointer saving/setup instructions?

During my profiling with flamegraph, I found that callstacks are sometimes broken even when all codebases are compiled with the -fno-omit-frame-pointer flag. By checking the binary generated by gcc, I noticed gcc may reorder x86 frame pointer saving/setup instructions (i.e., push %rbp; move %rsp, %rbp), sometimes even after ret instructions of some branches. As shown in the example below, push %rbp; move %rsp, %rbp are put at the bottom of the function. It leads to incomplete and misleading callstacks when perf happens to sample instructions in the function before frame pointers are properly set.
C code:
int flextcp_fd_slookup(int fd, struct socket **ps)
{
struct socket *s;
if (fd >= MAXSOCK || fhs[fd].type != FH_SOCKET) {
errno = EBADF;
return -1;
}
uint32_t lock_val = 1;
s = fhs[fd].data.s;
asm volatile (
"1:\n"
"xchg %[locked], %[lv]\n"
"test %[lv], %[lv]\n"
"jz 3f\n"
"2:\n"
"pause\n"
"cmpl $0, %[locked]\n"
"jnz 2b\n"
"jmp 1b\n"
"3:\n"
: [locked] "=m" (s->sp_lock), [lv] "=q" (lock_val)
: "[lv]" (lock_val)
: "memory");
*ps = s;
return 0;
}
CMake Debug Profile:
0000000000007c73 <flextcp_fd_slookup>:
7c73: f3 0f 1e fa endbr64
7c77: 55 push %rbp
7c78: 48 89 e5 mov %rsp,%rbp
7c7b: 48 83 ec 20 sub $0x20,%rsp
7c7f: 89 7d ec mov %edi,-0x14(%rbp)
7c82: 48 89 75 e0 mov %rsi,-0x20(%rbp)
7c86: 81 7d ec ff ff 0f 00 cmpl $0xfffff,-0x14(%rbp)
7c8d: 7f 1b jg 7caa <flextcp_fd_slookup+0x37>
7c8f: 8b 45 ec mov -0x14(%rbp),%eax
7c92: 48 98 cltq
7c94: 48 c1 e0 04 shl $0x4,%rax
7c98: 48 89 c2 mov %rax,%rdx
7c9b: 48 8d 05 86 86 00 00 lea 0x8686(%rip),%rax # 10328 <fhs+0x8>
7ca2: 0f b6 04 02 movzbl (%rdx,%rax,1),%eax
7ca6: 3c 01 cmp $0x1,%al
7ca8: 74 12 je 7cbc <flextcp_fd_slookup+0x49>
7caa: e8 31 b9 ff ff callq 35e0 <__errno_location#plt>
7caf: c7 00 09 00 00 00 movl $0x9,(%rax)
7cb5: b8 ff ff ff ff mov $0xffffffff,%eax
7cba: eb 53 jmp 7d0f <flextcp_fd_slookup+0x9c>
7cbc: c7 45 f4 01 00 00 00 movl $0x1,-0xc(%rbp)
7cc3: 8b 45 ec mov -0x14(%rbp),%eax
7cc6: 48 98 cltq
7cc8: 48 c1 e0 04 shl $0x4,%rax
7ccc: 48 89 c2 mov %rax,%rdx
7ccf: 48 8d 05 4a 86 00 00 lea 0x864a(%rip),%rax # 10320 <fhs>
7cd6: 48 8b 04 02 mov (%rdx,%rax,1),%rax
7cda: 48 89 45 f8 mov %rax,-0x8(%rbp)
7cde: 48 8b 55 f8 mov -0x8(%rbp),%rdx
7ce2: 8b 45 f4 mov -0xc(%rbp),%eax
7ce5: 87 82 c0 00 00 00 xchg %eax,0xc0(%rdx)
7ceb: 85 c0 test %eax,%eax
7ced: 74 0d je 7cfc <flextcp_fd_slookup+0x89>
7cef: f3 90 pause
7cf1: 83 ba c0 00 00 00 00 cmpl $0x0,0xc0(%rdx)
7cf8: 75 f5 jne 7cef <flextcp_fd_slookup+0x7c>
7cfa: eb e9 jmp 7ce5 <flextcp_fd_slookup+0x72>
7cfc: 89 45 f4 mov %eax,-0xc(%rbp)
7cff: 48 8b 45 e0 mov -0x20(%rbp),%rax
7d03: 48 8b 55 f8 mov -0x8(%rbp),%rdx
7d07: 48 89 10 mov %rdx,(%rax)
7d0a: b8 00 00 00 00 mov $0x0,%eax
7d0f: c9 leaveq
7d10: c3 retq
CMake Release Profile:
0000000000007d80 <flextcp_fd_slookup>:
7d80: f3 0f 1e fa endbr64
7d84: 81 ff ff ff 0f 00 cmp $0xfffff,%edi
7d8a: 7f 44 jg 7dd0 <flextcp_fd_slookup+0x50>
7d8c: 48 63 ff movslq %edi,%rdi
7d8f: 48 8d 05 6a 85 00 00 lea 0x856a(%rip),%rax # 10300 <fhs>
7d96: 48 c1 e7 04 shl $0x4,%rdi
7d9a: 48 01 c7 add %rax,%rdi
7d9d: 80 7f 08 01 cmpb $0x1,0x8(%rdi)
7da1: 75 2d jne 7dd0 <flextcp_fd_slookup+0x50>
7da3: 48 8b 17 mov (%rdi),%rdx
7da6: b8 01 00 00 00 mov $0x1,%eax
7dab: 87 82 c0 00 00 00 xchg %eax,0xc0(%rdx)
7db1: 85 c0 test %eax,%eax
7db3: 74 0d je 7dc2 <flextcp_fd_slookup+0x42>
7db5: f3 90 pause
7db7: 83 ba c0 00 00 00 00 cmpl $0x0,0xc0(%rdx)
7dbe: 75 f5 jne 7db5 <flextcp_fd_slookup+0x35>
7dc0: eb e9 jmp 7dab <flextcp_fd_slookup+0x2b>
7dc2: 31 c0 xor %eax,%eax
7dc4: 48 89 16 mov %rdx,(%rsi)
7dc7: c3 retq
7dc8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
7dcf: 00
7dd0: 55 push %rbp
7dd1: 48 89 e5 mov %rsp,%rbp
7dd4: e8 b7 b7 ff ff callq 3590 <__errno_location#plt>
7dd9: c7 00 09 00 00 00 movl $0x9,(%rax)
7ddf: b8 ff ff ff ff mov $0xffffffff,%eax
7de4: 5d pop %rbp
7de5: c3 retq
7de6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
7ded: 00 00 00
Is there any way to prevent gcc from reordering these two instructions?
Edit: I use the default toolchain (gcc-11.2.0 + glibc 2.35) on Ubuntu 22.04. Sorry that a reproducible example is not available.
Edit: Add source code of the example function.

Try -fno-shrink-wrap
This looks like "shrink-wrap" optimization: only doing the function prologue in a code path where it's needed. The usual benefit is to run an early-out check before the prologue, not saving/restoring a bunch of registers on that path through the function.
But here, GCC decided to only do the prologue (setting up a frame pointer) if it had to call another function. That function is __errno_location in the error-return path. Oops. :P (And GCC correctly realized that's the uncommon case, and put it out-of-line after the ret through the fast path. So the fast path can be a straight line with no taken branches, other than inside your asm(). It's not a separate function, it's just tail-duplication of the one you showed source for.)
The main path through the function is very tiny, just a few C assignment statements and an asm() statement. GCC doesn't have a clear idea of how big an asm block is (although I think has some heuristics, but is still rather willing to inline one). And it has no idea if there might be loops or any significant time spent in an asm block.
This is a known issue, GCC bug #98018 suggested that GCC should have an option to force frame-pointer setup at the actual top of a function. Because there currently isn't an option that's 100% reliable, other than disabling optimization which is not usable. (Thanks to #Margaret Bloom for finding & linking this.)
As comment 6 on that GCC bug mentions, disabling shrink-wrapping is part of what's necessary to make sure GCC sets up the frame pointer at the top of the function itself, not just inside some if that needs the prologue.
That GCC issue seems to be considering a feature that would stop function inlining, so backtraces would fully reflect the C abstract machine's nesting of function calls. That goes beyond what you're looking for, which I think is just to have frame pointers set up on entry to functions that exist in the asm after optimization.
Disabling shrink-wrapping will force the whole prologue to happen there, including push of other regs, if there were any. Not just the frame pointer.
But here there aren't any others. Still, with optimization enabled in general, losing shrink-wrapping is probably pretty minor.

How does the GNU toolchain decide to use near vs. short jump instructions?

I've got some code that gcc (4.8.5, if it matters) compiles into nearly identical binary on two different machines except for one spot, where something in the toolchain on one machine decides to use a "near" JE instruction while the toolchain on the other machine decides to use a "short" JE instruction:
41e274: 85 ed test %ebp,%ebp 41e274: 85 ed test %ebp,%ebp
41e276: 0f 84 14 00 00 00 je 41e290 | 41e276: 74 18 je 41e290
41e27c: 31 ed xor %ebp,%ebp | 41e278: 31 ed xor %ebp,%ebp
41e27e: 48 81 c4 30 01 00 00 add $0x130,%rsp | 41e27a: 48 81 c4 30 01 00 00 add $0x130,%rsp
41e285: 89 e8 mov %ebp,%eax | 41e281: 89 e8 mov %ebp,%eax
41e287: 5b pop %rbx | 41e283: 5b pop %rbx
41e288: 5d pop %rbp | 41e284: 5d pop %rbp
41e289: 41 5c pop %r12 | 41e285: 41 5c pop %r12
41e28b: c3 retq | 41e287: c3 retq
41e28c: 0f 1f 40 00 nopl 0x0(%rax) | 41e288: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,
> 41e28f: 00
41e290: 48 8d 7c 24 60 lea 0x60(%rsp),%rdi 41e290: 48 8d 7c 24 60 lea 0x60(%rsp),%rdi
I'm assuming this is some sort of optimization (though I'm unclear as to whether it's the compiler or assembler that is being clever).
The odd thing, though, is that (in the same object file) there are other "nearby je branches" (to less than 40 or so bytes away) that are not similarly optimized to the single byte opcode:
c320: 45 85 ed test %r13d,%r13d c320: 45 85 ed test %r13d,%r13d
c323: 0f 84 17 00 00 00 je c340 c323: 0f 84 17 00 00 00 je c340
c329: 45 31 ed xor %r13d,%r13d c329: 45 31 ed xor %r13d,%r13d
c32c: 48 81 c4 30 01 00 00 add $0x130,%rsp c32c: 48 81 c4 30 01 00 00 add $0x130,%rsp
c333: 44 89 e8 mov %r13d,%eax c333: 44 89 e8 mov %r13d,%eax
c336: 5b pop %rbx c336: 5b pop %rbx
c337: 5d pop %rbp c337: 5d pop %rbp
c338: 41 5c pop %r12 c338: 41 5c pop %r12
c33a: 41 5d pop %r13 c33a: 41 5d pop %r13
c33c: 41 5e pop %r14 c33c: 41 5e pop %r14
c33e: c3 retq c33e: c3 retq
c33f: 90 nop c33f: 90 nop
c340: 4c 8d 74 24 60 lea 0x60(%rsp),%r14 c340: 4c 8d 74 24 60 lea 0x60(%rsp),%r14
Is there some parameter/configuration for gcc (or ld) that I can use to control this? I'd like to ensure that my code results in the same binary when compiled elsewhere.

Assembly Interpretation - Register confusion

I'm working on an assignment for class where I have to interpret assembly. I know the input to defuse the bomb is 442, but I'm not exactly sure why.
8048c80: 83 ec 2c sub $0x2c,%esp
8048c83: c7 44 24 1c 00 00 00 movl $0x0,0x1c(%esp)
8048c8a: 00
8048c8b: 8d 44 24 1c lea 0x1c(%esp),%eax
8048c8f: 89 44 24 08 mov %eax,0x8(%esp)
8048c93: c7 44 24 04 64 a7 04 movl $0x804a764,0x4(%esp)
8048c9a: 08
8048c9b: 8b 44 24 30 mov 0x30(%esp),%eax
8048c9f: 89 04 24 mov %eax,(%esp)
8048ca2: e8 59 fc ff ff call 8048900 <__isoc99_sscanf#plt>
8048ca7: 83 f8 01 cmp $0x1,%eax
8048caa: 74 05 je 8048cb1 <phase_1+0x31>
8048cac: e8 e4 07 00 00 call 8049495 <explode_bomb>
8048cb1: 81 7c 24 1c ba 01 00 cmpl $0x1ba,0x1c(%esp)
8048cb8: 00
8048cb9: 74 05 je 8048cc0 <phase_1+0x40>
8048cbb: e8 d5 07 00 00 call 8049495 <explode_bomb>
8048cc0: 83 c4 2c add $0x2c,%esp
8048cc3: c3 ret
Sscanf takes two values, "%d" and my inputted value, but I'm not sure where it stores the value or why %eax is 1 or why 0x1c(%esp) has the value. We store 0x0 there at the beginning, and then move 0x30(%esp), %eax, so shouldn't it be 0? Any help understanding this would be very much appreciated.
To be clear, this is x86 in at&t syntax.

Deciphering x86 assembly function

I am currently working on phase 2 of the binary bomb assignment. I'm having trouble deciphering exactly what a certain function does when called. I've been stuck on it for days.
The function is:
0000000000400f2a <func2a>:
400f2a: 85 ff test %edi,%edi
400f2c: 74 1d je 400f4b <func2a+0x21>
400f2e: b9 cd cc cc cc mov $0xcccccccd,%ecx
400f33: 89 f8 mov %edi,%eax
400f35: f7 e1 mul %ecx
400f37: c1 ea 03 shr $0x3,%edx
400f3a: 8d 04 92 lea (%rdx,%rdx,4),%eax
400f3d: 01 c0 add %eax,%eax
400f3f: 29 c7 sub %eax,%edi
400f41: 83 04 be 01 addl $0x1,(%rsi,%rdi,4)
400f45: 89 d7 mov %edx,%edi
400f47: 85 d2 test %edx,%edx
400f49: 75 e8 jne 400f33 <func2a+0x9>
400f4b: f3 c3 repz retq
It gets called in the larger function "phase_2":
0000000000400f4d <phase_2>:
400f4d: 53 push %rbx
400f4e: 48 83 ec 60 sub $0x60,%rsp
400f52: 48 c7 44 24 30 00 00 movq $0x0,0x30(%rsp)
400f59: 00 00
400f5b: 48 c7 44 24 38 00 00 movq $0x0,0x38(%rsp)
400f62: 00 00
400f64: 48 c7 44 24 40 00 00 movq $0x0,0x40(%rsp)
400f6b: 00 00
400f6d: 48 c7 44 24 48 00 00 movq $0x0,0x48(%rsp)
400f74: 00 00
400f76: 48 c7 44 24 50 00 00 movq $0x0,0x50(%rsp)
400f7d: 00 00
400f7f: 48 c7 04 24 00 00 00 movq $0x0,(%rsp)
400f86: 00
400f87: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
400f8e: 00 00
400f90: 48 c7 44 24 10 00 00 movq $0x0,0x10(%rsp)
400f97: 00 00
400f99: 48 c7 44 24 18 00 00 movq $0x0,0x18(%rsp)
400fa0: 00 00
400fa2: 48 c7 44 24 20 00 00 movq $0x0,0x20(%rsp)
400fa9: 00 00
400fab: 48 8d 4c 24 58 lea 0x58(%rsp),%rcx
400fb0: 48 8d 54 24 5c lea 0x5c(%rsp),%rdx
400fb5: be 9e 26 40 00 mov $0x40269e,%esi
400fba: b8 00 00 00 00 mov $0x0,%eax
400fbf: e8 6c fc ff ff callq 400c30 <__isoc99_sscanf#plt>
400fc4: 83 f8 02 cmp $0x2,%eax
400fc7: 74 05 je 400fce <phase_2+0x81>
400fc9: e8 c1 06 00 00 callq 40168f <explode_bomb>
400fce: 83 7c 24 5c 64 cmpl $0x64,0x5c(%rsp)
400fd3: 76 07 jbe 400fdc <phase_2+0x8f>
400fd5: 83 7c 24 58 64 cmpl $0x64,0x58(%rsp)
400fda: 77 05 ja 400fe1 <phase_2+0x94>
400fdc: e8 ae 06 00 00 callq 40168f <explode_bomb>
400fe1: 48 8d 74 24 30 lea 0x30(%rsp),%rsi
400fe6: 8b 7c 24 5c mov 0x5c(%rsp),%edi
400fea: e8 3b ff ff ff callq 400f2a <func2a>
400fef: 48 89 e6 mov %rsp,%rsi
400ff2: 8b 7c 24 58 mov 0x58(%rsp),%edi
400ff6: e8 2f ff ff ff callq 400f2a <func2a>
400ffb: bb 00 00 00 00 mov $0x0,%ebx
401000: 8b 04 1c mov (%rsp,%rbx,1),%eax
401003: 39 44 1c 30 cmp %eax,0x30(%rsp,%rbx,1)
401007: 74 05 je 40100e <phase_2+0xc1>
401009: e8 81 06 00 00 callq 40168f <explode_bomb>
40100e: 48 83 c3 04 add $0x4,%rbx
401012: 48 83 fb 28 cmp $0x28,%rbx
401016: 75 e8 jne 401000 <phase_2+0xb3>
401018: 48 83 c4 60 add $0x60,%rsp
40101c: 5b pop %rbx
40101d: c3 retq
I completely understand what phase_2 is doing, I just don't understand what func2a is doing and how it affects the values at 0x30(%rsp) and so on. Because of this I always get to the comparison statement at 0x401003, and the bomb eventually explodes there.
My problem is I don't understand how the input (phase solution) is affecting the values at 0x30(%rsp) via func2a.

400f2a: 85 ff test %edi,%edi
400f2c: 74 1d je 400f4b <func2a+0x21>
This is just an early exit for when edi is zero (je is the same as jz).
400f2e: b9 cd cc cc cc mov $0xcccccccd,%ecx
400f33: 89 f8 mov %edi,%eax
400f35: f7 e1 mul %ecx
400f37: c1 ea 03 shr $0x3,%edx
This is a classic optimization trick; it is the integer arithmetic equivalent of dividing by multiplying by the inverse (see here for details); in practice, here it's the same as saying edx = edi / 10;
400f3a: 8d 04 92 lea (%rdx,%rdx,4),%eax
400f3d: 01 c0 add %eax,%eax
Here it is exploiting lea to perform arithmetic (and it's way clearer in Intel syntax, where it is lea eax,[rdx+rdx*4] => eax = edx*5), then sums the result with itself. It all boils down to eax = edx*10.
400f3f: 29 c7 sub %eax,%edi
Then, subtract it back to edi.
So, all in all this is a complicated (but fast) way to compute the last decimal digit of edi; what we have until now is something like:
void func2a(unsigned edi) {
if(edi==0) return;
label1:
edx=edi/10;
edi%=10;
// ...
}
(label1: is there because 400f33 is a jump target later)
Going on:
400f41: 83 04 be 01 addl $0x1,(%rsi,%rdi,4)
Again, this is way clearer to me in Intel syntax - add dword [rsi+rdi*4],byte +0x1. It is a regular increment into an array of 32-bit int (rdi is multiplied by 4); so, we can imagine that rsi points to an array of integers, indexed with the just-calculated last digit of edi.
void func2a(unsigned edi, int rsi[]) {
if(edi==0) return;
label1:
edx=edi/10;
edi%=10;
rsi[edi]++;
}
Then:
400f45: 89 d7 mov %edx,%edi
400f47: 85 d2 test %edx,%edx
400f49: 75 e8 jne 400f33 <func2a+0x9>
Move the result of the division we calculated above to edi, and loop if it's different from zero.
400f4b: f3 c3 repz retq
Return (using an unusual encoding of the instruction that is optimal for certain AMD processors).
So, by rewriting the jumps with a while loop and giving some meaningful names...
// number is edi, digits_count is rsi, as per regular
// x64 SystemV calling convention
void count_digits(unsigned number, int digits_count[]) {
while(number) {
digits_count[number%10]++;
number/=10;
}
}
I.e., this is a function that, given an integer, counts the occurrences of the single decimal digits, by incrementing the corresponding buckets in the digits_count array.
Fun fact: if we give the C code above to gcc (almost any recent version at -O1) we obtain back exactly the assembly you provided.

SIZE command in UNIX

The following is my C file:
int main()
{
return 36;
}
It contains only return statement. But if I use the size command, it shows the
output like this:
mohanraj#ltsp63:~/Development/chap8$ size a.out
text data bss dec hex filename
1056 252 8 1316 524 a.out
mohanraj#ltsp63:~/Development/chap8$
Even though my program does not contain any global variable, or undeclared data. But, the output shows data segment have 252 and the bss have 8 bytes. So, why the output is like this? what is 252 and 8 refers.

Size Command
First see the definition of each column:
text - Actual machine instructions that your CPU going to execute. Linux allows to share this data.
data - All initialized variables (declarations) declared in a program (e.g., float salary=123.45;).
bss - The BSS consists of uninitialized data such as arrays that you have not set any values to or null pointers.
As Blue Moon said. On Linux, the execution starts by calling _start() function. Which does environment setup. Every C program has hidden "libraries" that depends on compilator you using. There are settings for global parameters, exit calls and after complete configuration it finally calls your main() function.
ASFAIK there's no way to see how your code looks encapsulated with configuration and _start() function. But I can show you that even your code contains more information than you thought the closer to hardware we are.
Hint:
Type readelf -a a.out to see how much information your exec really carrying.
What is inside?
Do not compare code in your source file to the size of executable file, it depends on the OS, compilator, and used libraries.
In my example, with exactly the same code, SIZE returns:
eryk#eryk-pc:~$ gcc a.c
eryk#eryk-pc:~$ size a.out
text data bss dec hex filename
1033 276 4 1313 521 a.out
Let's see what is inside...
eryk#eryk-pc:~$ gcc -S a.c
This will run the preprocessor over a.c, perform the initial compilation and then stop before the assembler is run.
eryk#eryk-pc:~$ cat a.s
.file "a.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
movl $36, %eax
popl %ebp
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
Then look on the assembly code
eryk#eryk-pc:~$ objdump -d -M intel -S a.out
a.out: file format elf32-i386
Disassembly of section .init:
08048294 <_init>:
8048294: 53 push ebx
8048295: 83 ec 08 sub esp,0x8
8048298: e8 83 00 00 00 call 8048320 <__x86.get_pc_thunk.bx>
804829d: 81 c3 63 1d 00 00 add ebx,0x1d63
80482a3: 8b 83 fc ff ff ff mov eax,DWORD PTR [ebx-0x4]
80482a9: 85 c0 test eax,eax
80482ab: 74 05 je 80482b2 <_init+0x1e>
80482ad: e8 1e 00 00 00 call 80482d0 <__gmon_start__#plt>
80482b2: 83 c4 08 add esp,0x8
80482b5: 5b pop ebx
80482b6: c3 ret
Disassembly of section .plt:
080482c0 <__gmon_start__#plt-0x10>:
80482c0: ff 35 04 a0 04 08 push DWORD PTR ds:0x804a004
80482c6: ff 25 08 a0 04 08 jmp DWORD PTR ds:0x804a008
80482cc: 00 00 add BYTE PTR [eax],al
...
080482d0 <__gmon_start__#plt>:
80482d0: ff 25 0c a0 04 08 jmp DWORD PTR ds:0x804a00c
80482d6: 68 00 00 00 00 push 0x0
80482db: e9 e0 ff ff ff jmp 80482c0 <_init+0x2c>
080482e0 <__libc_start_main#plt>:
80482e0: ff 25 10 a0 04 08 jmp DWORD PTR ds:0x804a010
80482e6: 68 08 00 00 00 push 0x8
80482eb: e9 d0 ff ff ff jmp 80482c0 <_init+0x2c>
Disassembly of section .text:
080482f0 <_start>:
80482f0: 31 ed xor ebp,ebp
80482f2: 5e pop esi
80482f3: 89 e1 mov ecx,esp
80482f5: 83 e4 f0 and esp,0xfffffff0
80482f8: 50 push eax
80482f9: 54 push esp
80482fa: 52 push edx
80482fb: 68 70 84 04 08 push 0x8048470
8048300: 68 00 84 04 08 push 0x8048400
8048305: 51 push ecx
8048306: 56 push esi
8048307: 68 ed 83 04 08 push 0x80483ed
804830c: e8 cf ff ff ff call 80482e0 <__libc_start_main#plt>
8048311: f4 hlt
8048312: 66 90 xchg ax,ax
8048314: 66 90 xchg ax,ax
8048316: 66 90 xchg ax,ax
8048318: 66 90 xchg ax,ax
804831a: 66 90 xchg ax,ax
804831c: 66 90 xchg ax,ax
804831e: 66 90 xchg ax,ax
08048320 <__x86.get_pc_thunk.bx>:
8048320: 8b 1c 24 mov ebx,DWORD PTR [esp]
8048323: c3 ret
8048324: 66 90 xchg ax,ax
8048326: 66 90 xchg ax,ax
8048328: 66 90 xchg ax,ax
804832a: 66 90 xchg ax,ax
804832c: 66 90 xchg ax,ax
804832e: 66 90 xchg ax,ax
08048330 <deregister_tm_clones>:
8048330: b8 1f a0 04 08 mov eax,0x804a01f
8048335: 2d 1c a0 04 08 sub eax,0x804a01c
804833a: 83 f8 06 cmp eax,0x6
804833d: 77 01 ja 8048340 <deregister_tm_clones+0x10>
804833f: c3 ret
8048340: b8 00 00 00 00 mov eax,0x0
8048345: 85 c0 test eax,eax
8048347: 74 f6 je 804833f <deregister_tm_clones+0xf>
8048349: 55 push ebp
804834a: 89 e5 mov ebp,esp
804834c: 83 ec 18 sub esp,0x18
804834f: c7 04 24 1c a0 04 08 mov DWORD PTR [esp],0x804a01c
8048356: ff d0 call eax
8048358: c9 leave
8048359: c3 ret
804835a: 8d b6 00 00 00 00 lea esi,[esi+0x0]
08048360 <register_tm_clones>:
8048360: b8 1c a0 04 08 mov eax,0x804a01c
8048365: 2d 1c a0 04 08 sub eax,0x804a01c
804836a: c1 f8 02 sar eax,0x2
804836d: 89 c2 mov edx,eax
804836f: c1 ea 1f shr edx,0x1f
8048372: 01 d0 add eax,edx
8048374: d1 f8 sar eax,1
8048376: 75 01 jne 8048379 <register_tm_clones+0x19>
8048378: c3 ret
8048379: ba 00 00 00 00 mov edx,0x0
804837e: 85 d2 test edx,edx
8048380: 74 f6 je 8048378 <register_tm_clones+0x18>
8048382: 55 push ebp
8048383: 89 e5 mov ebp,esp
8048385: 83 ec 18 sub esp,0x18
8048388: 89 44 24 04 mov DWORD PTR [esp+0x4],eax
804838c: c7 04 24 1c a0 04 08 mov DWORD PTR [esp],0x804a01c
8048393: ff d2 call edx
8048395: c9 leave
8048396: c3 ret
8048397: 89 f6 mov esi,esi
8048399: 8d bc 27 00 00 00 00 lea edi,[edi+eiz*1+0x0]
080483a0 <__do_global_dtors_aux>:
80483a0: 80 3d 1c a0 04 08 00 cmp BYTE PTR ds:0x804a01c,0x0
80483a7: 75 13 jne 80483bc <__do_global_dtors_aux+0x1c>
80483a9: 55 push ebp
80483aa: 89 e5 mov ebp,esp
80483ac: 83 ec 08 sub esp,0x8
80483af: e8 7c ff ff ff call 8048330 <deregister_tm_clones>
80483b4: c6 05 1c a0 04 08 01 mov BYTE PTR ds:0x804a01c,0x1
80483bb: c9 leave
80483bc: f3 c3 repz ret
80483be: 66 90 xchg ax,ax
080483c0 <frame_dummy>:
80483c0: a1 10 9f 04 08 mov eax,ds:0x8049f10
80483c5: 85 c0 test eax,eax
80483c7: 74 1f je 80483e8 <frame_dummy+0x28>
80483c9: b8 00 00 00 00 mov eax,0x0
80483ce: 85 c0 test eax,eax
80483d0: 74 16 je 80483e8 <frame_dummy+0x28>
80483d2: 55 push ebp
80483d3: 89 e5 mov ebp,esp
80483d5: 83 ec 18 sub esp,0x18
80483d8: c7 04 24 10 9f 04 08 mov DWORD PTR [esp],0x8049f10
80483df: ff d0 call eax
80483e1: c9 leave
80483e2: e9 79 ff ff ff jmp 8048360 <register_tm_clones>
80483e7: 90 nop
80483e8: e9 73 ff ff ff jmp 8048360 <register_tm_clones>
080483ed <main>:
80483ed: 55 push ebp
80483ee: 89 e5 mov ebp,esp
80483f0: b8 24 00 00 00 mov eax,0x24
80483f5: 5d pop ebp
80483f6: c3 ret
80483f7: 66 90 xchg ax,ax
80483f9: 66 90 xchg ax,ax
80483fb: 66 90 xchg ax,ax
80483fd: 66 90 xchg ax,ax
80483ff: 90 nop
08048400 <__libc_csu_init>:
8048400: 55 push ebp
8048401: 57 push edi
8048402: 31 ff xor edi,edi
8048404: 56 push esi
8048405: 53 push ebx
8048406: e8 15 ff ff ff call 8048320 <__x86.get_pc_thunk.bx>
804840b: 81 c3 f5 1b 00 00 add ebx,0x1bf5
8048411: 83 ec 1c sub esp,0x1c
8048414: 8b 6c 24 30 mov ebp,DWORD PTR [esp+0x30]
8048418: 8d b3 0c ff ff ff lea esi,[ebx-0xf4]
804841e: e8 71 fe ff ff call 8048294 <_init>
8048423: 8d 83 08 ff ff ff lea eax,[ebx-0xf8]
8048429: 29 c6 sub esi,eax
804842b: c1 fe 02 sar esi,0x2
804842e: 85 f6 test esi,esi
8048430: 74 27 je 8048459 <__libc_csu_init+0x59>
8048432: 8d b6 00 00 00 00 lea esi,[esi+0x0]
8048438: 8b 44 24 38 mov eax,DWORD PTR [esp+0x38]
804843c: 89 2c 24 mov DWORD PTR [esp],ebp
804843f: 89 44 24 08 mov DWORD PTR [esp+0x8],eax
8048443: 8b 44 24 34 mov eax,DWORD PTR [esp+0x34]
8048447: 89 44 24 04 mov DWORD PTR [esp+0x4],eax
804844b: ff 94 bb 08 ff ff ff call DWORD PTR [ebx+edi*4-0xf8]
8048452: 83 c7 01 add edi,0x1
8048455: 39 f7 cmp edi,esi
8048457: 75 df jne 8048438 <__libc_csu_init+0x38>
8048459: 83 c4 1c add esp,0x1c
804845c: 5b pop ebx
804845d: 5e pop esi
804845e: 5f pop edi
804845f: 5d pop ebp
8048460: c3 ret
8048461: eb 0d jmp 8048470 <__libc_csu_fini>
8048463: 90 nop
8048464: 90 nop
8048465: 90 nop
8048466: 90 nop
8048467: 90 nop
8048468: 90 nop
8048469: 90 nop
804846a: 90 nop
804846b: 90 nop
804846c: 90 nop
804846d: 90 nop
804846e: 90 nop
804846f: 90 nop
08048470 <__libc_csu_fini>:
8048470: f3 c3 repz ret
Disassembly of section .fini:
08048474 <_fini>:
8048474: 53 push ebx
8048475: 83 ec 08 sub esp,0x8
8048478: e8 a3 fe ff ff call 8048320 <__x86.get_pc_thunk.bx>
804847d: 81 c3 83 1b 00 00 add ebx,0x1b83
8048483: 83 c4 08 add esp,0x8
8048486: 5b pop ebx
8048487: c3 ret
Next step would converting above code to 01 notation.
As you can see. Even simple c program contains complicated operation the closer to hardware your code is. I hope I have explained to you why the executable file is bigger than you thought. If you have any doubts, feel free to comment my post. I will edit my answer immediately.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

gcc generates unnecessary (?) instructions - c

Related

How to prevent gcc from reordering x86 frame pointer saving/setup instructions?

How does the GNU toolchain decide to use near vs. short jump instructions?

Assembly Interpretation - Register confusion

Deciphering x86 assembly function

SIZE command in UNIX

Categories

Resources