Why does the compiler generate such code when initializing a volatile array? - c

I have the following program that enables the alignment check (AC) bit in the x86 processor flags register in order to trap unaligned memory accesses. Then the program declares two volatile variables:
#include <assert.h>
int main(void)
{
#ifndef NOASM
__asm__(
"pushf\n"
"orl $(1<<18),(%esp)\n"
"popf\n"
);
#endif
volatile unsigned char foo[] = { 1, 2, 3, 4, 5, 6 };
volatile unsigned int bar = 0xaa;
return 0;
}
If I compile this, the code generated initially does the
obvious things like setting up the stack and creating the array of chars by moving the values 1, 2, 3, 4, 5, 6 onto the stack:
/tmp ➤ gcc test3.c -m32
/tmp ➤ gdb ./a.out
(gdb) disassemble main
0x0804843d <+0>: push %ebp
0x0804843e <+1>: mov %esp,%ebp
0x08048440 <+3>: and $0xfffffff0,%esp
0x08048443 <+6>: sub $0x20,%esp
0x08048446 <+9>: mov %gs:0x14,%eax
0x0804844c <+15>: mov %eax,0x1c(%esp)
0x08048450 <+19>: xor %eax,%eax
0x08048452 <+21>: pushf
0x08048453 <+22>: orl $0x40000,(%esp)
0x0804845a <+29>: popf
0x0804845b <+30>: movb $0x1,0x16(%esp)
0x08048460 <+35>: movb $0x2,0x17(%esp)
0x08048465 <+40>: movb $0x3,0x18(%esp)
0x0804846a <+45>: movb $0x4,0x19(%esp)
0x0804846f <+50>: movb $0x5,0x1a(%esp)
0x08048474 <+55>: movb $0x6,0x1b(%esp)
0x08048479 <+60>: mov 0x16(%esp),%eax
0x0804847d <+64>: mov %eax,0x10(%esp)
0x08048481 <+68>: movzwl 0x1a(%esp),%eax
0x08048486 <+73>: mov %ax,0x14(%esp)
0x0804848b <+78>: movl $0xaa,0xc(%esp)
0x08048493 <+86>: mov $0x0,%eax
0x08048498 <+91>: mov 0x1c(%esp),%edx
0x0804849c <+95>: xor %gs:0x14,%edx
0x080484a3 <+102>: je 0x80484aa <main+109>
0x080484a5 <+104>: call 0x8048310 <__stack_chk_fail#plt>
0x080484aa <+109>: leave
0x080484ab <+110>: ret
However at main+60 it does something strange: it moves the array of 6 bytes to another part of the stack: the data is moved one 4-byte word at a time in registers. But the bytes start at offset 0x16, which is not aligned, so the program will crash when attempting to perform the mov.
So I've two questions:
Why is the compiler emitting code to copy the array to another part of the stack? I assumed volatile would skip every optimization and always perform memory accesses. Maybe volatile vars are required to always be accessed as whole words, and so the compiler would always use temporary registers to read/write whole words?
Why does the compiler not put the char array at an aligned address if it later intends to do these mov calls? I understand that x86 is normally safe with unaligned accesses, and on modern processors it does not even carry a performance penalty; however in all other instances I see the compiler trying to avoid generating unaligned accesses, since they are considered, AFAIK, an unspecified behavior in C. My guess is that, since later it provides a properly aligned pointer for the copied array on the stack, it just does not care about alignment of the data used only for initialization in a way which is invisible to the C program?
If my hypotheses above are right, it means that I cannot expect an x86 compiler to always generate aligned accesses, even if the compiled code never attempts to perform unaligned accesses itself, and so setting the AC flag is not a practical way to detect parts of the code where unaligned accesses are performed.
EDIT: After further research I can answer most of this myself. In an attempt to make progress, I added an option in Redis to set the AC flag and otherwise run normally. I found that this approach is not viable: the process immediately crashes inside libc: __mempcpy_sse2 () at ../sysdeps/x86_64/memcpy.S:83. I assume that the whole x86 software stack simply does not really care about misalignment since it is handled very well by this architecture. Thus it is not practical to run with the AC flag set.
So the answer to question 2 above is that, like the rest of the software stack, the compiler is free to do as it pleases, and relocate things on the stack without caring about alignment, so long as the behavior is correct from the perspective of the C program.
The only question left to answer, is why with volatile, is a copy made in a different part of the stack? My best guess is that the compiler is attempting to access whole words in variables declared volatile even during initialization (imagine if this address was mapped to an I/O port), but I'm not sure.

You're compiling without optimization, so the compiler is generating straight-forward code without worrying about how inefficient it is. So it first creates the initializer { 1, 2, 3, 4, 5, 6 } in temp space on the stack, and it then copies that into the space it allocated for foo.

The compiler populates the array in a working storage area, one byte at a time, which is not atomic. It then moves the entire array to its final resting place using an atomic MOVZ instruction (the atomicity is implicit when the target address is naturally aligned).
The write has to be atomic because the compiler must assume (due to the volatile keyword) that the array can be accessed at any time by anyone else.

Related

Difference in x86-32 and x64 Assembly stack allocation for a fixed-size buffer with unoptimized C (GCC)

Doing some basic disassembly and have noticed that the buffer is being given additional buffer space for some reason although what i am looking at in a tutorial uses the same code but is only given the correct (500) chars in length. Why is this?
My code:
#include <stdio.h>
#include <string.h>
int main (int argc, char** argv){
char buffer[500];
strcpy(buffer, argv[1]);
return 0;
}
compiled with GCC, the dissembled code is:
0x0000000000001139 <+0>: push %rbp
0x000000000000113a <+1>: mov %rsp,%rbp
0x000000000000113d <+4>: sub $0x210,%rsp
0x0000000000001144 <+11>: mov %edi,-0x204(%rbp)
0x000000000000114a <+17>: mov %rsi,-0x210(%rbp)
0x0000000000001151 <+24>: mov -0x210(%rbp),%rax
0x0000000000001158 <+31>: add $0x8,%rax
0x000000000000115c <+35>: mov (%rax),%rdx
0x000000000000115f <+38>: lea -0x200(%rbp),%rax
0x0000000000001166 <+45>: mov %rdx,%rsi
0x0000000000001169 <+48>: mov %rax,%rdi
0x000000000000116c <+51>: call 0x1030 <strcpy#plt>
0x0000000000001171 <+56>: mov $0x0,%eax
0x0000000000001176 <+61>: leave
0x0000000000001177 <+62>: ret
However, this video https://www.youtube.com/watch?v=1S0aBV-Waeo clearly only has 500 bytes assigned
Why is this this the case as the only difference I can see here is one is 32-bit and another (mine) is on x86-64.
500 is not a multiple of 16.
The x86-64 ABI (application binary interface) requires the stack pointer to be a multiple of 16 whenever a call instruction is about to happen. (Since call pushes an 8-byte return address, this means the stack pointer is always congruent to 8, mod 16, when control reaches the first instruction of a called function.) For the code shown, it is convenient for the compiler to achieve this requirement by increasing the value it uses in the sub instruction, making it be a multiple of 16.
The x86-32 ABI did not make this requirement, so there was no reason for the compiler used in the video to increase the size of the stack frame.
Note that you appear to have compiled your code without optimization. I get this at -O2:
0x0000000000000000 <+0>: sub $0x208,%rsp
0x0000000000000007 <+7>: mov 0x8(%rsi),%rsi
0x000000000000000b <+11>: mov %rsp,%rdi
0x000000000000000e <+14>: call <strcpy#PLT>
0x0000000000000013 <+19>: xor %eax,%eax
0x0000000000000015 <+21>: add $0x208,%rsp
0x000000000000001c <+28>: ret
The stack adjustment is still somewhat larger than the size of the array, but not as big as what you had, and no longer a multiple of 16; the difference is that with optimization on, the frame pointer is eliminated, so %rbp does not need to be saved and restored, and so the stack pointer is not a multiple of 16 at the point of the sub instruction.
(Incidentally, there is no requirement anywhere for a stack frame to be as small as possible. "Quality of implementation" dictates that it should be as small as possible, but for various reasons it's quite common for the compiler to miss that target. In my optimized code dump, I don't see any reason why the immediate operand to sub and add couldn't have been 0x1f8 (504).

Why does compiler decide to put memory variables to '%fs' registor? [duplicate]

I've written a piece of C code and I've disassembled it as well as read the registers to understand how the program works in assembly.
int test(char *this){
char sum_buf[6];
strncpy(sum_buf,this,32);
return 0;
}
The piece of my code that I've been examining is the test function. When I disassemble the output my test function I get ...
0x00000000004005c0 <+12>: mov %fs:0x28,%rax
=> 0x00000000004005c9 <+21>: mov %rax,-0x8(%rbp)
... stuff ..
0x00000000004005f0 <+60>: xor %fs:0x28,%rdx
0x00000000004005f9 <+69>: je 0x400600 <test+76>
0x00000000004005fb <+71>: callq 0x4004a0 <__stack_chk_fail#plt>
0x0000000000400600 <+76>: leaveq
0x0000000000400601 <+77>: retq
What I would like to know is what mov %fs:0x28,%rax is really doing?
Both the FS and GS registers can be used as base-pointer addresses in order to access special operating system data-structures. So what you're seeing is a value loaded at an offset from the value held in the FS register, and not bit manipulation of the contents of the FS register.
Specifically what's taking place, is that FS:0x28 on Linux is storing a special sentinel stack-guard value, and the code is performing a stack-guard check. For instance, if you look further in your code, you'll see that the value at FS:0x28 is stored on the stack, and then the contents of the stack are recalled and an XOR is performed with the original value at FS:0x28. If the two values are equal, which means that the zero-bit has been set because XOR'ing two of the same values results in a zero-value, then we jump to the test routine, otherwise we jump to a special function that indicates that the stack was somehow corrupted, and the sentinel value stored on the stack was changed.
If using GCC, this can be disabled with:
-fno-stack-protector
glibc:
uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
# ifdef THREAD_SET_STACK_GUARD
THREAD_SET_STACK_GUARD (stack_chk_guard);
the _dl_random from kernel.
Looking at http://www.imada.sdu.dk/Courses/DM18/Litteratur/IntelnATT.htm, I think %fs:28 is actually an offset of 28 bytes from the address in %fs. So I think it's loading a full register size from location %fs + 28 into %rax.

Assembly: Purpose of loading the effective address before a call to a function?

Source C Code:
int main()
{
int i;
for(i=0, i < 10; i++)
{
printf("Hello World!\n");
}
}
Dump of Intel syntax x86 assembler code for function main:
1. 0x000055555555463a <+0>: push rbp
2. 0x000055555555463b <+1>: mov rbp,rsp
3. 0x000055555555463e <+4>: sub rsp,0x10
4. 0x0000555555554642 <+8>: mov DWORD PTR [rbp-0x4],0x0
5. 0x0000555555554649 <+15>: jmp 0x55555555465b <main+33>
6. 0x000055555555464b <+17>: lea rdi,[rip+0xa2] # 0x5555555546f4
7. 0x0000555555554652 <+24>: call 0x555555554510 <puts#plt>
8. 0x0000555555554657 <+29>: add DWORD PTR [rbp-0x4],0x1
9. 0x000055555555465b <+33>: cmp DWORD PTR [rbp-0x4],0x9
10. 0x000055555555465f <+37>: jle 0x55555555464b <main+17>
11. 0x0000555555554661 <+39>: mov eax,0x0
12. 0x0000555555554666 <+44>: leave
13. 0x0000555555554667 <+45>: ret
I'm currently working through "Hacking, The Art of Exploitation 2nd Edition by Jon Erickson", and I'm just starting to tackle assembly.
I have a few questions about the translation of the provided C code to Assembly, but I am mainly wondering about my first question.
1st Question: What is the purpose of line 6? (lea rdi,[rip+0xa2]).
My current working theory, is that this is used to save where the next instructions will jump to in order to track what is going on. I believe this line correlates with the printf function in the source C code.
So essentially, its loading the effective address of rip+0xa2 (0x5555555546f4) into the register rdi, to simply track where it will jump to for the printf function?
2nd Question: What is the purpose of line 11? (mov eax,0x0?)
I do not see a prior use of the register, EAX and am not sure why it needs to be set to 0.
The LEA puts a pointer to the string literal into a register, as the first arg for puts. The search term you're looking for is "calling convention" and/or ABI. (And also RIP-relative addressing). Why is the address of static variables relative to the Instruction Pointer?
The small offset between code and data (only +0xa2) is because the .rodata section gets linked into the same ELF segment as .text, and your program is tiny. (Newer gcc + ld versions will put it in a separate page so it can be non-executable.)
The compiler can't use a shorter more efficient mov edi, address in position-independent code in your Linux PIE executable. It would do that with gcc -fno-pie -no-pie
mov eax,0 implements the implicit return 0 at the end of main that C99 and C++ guarantee. EAX is the return-value register in all calling conventions.
If you don't use gcc -O2 or higher, you won't get peephole optimizations like xor-zeroing (xor eax,eax).
This:
lea rdi,[rip+0xa2]
Is a typical position independent LEA, putting the string address into a register (instead of loading from that memory address).
Your executable is position independent, meaning that it can be loaded at runtime at any address. Therefore, the real address of the argument to be passed to puts() needs to be calculated at runtime every single time, since the base address of the program could be different each time. Also, puts() is used instead of printf() because the compiler optimized the call since there is no need to format anything.
In this case, the binary was most probably loaded with the base address 0x555555554000. The string to use is stored in your binary at offset 0x6f4. Since the next instruction is at offset 0x652, you know that, no matter where the binary is loaded in memory, the address you want will be rip + (0x6f4 - 0x652) = rip + 0xa2, which is what you see above. See this answer of mine for another example.
The purpose of:
mov eax,0x0
Is to set the return value of main(). In Intel x86, the calling convention is to return values in the rax register (eax if the value is 32 bits, which is true in this case since main returns an int). See the table entry for x86-64 at the end of this page.
Even if you don't add an explicit return statement, main() is a special function, and the compiler will add a default return 0 for you.
If you add some debug data and symbols to the assembly everything will be easier. It is also easier to read the code if you add some optimizations.
There is a very useful tool godbolt and your example https://godbolt.org/z/9sRFmU
On the asm listing there you can clearly see that that lines loads the address of the string literal which will be then printed by the function.
EAX is considered volatile and main by default returns zero and thats the reason why it is zeroed.
The calling convention is explained here: https://en.wikipedia.org/wiki/X86_calling_conventions
Here you have more interesting cases https://godbolt.org/z/M4MeGk

Interpreting this line in Assembly language?

Below are the first 5 lines of a disassembled C program that I am trying to reverse engineer back into it's C code for purposes of better learning assembly language. At the beginning of this code I see it makes room on the stack and immediately calls
0x000000000040054e <+8>: mov %fs:0x28,%rax
I am confused what this line does, and what might be calling this from the corresponding C program. The only time I have seen this line so far is when a different method within a C program is called, but this time it is not followed by any Callq instructions so I am not so sure... Any ideas what else could be in this C program to be making this call?
0x0000000000400546 <+0>: push %rbp
0x0000000000400547 <+1>: mov %rsp,%rbp
0x000000000040054a <+4>: sub $0x40,%rsp
0x000000000040054e <+8>: mov %fs:0x28,%rax
0x0000000000400557 <+17>: mov %rax,-0x8(%rbp)
0x000000000040055b <+21>: xor %eax,%eax
0x000000000040055d <+23>: movl $0x17,-0x30(%rbp)
...
I know this is to provide some form of stack protection for buffer overflow attacks, I just need to know what C code would prompt this protection if not for a seperate method.
As you say, this is code used to defend against buffer overflows. The compiler generates this "stack canary check" for functions that have local variables that might be buffers that could be overflowed. Note the instructions immediately above and below the line you are asking about:
sub $0x40, %rsp
mov %fs:0x28, %rax
mov %rax, -0x8(%ebp)
xor %eax, %eax
The sub allocates 64 bytes of space on the stack, which is enough room for at least one small array. Then a secret value is copied from %fs:0x28 to the top of that space, just below the previous frame pointer and the return address, and then it is erased from the register file.
The body of the function does something with arrays; if it writes sufficiently far past the end of an array, it will overwrite the secret value. At the end of the function, there will be code along the lines of
mov -0x8(%rbp), %rax
xor %fs:28, %rax
jne 1
mov %rbp, %rsp
pop %rbp
ret
1:
call __stack_chk_fail # does not return
This verifies that the secret value is unchanged, and crashes the program if it has changed. The idea is that someone trying to exploit a simple buffer overflow vulnerability, like you have when you use gets, won't be able to change the return address without also modifying the secret value.
The compiler has several different heuristics, selectable with command line options, for deciding when it is necessary to generate stack-canary protection code.
You can't write C code corresponding to this assembly language yourself, because it uses the unusual %fs:nnnn addressing mode; the stack-canary code intentionally uses an addressing mode that no other code generation relies on, to make it as difficult as possible for the adversary to learn the secret value.

C allocated space size on stack for an array

I have a simple program called demo.c which allocates space for a char array with the length of 8 on the stack
#include<stdio.h>
main()
{
char buffer[8];
return 0;
}
I thought that 8 bytes will be allocated from stack for the eight chars but if I check this in gdb there are 10 bytes subtracted from the stack.
I compile the the program with this command on my Ubuntu 32 bit machine:
$ gcc -ggdb -o demo demo.c
Then I analyze the program with:
$ gdb demo
$ disassemble main
(gdb) disassemble main
Dump of assembler code for function main:
0x08048404 <+0>: push %ebp
0x08048405 <+1>: mov %esp,%ebp
0x08048407 <+3>: and $0xfffffff0,%esp
0x0804840a <+6>: sub $0x10,%esp
0x0804840d <+9>: mov %gs:0x14,%eax
0x08048413 <+15>: mov %eax,0xc(%esp)
0x08048417 <+19>: xor %eax,%eax
0x08048419 <+21>: mov $0x0,%eax
0x0804841e <+26>: mov 0xc(%esp),%edx
0x08048422 <+30>: xor %gs:0x14,%edx
0x08048429 <+37>: je 0x8048430 <main+44>
0x0804842b <+39>: call 0x8048340 <__stack_chk_fail#plt>
0x08048430 <+44>: leave
0x08048431 <+45>: ret
End of assembler dump.
0x0804840a <+6>: sub $0x10,%esp says, that there are 10 bytes allocated from the stack right?
Why are there 10 bytes allocated and not 8?
No, 0x10 means it's hexadecimal, i.e. 1016, which is 1610 bytes in decimal.
Probably due to alignment requirements for the stack.
Please note that the constant $0x10 is in hexadecimal this is equal to 16 byte.
Take a look at the machine code:
0x08048404 <+0>: push %ebp
0x08048405 <+1>: mov %esp,%ebp
0x08048407 <+3>: and $0xfffffff0,%esp
0x0804840a <+6>: sub $0x10,%esp
...
0x08048430 <+44>: leave
0x08048431 <+45>: ret
As you can see before we subtract 16 from the esp we ensure to make esp pointing to a 16 byte aligned address first (take a look at the and $0xfffffff0,%esp instruction).
I guess the compiler try to respect the alignment so he simply reserves 16 byte as well. It does not matter anyway because 8 byte fit into 16 byte very well.
sub $0x10, %esp is saying that there are 16 bytes on the stack, not 10 since 0x is hexadecimal notation.
The amount of space for the stack is completely dependent on the compiler. In this case it's most like an alignment issue where the alignment is 16 bytes and you've requested 8, so it gets increased to 16.
If you requested 17 bytes, it would most likely have been sub $0x20, %esp or 32 bytes instead of 17.
(I skipped over some things the other answers explain in more detail).
You compiled with -O0, so gcc is operating in a super-simple way that tells you something about compiler internals, but little about how to make good code from C.
gcc is keeping the stack 16B-aligned at all times. The 32bit SysV ABI only guarantees 4B stack alignment, but GNU/Linux systems actually assume and maintain gcc's default -mpreferred-stack-boundary=4 (16B-aligned).
Your version of gcc also defaults to using -fstack-protector, so it checks for stack-smashing in functions with local char arrays with 4 or more elements:
-fstack-protector
Emit extra code to check for buffer overflows, such as stack smashing attacks. This is done by adding a guard variable to
functions with
vulnerable objects. This includes functions that call "alloca", and functions with buffers larger than 8 bytes. The guards
are
initialized when a function is entered and then checked when the function exits. If a guard check fails, an error message is
printed and
the program exits.
For some reason, this is actually kicking in with char arrays >= 4B, but not with integer arrays. (At least, not when they're unused!). char pointers can alias anything, which may have something to do with it.
See the code on godbolt, with asm output. Note how main is special: it uses andl $-16, %esp to align the stack on entry to main, but other functions assume the stack was 16B-aligned before the call instruction that called them. So they'll typically sub $24, %esp, after pushing %ebp. (%ebp and the return address are 8B total, so the stack is 8B away from being 16B-aligned). This leaves room for the stack-protector canary.
The 32bit SysV ABI only requires arrays to be aligned to the natural alignment of their elements, so this 16B alignment for the char array is just what the compiler decided to do in this case, not something you can count on.
The 64bit ABI is different:
An array uses the same alignment as its elements, except that a local
or global array variable of length at least 16 bytes or a C99
variable-length array variable always has alignment of at least 16
bytes
(links from the x86 tag wiki)
So you can count on char buf[1024] being 16B-aligned on SysV, allowing you to use SSE aligned loads/stores on it.

Resources