Why do byte spills occur and what do they achieve? - c

What is a byte spill?
When I dump the x86 ASM from an LLVM intermediate representation generated from a C program, there are numerous spills, usually of a 4 byte size. I cannot figure out why they occur and what they achieve.
They seem to "cut" pieces of the stack off, but in an unusual way:
## this fragment comes from a C program right before a malloc() call to a struct.
## there are other spills in different circumstances in this same program, so it
## is not related exclusively to malloc()
...
sub ESP, 84
mov EAX, 60
mov DWORD PTR [ESP + 80], 0
mov DWORD PTR [ESP], 60
mov DWORD PTR [ESP + 60], EAX # 4-byte Spill
call malloc
mov ECX, 60
...

A register spill is simply what happens when you have more local variables than registers (it's an analogy - really the meaning is that they must be saved to memory). The instruction is saving the value of EAX, likely because EAX is clobbered by malloc and you don't have another spare register to save it in (and for whatever reason the compiler has decided it needs the constant 60 in the register later).
By the looks of it, the compiler could certainly have omitted mov DWORD PTR [ESP + 60], EAX and instead repeated the mov EAX, 60 where it would otherwise mov EAX, DWORD PTR [ESP + 60] or whatever offset it used, because the saved value of EAX cannot be other than 60 at that point. However, compilation is not guaranteed to be perfectly optimal.
Bear also in mind that after sub ESP, 84, the stack size is not adjusted (except by the call instruction which of course pushes the return address). The following instructions are using ESP as a memory offset, not a destination.

Related

Can't write to memory requested with malloc/calloc in x64 Assembly

This is my first question on this platform. I'm trying to modify the pixels of an image file and to copy them to memory requested with calloc. When the code tries to dereference the pointer to the memory requested with calloc at offset 16360 to write, an "access violation writing location" exception is thrown. Sometimes the offset is slightly higher or lower. The amount of memory requested is correct. When I write equivalent code in C++ with calloc, it works, but not in assembly. I've also tried to request an higher amount of memory in assembly and to raise the heap and stack size in the visual studio settings but nothing works for the assembly code. I also had to set the option /LARGEADDRESSAWARE:NO before I could even build and run the program.
I know that the AVX instruction sets would be better suited for this, but the code would contain slightly more lines so I made it simpler for this question and I'm also not a pro, I did this to practice the AVX instruction set.
Many thanks in advance :)
const uint8_t* getImagePtr(sf::Image** image, const char* imageFilename, uint64_t* imgSize) {
sf::Image* img = new sf::Image;
img->loadFromFile(imageFilename);
sf::Vector2u sz = img->getSize();
*imgSize = uint64_t((sz.x * sz.y) * 4u);
*image = img;
return img->getPixelsPtr();
}
EXTRN getImagePtr:PROC
EXTRN calloc:PROC
.data
imagePixelPtr QWORD 0 ; contains address to source array of 8 bit pixels
imageSize QWORD 0 ; contains size in bytes of the image file
image QWORD 0 ; contains pointer to image object
newImageMemory QWORD 0 ; contains address to destination array
imageFilename BYTE "IMAGE.png", 0 ; name of the file
.code
mainasm PROC
sub rsp, 40
mov rcx, OFFSET image
mov rdx, OFFSET imageFilename
mov r8, OFFSET imageSize
call getImagePtr
mov imagePixelPtr, rax
mov rcx, 1
mov rdx, imageSize
call calloc
add rsp, 40
cmp rax, 0
je done
mov newImageMemory, rax
mov rcx, imageSize
xor eax, eax
mov bl, 20
SomeLoop:
mov dl, BYTE PTR [imagePixelPtr + rax]
add dl, bl
mov BYTE PTR [newImageMemory + rax], dl ; exception when dereferencing and writing to offset 16360
inc rax
loop SomeLoop
done:
ret
mainasm ENDP
END
Let's translate this line back into C:
mov BYTE PTR [newImageMemory + rax], dl ;
In C, this is more or less equivalent to:
*((unsigned char *)&newImageMemory + rax) = dl;
Which is clearly not what you want. It's writing to an offset from the location of newImageMemory, and not to an offset from where newImageMemory points to.
You will need to keep newImageMemory in a register if you want to use it as the base address for an offset.
While we're at it, this line is also wrong, for the same reason:
mov dl, BYTE PTR [imagePixelPtr + rax]
It just happens not to crash.

Does function parameter names has a place in memory in C?

I dont think function parameter names are treated like variables and they doesnt get stored in memory. But I dont get how we can use these parameters in functions as variables if they dont have any place in memory. Can anyone explain me whats going on with function parameters and if they have place or not in memory
All variables are either allocated somewhere or optimized away in case the compiler found them unnecessary. Function parameters are variables and are almost certainly stored either in a CPU register or on the stack, if they are used by the program.
The only time when they might not get allocated is when the function is inlined - when the whole function call is optimized away and the function code is instead injected in the caller-side machine code. In such cases the original variables used by the caller might be used instead.
Function parameter names however are not stored anywhere in the final executable, just like any other identifier isn't stored there either. Names of variables, functions etc only exist in the source code, for the benefit of the programmer alone.
Although your title asks about “function parameter names,” it appears your question is about function parameters, which are different.
Commonly, arguments are passed to functions by putting them in processor registers or on the hardware stack. Each computing platform has some specification of which arguments should be passed where. For example, the first few small arguments (such as int values) may be passed in certain processor registers, while more or larger arguments may be put on the stack.
To the called function, these are parameters. The called function uses them from the processor registers or the stack.
I'm writing this answer assuming you know what the stack and CPU registers are. If you don't, I'd suggest you look them up before seeing this answer.
I dont think function parameter names are treated like variables and they doesnt get stored in memory.
At the assembly level, function parameter names don't really exist. But for function parameters, it depends on the assembly generated based on the compiler's level of optimization. Consider this simple function:
int foo(int a, int b)
{
return a + b;
}
Using Compiler Explorer, I checked the generated disassembly of x64 GCC 10.2. On -O0, it looks like this:
foo:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
These two lines:
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
interestingly show that the arguments are passed to edi and esi for a and b respectively, and then moved into the stack, presumably in case the registers need to be used elsewhere in the function. The rest of the function uses the space in the stack as opposed to the registers:
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
(In case you didn't know, eax/rax generally holds the value for functions and edx in this case just serves as a general-purpose register, so these three lines are besaically eax = a; edx = b; eax += edx).
Ok, so that makes sense. The arguments are passed to registers and copied to the stack, where they are used for the rest of the function. What about -O1?
foo:
lea eax, [rdi+rsi]
ret
Now that is a lot shorter. Here, eax gets the value of rdi + rsi and the function ends. All the copying to the stack is completely skipped and the registers are used directly. So yes, in this case, the memory is never used.
EDIT
After writing this answer, I went and checked the generated assembly with the -m32 option and noticed that arguments were always pushed to the stack before the function was called. Assembly generated from -O0 looks like this:
foo:
push ebp
mov ebp, esp
mov edx, DWORD PTR [ebp+8]
mov eax, DWORD PTR [ebp+12]
add eax, edx
pop ebp
ret
Here, since the arguments are passed to the stack before the function is called, they don't have be copied from the registers to the stack (because they're already there). So the function is shorter, and amount of registers used is reduced. However, on higher levels of optimization, the function ends up becoming longer because of this:
foo:
mov eax, DWORD PTR [esp+8]
add eax, DWORD PTR [esp+4]
ret
So with -m32 set, parameters are always placed in memory.

Causes and benefits of this improvement on gcc version >= 4.9.0 vs gcc version < 4.9?

I have recently exploited a dangerous program and found something interesting about the difference between versions of gcc on x86-64 architecture.
Note:
Wrongful usage of gets is not the issue here.
If we replace gets with any other functions, the problem doesn't change.
This is the source code I use:
#include <stdio.h>
int main()
{
char buf[16];
gets(buf);
return 0;
}
I use gcc.godbolt.org to disassemble the program with flag -m32 -fno-stack-protector -z execstack -g.
At the disassembled code, when gcc with version >= 4.9.0:
lea ecx, [esp+4] # begin of main
and esp, -16
push DWORD PTR [ecx-4] # push esp
push ebp
mov ebp, esp
/* between these comment is not related to the question
push ecx
sub esp, 20
sub esp, 12
lea eax, [ebp-24]
push eax
call gets
add esp, 16
mov eax, 0
*/
mov ebp, esp
mov ecx, DWORD PTR [ebp-4] # ecx = saved esp
leave
lea esp, [ecx-4]
ret # end of main
But gcc with version < 4.9.0 just:
push ebp # begin of main
mov ebp, esp
/* between these comment is not related to the question
and esp, -16
sub esp, 32
lea eax, [esp+16]
mov DWORD PTR [esp], eax
call gets
mov eax, 0
*/
leave
ret # end of main
My question is: What is the causes of this difference on the disassembled code and its benefits? Does it have a name for this technique?
I can't say for sure without the actual values in:
and esp, 0xXX # XX is a number
but this looks a lot like extra code to align the stack to a larger value than the ABI requires.
Edit: The value is -16, which is 32-bit 0xFFFFFFF0 or 64-bit 0xFFFFFFFFFFFFFFF0 so this is indeed stack alignment to 16 bytes, likely meant for use of SSE instructions. As mentioned in comments, there is more code in the >= 4.9.0 version because it also aligns the frame pointer instead of only the stack pointer.
The i386 ABI, used for 32-bit programs, imposes that a process, immediately after loaded, has to have the stack aligned on 32-bit values:
%esp Performing its usual job, the stack pointer holds the address of the
bottom of the stack, which is guaranteed to be word aligned.
confront this with the x86_64 ABI1 used for 64-bit programs:
%rsp The stack pointer holds the address of the byte with lowest address which
is part of the stack. It is guaranteed to be 16-byte aligned at process entry
The opportunity gave by the new AMD's 64-bit technology to rewrite the old i386 ABI allow a number of optimizations that were lacking due to backward compatibility, among these a bigger (stricter?) stack alignment.
I won't dwell on the benefits of stack alignment but it suffices to say that if a 4-byte alignment was good, so is a 16-byte one.
So much that it is worth spending some instructions aligning the stack.
That's what GCC 4.9.0+ does, it aligns the stack at 16-bytes.
That explains the and esp, -16 but not the other instructions.
Aligning the stack with and esp, -16 is the fastest way to do it when the compiler only knows that the stack is 4-byte aligned (since esp MOD 16 can be 0, 4, 8 or 12).
However it is a destructive method, the compiler loses the original esp value.
But now it comes the chicken or the egg problem: if we save the original esp on the stack before aligning the stack, we lose it because we don't know how far the stack pointer is lowered by the alignment. If we save it after the alignment, well, we can't. We lost it in the alignment.
So the only possible solution is to save it in a register, align the stack and then save said register on the stack.
;Save the stack pointer in ECX, actually is ESP+4 but still does
lea ecx, [esp+4] #ECX = ESP+4
;Align the stack
and esp, -16 #This lowers ESP by 0, 4, 8 or 12
;IGNORE THIS FOR NOW
push DWORD PTR [ecx-4]
;Usual prolog
push ebp
mov ebp, esp
;Save the original ESP (before alignment), actually is ESP+4 but OK
push ecx
GCC saves esp+4 in ecx, I don't know why2 but this values still does the trick.
The only mystery left is the push DWORD PTR [ecx-4].
But it turns out to be a simple mystery: for debugging purposes GCC pushes the return addresses just before the old frame pointer (before push ebp), this is where 32-bit tools expect it to be.
Since ecx=esp_o+4, where esp_o is the original stack pointer pre-alignment, [ecx-4] = [esp_o] = return address.
Note that now the stack is at 12 bytes modulo 16, thus the local variable area must be of size 16*k+4 to have the stack aligned at 16-byte again.
In your example k is 1 and the area is of 20 bytes in size.
The subsequent sub esp, 12 is to align the stack for the gets function (the requirement is to have the stack aligned at the function call).
Finally, the code
mov ebp, esp
mov ecx, DWORD PTR [ebp-4] # ecx = saved esp
leave
lea esp, [ecx-4]
ret
The first instruction is copy-paste error.
One could check it out or simply reason that
if it were there the [ebp-4] would be below the stack pointer (and there is no red zone for the i386 ABI).
The rest is just undoing what's is done in the prolog:
;Get the original stack pointer
mov ecx, DWORD PTR [ebp-4] ;ecx = esp_o+4
;Standard epilog
leave ;mov esp, ebp / pop ebp
;The stack pointer points to the copied return address
;Restore the original stack pointer
lea esp, [ecx-4] ;esp = esp_o
ret
GCC has to first get the original stack pointer (+4) saved on the stack, then restore the old frame pointer (ebp) and finally, restore the original stack pointer.
The return address is on the top of the stack when lea esp, [ecx-4] is executed, so in theory GCC could just return but it has to restore the original esp because main is not the first function to be executed in a C program, so it cannot leave the stack unbalanced.
1 This is not the latest version but the text quoted went unchanged in the successive editions.
2 This has been discussed here on SO but I can't remember if in some comment or in an answer.

Why is the compiler generating a push/pop instruction pair?

I compiled the code below with the VC++ 2010 compiler:
__declspec(dllexport)
unsigned int __cdecl __mm_getcsr(void) { return _mm_getcsr(); }
and the generated code was:
push ECX
stmxcsr [ESP]
mov EAX, [ESP]
pop ECX
retn
Why is there a push ECX/pop ECX instruction pair?
The compiler is making room on the stack to store the MXCSR. It could have equally well done this:
sub esp,4
stmxcsr [ESP]
mov EAX, [ESP]
add esp,4
retn
But "push ecx" is probably shorter or faster.
The push here is used to allocate 4 bytes of temporary space. [ESP] would normally point to the pushed return address, which we cannot overwrite.
ECX will be overwritten here, however, ECX is a probably a volatile register in the ABI you're targeting, so functions don't have to preserve ECX.
The reason a push/pop is used here is a space (and possibly speed) optimization.
It creates an top-of-stack entry that ESP now refers to as the target for the stmxcsr instruction. Then the result is stored in EAX for the return.

Dive into the assembly

Function in c:
PHPAPI char *php_pcre_replace(char *regex, int regex_len,
char *subject, int subject_len,
zval *replace_val, int is_callable_replace,
int *result_len, int limit, int *replace_count TSRMLS_DC)
{
pcre_cache_entry *pce; /* Compiled regular expression */
/* Compile regex or get it from cache. */
if ((pce = pcre_get_compiled_regex_cache(regex, regex_len TSRMLS_CC)) == NULL) {
return NULL;
}
....
}
Its assembly:
php5ts!php_pcre_replace:
1015db70 8b442408 mov eax,dword ptr [esp+8]
1015db74 8b4c2404 mov ecx,dword ptr [esp+4]
1015db78 56 push esi
1015db79 8b74242c mov esi,dword ptr [esp+2Ch]
1015db7d 56 push esi
1015db7e 50 push eax
1015db7f 51 push ecx
1015db80 e8cbeaffff call php5ts!pcre_get_compiled_regex_cache (1015c650)
1015db85 83c40c add esp,0Ch
1015db88 85c0 test eax,eax
1015db8a 7502 jne php5ts!php_pcre_replace+0x1e (1015db8e)
php5ts!php_pcre_replace+0x1c:
1015db8c 5e pop esi
1015db8d c3 ret
The c function call pcre_get_compiled_regex_cache(regex, regex_len TSRMLS_CC) corresponds to 1015db7d~1015db80 which pushes the 3 parameters to the stack and call it.
But my doubt is,among so many registers,how does the compiler decide to use eax,ecx and esi(this is special,as it's restored before using,why?) as the intermediate to carry to the stack?
There must be some hidden indication in c that tells the compiler to do it this way,right?
No, there is no hidden indication.
This is a typical strategy for generating 80x86 instructions used by many compiler implementations, C and otherwise. For example, the 1980s Intel Fortran-77 compiler, when optimization was turned on, did the same thing.
That is uses eax and ecx preferentially is probably an artifact of avoiding use of esi and edi since those registers cannot directly be used to load byte operands.
Why not ebx and edx? Well, those are preferred by many code generators for holding intermediate pointers in evaluating complex structure evaluation, which is to say, there isn't much reason at all. The compiler just looked for two available registers to use and overwrote them to buffer the values.
Why not reuse eax like this?:
push esi
mov eax,dword ptr [esp+2Ch]
push eax
mov eax,dword ptr [esp+8]
push eax
mov eax,dword ptr [esp+4]
push eax
Because that causes pipeline stalls waiting for eax to complete previous memory cycles, in 80x86s since the 80586 (maybe 80486—it's too long ago to be sure off the top of my head).
The x86 architecture is a strange beast. Each register, though promoted as being "general purpose" by Intel, has its quirks (cx/ecx is tied to the loop instruction for example, and eax:edx is tied to the multiply instruction). That combined with the peculiar ways to optimize execution to avoid cache misses and pipeline stalls often leads to inscrutable generated code by a code generator which factors all that in.

Resources