Although the code works, I'm baffled by the compiler's decision to seemingly mix 32 and 64 bit parameters of the same type. Specifically, I have a function which receives three char pointers. Looking at the assembly code, two of the three are passed as 64-bit pointers (as expected), while the third, a local constant, but character string nonetheless, is being passed as a 32-bit pointer. I don't see how my function could ever know when the 3rd parameter isn't a fully loaded 64-bit pointer. Obviously it doesn't matter as long as the higher side is 0, but I don't see it making an effort to ensure that. Anything could be in the high side of RDX in this example. What am I missing? BTW, the receiving function assumes it's a full 64-bit pointer and includes this code on entry:
movq %rdx, -24(%rbp)
This is the code in question:
.LC4
.string "My Silly String"
.text
.globl funky_funk
.type funky_funk, #function
funky_funk:
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $16, %rsp
movq %rdi, -16(%rbp) ;char *dst 64-bit
movl %esi, -20(%rbp) ;int len, 32 bits OK
movl $.LC4, %edx ;<<<<---- why is it not RDX?
movl -20(%rbp), %ecx ;int len 32-bits OK
movq -16(%rbp), %rbx ;char *dst 64-bit
movq -16(%rbp), %rax ;char *dst 64-bit
movq %rbx, %rsi ;char *dst 64-bit
movq %rax, %rdi ;char *dst 64-bit
call edc_function
void funky_funk(char *dst, int len)
{ //how will function know when
edc_function(dst, dst, STRING_LC4, len); //a str passed in 3rd parm
} //is 32-bit ptr vs 64-bit ptr?
void edc_function(char *dst, char *src, char *key, int len)
{
//so, is key a 32-bit ptr? or is key a 64-bit ptr?
}
When loading a 32 bits value in a register, the value is zero extended. You are probably working in a mode where the compiler knows the code is in the lower 32 bit addressable memory.
GCC has several memory models for x64, two of which have that property. From GCC documentation:
`-mcmodel=small'
Generate code for the small code model: the program and its
symbols must be linked in the lower 2 GB of the address space.
Pointers are 64 bits. Programs can be statically or dynamically
linked. This is the default code model.
`-mcmodel=medium'
Generate code for the medium model: The program is linked in the
lower 2 GB of the address space. Small symbols are also placed
there. Symbols with sizes larger than `-mlarge-data-threshold'
are put into large data or bss sections and can be located above
2GB. Programs can be statically or dynamically linked.
(the other ones are kernel, which is similar to small but in the upper/negative 2GB of
address space and large with no restriction).
Adding this as an answer, as it contains "part of the puzzle" for the original question:
As long as the compiler can determine [by for example specifying a memorymodel that satisfies this] that .LC4 is within the first 4GB, it can do this. %edx will be loaded with 32 bits of the address of LC4, and upper bits set to zero, so when the edc_function() is called, it can use the full 64-bits of %rdx, and as long as the address is within the lower 4GB, it will work out fine.
Related
I have following C function:
void function(int a) {
char buffer[1];
}
It produces following assembly code(gcc with 0 optimization, 64 bit machine):
function:
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
nop
popq %rbp
ret
Questions:
Why buffer occupies 20 bytes?
If I declare char buffer instead of char buffer[1] the offset is 4 bytes, but I expected to see 8, since machine is 64 bit and I thought it will use qword(64 bit).
Thanks in advance and sorry if question is duplicated, I was not able to find the answer.
movl %edi, -20(%rbp) is spilling the function arg from a register into the red-zone below the stack pointer. It's 4 bytes long, leaving 16 bytes of space above it below RSP.
gcc's -O0 (naive anti-optimized) code-gen for you function doesn't actually touch the memory it reserved for buffer[], so you don't know where it is.
You can't infer that buffer[] is using up all 16 bytes above a in the red zone, just that gcc did a bad job of packing locals efficiently (because you compiled with -O0 so it didn't even try). But it's definitely not 20 because there isn't that much space left. Unless it put buffer[] below a, somewhere else in the rest of the 128-byte red-zone. (Hint: it didn't.)
If we add an initializer for the array, we can see where it actually stores the byte.
void function(int a) {
volatile char buffer[1] = {'x'};
}
compiled by gcc8.2 -xc -O0 -fverbose-asm -Wall on the Godbolt compiler explorer:
function:
pushq %rbp
movq %rsp, %rbp # function prologue, creating a traditional stack frame
movl %edi, -20(%rbp) # a, a
movb $120, -1(%rbp) #, buffer
nop # totally useless, IDK what this is for
popq %rbp # tear down the stack frame
ret
So buffer[] is in fact one byte long, right below the saved RBP value.
The x86-64 System V ABI requires 16-byte alignment for automatic storage arrays that are at least 16 bytes long, but that's not the case here so that rule doesn't apply.
I don't know why gcc leaves extra padding before the spilled register arg; gcc often has that kind of missed optimization. It's not giving a any special alignment.
If you add extra local arrays, they will fill up that 16 bytes above the spilled arg, still spilling it to -20(%rbp). (See function2 in the Godbolt link)
I also included clang -O0, and icc -O3 and MSVC optimized output, in the Godbolt link. Fun fact: ICC chooses to optimize away volatile char buffer[1] = {'x'}; without actually storing to memory, but MSVC allocates it in the shadow space. (Windows x64 uses a different calling convention, and has 32B shadow space above the return address instead of a 128B red zone below the stack pointer.)
clang/LLVM -O0 chooses to spill a right below RSP, and put the array 1 byte below that.
With just char buffer instead of char buffer[1]
We get movl %edi, -4(%rbp) # a, a from gcc -O0. It apparently optimizes away the unused and uninitialized local variable entirely, and spills a right below the saved RBP. (I didn't run it under GDB or look at the debug info to see if &buffer would give us.)
So again, you're mixing up a with buffer.
If we initialize it with char buffer = 'x', we're back to the old stack layout, with buffer at -1(%rbp).
Or even if we just make it volatile char buffer; without an initializer, then space for it exists on the stack and a is spilled to -20(%rbp) even with no store done to buffer.
4 bytes aligned char ,8 bytes pushed rbp, 8 bytes a = 20. Start addres of the a is current stack pointer minus 20
I'm currently solving problem 3.3 from 3rd edition of Computer System: a programmer's perspective and I'm having a hard time understanding what these errors mean...
movb $0xF, (%ebx) gives an error because ebx can't be used as address register
movl %rax, (%rsp) and
movb %si, 8(%rbp) gives error saying that theres a mismatch between instruction suffix and register I.D.
movl %eax, %rdx gives an error saying that destination operand incorrect size
why can't we use ebx as address register? Is it because its 32-bit register? Would the following line work if it was movb $0xF, (%rbx) instead? since rbx is of 64bit register?
for the error regarding mismatch between instruction suffix and register I.D, does this error appear because it should've been movq %rax, (%rsp)and movew %si, 8(%rbp) instead of movl %rax, (%rsp) and movb %si, 8(%rbp)?
and lastly, for the error regarding "destination operand incorrect size", is this because the destination register was 64 bit instead of 32? so if the line of code was movl %eax, %edx instead, the error wouldn't have occurred?
any enlightenment would be appreciated.
this is for x86-64
movb $0xF, (%ebx) gives an error because ebx can't be used as address register
It's true that ebx can't be used as an address register (for x86-64), but rbx can. ebx is the lower 32bits of rbx. The whole point of 64bit code is that addresses can be 64bits, so trying to reference memory by using a 32bit register makes little sense.
movl %rax, (%rsp) and movb %si, 8(%rbp) gives error saying that
theres a mismatch between instruction suffix and register I.D.
Yes, because you are using movl, the 'l' means long, which (in this context) means 32bits. However, rax is a 64bit register. If you want to write 64bits out of rax, you should use movq. If you want to write 32bits, you should use eax.
movl %eax, %rdx gives an error saying that destination operand incorrect size
You are trying to move a 32bit value into a 64bit register. There are instructions to do this conversion for you (see cdq for example), but movl isn't one of them.
movb $0xF, (%ebx) assembles just fine (with a 0x67 address-size prefix), and executes correctly if the address in ebx is valid.
It might be a bug (and e.g. lead to a segfault from truncating a pointer), or sub-optimal, but if your book makes any stronger claim than that (like that it won't assemble) then your book contains an error.
The only reason you'd ever use that instead of movb $0xF, (%rbx) is if the upper bytes of %rbx potentially held garbage, e.g. in the x32 ABI (ILP32 in long mode), or if you're a dumb compiler that always uses address-size prefixes when targeting 32-bit-pointer mode even when addresses are known to be safely zero-extended.
32-bit address size is actually useful for the x32 ABI for the more common case where an index register holds high garbage, e.g. movl $0x12345, (%edi, %esi,4).
gcc -mx32 could easily emit a movb $0xF, (%ebx) instruction in real life. (Note that -mx32 (32-bit pointers in long mode) is different from -m32 (i386 ABI))
int ext(); // can't inline
void foo(char *p) {
ext(); // clobbers arg-passing registers
*p = 0xf; // so gcc needs to save the arg for after the call
}
Compiles with gcc7.3 -mx32 -O3 on the Godbolt compiler explorer into
foo(char*):
pushq %rbx # rbx is gcc's first choice of call-preserved reg.
movq %rdi, %rbx # stupid gcc copies the whole 64 bits when only the low 32 are useful
call ext()
movb $15, (%ebx) # $15 = $0xF
popq %rbx
ret
mov $edi, %ebx would have been better; IDK why gcc wants to copy the whole 64-bit register when it's treating pointers as 32-bit values. The x32 ABI unfortunately never really caught on on x86 so I guess nobody's put in the time to get gcc to generate great code for it.
AArch64 also has an ILP32 ABI to save memory / cache-footprint on pointer data, so maybe gcc will get better at 32-bit pointers in 64-bit mode in general (benefiting x86-64 as well) if any work for AArch64 ILP32 improves the common cross-architecture parts of this.
so if the line of code was movl %eax, %edx instead, the error wouldn't have occurred?
Right, that would zero-extend EAX into RDX. If you wanted to sign-extend EAX into RDX, use movslq %eax, %rdx (aka Intel-syntax movsxd)
(Almost) all x86 instructions require all their operands to be the same size. (In terms of operand-size; many instructions have a form with an 8-bit or 32-bit immediate that's sign extended to 64-bit or whatever the instruction's operand-size is. e.g. add $1, %eax will use the 3-byte add imm8, r/m32 form.)
Exceptions include shl %cl, %eax, and movzx/movsx.
In AT&T syntax, the sizes of registers have to match the operand-size suffix, if you use one. If you don't, the registers imply an operand-size. e.g. mov %eax, %edx is the same as movl.
Memory + immediate instructions with no register source or destination need an explicit size: add $1, (%rdx) won't assemble because the operand-size is ambiguous, but add %eax, (%rdx) is an addl (32-bit operand-size).
movew %si, 8(%rbp)
No, movw %si, 8(%rbp) would work though :P But note that if you've made a traditional stack frame with push %rbp / mov %rsp, %rbp on function entry, that store to 8(%rbp) will overwrite the low 16 bits of your return address on the stack.
But there's no requirement in x86-64 code for Windows or Linux that you have %rbp pointing there, or holding a valid pointer at all. It's just a call-preserved register like %rbx that you can use for whatever you want as long as you restore the caller's value before returning.
I'm trying to understand the underlying assembly for a simple C function.
program1.c
void function() {
char buffer[1];
}
=>
push %ebp
mov %esp, %ebp
sub $0x10, %esp
leave
ret
Not sure how it's arriving at 0x10 here? Isn't a character 1 byte, which is 8 bits, so it should be 0x08?
program2.c
void function() {
char buffer[4];
}
=>
push %ebp
mov %esp, %ebp
sub $0x18, %esp
mov ...
mov ...
[a bunch of random instructions]
Not sure how it's arriving at 0x18 here either? Also, why are there so many additional instructions after the SUB instruction? All I did was change the length of the array from 1 to 4.
gcc uses -mpreferred-stack-boundary=4 by default for x86 32 and 64bit ABIs, so it keeps %esp 16B-aligned.
I was able to reproduce your output with gcc 4.8.2 -O0 -m32 on the Godbolt Compiler Explorer
void f1() { char buffer[1]; }
pushl %ebp
movl %esp, %ebp # make a stack frame (`enter` is super slow, so gcc doesn't use it)
subl $16, %esp
leave # `leave` is not terrible compared to mov/pop
ret
You must be using a version of gcc with -fstack-protector enabled by default. Newer gcc isn't usually configured to do that, so you don't get the same sentinel value and check written to the stack. (Try a newer gcc in that godbolt link)
void f4() { char buffer[4]; }
pushl %ebp #
movl %esp, %ebp # make a stack frame
subl $24, %esp # IDK why it reserves 24, rather than 16 or 32B, but prob. has something to do with aligning the stack for the possible call to __stack_chk_fail
movl %gs:20, %eax # load a value from thread-local storage
movl %eax, -12(%ebp) # store it on the stack
xorl %eax, %eax # tmp59
movl -12(%ebp), %eax # D.1377, tmp60
xorl %gs:20, %eax # check that the sentinel value matches what we stored
je .L3 #,
call __stack_chk_fail #
.L3:
leave
ret
Apparently gcc considers char buffer[4] a "vulnerable object", but not char buffer[1]. Without -fstack-protector, there'd be little to no difference in the asm even at -O0.
Isn't a character 1 byte, which is 8 bits, so it should be 0x08?
This values are not bits, they are bytes.
Not sure how it's arriving at 0x10 here?
This lines:
push %ebp
mov %esp, %ebp
sub $0x10, %esp
Are allocating space on the stack, 16 bytes of memory are being reserved for the execution of this function.
All those bytes are needed to store information like:
A 4 byte memory address for the instruction that will be jumped to in the ret instruction
The local variables of the functions
Data structure alignment
Other stuff i can't remember right now :)
In your example, 16 bytes were allocated. 4 of them are for the address of the next instruction that will be called, so we have 12 bytes left. 1 byte is for the char array of size 1, which is probably optimized by the compiler to a single char. The last 11 bytes are probably to store some of the stuff i can't remember and the padding's added by the compiler.
Not sure how it's arriving at 0x18 here either?
Each of the additional bytes in your second example increased the stack size in 2 bytes, 1 byte for the char, and 1 likely for memory alignment purposes.
Also, why are there so many additional instructions after the SUB instruction?
Please update the question with the instructions.
This code is just setting up the stack frame. This is used as scratch space for local variables, and will have some kind of alignment requirement.
You haven't mentioned your platform, so I can't tell you exactly what the requirements are for your system, but obviously both values are at least 8-byte aligned (so the size of your local variables is rounded up so %esp is still a multiple of 8).
Search for "c function prolog epilog" or "c function call stack" to find more resources in this area.
Edit - Peter Cordes' answer explains the discrepancy and the mysterious extra instructions.
And for completeness, although Fábio already answered this part:
Not sure how it's arriving at 0x10 here? Isn't a character 1 byte, which is 8 bits, so it should be 0x08?
On x86, %esp is the stack pointer, and pointers store addresses, and these are addresses of bytes. Sub-byte addressing is rarely used (cf. Peter's comment). If you want to examine individual bits inside a byte, you'd usually use bitwise (&,|,~,^) operations on the value, but not change the address.
(You could equally argue that sub-cache-line addressing is a convenient fiction, but we're rapidly getting off-topic).
Whenever you allocate memory, your operating system almost never actually gives you exactly that amount, unless you use a function like pvalloc, which gives you a page-aligned amount of bytes (usually 4K). Instead, your operating system assumes that you might need more in the future, so goes ahead and gives you a bit more.
To disable this behavior, use a lower-level system call that doesn't do buffering, like sbrk(). These lecture notes are an excellent resource:
http://web.eecs.utk.edu/~plank/plank/classes/cs360/360/notes/Malloc1/lecture.html
I have the following program. I wonder why it outputs -4 on the following 64 bit machine? Which of my assumptions went wrong ?
[Linux ubuntu 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux]
In the above machine and gcc compiler, by default b should be pushed first and a second.
The stack grows downwards. So b should have higher address and a have lower address. So result should be positive. But I got -4. Can anybody explain this ?
The arguments are two chars occupying 2 bytes in the stack frame. But I saw the difference as 4 where as I am expecting 1. Even if somebody says it is because of alignment, then I am wondering a structure with 2 chars is not aligned at 4 bytes.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void CompareAddress(char a, char b)
{
printf("Differs=%ld\n", (intptr_t )&b - (intptr_t )&a);
}
int main()
{
CompareAddress('a','b');
return 0;
}
/* Differs= -4 */
Here's my guess:
On Linux in x64, the calling convention states that the first few parameters are passed by register.
So in your case, both a and b are passed by register rather than on the stack. However, since you take its address, the compiler will store it somewhere on the stack after the function is called.(Not necessary in the downwards order.)
It's also possible that the function is just outright inlined.
In either case, the compiler makes temporary stack space to store the variables. Those can be in any order and subject to optimizations. So they may not be in any particular order that you might expect.
The best way to answer these sort of questions (about behaviour of a specific compiler on a specific platform) is to look at the assembler. You can get gcc to dump its assembler by passing the -S flag (and the -fverbose-asm flag is nice too). Running
gcc -S -fverbose-asm file.c
gives a file.s that looks a little like (I've removed all the irrelevant bits, and the bits in parenthesis are my notes):
CompareAddress:
# ("allocate" memory on the stack for local variables)
subq $16, %rsp
# (put a and b onto the stack)
movl %edi, %edx # a, tmp62
movl %esi, %eax # b, tmp63
movb %dl, -4(%rbp) # tmp62, a
movb %al, -8(%rbp) # tmp63, b
# (get their addresses)
leaq -8(%rbp), %rdx #, b.0
leaq -4(%rbp), %rax #, a.1
subq %rax, %rdx # a.1, D.4597 (&b - &a)
# (set up the parameters for the printf call)
movl $.LC0, %eax #, D.4598
movq %rdx, %rsi # D.4597,
movq %rax, %rdi # D.4598,
movl $0, %eax #,
call printf #
main:
# (put 'a' and 'b' into the registers for the function call)
movl $98, %esi #,
movl $97, %edi #,
call CompareAddress
(This question explains nicely what [re]bp and [re]sp are.)
The reason the difference is negative is the stack grows downward: i.e. if you push two things onto the stack, the one you push first will have a larger address, and a is pushed before b.
The reason it is -4 rather than -1 is the compiler has decided that aligning the arguments to 4 byte boundaries is "better", probably because a 32 bit/64 bit CPU deals with 4 bytes at time better than it handles single bytes.
(Also, looking at the assembler shows the effect that -mpreferred-stack-boundary has: it essentially means that memory on the stack is allocated in different sized chunks.)
I think the answer that program given you is correct, the default preferred-stack-boundary of GCC is 4, you can set -mpreferred-stack-boundary=num to GCC options to change the stack boudary, then program will give you the different answer according your set.
I compiled the following C code:
typedef struct {
long x, y, z;
} Foo;
long Bar(Foo *f, long i)
{
return f[i].x + f[i].y + f[i].z;
}
with the command gcc -S -O3 test.c. Here is the Bar function in the output:
.section __TEXT,__text,regular,pure_instructions
.globl _Bar
.align 4, 0x90
_Bar:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
leaq (%rsi,%rsi,2), %rcx
movq 8(%rdi,%rcx,8), %rax
addq (%rdi,%rcx,8), %rax
addq 16(%rdi,%rcx,8), %rax
popq %rbp
ret
Leh_func_end1:
I have a few questions about this assembly code:
What is the purpose of "pushq %rbp", "movq %rsp, %rbp", and "popq %rbp", if neither rbp nor rsp is used in the body of the function?
Why do rsi and rdi automatically contain the arguments to the C function (i and f, respectively) without reading them from the stack?
I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)? The leaq instruction was replaced with:
imulq $88, %rsi, %rcx
The function is simply building its own stack frame with these instructions. There's nothing really unusual about them. You should note, though, that due to this function's small size, it will probably be inlined when used in the code. The compiler is always required to produce a "normal" version of the function, though. Also, what #ouah said in his answer.
This is because that's how the AMD64 ABI specifies the arguments should be passed to functions.
If the class is INTEGER, the next available register of the sequence
%rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.
Page 20, AMD64 ABI Draft 0.99.5 – September 3, 2010
This is not directly related to the structure size, rather - the absolute address that the function has to access. If the size of the structure is 24 bytes, f is the address of the array containing the structures, and i is the index at which the array has to be accessed, then the byte offset to each structure is i*24. Multiplying by 24 in this case is achieved by a combination of lea and SIB addressing. The first lea instruction simply calculates i*3, then every subsequent instruction uses that i*3 and multiplies it further by 8, therefore accessing the array at the needed absolute byte offset, and then using immediate displacements to access the individual structure members ((%rdi,%rcx,8). 8(%rdi,%rcx,8), and 16(%rdi,%rcx,8)). If you make the size of the structure 88 bytes, there is simply no way of doing such a thing swiftly with a combination of lea and any kind of addressing. The compiler simply assumes that a simple imull will be more efficient in calculating i*88 than a series of shifts, adds, leas or anything else.
What is the purpose of pushq %rbp, movq %rsp, %rbp, and popq %rbp, if neither rbp nor rsp is used in the body of the function?
To keep track of the frames when you use a debugger. Add -fomit-frame-pointer to optimize (note that it should be enabled at -O3 but in a lot of gcc versions I used it is not).
3. I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)?
The leaq call is (essentially and in this cae) calculating k*a+b where "k" is 1, 2, 4, or 8 and "a" and "b" are registers. If "a" and "b" are the same, it can be used for structures of 1, 2, 3, 4, 5, 8, and 9 longs.
Larger structures like 16 longs may be optimizable by calculating the offset with for "k" and doubling, but I do not know if that is what the compiler will actually do; you would have to test.