is it certain in which register arguments and variables are stored? - c

I'm still uncertain how registers are being used by the assembler
say I have a program:
int main(int rdi, int rsi, int rdx) {
rdx = rdi;
return 0;
}
Would this in assembly be translated into:
movq %rdx, %rdi
ret rax;
I'm new to AT&T and have hard time predicting when a certain register will be used.
Looking at this chart from Computer Systems - A programmer's perspective, third edition, R.E. Bryant and D. R. O'Hallaron:
charter

Is it certain in which register arguments and variables are stored?
Only at entry and exit of a function.
There is no guarantee as to what registers will be used within a function, even for variables which are parameters to the function. Compilers can (and often will) move variables around between registers to optimize register/stack usage, especially on register-starved architectures like x86.
In this case, a simple assignment operation like rdx = rdi may not compile to any assembly code at all, because the compiler will simply recognize that both values can now be found in the register %rdi. Even for a more complex operation like rdx = rdi + 1, the compiler has the freedom to store the value in any register, not specifically in %rdx. (It may even store the value back to %rdi, e.g. inc %rdi, if it recognizes that the original value is never used afterwards.)

No, it would be translated into:
mov %rdi, %rdx # move %rdi into %rdx
xor %eax, %eax # zero return value
ret # return
Of course, it's more than likely that rdx = rdi (and therefore mov %rdi, %rdx) will be removed by the compiler, because rdx is not used again.
Credit to #Jester for finding this out before me.

Related

Why can't I get the value of asm registers in C?

I'm trying to get the values of the assembly registers rdi, rsi, rdx, rcx, r8, but I'm getting the wrong value, so I don't know if what I'm doing is taking those values or telling the compiler to write on these registers, and if that's the case how could I achieve what I'm trying to do (Put the value of assembly registers in C variables)?
When this code compiles (with gcc -S test.c)
#include <stdio.h>
void beautiful_function(int a, int b, int c, int d, int e) {
register long rdi asm("rdi");
register long rsi asm("rsi");
register long rdx asm("rdx");
register long rcx asm("rcx");
register long r8 asm("r8");
const long save_rdi = rdi;
const long save_rsi = rsi;
const long save_rdx = rdx;
const long save_rcx = rcx;
const long save_r8 = r8;
printf("%ld\n%ld\n%ld\n%ld\n%ld\n", save_rdi, save_rsi, save_rdx, save_rcx, save_r8);
}
int main(void) {
beautiful_function(1, 2, 3, 4, 5);
}
it outputs the following assembly code (before the function call):
movl $1, %edi
movl $2, %esi
movl $3, %edx
movl $4, %ecx
movl $5, %r8d
callq _beautiful_function
When I compile and execute it outputs this:
0
0
4294967296
140732705630496
140732705630520
(some undefined values)
What did I do wrong ? and how could I do this?
Your code didn't work because Specifying Registers for Local Variables explicitly tells you not to do what you did:
The only supported use for this feature is to specify registers for input and output operands when calling Extended asm (see Extended Asm).
Other than when invoking the Extended asm, the contents of the specified register are not guaranteed. For this reason, the following uses are explicitly not supported. If they appear to work, it is only happenstance, and may stop working as intended due to (seemingly) unrelated changes in surrounding code, or even minor changes in the optimization of a future version of gcc:
Passing parameters to or from Basic asm
Passing parameters to or from Extended asm without using input or output operands.
Passing parameters to or from routines written in assembler (or other languages) using non-standard calling conventions.
To put the value of registers in variables, you can use Extended asm, like this:
long rdi, rsi, rdx, rcx;
register long r8 asm("r8");
asm("" : "=D"(rdi), "=S"(rsi), "=d"(rdx), "=c"(rcx), "=r"(r8));
But note that even this might not do what you want: the compiler is within its rights to copy the function's parameters elsewhere and reuse the registers for something different before your Extended asm runs, or even to not pass the parameters at all if you never read them through the normal C variables. (And indeed, even what I posted doesn't work when optimizations are enabled.) You should strongly consider just writing your whole function in assembly instead of inline assembly inside of a C function if you want to do what you're doing.
Even if you had a valid way of doing this (which this isn't), it probably only makes sense at the top of a function which isn't inlined. So you'd probably need __attribute__((noinline, noclone)). (noclone is a GCC attribute that clang will warn about not recognizing; it means not to make an alternate version of the function with fewer actual args, to be called in the case where some of them are known constants that can get propagated into the clone.)
register ... asm local vars aren't guaranteed to do anything except when used as operands to Extended Asm statements. GCC does sometimes still read the named register if you leave it uninitialized, but clang doesn't. (And it looks like you're on a Mac, where the gcc command is actually clang, because so many build scripts use gcc instead of cc.)
So even without optimization, the stand-alone non-inlined version of your beautiful_function is just reading uninitialized stack space when it reads your rdi C variable in const long save_rdi = rdi;. (GCC does happen to do what you wanted here, even at -Os - optimizes but chooses not to inline your function. See clang and GCC (targeting Linux) on Godbolt, with asm + program output.).
Using an asm statement to make register asm do something
(This does what you say you want (reading registers), but because of other optimizations, still doesn't produce 1 2 3 4 5 with clang when the caller can see the definition. Only with actual GCC. There might be a clang option to disable some relevant IPA / IPO optimization, but I didn't find one.)
You can use an asm volatile() statement with an empty template string to tell the compiler that the values in those registers are now the values of those C variables. (The register ... asm declarations force it to pick the right register for the right variable)
#include <stdlib.h>
#include <stdio.h>
__attribute__((noinline,noclone))
void beautiful_function(int a, int b, int c, int d, int e) {
register long rdi asm("rdi");
register long rsi asm("rsi");
register long rdx asm("rdx");
register long rcx asm("rcx");
register long r8 asm("r8");
// "activate" the register-asm locals:
// associate register values with C vars here, at this point
asm volatile("nop # asm statement here" // can be empty, nop is just because Godbolt filters asm comments
: "=r"(rdi), "=r"(rsi), "=r"(rdx), "=r"(rcx), "=r"(r8) );
const long save_rdi = rdi;
const long save_rsi = rsi;
const long save_rdx = rdx;
const long save_rcx = rcx;
const long save_r8 = r8;
printf("%ld\n%ld\n%ld\n%ld\n%ld\n", save_rdi, save_rsi, save_rdx, save_rcx, save_r8);
}
int main(void) {
beautiful_function(1, 2, 3, 4, 5);
}
This makes asm in your beautiful_function that does capture the incoming values of your registers. (It doesn't inline, and the compiler happens not to have used any instructions before the asm statement that steps on any of those registers. The latter is not guaranteed in general.)
On Godbolt with clang -O3 and gcc -O3
gcc -O3 does actually work, printing what you expect. clang still prints garbage, because the caller sees that the args are unused, and decides not to set those registers. (If you'd hidden the definition from the caller, e.g. in another file without LTO, that wouldn't happen.)
(With GCC, noninline,noclone attributes are enough to disable this inter-procedural optimization, but not with clang. Not even compiling with -fPIC makes that possible. I guess the idea is that symbol-interposition to provide an alternate definition of beautiful_function that does use its args would violate the one definition rule in C. So if clang can see a definition for a function, it assumes that's how the function works, even if it isn't allowed to actually inline it.)
With clang:
main:
pushq %rax # align the stack
# arg-passing optimized away
callq beautiful_function#PLT
# indirect through the PLT because I compiled for Linux with -fPIC,
# and the function isn't "static"
xorl %eax, %eax
popq %rcx
retq
But the actual definition for beautiful_function does exactly what you want:
# clang -O3
beautiful_function:
pushq %r14
pushq %rbx
nop # asm statement here
movq %rdi, %r9 # copying all 5 register outputs to different regs
movq %rsi, %r10
movq %rdx, %r11
movq %rcx, %rbx
movq %r8, %r14
leaq .L.str(%rip), %rdi
xorl %eax, %eax
movq %r9, %rsi # then copying them to printf args
movq %r10, %rdx
movq %r11, %rcx
movq %rbx, %r8
movq %r14, %r9
popq %rbx
popq %r14
jmp printf#PLT # TAILCALL
GCC wastes fewer instructions, just for example starting with movq %r8, %r9 to move your r8 C var as the 6th arg to printf. Then movq %rcx, %r8 to set up the 5th arg, overwriting one of the output registers before it's read all of them. Something clang was over-cautious about. However, clang does still push/pop %r12 around the asm statement; I don't understand why. It ends by tailcalling printf, so it wasn't for alignment.
Related:
How to specify a specific register to assign the result of a C expression in inline assembly in GCC? - the opposite problem: materialize a C variable value in a specific register at a certain point.
Reading a register value into a C variable - the previous canonical Q&A which uses the now-unsupported register ... asm("regname") method like you were trying to. Or with a register-asm global variable, which hurts efficiency of all your code by leaving it otherwise untouched.
I forgot I'd answered that Q&A, making basically the same points as this. And some other points, e.g. that this doesn't work on registers like the stack pointer.

why callees don't use caller saved registers first?

We know that by x86-64 convention, registers %rbx, %rbp, and %r12–%r15 are classified as callee-saved registers. While %r10 and %r111 are caller-saved registers.
but when I comiple C code in most of case, e.g. function P calls Q, I see following assembly code for function Q:
Q:
push %rbx
movq %rdx, %rbx
...
popq %rbx
ret
We know that since %rbx is a callee-saved register, we must store it on stack and restore it for the caller P later.
but wouldn't it be more concise and save stack operations by using a caller saved register %r10 as:
Q:
movq %rdx, %r10
...
ret
so callee doen't need to worry about save and restore the register for the caller, becuase the caller had already pushed it to stack before calling the callee?
You seem to be mixed up about what "caller-saved" means. I think this bad choice of terminology has fooled you into thinking that compilers actually will save them in the caller around function calls. That would be slower usually (Why do compilers insist on using a callee-saved register here?), especially in a function that makes more than one call, or calls in a loop.
Better terminology is call-clobbered vs. call-preserved, which reflects how compilers actually use them, and how humans should think about them: registers that die on a function call, or that don't. Compilers don't push/pop a call-clobbered (aka caller-saved) register around each call.
But if you were going to push/pop a value around a single function call, you'd just do that with %rdx. Copying it to R10 would just be a waste of instructions. So mov %r10 is useless. With a later push it's just inefficient, without it's incorrect.
The reason for copying to a call-preserved register is so the function arg will survive a function call that the function makes later. Obviously you have to use a call-preserved register for that; call-clobbered registers don't survive function calls.
When a call-preserved register isn't needed, yes compilers do pick call-clobbered registers.
If you expand your example to an actual MCVE instead of just showing the asm without source, this should be clearer. If you write a leaf function that needs a mov to evaluate an expression, or a non-leaf that doesn't need any of its args after the first function-call, you won't see it wasting instructions saving and using a call-preserved reg. e.g.
int foo(int a) {
return (a>>2) + (a>>3) + (a>>4);
}
https://godbolt.org/z/ceM4dP with GCC and clang -O3:
# gcc10.2
foo(int):
mov eax, edi
mov edx, edi # using EDX, a call-clobbered register
sar edi, 4
sar eax, 2
sar edx, 3
add eax, edx
add eax, edi
ret
Right shift can't be done with LEA to copy-and-operate, and shifting the same input 3 different ways convinces GCC to use mov to copy the input. (Instead of doing a chain of right-shifts: compilers love to minimize latency at the expense of more instructions because that's often best for wide OoO exec.)

How do I translate an optimized x86-64 asm loop back to a C for loop?

I have the following:
foo:
movl $0, %eax //result = 0
cmpq %rsi, %rdi // rdi = x, rsi = y?
jle .L2
.L3:
addq %rdi, %rax //result = result + i?
subq $1, %rdi //decrement?
cmp %rdi, rsi
jl .L3
.L2
rep
ret
And I'm trying to translate it to:
long foo(long x, long y)
{
long i, result = 0;
for (i= ; ; ){
//??
}
return result;
}
I don't know what cmpq %rsi, %rdi mean.
Why isn't there another &eax for long i?
I would love some help in figuring this out. I don't know what I'm missing - I been going through my notes, textbook, and rest of the internet and I am stuck. It's a review question, and I've been at it for hours.
Assuming this is a function taking 2 parameters. Assuming this is using the gcc amd64 calling convention, it will pass the two parameters in rdi and rsi. In your C function you call these x and y.
long foo(long x /*rdi*/, long y /*rsi*/)
{
//movl $0, %eax
long result = 0; /* rax */
//cmpq %rsi, %rdi
//jle .L2
if (x > y) {
do {
//addq %rdi, %rax
result += x;
//subq $1, %rdi
--x;
//cmp %rdi, rsi
//jl .L3
} while (x > y);
}
return result;
}
I don't know what cmpq %rsi, %rdi mean
That's AT&T syntax for cmp rdi, rsi. https://www.felixcloutier.com/x86/CMP.html
You can look up the details of what a single instruction does in an ISA manual.
More importantly, cmp/jcc like cmp %rsi,%rdi/jl is like jump if rdi<rsi.
Assembly - JG/JNLE/JL/JNGE after CMP. If you go through all the details of how cmp sets flags, and which flags each jcc condition checks, you can verify that it's correct, but it's much easier to just use the semantic meaning of JL = Jump on Less-than (assuming flags were set by a cmp) to remember what they do.
(It's reversed because of AT&T syntax; jcc predicates have the right semantic meaning for Intel syntax. This is one of the major reasons I usually prefer Intel syntax, but you can get used to AT&T syntax.)
From the use of rdi and rsi as inputs (reading them without / before writing them), they're the arg-passing registers. So this is the x86-64 System V calling convention, where integer args are passed in RDI, RSI, RDX, RCX, R8, R9, then on the stack. (What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 covers function calls as well as system calls). The other major x86-64 calling convention is Windows x64, which passes the first 2 args in RCX and RDX (if they're both integer types).
So yes, x=RDI and y=RSI. And yes, result=RAX. (writing to EAX zero-extends into RAX).
From the code structure (not storing/reloading every C variable to memory between statements), it's compiled with some level of optimization enabled, so the for() loop turned into a normal asm loop with the conditional branch at the bottom. Why are loops always compiled into "do...while" style (tail jump)? (#BrianWalker's answer shows the asm loop transliterated back to C, with no attempt to form it back into an idiomatic for loop.)
From the cmp/jcc ahead of the loop, we can tell that the compiler can't prove the loop runs a non-zero number of iterations. So whatever the for() loop condition is, it might be false the first time. (That's unsurprising given signed integers.)
Since we don't see a separate register being used for i, we can conclude that optimization reused another var's register for i. Like probably for(i=x;, and then with the original value of x being unused for the rest of the function, it's "dead" and the compiler can just use RDI as i, destroying the original value of x.
I guessed i=x instead of y because RDI is the arg register that's modified inside the loop. We expect that the C source modifies i and result inside the loop, and presumably doesn't modify it's input variables x and y. It would make no sense to do i=y and then do stuff like x--, although that would be another valid way of decompiling.
cmp %rdi, %rsi / jl .L3 means the loop condition to (re)enter the loop is rsi-rdi < 0 (signed), or i<y.
The cmp/jcc before the loop is checking the opposite condition; notice that the operands are reversed and it's checking jle, i.e. jng. So that makes sense, it really is same loop condition peeled out of the loop and implemented differently. Thus it's compatible with the C source being a plain for() loop with one condition.
sub $1, %rdi is obviously i-- or --i. We can do that inside the for(), or at the bottom of the loop body. The simplest and most idiomatic place to put it is in the 3rd section of the for(;;) statement.
addq %rdi, %rax is obviously adding i to result. We already know what RDI and RAX are in this function.
Putting the pieces together, we arrive at:
long foo(long x, long y)
{
long i, result = 0;
for (i= x ; i>y ; i-- ){
result += i;
}
return result;
}
Which compiler made this code?
From the .L3: label names, this looks like output from gcc. (Which somehow got corrupted, removing the : from .L2, and more importantly removing the % from %rsi in one cmp. Make sure you copy/paste code into SO questions to avoid this.)
So it may be possible with the right gcc version/options to get exactly this asm back out for some C input. It's probably gcc -O1, because movl $0, %eax rules out -O2 and higher (where GCC would look for the xor %eax,%eax peephole optimization for zeroing a register efficiently). But it's not -O0 because that would be storing/reloading the loop counter to memory. And -Og (optimize a bit, for debugging) likes to use a jmp to the loop condition instead of a separate cmp/jcc to skip the loop. This level of detail is basically irrelevant for simply decompiling to C that does the same thing.
The rep ret is another sign of gcc; gcc7 and earlier used this in their default tune=generic output for ret that's reached as a branch target or a fall-through from a jcc, because of AMD K8/K10 branch prediction. What does `rep ret` mean?
gcc8 and later will still use it with -mtune=k8 or -mtune=barcelona. But we can rule that out because that tuning option would use dec %rdi instead of subq $1, %rdi. (Only a few modern CPUs have any problems with inc/dec leaving CF unmodified, for register operands. INC instruction vs ADD 1: Does it matter?)
gcc4.8 and later put rep ret on the same line. gcc4.7 and earlier print it as you've shown, with the rep prefix on the line before.
gcc4.7 and later like to put the initial branch before the mov $0, %eax, which looks like a missed optimization. It means they need a separate return 0 path out of the function, which contains another mov $0, %eax.
gcc4.6.4 -O1 reproduces your output exactly, for the source shown above, on the Godbolt compiler explorer
# compiled with gcc4.6.4 -O1 -fverbose-asm
foo:
movl $0, %eax #, result
cmpq %rsi, %rdi # y, x
jle .L2 #,
.L3:
addq %rdi, %rax # i, result
subq $1, %rdi #, i
cmpq %rdi, %rsi # i, y
jl .L3 #,
.L2:
rep
ret
So does this other version which uses i=y. Of course there are many things we could add that would optimize away, like maybe i=y+1 and then having a loop condition like x>--i. (Signed overflow is undefined behaviour in C, so the compiler can assume it doesn't happen.)
// also the same asm output, using i=y but modifying x in the loop.
long foo2(long x, long y) {
long i, result = 0;
for (i= y ; x>i ; x-- ){
result += x;
}
return result;
}
In practice the way I actually reversed this:
I copy/pasted the C template into Godbolt (https://godbolt.org/). I could see right away (from the mov $0 instead of xor-zero, and from the label names) that it looked like gcc -O1 output, so I put in that command line option and picked an old-ish version of gcc like gcc6. (Turns out this asm was actually from a much older gcc).
I tried an initial guess like x<y based on the cmp/jcc, and i++ (before I'd actually read the rest of the asm carefully at all), because for loops often use i++. The trivial-looking infinite-loop asm output showed me that was obviously wrong :P
I guessed that i=x, but after taking a wrong turn with a version that did result += x but i--, I realized that i was a distraction and at first simplified by not using i at all. I just used x-- while first reversing it because obviously RDI=x. (I know the x86-64 System V calling convention well enough to see that instantly.)
After looking at the loop body, the result += x and x-- were totally obvious from the add and sub instructions.
cmp/jl was obviously a something < something loop condition involving the 2 input vars.
I wasn't sure I if it was x<y or y<x, and newer gcc versions were using jne as the loop condition. I think at that point I cheated and looked at Brian's answer to check it really was x > y, instead of taking a minute to work through the actual logic. But once I had figured out it was x--, only x>y made sense. The other one would be true until wraparound if it entered the loop at all, but signed overflow is undefined behaviour in C.
Then I looked at some older gcc versions to see if any made asm more like in the question.
Then I went back and replaced x with i inside the loop.
If this seems kind of haphazard and slapdash, that's because this loop is so tiny that I didn't expect to have any trouble figuring it out, and I was more interested in finding source + gcc version that exactly reproduced it, rather than the original problem of just reversing it at all.
(I'm not saying beginners should find it that easy, I'm just documenting my thought process in case anyone's curious.)

data movement error clarification

I'm currently solving problem 3.3 from 3rd edition of Computer System: a programmer's perspective and I'm having a hard time understanding what these errors mean...
movb $0xF, (%ebx) gives an error because ebx can't be used as address register
movl %rax, (%rsp) and
movb %si, 8(%rbp) gives error saying that theres a mismatch between instruction suffix and register I.D.
movl %eax, %rdx gives an error saying that destination operand incorrect size
why can't we use ebx as address register? Is it because its 32-bit register? Would the following line work if it was movb $0xF, (%rbx) instead? since rbx is of 64bit register?
for the error regarding mismatch between instruction suffix and register I.D, does this error appear because it should've been movq %rax, (%rsp)and movew %si, 8(%rbp) instead of movl %rax, (%rsp) and movb %si, 8(%rbp)?
and lastly, for the error regarding "destination operand incorrect size", is this because the destination register was 64 bit instead of 32? so if the line of code was movl %eax, %edx instead, the error wouldn't have occurred?
any enlightenment would be appreciated.
this is for x86-64
movb $0xF, (%ebx) gives an error because ebx can't be used as address register
It's true that ebx can't be used as an address register (for x86-64), but rbx can. ebx is the lower 32bits of rbx. The whole point of 64bit code is that addresses can be 64bits, so trying to reference memory by using a 32bit register makes little sense.
movl %rax, (%rsp) and movb %si, 8(%rbp) gives error saying that
theres a mismatch between instruction suffix and register I.D.
Yes, because you are using movl, the 'l' means long, which (in this context) means 32bits. However, rax is a 64bit register. If you want to write 64bits out of rax, you should use movq. If you want to write 32bits, you should use eax.
movl %eax, %rdx gives an error saying that destination operand incorrect size
You are trying to move a 32bit value into a 64bit register. There are instructions to do this conversion for you (see cdq for example), but movl isn't one of them.
movb $0xF, (%ebx) assembles just fine (with a 0x67 address-size prefix), and executes correctly if the address in ebx is valid.
It might be a bug (and e.g. lead to a segfault from truncating a pointer), or sub-optimal, but if your book makes any stronger claim than that (like that it won't assemble) then your book contains an error.
The only reason you'd ever use that instead of movb $0xF, (%rbx) is if the upper bytes of %rbx potentially held garbage, e.g. in the x32 ABI (ILP32 in long mode), or if you're a dumb compiler that always uses address-size prefixes when targeting 32-bit-pointer mode even when addresses are known to be safely zero-extended.
32-bit address size is actually useful for the x32 ABI for the more common case where an index register holds high garbage, e.g. movl $0x12345, (%edi, %esi,4).
gcc -mx32 could easily emit a movb $0xF, (%ebx) instruction in real life. (Note that -mx32 (32-bit pointers in long mode) is different from -m32 (i386 ABI))
int ext(); // can't inline
void foo(char *p) {
ext(); // clobbers arg-passing registers
*p = 0xf; // so gcc needs to save the arg for after the call
}
Compiles with gcc7.3 -mx32 -O3 on the Godbolt compiler explorer into
foo(char*):
pushq %rbx # rbx is gcc's first choice of call-preserved reg.
movq %rdi, %rbx # stupid gcc copies the whole 64 bits when only the low 32 are useful
call ext()
movb $15, (%ebx) # $15 = $0xF
popq %rbx
ret
mov $edi, %ebx would have been better; IDK why gcc wants to copy the whole 64-bit register when it's treating pointers as 32-bit values. The x32 ABI unfortunately never really caught on on x86 so I guess nobody's put in the time to get gcc to generate great code for it.
AArch64 also has an ILP32 ABI to save memory / cache-footprint on pointer data, so maybe gcc will get better at 32-bit pointers in 64-bit mode in general (benefiting x86-64 as well) if any work for AArch64 ILP32 improves the common cross-architecture parts of this.
so if the line of code was movl %eax, %edx instead, the error wouldn't have occurred?
Right, that would zero-extend EAX into RDX. If you wanted to sign-extend EAX into RDX, use movslq %eax, %rdx (aka Intel-syntax movsxd)
(Almost) all x86 instructions require all their operands to be the same size. (In terms of operand-size; many instructions have a form with an 8-bit or 32-bit immediate that's sign extended to 64-bit or whatever the instruction's operand-size is. e.g. add $1, %eax will use the 3-byte add imm8, r/m32 form.)
Exceptions include shl %cl, %eax, and movzx/movsx.
In AT&T syntax, the sizes of registers have to match the operand-size suffix, if you use one. If you don't, the registers imply an operand-size. e.g. mov %eax, %edx is the same as movl.
Memory + immediate instructions with no register source or destination need an explicit size: add $1, (%rdx) won't assemble because the operand-size is ambiguous, but add %eax, (%rdx) is an addl (32-bit operand-size).
movew %si, 8(%rbp)
No, movw %si, 8(%rbp) would work though :P But note that if you've made a traditional stack frame with push %rbp / mov %rsp, %rbp on function entry, that store to 8(%rbp) will overwrite the low 16 bits of your return address on the stack.
But there's no requirement in x86-64 code for Windows or Linux that you have %rbp pointing there, or holding a valid pointer at all. It's just a call-preserved register like %rbx that you can use for whatever you want as long as you restore the caller's value before returning.

Why does C not push a pointer on the stack when calling a assembly function?

I am currently trying to get some experience with calling assembly functions from C. Therefore, I created a little program which calculates the sum of all array elements.
The C Code looks like this:
#include <stdio.h>
#include <stdint.h>
extern int32_t arrsum(int32_t* arr,int32_t length);
int main()
{
int32_t test[] = {1,2,3};
int32_t length = 3;
int32_t sum = arrsum(test,length);
printf("Sum of arr: %d\n",sum);
return 0;
}
And the assembly function looks like this:
.text
.global arrsum
arrsum:
pushq %rbp
movq %rsp, %rbp
pushq %rdi
pushq %rcx
movq 24(%rbp),%rcx
#movq 16(%rbp),%rdi
xorq %rax,%rax
start_loop:
addl (%rdi),%eax
addq $4,%rdi
loop start_loop
popq %rcx
popq %rdi
movq %rbp , %rsp
popq %rbp
ret
I assumed that C obeys the calling convention and pushes all arguments on the stack. And indeed, at position 24(%rbp) I am able to find the length of the array. I expected to find the pointer to the array at 16(%rbp), but instead I just found 0x0. After some debugging I found that C didn't push the pointer at all but instead moved the whole pointer into the %rdi register.
Why does this happen? I couldn't find any information about this behavior.
The calling convention the C compiler will use depends on your system, metadata you pass to the compiler and flags. It sounds like your compiler is using the System V AMD64 calling convention detailed here: https://en.m.wikipedia.org/wiki/X86_calling_conventions (implying that you're using a Unix-like OS on a 64 bit x86 chip). Basically, in this convention most arguments go into registers because it's faster and the 64 bit x86 systems have enough registers to make this work (usually).
I assumed that C obeys the calling convention and pushes all arguments on the stack.
There is no "the" calling convention. Passing arguments via the stack is only one possible calling convention (of many). This strategy is commonly used on 32-bit systems, but even there, it is not the only way that parameters are passed.
Most 64-bit calling conventions pass the first 4–6 arguments in registers, which is generally more efficient than passing them on the stack.
Exactly which calling convention is at play here is system-dependent; your question doesn't give much of a clue whether you're using Windows or *nix, but I'm guessing that you're using *nix since the parameter is being passed in the rdi register. In that case, the compiler would be following the System V AMD64 ABI.
In the System V AMD64 calling convention, the first six integer-sized arguments (which can also be pointers) are passed in the registers RDI, RSI, RDX, RCX, R8, and R9, in that order. Each register is dedicated to a parameter, thus parameter 1 always goes into RDI, parameter 2 always goes into RSI, and so on. Floating-point parameters are instead passed via the vector registers, XMM0-XMM7. Additional parameters are passed on the stack in reverse order.
More information about this and other common calling conventions is available in the x86 tag wiki.

Resources