Do C compilers guarantee two eightbyte field structs will be passed as INTEGER on SysV x64? - c

Specifically in the context of the SysV x86-64 ABI
If I have a struct with only two fields, such as:
typedef struct {
void *foo;
void *bar;
} foobar_t;
And I pass it to a function with a definition like so:
foobar_t example_function(foobar_t example_param);
The ABI seems to say that each eightbyte field should be passed as INTEGER to the function, therefore rdi == foo and rsi == bar. Similarly, when returning we should be able to use rax and rdx, since we don't need a memory pointer in rdi. If example_function is trivially defined as:
foobar_t example_function(foobar_t example_param) {
return example_param;
}
A valid assembly implementation, ignoring prologue and epilogue, would be:
example_function:
mov rax, rdi
mov rdx, rsi
ret
Conceivably, a mentally-deficient compiler could fill the struct with NO_CLASS padding and make that assembly invalid somehow. I'm wondering if it's written down anywhere that a struct with only two eightbyte fields must be handled this way.
The larger context to my question is that I'm writing a simple C11 task switcher for my own edification. I'm basing it largely on boost.context and this is exactly how boost passes two-field structs around. I want to know if it's kosher under all circumstances or if boost is cheating a little.

The ABI seems to say that each eightbyte field should be passed as
INTEGER to the function, therefore rdi == foo and rsi == bar.
Agreed, for "global" functions accessible from multiple compilation units, the argument structure is broken up into to eightbyte pieces, the first completely filled by foo, and the second completely filled by bar. These are classified as INTEGER, and therefore passed in %rdi and %rsi, respectively.
Similarly, when returning we should be able to use rax and rdx, since we don't need a memory pointer in rdi.
I don't follow your point about %rdi, but I agree that the members of the return value are returned in %rax and %rdx.
A valid assembly implementation, ignoring prologue and epilogue, would be: [...]
Agreed.
Conceivably, a mentally-deficient compiler could fill the struct with NO_CLASS padding and make that assembly invalid somehow. I'm wondering if it's written down anywhere that a struct with only two eightbyte fields must be handled this way.
A compiler that produces code conforming to the SysV x86-64 ABI will use the registers already discussed for passing the argument and returning the return value. Such a compiler is of course not obligated to implement the function body exactly as you describe, but I'm not seeing your concern. Yes, these details are written down. Although the specific case you present is not explicitly described in the ABI specification you linked, all of the behavior discussed above follows from that specification. That's the point of it.
A compiler that produces code (for a global function) that behaves differently is not mentally-deficient, it is non-conforming.
The larger context to my question is that I'm writing a simple C11
task switcher for my own edification. I'm basing it largely on
boost.context and this is exactly how boost passes two-field structs
around. I want to know if it's kosher under all circumstances or if
boost is cheating a little.
It would take me more analysis than I'm prepared to expend to determine exactly what Boost is doing in the code you point to. Note that it is not what you present in your example_function. But it is reasonable to suppose that Boost is at least attempting to implement its function calls according to the ABI.

Compilers agreeing on struct layout and how they're passed by value as function args are key parts of an ABI. Otherwise they couldn't call each other's functions.
Hand-written asm is not different from compiler-generated asm; it doesn't have to have to come from the same version of the same compiler to interoperate properly. This is why stable and correct ABIs are such a big deal.
Compatibility with hand-written asm is fairly similar to compatibility with machine code that was compiled a long time ago and has been sitting in a binary shared library for years. If it was correct then, it's correct now. Unless the structs have changed in the source newly compiled code can call and be called by the existing instructions.
If a compiler doesn't match the standard as-written, it's broken.
Or maybe more accurately, if it doesn't match gcc, it's broken. And if the standard wording doesn't describe what gcc/clang/ICC do, then the standard document is broken.
If you had a compiler for x86-64 System V that passes a 2x void* struct any way other than in 2 registers, that compiler is broken, not your hand-written asm.
(Assuming there aren't a lot of earlier args that use up the arg-passing registers before we get to the struct arg.)

Related

Why can't stdcall handle varying amounts of arguments?

My understanding is that for the cdecl calling convention, the caller is responsible for cleaning the stack and therefore can pass any number of arguments.
On the other hand, stdcall callees clean the stack and therefore cannot receive varying amounts of arguments.
My question is twofold:
Couldn't stdcall functions also get a parameter about how many variables there are and do the same?
How do cdecl functions know how many arguments they've received?
Couldn't stdcall functions also get a parameter of how many variables are there and do the same?
Yes, sure. You could invent any calling convention. But then that wouldn't be stdcall anymore.
How do cdecl functions know how many arguments they've received?
They don't. They assume to find the required number of arguments in the locations specified by the calling convention. If they are missing, then that's a bug which the code cannot observe. The following code compiles:
printf("%s");
even though it is missing an argument. The result is undefined. For printf-style functions compilers generally issue warnings (if they can) due to knowledge of the functions' internals, but that's not a solution that can be generically applied.
If a caller provides the wrong number or types of arguments, then the behavior is undefined.
Couldn't stdcall functions also get a parameter of how many variables are there and do the same?
If the caller has to pass a separate arg with the number of bytes to be popped, that's more work than just doing add esp, 16 or whatever after the call (cdecl style caller-pops). It would totally defeat the purpose of stdcall, which is to save a few bytes of space at each call site, especially for naive code-gen that wouldn't defer popping args across a couple calls, or reuse the space allocated by a push with mov stores. (There are often multiple call-sites for each function, so the extra 2 bytes for ret imm16 vs. ret is amortized over that.)
Even worse, the callee can't use a variable number efficiently on x86 / x86-64. ret imm16 only works with an immediate (constant embedded in the machine code), so to pop a variable number of bytes above the return address, a function would have to copy the return address high up in the stack and do a plain ret from there. (Or defeat branch return-address branch prediction by popping the return address into a register.)
See also:
Stack cleanup in stdcall (callee-pops) for variable arguments (x86 asm)
What calling convention does printf() in C use? (why stdcall is unusable)
How do cdecl functions know how many arguments they've received?
They don't.
C is designed around the assumption that variadic functions don't know how many args they received, so functions need something like a format string or sentinel to know how many to iterate. For example, the POSIX execl(3) (wrapper for the execve(2) system call) takes a NULL-terminated list of char* args.
Thus calling conventions in general don't waste code-size and cycles on providing a count as a side-channel; whatever info the function needs will be part of the real C-level args.
Fun fact: printf("%d", 1, 2, 3) is well-defined behaviour in C, and is required to safely ignore args beyond the ones referenced by the format string.
So using stdcall and calculating based on the format-string can't work. You're right, if you wanted to make a callee-pops convention that worked for variadic functions, you would need to pass a size somewhere, e.g. in a register. But like I said earlier, the caller knows the right number, so it would be vastly easier to let the caller manage the stack, instead of making the callee dig up this extra arg later. That's why no real-world calling conventions work this way, AFAIK.
Passing the number of arguments in a callee cleans the stack convention would be possible but the additional overhead of the extra parameter outweighs its usefulness. It wastes stack space with the extra parameter and complicates the callees stack handling.
The reason stdcall was invented is because it makes the code smaller. One adjustment in the callee vs adjusting every place it is called (on x86 or on another architecture when there are more parameters than you can pass in registers). The x86 even has a retn # instruction where # is the number of bytes to adjust. Windows NT switched from cdecl to stdcall early in its development and it supposedly reduced the size and improved speed (I believe Larry Osterman blogged about this (mini answer here)).
cdecl functions do not know how many parameters there are. You are allowed (on the ABI level) to pass more arguments than the function will actually use. A printf style function will use the format parameter as a "guide" to access the parameters one by one. When this is done the callee also has to be informed of the type of each parameter (so it knows the size which in turn, in an implementation defined manner, allows it to walk the list of parameters. On Windows x86 the parameters are on the stack, all you need is the parameter size to calculate their offset as you walk the stack). The va_list and its macros in stdarg.h provides the helping glue for C functions to access these parameters.
My summary, based on #IInspectable's answer.
stdcall functions could also get a parameter of how many variables there are, but then that wouldn't be stdcall anymore.
cdecl don't know how many arguments to read. It is assumed that the function will be able to derive the number of arguments based on a pre-determined amount of arguments, like a format string for printf.
If a caller provides the less arguments than could be derived, or of an unexpected type, then the behavior is undefined. (Thanks for the correction #Peter Cordes)

What calling convention should I use to make things portable?

I am writing a C interface for CPU's cpuid instruction. I'm just doing this as kind of an exercise: I don't want to use compiler-depended headers such as cpuid.h for GCC or intrin.h for MSVC. Also, I'm aware that using C inline assembly would be a better choice, since it avoids thinking about calling conventions (see this implementation): I'd just have to think about different compiler's syntaxes. However I'd like to start practicing a bit with integrating assembly and C.
Given that I now have to write a different assembly implementation for each major assembler (I was thinking of GAS, MASM and NASM) and for each of them both for x86-64 and x86, how should I handle the fact that different machines and C compilers may use different calling conventions?
If you really want to write, as just an exercise, an assembly function that "conforms" to all the common calling conventions for x86_64 (I know only the Windows one and the System V one), without relying on attributes or compiler flags to force the calling convention, let's take a look at what's common.
The Windows GPR passing order is rcx, rdx, r8, r9. The System V passing order is rdi, rsi, rdx, rcx, r8, r9. In both cases, rax holds the return value if it fits and is a piece of POD. Technically speaking, you can get away with a "polyglot" called function if it (0) saves the union of what each ABI considers non-volatile, and (1) returns something that can fit in a single register, and (2) takes no more than 2 GPR arguments, because overlap would happen past that. To be absolutely generic, you could make it take a single pointer to some structure that would hold whatever arbitrary return data you want.
So now our arguments will come through either rcx and rdx or rdi and rsi. How do you tell which will contain the arguments? I'm actually not sure of a good way. Maybe what you could do instead is have a wrapper that puts the arguments in the right spot, and have your actual function take "padding" arguments, so that your arguments always land in rcx and rdx. You could technically expand to r8 and r9 this way.
#ifdef _WIN32
#define CPUID(information) cpuid(information, NULL, NULL, NULL)
#else
#define CPUID(information) cpuid(NULL, NULL, NULL, information)
#endif
// d duplicates a
// c duplicates b
no_more_than_64_bits_t cpuid(void * a, void * b, void * c, void * d);
Then, in your assembly, save the union of what each ABI considers non-volatile, do your thing, put whatever information you want in the structure to which rcx points, and restore.

gcc 8.2+ doesn't always align the stack before a call on x86?

The current (Linux) version of the SysV i386 ABI requires 16-byte stack alignment before a call:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%esp + 4) is always a multiple of 16 (32) when control is transferred to the function entry point.
On GCC 8.1 this code aligns the stack to 16-byte boundary prior to the call to callee: (Godbolt)
source
# bytes
call
4
push ebp
4
sub esp, 24
24
sub esp, 4
4
push eax
4
push eax
4
push eax
4
Total
48
On all versions of GCC 8.2 and later, it aligns to a 4-byte boundary: (Godbolt)
source
# bytes
call
4
push ebp
4
sub esp, 16
16
push eax
4
push eax
4
push eax
4
Total
36
Easily verifiable if we shorten or raise the number of parameters required by callee.
Changing -mprefered-stack-boundary bizarrely changes the operand to the sub instruction, but does nothing to change the actual stack alignment: (Godbolt)
So, uh, what gives?
Since you provided a definition of the function in the same translation unit, apparently GCC sees that the function doesn't care about stack alignment and doesn't bother much with it. And apparently this basic inter-procedural analysis / optimization (IPA) is on by default even at -O0.
Turns out this option even has an obvious name when I searched for "ipa" options in the manual: -fipa-stack-alignment is on by default even at -O0. Manually turning it off with -fno-ipa-stack-alignment results in what you expected, a second sub whose value depends on the number of pushes (Godbolt), making sure ESP is aligned by 16 before a call like modern Linux versions of the i386 SysV ABI use.
Or if you change the definition to just a declaration, then the resulting asm is as expected, fully respecting -mpreferred-stack-boundary.
void callee(void* a, void* b) {
}
to
void callee(void* a, void* b);
Using -fPIC also forces GCC to not assume anything about the callee, so it does respect the possibility of function interposition (e.g. via LD_PRELOAD) with the appropriate option.
Without compiling for a shared library, GCC is allowed to assume that any definition it sees for a global function is the definition, thanks to ISO C's one-definition-rule.
If you use __attribute__((noipa)) on the function definition, then call sites won't assume anything based on the definition. Just like if you'd renamed the definition (so you could still look at it) and provided only a declaration of the name the caller uses.
If you just want to stop inlining, you can use __attribute__((noinline,noclone)) instead, to still allow the callsite to be like it would if the optimizer simply chose not to inline, but could still see this definition. That may or may not be what you want.
See also How to remove "noise" from GCC/clang assembly output? re: writing functions whose asm is interesting to look at, and compiler options.
And BTW, I found it easiest to change the declaration / definition to variadic, so I could add or remove args with only a change to the caller. I was still able to reproduce your result of that not changing the sub amount even when the push amount changes with an extra arg, when there's a definition, but not with just a declaration.
void callee(void* a, ...) // {} // comment out a body or not
;

What could happen when you call function returning int with void (*)() pointer?

I would like to know what could happen in a situation like this:
int foo()
{
return 1;
}
void bar()
{
void(*fPtr)();
fPtr = (void(*)())foo;
fPtr();
}
Address of function returning int is assigned to pointer of void(*)() type and the function pointed is called.
What does the standard say about it?
Regardless of answer to 1st question: Are we safe to call the function like this? In practise shouldnt the outcome be just that callee (foo) will put something in EAX / RAX and caller (bar) will just ignore the rax content and go on with the program? I'm interested in Windows calling convention x86 and x64.
Thanks a lot for your time
1)
From the C11 standard - 6.5.2.2 - 9
If the function is defined with a type that is not compatible with the type (of the expression) pointed to by the expression that denotes the called function, the behavior is undefined
It is clearly stated that if a function is called using a pointer of type that does not match the type it is defined with, it leads to Undefined Behavior.
But the cast is okay.
2)
Regarding your second question - In case of a well defined Calling convention XXX and implementation YYYY -
You might have disassembled a sample program (even this one) and figured out that it "works". But there are slight complications. You see, the compilers these days are very smart. There are some compilers which are capable of performing precise inter procedural analysis. Some compiler might figure out that you have behavior that is not defined and it might make some assumption that might break the behavior.
A simple example -
Since the compiler sees that this function is being called with type void(*)(), it will assume that it is not supposed to return anything, and it might remove the instructions required to return the correct value.
In this case other functions calling this functions (in a right way) will get a bad value and thus it would have visible bad effects.
PS: As pointed out by #PeterCordes any modern, sane and useful compiler won't have such an optimization and probably it is always safe to use such calls. But the intent of the answer and the example (probably too simplistic) is to remind that one must tread very carefully when dealing with UBs.
What happens in practice depends a lot on how the compiler implements this. You're assuming C is just a thin ("obvious") layer over asm, but it isn't.
In this case, a compiler can see that you're calling a function through a pointer with the wrong type (which has undefined behavior1), so it could theoretically compile bar() to:
bar:
ret
A compiler can assume undefined behavior never happens during the execution of a program. Calling bar() always results in undefined behavior. Therefore the compiler can assume bar is never called and optimize the rest of the program based on that.
1 C99, 6.3.2.3/8:
If a converted
pointer is used to call a function whose type is not compatible with the pointed-to type,
the behavior is undefined.
About sub-question 2:
Nearly all x86 calling conventions I know (cdecl, stdcall, syscall, fastcall, pascal, 64-bit Windows and 64-bit Linux) will allow void functions to modify the ax/eax/rax register and the difference between an int function and a void function is only that the returned value is passed in the eax register.
The same is true for the "default" calling convention on most other CPUs I have already worked with (MIPS, Sparc, ARM, V850/RH850, PowerPC, TriCore). The register name is not eax but different, of course.
So when using these calling convention you can safely call the int function using a void pointer.
There are however calling conventions where this is not the case: I've read about a calling convention that implicitly use an additional argument for non-void functions...
At the asm level only, this is safe in all normal x86 calling conventions for integer types: eax/rax is call-clobbered, and the caller doesn't have to do anything differently to call a void function vs. an int function and ignoring the return value.
For non-integer return types, this is a problem even in asm. Struct returns are done via a hidden pointer arg that displaces the other args, and the caller is going to store through it so it better not hold garbage. (Assuming the case is more complex than the one shown here, so the function doesn't just inline when optimization is enabled.) See the Godbolt link below for an example of calling through a casted function pointer that results in a store through a garbage "pointer" in rdi.
For legacy 32-bit code, FP return values are in st(0) on the x87 stack, and it's the caller's responsibility to not leave the x87 stack unbalanced. float / double / __m128 return values are safe to ignore in 64-bit ABIs, or in 32-bit code using a calling convention that returns FP values in xmm0 (SSE/SSE2).
In C, this is UB (see other answers for quotes from the standard). When possible / convenient, prefer a workaround (see below).
It's possible that future aggressive optimizations based on a no-UB assumption could break code like this. For example, a compiler might assume any path that leads to UB is never taken, so an if() condition that leads to this code running must always be false.
Note that merely compiling bar() can't break foo() or other functions that don't call bar(). There's only UB if bar() ever runs, so emitting a broken externally-visible definition for foo() (like #Ajay suggests) is not a possible consequence. (Except maybe if you use whole-program optimization and the compiler proves that bar() is always called at least once.) The compiler can break functions that call bar(), though, at least the parts of them that lead to the UB.
However, it is allowed (by accident or on purpose) by many current compilers for x86. Some users expect this to work, and this kind of thing is present in some real codebases, so compiler devs may support this usage even if they implement aggressive optimizations that would otherwise assume this function (and thus all paths that lead to it in any callers) never run. Or maybe not!
An implementation is free to define the behaviour in cases where the ISO C standard leaves the behaviour undefined. However, I don't think gcc/clang or any other compiler explicitly guarantees that this is safe. Compiler devs might or might not consider it a compiler bug if this code stopped working.
I definitely can't recommend doing this, because it may well not continue to be safe. Hopefully if compiler devs decide to break it with aggressive no-UB-assuming optimizations, there will be options to control which kinds of UB are assumed not to happen. And/or there will be warnings. As discussed in comments, whether to take a risk of possible future breakage for short-term performance / convenience benefits depends on external factors (like will lives be at risk, and how carefully you plan to maintain in the future, e.g. checking compiler warnings with future compiler versions.)
Anyway, if it works, it's because of the generosity of your compiler, not because of any kind of standards guarantee. This compiler generosity may be intentional and semi-maintained, though.
See also discussion on another answer: the compilers people actually use aim to be useful, not just standards compliant. The C standard allows enough freedom to make a compliant but not very useful implementation. (Many would argue that compilers that assume no signed overflow even on machines where it has well-defined semantics have already gone past this point, though. See also What Every C Programmer Should Know About Undefined Behavior (an LLVM blog post).)
If the compiler can't prove that it would be UB (e.g. if it can't statically determine which function a function-pointer is pointing to), there's pretty much no way it can break (if the functions are ABI-compatible). Clang's runtime UB-sanitizer would still find it, but a compiler doesn't have much choice in code-gen for calling through an unknown function pointer. It just has to call the way the ABI / calling convention says it should. It can't tell the difference between casting a function pointer to the "wrong" type and casting it back to the correct type (unless you dereference the same function pointer with two different types, which means one or the other must be UB. But the compiler would have a hard time proving it, because the first call might not return. noreturn functions don't have to be marked noreturn.)
But remember that link-time optimization / inlining / constant-propagation could let the compiler see which function is pointed to even in a function that gets a function pointer as an arg or from a global variable.
Workarounds (for a function before you take its address):
If the function won't be part of Link-Time-Optimization, you could lie to the compiler and give it a prototype that matches how you want to call it (as long as you're sure you got the asm-level calling convention is compatible).
You could write a wrapper function. It's potentially less efficient (an extra jmp if it just tail-calls the original), but if it inlines then you're cloning the function to make a version that doesn't do any of the work of creating a return value. This might still be a loss if that was cheap compared to the extra I-cache / uop cache pressure of a 2nd definition, if the version that does return a value is used too.
You could also define an alternate name for a function, using linker stuff so both symbols have the same address. That way you can have two prototypes for the same block of compiler-generated machine code.
Using the GNU toolchain, you can use an attribute on a prototype to make it a weak alias (at the asm / linker level). This doesn't work for all targets; it works for ELF object files, but IDK about Windows.
// in GNU C:
int foo(void) { return 4; }
// include this line in a header if you want; weakref is per translation unit
// a definition (or prototype) for foo doesn't have to be visible.
static void foo_void(void) __attribute((weakref("foo"))); // in C++, use the mangled name
int bar_safe(void) {
void (*goo)(void) = (void(*)())foo_void;
goo();
return 1;
}
example on Godbolt for gcc7.2 and clang5.0.
gcc7.2 inlines foo through the weak alias call to foo_void! clang doesn't, though. I think that means that this is safe, and so is function-pointer casting, in gcc. Alternatively it means that this is potentially dangerous, too. >.<
clang's undefined-behaviour sanitizer does runtime function typeinfo checking (in C++ mode only) for calls through function pointers. int () is different from void (), so it will detect and report this UB on x86. (See the asm on Godbolt). It probably doesn't mean it's actually unsafe at the moment, though, because it doesn't yet detect / warn about it at compile time.
Use the above workarounds in the code that takes the address of the function, not in the code that receives a function pointer.
You want to let the compiler see a real function with the signature that it will eventually be called with, regardless of the function pointer type you pass it through. Make an alias / wrapper with a signature that matches what the function pointer will eventually be cast to. If that means you have to cast the function pointer to pass it in the first place, so be it.
(I think it's safe to create a pointer to the wrong type as long as it's not dereferenced. It's UB to even create an unaligned pointer, even if you don't dereference, but that's different.)
If you have code that needs to deref the same function pointer as int foo(args) in one place and void foo(args) in another place, you're screwed as far as avoiding UB.
C11 ยง6.3.2.3 paragraph 8:
A pointer to a function of one type may be converted to a pointer to a
function of another type and back again; the result shall compare
equal to the original pointer. If a converted pointer is used to call
a function whose type is not compatible with the referenced type, the
behavior is undefined.

How are function arguments stored in memory?

While trying to make my own alternative to the stdarg.h macros for variable arguments functions, a.k.a. functions with an unknown number of arguments, i tried to understand the way the arguments are stored in memory.
Here is a MWE :
#include <stdio.h>
void foo(int num, int bar1, int bar2)
{
printf("%p %p %p %p\n", &foo, &num, &bar1, &bar2);
}
int main ()
{
int i, j;
i = 3;
j = -5;
foo(2, i, j);
return 0;
}
I understand without any problem that the function's address is not in the same place as the arguments' addresses.
But the latter aren't always organized in the same way.
On a x86_32 architecture (mingw32), i get this kind of result :
004013B0 0028FEF0 0028FEF4 0028FEF8
which means that the adresses are in the same order as the arguments.
BUT when I run it on a x86_64 this time the output is :
0x400536 0x7fff53b5f03c 0x7fff53b5f038 0x7fff53b5f034
Where the addresses are obviously in reverse order w.r.t. the arguments.
Therefore my question is (tl;dr) :
Are the arguments' addresses architecture dependent, or also compiler dependent?
It is compiler dependent. Compiler vendors naturally have to obey by the rules of the CPU architecture. A compiler normally obey the platform ABI as well, at least for code that could potentially interoperate with code produced by another compiler. The platform ABI is a specification of calling convention, linking semantic and much more, for a given platform.
E.g. compilers on linux and other unix like operating system adhere to the System V Application Binary Interface, and you'll find in chapter 3.2.3 how parameters are passed to functions (arguments passed in registers are passed left to right and arguments passed in memory(on the stack) are passed from right to left). On Windows, the rules are documented here.
They're ABI dependent. In cases where it doesn't matter (functions that will only be called in a known way), it's entirely compiler dependent and that usually means using registers, which don't have an address (those arguments will have an address if you ask for that address, giving the appearance that everything has an address). Functions that get inlined don't even really have arguments anymore, so the question of what their addresses are is moot - though again they will appear to exist and have an address when you force that happen.
Arguments may not be stored in memory at all, but passed via registers; however the language requires an address to be returned for any symbol operand of &, so your observation may be a result of you actually attempting the observation and the compiler has simply copied the values to those addresses in order that they are addressable.
It might be interesting to see what happens if you request the addresses in a different order that they were passed for example:
printf("%p %p %p %p\n", &num, &bar1, &bar2, &foo) ;
You may or may not get the same result; the point is that teh addresses you observed may be an artefact of the observation rather than of the passing. Certainly in the ARM ABI, the first four arguments to a function are passed in registers R0, R1, R2, & R3, and thereafter are passed vis the stack.
On x86_64 you get the arguments in a "weird" order because they are not actually passed to the function in any memory at all. They are passed in cpu registers. By taking their address you actually force the compiler to generate code that will store the arguments in memory (on the stack in your case) so that you can take the address of them.
You can't implement stdarg macros without interacting with the compiler. In gcc the stdarg macros just wrap a builtin construct because there is no way for you to know where the arguments might be by the time you need them (the compiler might have reused the registers for something). The builtin stdarg support in gcc can significantly change code generation for functions that use them so that the arguments are available at all. I presume the same goes for other compilers.

Resources