Measuring size of a function generated with Clang/LLVM?

Measuring size of a function generated with Clang/LLVM? - c

Recently, when working on a project, I had a need to measure the size of a C function in order to be able to copy it somewhere else, but was not able to find any "clean" solutions (ultimately, I just wanted to have a label inserted at the end of the function that I could reference).
Having written the LLVM backend for this architecture (while it may look like ARM, it isn't) and knowing that it emitted assembly code for that architecture, I opted for the following hack (I think the comment explains it quite well):
/***************************************************************************
* if ENABLE_SDRAM_CALLGATE is enabled, this function should NEVER be called
* from C code as it will corrupt the stack pointer, since it returns before
* its epilog. this is done because clang does not provide a way to get the
* size of the function so we insert a label with inline asm to measure the
* function. in addition to that, it should not call any non-forceinlined
* functions to avoid generating a PC relative branch (which would fail if
* the function has been copied)
**************************************************************************/
void sdram_init_late(sdram_param_t* P) {
/* ... */
#ifdef ENABLE_SDRAM_CALLGATE
asm(
"b lr\n"
".globl sdram_init_late_END\n"
"sdram_init_late_END:"
);
#endif
}
It worked as desired but required some assembler glue code in order to call it and is a pretty dirty hack that only worked because I could assume several things about the code generation process.
I've also considered other ways of doing this which would work better if LLVM was emitting machine code (since this approach would break once I added an MC emitter to my LLVM backend). The approach I considered involved taking the function and searching for the terminator instruction (which would either be a b lr instruction or a variation of pop ..., lr) but that could also introduce additional complications (though it seemed better than my original solution).
Can anyone suggest a cleaner way of getting the size of a C function without having to resort to incredibly ugly and unreliable hacks such as the ones outlined above?

I think you're right that there aren't any truly portable ways to do this. Compilers are allowed to re-order functions, so taking the address of the next function in source order isn't safe (but does work in some cases).
If you can parse the object file (maybe with libbfd), you might be able to get function sizes from that.
clang's asm output has this metadata (the .size assembler directive after every function), but I'm not sure whether it ends up in the object file.
int foo(int a) { return a * a * 2; }
## clang-3.8 -O3 for amd64:
## some debug-info lines manually removed
.globl foo
foo:
.Lfunc_begin0:
.cfi_startproc
imul edi, edi
lea eax, [rdi + rdi]
ret
.Lfunc_end0:
.size foo, .Lfunc_end0-foo ####### This line
Compiling this to a .o with clang-3.8 -O3 -Wall -Wextra func-size.c -c, I can then do:
$ readelf --symbols func-size.o
Symbol table '.symtab' contains 4 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS func-size.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 2
3: 0000000000000000 7 FUNC GLOBAL DEFAULT 2 foo ### This line
The three instructions total 7 bytes, which matches up with the size output here. It doesn't include the padding to align the entry point, or the next function: the .align directives are outside the two labels that are subtracted to calculate the .size.
This probably doesn't work well for stripped executables. Even their global functions won't still be present in the symbol table of the executable. So you might need a two-step build process:
compile your "normal" code
get sizes of functions you care about into a table, using readelf | some text processing > sizes.c
compile sizes.c
link everything together
Caveat
A really clever compiler could compile multiple similar functions to share a common implementation. So one of the functions jumps into the middle of the other function body. If you're lucky, all the functions are grouped together, with the "size" of each measuring from its entry point all the way to the end of the blocks of code it uses. (But that overlap would make the total sizes add up to more than the size of the file.)
Current compilers don't do this, but you can prevent it by putting the function in a separate compilation unit, and not using whole-program link-time optimization.
A compiler could decide to put a conditionally-executed block of code before the function entry point, so the branch can use a shorter encoding for a small displacement. This makes that block look like a static "helper" function which probably wouldn't be included in the "size" calculation for function. Current compilers never do this, either, though.
Another idea, which I'm not confident is safe:
Put an asm volatile with just a label definition at the end of your function, and then assume the function size is at most that + 32 bytes or something. So when you copy the function, you allocate a buffer 32B larger than your "calculated" size. Hopefully there's only a "ret" insn beyond the label, but actually it probably goes before the function epilogue which pops all the call-preserved registers it used.
I don't think the optimizer can duplicate an asm volatile statement, so it would force the compiler to jump to a common epilogue instead of duplicating the epilogue like it might sometimes for early-out conditions.
But I'm not sure there's an upper bound on how much could end up after the asm volatile.

Related

gcc 8.2+ doesn't always align the stack before a call on x86?

The current (Linux) version of the SysV i386 ABI requires 16-byte stack alignment before a call:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%esp + 4) is always a multiple of 16 (32) when control is transferred to the function entry point.
On GCC 8.1 this code aligns the stack to 16-byte boundary prior to the call to callee: (Godbolt)
source
# bytes
call
4
push ebp
4
sub esp, 24
24
sub esp, 4
4
push eax
4
push eax
4
push eax
4
Total
48
On all versions of GCC 8.2 and later, it aligns to a 4-byte boundary: (Godbolt)
source
# bytes
call
4
push ebp
4
sub esp, 16
16
push eax
4
push eax
4
push eax
4
Total
36
Easily verifiable if we shorten or raise the number of parameters required by callee.
Changing -mprefered-stack-boundary bizarrely changes the operand to the sub instruction, but does nothing to change the actual stack alignment: (Godbolt)
So, uh, what gives?

Since you provided a definition of the function in the same translation unit, apparently GCC sees that the function doesn't care about stack alignment and doesn't bother much with it. And apparently this basic inter-procedural analysis / optimization (IPA) is on by default even at -O0.
Turns out this option even has an obvious name when I searched for "ipa" options in the manual: -fipa-stack-alignment is on by default even at -O0. Manually turning it off with -fno-ipa-stack-alignment results in what you expected, a second sub whose value depends on the number of pushes (Godbolt), making sure ESP is aligned by 16 before a call like modern Linux versions of the i386 SysV ABI use.
Or if you change the definition to just a declaration, then the resulting asm is as expected, fully respecting -mpreferred-stack-boundary.
void callee(void* a, void* b) {
}
to
void callee(void* a, void* b);
Using -fPIC also forces GCC to not assume anything about the callee, so it does respect the possibility of function interposition (e.g. via LD_PRELOAD) with the appropriate option.
Without compiling for a shared library, GCC is allowed to assume that any definition it sees for a global function is the definition, thanks to ISO C's one-definition-rule.
If you use __attribute__((noipa)) on the function definition, then call sites won't assume anything based on the definition. Just like if you'd renamed the definition (so you could still look at it) and provided only a declaration of the name the caller uses.
If you just want to stop inlining, you can use __attribute__((noinline,noclone)) instead, to still allow the callsite to be like it would if the optimizer simply chose not to inline, but could still see this definition. That may or may not be what you want.
See also How to remove "noise" from GCC/clang assembly output? re: writing functions whose asm is interesting to look at, and compiler options.
And BTW, I found it easiest to change the declaration / definition to variadic, so I could add or remove args with only a change to the caller. I was still able to reproduce your result of that not changing the sub amount even when the push amount changes with an extra arg, when there's a definition, but not with just a declaration.
void callee(void* a, ...) // {} // comment out a body or not
;

Why can't local variable be used in GNU C basic inline asm statements?

Why cannot I use local variables from main to be used in basic asm inline? It is only allowed in extended asm, but why so?
(I know local variables are on the stack after return address (and therefore cannot be used once the function return), but that should not be the reason to not use them)
And example of basic asm:
int a = 10; //global a
int b = 20; //global b
int result;
int main() {
asm ( "pusha\n\t"
"movl a, %eax\n\t"
"movl b, %ebx\n\t"
"imull %ebx, %eax\n\t"
"movl %eax, result\n\t"
"popa");
printf("the answer is %d\n", result);
return 0;
}
example of extended:
int main (void) {
int data1 = 10; //local var - could be used in extended
int data2 = 20;
int result;
asm ( "imull %%edx, %%ecx\n\t"
"movl %%ecx, %%eax"
: "=a"(result)
: "d"(data1), "c"(data2));
printf("The result is %d\n",result);
return 0;
}
Compiled with:
gcc -m32 somefile.c
platform:
uname -a:
Linux 5.0.0-32-generic #34-Ubuntu SMP Wed Oct 2 02:06:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

You can use local variables in extended assembly, but you need to tell the extended assembly construct about them. Consider:
#include <stdio.h>
int main (void)
{
int data1 = 10;
int data2 = 20;
int result;
__asm__(
" movl %[mydata1], %[myresult]\n"
" imull %[mydata2], %[myresult]\n"
: [myresult] "=&r" (result)
: [mydata1] "r" (data1), [mydata2] "r" (data2));
printf("The result is %d\n",result);
return 0;
}
In this [myresult] "=&r" (result) says to select a register (r) that will be used as an output (=) value for the lvalue result, and that register will be referred to in the assembly as %[myresult] and must be different from the input registers (&). (You can use the same text in both places, result instead of myresult; I just made it different for illustration.)
Similarly [mydata1] "r" (data1) says to put the value of expression data1 into a register, and it will be referred to in the assembly as %[mydata1].
I modified the code in the assembly so that it only modifies the output register. Your original code modifies %ecx but does not tell the compiler it is doing that. You could have told the compiler that by putting "ecx" after a third :, which is where the list of “clobbered” registers goes. However, since my code lets the compiler assign a register, I would not have a specific register to list in the clobbered register. There may be a way to tell the compiler that one of the input registers will be modified but is not needed for output, but I do not know. (Documentation is here.) For this task, a better solution is to tell the compiler to use the same register for one of the inputs as the output:
__asm__(
" imull %[mydata1], %[myresult]\n"
: [myresult] "=r" (result)
: [mydata1] "r" (data1), [mydata2] "0" (data2));
In this, the 0 with data2 says to make it the same as operand 0. The operands are numbered in the order they appear, starting with 0 for the first output operand and continuing into the input operands. So, when the assembly code starts, %[myresult] will refer to some register that the value of data2 has been placed in, and the compiler will expect the new value of result to be in that register when the assembly is done.
When doing this, you have to match the constraint with how a thing will be used in assembly. For the r constraint, the compiler supplies some text that can be used in assembly language where a general processor register is accepted. Others include m for a memory reference, and i for an immediate operand.

There is little distinction between "Basic asm" and "Extended asm"; "basic asm" is just a special case where the __asm__ statement has no lists of outputs, inputs, or clobbers. The compiler does not do % substitution in the assembly string for Basic asm. If you want inputs or outputs you have to specify them, and then it's what people call "extended asm".
In practice, it may be possible to access external (or even file-scope static) objects from "basic asm". This is because these objects will (respectively may) have symbol names at the assembly level. However, to perform such access you need to be careful of whether it is position-independent (if your code will be linked into libraries or PIE executables) and meets other ABI constraints that might be imposed at linking time, and there are various considerations for compatibility with link-time optimization and other transformations the compiler may perform. In short, it's a bad idea because you can't tell the compiler that a basic asm statement modified memory. There's no way to make it safe.
A "memory" clobber (Extended asm) can make it safe to access static-storage variables by name from the asm template.
The use-case for basic asm is things that modify the machine state only, like asm("cli") in a kernel to disable interrupts, without reading or writing any C variables. (Even then, you'd often use a "memory" clobber to make sure the compiler had finished earlier memory operations before changing machine state.)
Local (automatic storage, not static ones) variables fundamentally never have symbol names, because they don't exist in a single instance; there's one object per live instance of the block they're declared in, at runtime. As such, the only possible way to access them is via input/output constraints.
Users coming from MSVC-land may find this surprising since MSVC's inline assembly scheme papers over the issue by transforming local variable references in their version of inline asm into stack-pointer-relative accesses, among other things. The version of inline asm it offers however is not compatible with an optimizing compiler, and little to no optimization can happen in functions using that type of inline asm. GCC and the larger compiler world that grew alongside C out of unix does not do anything similar.

You can't safely use globals in Basic Asm statements either; it happens to work with optimization disabled but it's not safe and you're abusing the syntax.
There's very little reason to ever use Basic Asm. Even for machine-state control like asm("cli") to disable interrupts, you'd often want a "memory" clobber to order it wrt. loads / stores to globals. In fact, GCC's https://gcc.gnu.org/wiki/ConvertBasicAsmToExtended page recommends never using Basic Asm because it differs between compilers, and GCC might change to treating it as clobbering everything instead of nothing (because of existing buggy code that makes wrong assumptions). This would make a Basic Asm statement that uses push/pop even more inefficient if the compiler is also generating stores and reloads around it.
Basically the only use-case for Basic Asm is writing the body of an __attribute__((naked)) function, where data inputs/outputs / interaction with other code follows the ABI's calling convention, instead of whatever custom convention the constraints / clobbers describe for a truly inline block of code.
The design of GNU C inline asm is that it's text that you inject into the compiler's normal asm output (which is then fed to the assembler, as). Extended asm makes the string a template that it can substitute operands into. And the constraints describe how the asm fits into the data-flow of the program logic, as well as registers it clobbers.
Instead of parsing the string, there is syntax that you need to use to describe exactly what it does. Parsing the template for var names would only solve part of the language-design problem that operands need to solve, and would make the compiler's code more complicated. (It would have to know more about every instruction to know whether memory, register, or immediate was allowed, and stuff like that. Normally its machine-description files only need to know how to go from logical operation to asm, not the other direction.)
Your Basic asm block is broken because you modify C variables without telling the compiler about it. This could break with optimization enabled (maybe only with more complex surrounding code, but happening to work is not the same thing as actually safe. This is why merely testing GNU C inline asm code is not even close to sufficient for it to be future proof against new compilers and changes in surrounding code). There is no implicit "memory" clobber. (Basic asm is the same as Extended asm except for not doing % substitution on the string literal. So you don't need %% to get a literal % in the asm output. It's implicitly volatile like Extended asm with no outputs.)
Also note that if you were targeting i386 MacOS, you'd need _result in your asm. result only happens to work because the asm symbol name exactly matches the C variable name. Using Extended asm constraints would make it portable between GNU/Linux (no leading underscore) vs. other platforms that do use a leading _.
Your Extended asm is broken because you modify an input ("c") (without telling the compiler that register is also an output, e.g. an output operand using the same register).
It's also inefficient: if a mov is the first or last instruction of your template, you're almost always doing it wrong and should have used better constraints.
Instead, you can do:
asm ("imull %%edx, %%ecx\n\t"
: "=c"(result)
: "d"(data1), "c"(data2));
Or better, use "+r"(data2) and "r"(data1) operands to give the compiler free choice when doing register allocation instead of potentially forcing the compiler to emit unnecessary mov instructions. (See #Eric's answer using named operands and "=r" and a matching "0" constraint; that's equivalent to "+r" but lets you use different C names for the input and output.)
Look at the asm output of the compiler to see how code-gen happened around your asm statement, if you want to make sure it was efficient.
Since local vars don't have a symbol / label in the asm text (instead they live in registers or at some offset from the stack or frame pointer, i.e. automatic storage), it can't work to use symbol names for them in asm.
Even for global vars, you want the compiler to be able to optimize around your inline asm as much as possible, so you want to give the compiler the option of using a copy of a global var that's already in a register, instead of getting the value in memory in sync with a store just so your asm can reload that.
Having the compiler try to parse your asm and figure out which C local var names are inputs and outputs would have been possible. (But would be a complication.)
But if you want it to be efficient, you need to figure out when x in the asm can be a register like EAX, instead of doing something braindead like always storing x into memory before the asm statement, and then replacing x with 8(%rsp) or whatever. If you want to give the asm statement control over where inputs can be, you need constraints in some form. Doing it on a per-operand basis makes total sense, and means the inline-asm handling doesn't have to know that bts can take an immediate or register source but not memory, for and other machine-specific details like that. (Remember; GCC is a portable compiler; baking a huge amount of per-machine info into the inline-asm parser would be bad.)
(MSVC forces all C vars in _asm{} blocks to be memory. It's impossible to use to efficiently wrap a single instruction because the input has to bounce through memory, even if you wrap it in a function so you can use the officially-supported hack of leaving a value in EAX and falling off the end of a non-void function. What is the difference between 'asm', '__asm' and '__asm__'? And in practice MSVC's implementation was apparently pretty brittle and hard to maintain, so much so that they removed it for x86-64, and it was documented as not supported in function with register args even in 32-bit mode! That's not the fault of the syntax design, though, just the actual implementation.)
Clang does support -fasm-blocks for _asm { ... } MSVC-style syntax where it parses the asm and you use C var names. It probably forces inputs and outputs into memory but I haven't checked.
Also note that GCC's inline asm syntax with constraints is designed around the same system of constraints that GCC-internals machine-description files use to describe the ISA to the compiler. (The .md files in the GCC source that tell the compiler about an instruction to add numbers that takes inputs in "r" registers, and has the text string for the mnemonic. Notice the "r" and "m" in some examples in https://gcc.gnu.org/onlinedocs/gccint/RTL-Template.html).
The design model of asm in GNU C is that it's a black-box for optimizer; you must fully describe the effects of the code (to the optimizer) using constraints. If you clobber a register, you have to tell the compiler. If you have an input operand that you want to destroy, you need to use a dummy output operand with a matching constraint, or a "+r" operand to update the corresponding C variable's value.
If you read or write memory pointed-to by a register input, you have to tell the compiler. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
If you use the stack, you have to tell the compiler (but you can't, so instead you have to avoid stepping on the red-zone :/ Using base pointer register in C++ inline asm) See also the inline-assembly tag wiki
GCC's design makes it possible for the compiler to give you an input in a register, and use the same register for a different output. (Use an early-clobber constraint if that's not ok; GCC's syntax is designed to efficiently wrap a single instruction that reads all its inputs before writing any of its outputs.)
If GCC could only infer all of these things from C var names appearing in asm source, I don't think that level of control would be possible. (At least not plausible.) And there'd probably be surprising effects all over the place, not to mention missed optimizations. You only ever use inline asm when you want maximum control over things, so the last thing you want is the compiler using a lot of complex opaque logic to figure out what to do.
(Inline asm is complex enough in its current design, and not used much compared to plain C, so a design that requires very complex compiler support would probably end up with a lot of compiler bugs.)
GNU C inline asm isn't designed for low-performance low-effort. If you want easy, just write in pure C or use intrinsics and let the compiler do its job. (And file missed-optimization bug reports if it makes sub-optimal code.)

This is because asm is a defined language which is common for all compilers on the same processor family. After using the __asm__ keyword, you can reliably use any good manual for the processor to then start writing useful code.
But it does not have a defined interface for C, and lets be honest, if you don't interface your assembler with your C code then why is it there?
Examples of useful very simple asm: generate a debug interrupt; set the floating point register mode (exceptions/accuracy);
Each compiler writer has invented their own mechanism to interface to C. For example in one old compiler you had to declare the variables you want to share as named registers in the C code. In GCC and clang they allow you to use their quite messy 2-step system to reference an input or output index, then associate that index with a local variable.
This mechanism is the "extension" to the asm standard.
Of course, the asm is not really a standard. Change processor and your asm code is trash. When we talk in general about sticking to the c/c++ standards and not using extensions, we don't talk about asm, because you are already breaking every portability rule there is.
Then, on top of that, if you are going to call C functions, or your asm declares functions that are callable by C then you will have to match to the calling conventions of your compiler. These rules are implicit. They constrain the way you write your asm, but it will still be legal asm, by some criteria.
But if you were just writing your own asm functions, and calling them from asm, you may not be constrained so much by the c/c++ conventions: make up your own register argument rules; return values in any register you want; make stack frames, or don't; preserve the stack frame through exceptions - who cares?
Note that you might still be constrained by the platform's relocatable code conventions (these are not "C" conventions, but are often described using C syntax), but this is still one way that you can write a chunk of "portable" asm functions, then call them using "extended" embedded asm.

Inserting "marker" instructions into assembly without GCC reordering them

For purposes of doing performance analysis it is useful to be able to
tell which line of C code goes with which line of generated assembly
code. This can be very difficult once a sufficient number of
optimization passes get involved, and I devised the following scheme
to make it easier (though it has a lot of caveats). I figured I would
use in-line assembly to insert an instruction that is effectively a
nop, but that the compiler would rarely or never generate itself. Then
when I looked at the generated code I could infer that assembly code
that appears between the inserted marker instructions probably comes
from C code that lies between the in-line assembly statements.
I came up with these candidates:
// Force insertion of a instruction that will only clobber
// flags and that the compiler hardly ever uses itself. Lie and say
// that it alters memory to try to prevent the compiler from moving
// around. Mark it volatile so the compiler can't remove it entirely.
#define ASSEMBLY_MARKER_0() \
__asm__ volatile ("cld" : /* no outputs */ : /* no inputs */ : "memory", "cc")
#define ASSEMBLY_MARKER_1() \
__asm__ volatile ("xorl %%eax,0" : /* no outputs */ : /* no inputs */ : "memory", "cc")
Then I decided to test whether the compiler would move instructions
across these boundaries. clang appears to do exactly what I want, but
GCC appears to not be deterred either by the memory clobbering or the
fact that this snippet is volatile. It reorders instructions anyway!
Is there any way to prevent this?
I know there are a lot of caveats to this method even if I get it to
work -- I may heavily influence generated code around the markers. But
I maintain that it would still be useful for finding things like
accidental implicit conversions between integer widths, and other
"wait that should never be necessary..." type problems.
You can see the difference between GCC and clang here: https://godbolt.org/z/ZtUPc9
C code:
int f(int x)
{
__asm__ volatile ("xorl %%eax,0" : /* no outputs */ : /* no inputs */ : "memory", "cc");
int j = x << 3;
__asm__ volatile ("xorl %%eax,0" : /* no outputs */ : /* no inputs */ : "memory", "cc");
return j;
}
GCC:
xorl %eax,0
xorl %eax,0
lea eax, [0+rdi*8]
ret
Clang:
xor dword ptr [0], eax
lea eax, [8*rdi]
xor dword ptr [0], eax
ret
Edits to answer questions in comments:
Why not nops? Because gcc inserts those itself often. The point is to stick out.
Why not move code into its own function? If you're doing this analysis on C++ template code for example, there be many layers of inlining that occur before producing the function that actually goes in the executable, and the code may be very different if you turn off the inlining (e.g. the code may have been written with the assumption that constant folding, dead code elimitation etc would get rid of trivial things).

Then I decided to test whether the compiler would move instructions across these boundaries. clang appears to do exactly what I want, but GCC appears to not be deterred either by the memory clobbering or the fact that this snippet is volatile. It reorders instructions anyway! Is there any way to prevent this?
Not really. The point is that such memory barriers avoid reordering stuff across it that is volatile (like volatile accesses or asm volatile) and / or memory accesses. Or in the case of x86 and cc (condition code) parts of def-use chains of condition code cannot be moved across. Such barriers do not whatsoever avoid moving unrelated instructions across it.
Sometimes it can be helpful to add options -save-temps -fverbose-asm to better understand assembly code and its relation to C. New versions of GCC dump C/C++ code alongside assembly code (dumped as *.s). When you inspect assembly (as opposed to disassembly) it's sufficient to inject asm comments to show where the inline asm is injected, there is no need to add actual instructions. Legibility of assembly might be improved by disabling debug-info (-g0).
To better understand the code, you can also disable passes that usually result in great amount of instruction reordering like instruction scheduling (-fno-schedule-insns,
-fno-schedule-insns2) but that has a big performance impact, of course.

GCC Inline-Assembly Error: "Operand size mismatch for 'int'"

first, if somebody knows a function of the Standard C Library, that prints
a string without looking for a binary zero, but requires the number of characters to draw, please tell me!
Otherwise, I have this problem:
void printStringWithLength(char *str_ptr, int n_chars){
asm("mov 4, %rax");//Function number (write)
asm("mov 1, %rbx");//File descriptor (stdout)
asm("mov $str_ptr, %rcx");
asm("mov $n_chars, %rdx");
asm("int 0x80");
return;
}
GCC tells the following error to the "int" instruction:
"Error: operand size mismatch for 'int'"
Can somebody tell me the issue?

There are a number of issues with your code. Let me go over them step by step.
First of all, the int $0x80 system call interface is for 32 bit code only. You should not use it in 64 bit code as it only accepts 32 bit arguments. In 64 bit code, use the syscall interface. The system calls are similar but some numbers are different.
Second, in AT&T assembly syntax, immediates must be prefixed with a dollar sign. So it's mov $4, %rax, not mov 4, %rax. The latter would attempt to move the content of address 4 to rax which is clearly not what you want.
Third, you can't just refer to the names of automatic variables in inline assembly. You have to tell the compiler what variables you want to use using extended assembly if you need any. For example, in your code, you could do:
asm volatile("mov $4, %%eax; mov $1, %%edi; mov %0, %%esi; mov %2, %%edx; syscall"
:: "r"(str_ptr), "r"(n_chars) : "rdi", "rsi", "rdx", "rax", "memory");
Fourth, gcc is an optimizing compiler. By default it assumes that inline assembly statements are like pure functions, that the outputs are a pure function of the explicit inputs. If the output(s) are unused, the asm statement can be optimized away, or hoisted out of loops if run with the same inputs.
But a system call like write has a side-effect you need the compiler to keep, so it's not pure. You need the asm statement to run the same number of times and in the same order as the C abstract machine would. asm volatile will make this happen. (An asm statement with no outputs is implicitly volatile, but it's good practice to make it explicit when the side effect is the main purpose of the asm statement. Plus, we do want to use an output operand to tell the compiler that RAX is modified, as well as being an input, which we couldn't do with a clobber.)
You do always need to accurately describe your asm's inputs, outputs, and clobbers to the compiler using Extended inline assembly syntax. Otherwise you'll step on the compiler's toes (it assumes registers are unchanged unless they're outputs or clobbers). (Related: How can I indicate that the memory *pointed* to by an inline ASM argument may be used? shows that a pointer input operand alone does not imply that the pointed-to memory is also an input. Use a dummy "m" input or a "memory" clobber to force all reachable memory to be in sync.)
You should simplify your code by not writing your own mov instructions to put data into registers but rather letting the compiler do this. For example, your assembly becomes:
ssize_t retval;
asm volatile ("syscall" // note only 1 instruction in the template
: "=a"(retval) // RAX gets the return value
: "a"(SYS_write), "D"(STDOUT_FILENO), "S"(str_ptr), "d"(n_chars)
: "memory", "rcx", "r11" // syscall destroys RCX and R11
);
where SYS_WRITE is defined in <sys/syscall.h> and STDOUT_FILENO in <stdio.h>. I am not going to explain all the details of extended inline assembly to you. Using inline assembly in general is usually a bad idea. Read the documentation if you are interested. (https://stackoverflow.com/tags/inline-assembly/info)
Fifth, you should avoid using inline assembly when you can. For example, to do system calls, use the syscall function from unistd.h:
syscall(SYS_write, STDOUT_FILENO, str_ptr, (size_t)n_chars);
This does the right thing. But it doesn't inline into your code, so use wrapper macros from MUSL for example if you want to really inline a syscall instead of calling a libc function.
Sixth, always check if the system call you want to call is already available in the C standard library. In this case, it is, so you should just write
write(STDOUT_FILENO, str_ptr, n_chars);
and avoid all of this altogether.
Seventh, if you prefer to use stdio, use fwrite instead:
fwrite(str_ptr, 1, n_chars, stdout);

There are so many things wrong with your code (and so little reason to use inline asm for it) that it's not worth trying to actually correct all of them. Instead, use the write(2) system call the normal way, via the POSIX function / libc wrapper as documented in the man page, or use ISO C <stdio.h> fwrite(3).
#include <unistd.h>
static inline
void printStringWithLength(const char *str_ptr, int n_chars){
write(1, str_ptr, n_chars);
// TODO: check error return value
}
Why your code doesn't assemble:
In AT&T syntax, immediates always need a $ decorator. Your code will assemble if you use asm("int $0x80").
The assembler is complaining about 0x80, which is a memory reference to the absolute address 0x80. There is no form of int that takes the interrupt vector as anything other than an immediate. I'm not sure exactly why it complains about the size, since memory references don't have an implied size in AT&T syntax.
That will get it to assemble, at which point you'll get linker errors:
In function `printStringWithLength':
5 : <source>:5: undefined reference to `str_ptr'
6 : <source>:6: undefined reference to `n_chars'
collect2: error: ld returned 1 exit status
(from the Godbolt compiler explorer)
mov $str_ptr, %rcx
means to mov-immediate the address of the symbol str_ptr into %rcx. In AT&T syntax, you don't have to declare external symbols before using them, so unknown names are assumed to be global / static labels. If you had a global variable called str_ptr, that instruction would reference its address (which is a link-time constant, so can be used as an immediate).
As other have said, this is completely the wrong way to go about things with GNU C inline asm. See the inline-assembly tag wiki for more links to guides.
Also, you're using the wrong ABI. int $0x80 is the x86 32-bit system call ABI, so it doesn't work with 64-bit pointers. What are the calling conventions for UNIX & Linux system calls on x86-64
See also the x86 tag wiki.

How are variable names stored in memory in C?

In C, let's say you have a variable called variable_name. Let's say it's located at 0xaaaaaaaa, and at that memory address, you have the integer 123. So in other words, variable_name contains 123.
I'm looking for clarification around the phrasing "variable_name is located at 0xaaaaaaaa". How does the compiler recognize that the string "variable_name" is associated with that particular memory address? Is the string "variable_name" stored somewhere in memory? Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it, and if so, wouldn't it have to use memory in order to make that substitution?

Variable names don't exist anymore after the compiler runs (barring special cases like exported globals in shared libraries or debug symbols). The entire act of compilation is intended to take those symbolic names and algorithms represented by your source code and turn them into native machine instructions. So yes, if you have a global variable_name, and compiler and linker decide to put it at 0xaaaaaaaa, then wherever it is used in the code, it will just be accessed via that address.
So to answer your literal questions:
How does the compiler recognize that the string "variable_name" is associated with that particular memory address?
The toolchain (compiler & linker) work together to assign a memory location for the variable. It's the compiler's job to keep track of all the references, and linker puts in the right addresses later.
Is the string "variable_name" stored somewhere in memory?
Only while the compiler is running.
Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it, and if so, wouldn't it have to use memory in order to make that substitution?
Yes, that's pretty much what happens, except it's a two-stage job with the linker. And yes, it uses memory, but it's the compiler's memory, not anything at runtime for your program.
An example might help you understand. Let's try out this program:
int x = 12;
int main(void)
{
return x;
}
Pretty straightforward, right? OK. Let's take this program, and compile it and look at the disassembly:
$ cc -Wall -Werror -Wextra -O3 example.c -o example
$ otool -tV example
example:
(__TEXT,__text) section
_main:
0000000100000f60 pushq %rbp
0000000100000f61 movq %rsp,%rbp
0000000100000f64 movl 0x00000096(%rip),%eax
0000000100000f6a popq %rbp
0000000100000f6b ret
See that movl line? It's grabbing the global variable (in an instruction-pointer relative way, in this case). No more mention of x.
Now let's make it a bit more complicated and add a local variable:
int x = 12;
int main(void)
{
volatile int y = 4;
return x + y;
}
The disassembly for this program is:
(__TEXT,__text) section
_main:
0000000100000f60 pushq %rbp
0000000100000f61 movq %rsp,%rbp
0000000100000f64 movl $0x00000004,0xfc(%rbp)
0000000100000f6b movl 0x0000008f(%rip),%eax
0000000100000f71 addl 0xfc(%rbp),%eax
0000000100000f74 popq %rbp
0000000100000f75 ret
Now there are two movl instructions and an addl instruction. You can see that the first movl is initializing y, which it's decided will be on the stack (base pointer - 4). Then the next movl gets the global x into a register eax, and the addl adds y to that value. But as you can see, the literal x and y strings don't exist anymore. They were conveniences for you, the programmer, but the computer certainly doesn't care about them at execution time.

A C compiler first creates a symbol table, which stores the relationship between the variable name and where it's located in memory. When compiling, it uses this table to replace all instances of the variable with a specific memory location, as others have stated. You can find a lot more on it on the Wikipedia page.

All variables are substituted by the compiler. First they are substituted with references and later the linker places addresses instead of references.
In other words. The variable names are not available anymore as soon as the compiler has run through

This is what's called an implementation detail. While what you describe is the case in all compilers I've ever used, it's not required to be the case. A C compiler could put every variable in a hashtable and look them up at runtime (or something like that) and in fact early JavaScript interpreters did exactly that (now, they do Just-In-TIme compilation that results in something much more raw.)
Specifically for common compilers like VC++, GCC, and LLVM: the compiler will generally assign a variable to a location in memory. Variables of global or static scope get a fixed address that doesn't change while the program is running, while variables within a function get a stack address-that is, an address relative to the current stack pointer, which changes every time a function is called. (This is an oversimplification.) Stack addresses become invalid as soon as the function returns, but have the benefit of having effectively zero overhead to use.
Once a variable has an address assigned to it, there is no further need for the name of the variable, so it is discarded. Depending on the kind of name, the name may be discarded at preprocess time (for macro names), compile time (for static and local variables/functions), and link time (for global variables/functions.) If a symbol is exported (made visible to other programs so they can access it), the name will usually remain somewhere in a "symbol table" which does take up a trivial amount of memory and disk space.

Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it
Yes.
and if so, wouldn't it have to use memory in order to make that substitution?
Yes. But it's the compiler, after it compiled your code, why do you care about memory?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight