Force clang to inline or use static call to `memmove()` - c

The following question sure sounds like an XY problem, but trust me, it's not. It is related to the lengthy investigation performed in this question.
I have very good reason to believe that a solution to that question's issue (not for the minimal example there, but for my actual code) involves ensuring memmove() is not called through a shared library, but rather as a static call, straight from my code to memmove() without going through a PLT, or even better, if memmove()'s code is inlined inside my code.
However, I have failed to find a command-line switch or anything else to prevent memmove() being called dynamically (note the dynamic call is confirmed by disassembling the output, after a build using -O3.) Does such a switch exist? While I could grab some optimized memmove() code for my platform (such as this), I would rather avoid introducing some code which might be obsoleted for a future architecture, and I do not consider this good programming practice anyway.
I'm using clang with the version string Ubuntu clang version 12.0.0-3ubuntu1~21.04.1 on a Raspberry Pi 4. Output of uname -a is Linux rpi4 5.11.0-1017-raspi #18-Ubuntu SMP PREEMPT Mon Aug 23 07:34:31 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux.

While I could grab some optimized memmove() code for my platform (such as this), I would rather avoid introducing some code which might be obsoleted for a future architecture
Your best bet is to write a straight-forward inline C or C++ implementation of memmove, and depend on the compiler to inline and optimize it.
Surprisingly doing this often beats hand-coded assembly implementations, because the compiler can often tell that the regions are non-overlapping, or that a whole multiple of 64-bit words is being used, and thus various parts of the generic memmove can be optimized out.
And doing this in C or C++ guarantees that it will continue to work on future architectures (so long as you can compile for them).

Related

Is it possible to generate ansi C functions with type information for a moving GC implementation?

I am wondering what methods there are to add typing information to generated C methods. I'm transpiling a higher-level programming language to C and I'd like to add a moving garbage collector. However to do that I need the method variables to have typing information, otherwise I could modify a primitive value that looks like a pointer.
An obvious approach would be to encapsulate all (primitive and non-primitive) variables in a struct that has an extra (enum) variable for typing information, however this would cause memory and performance overhead, the transpiled code is namely meant for embedded platforms. If I were to accept the memory overhead the obvious option would be to use a heap handle for all objects and then I'd be able to freely move heap blocks. However I'm wondering if there's a more efficient better approach.
I've come up with a potential solution, namely to predeclare and group variables based whether they're primitives or not (I can do that in the transpiler), and add an offset variable to each method at the end (I need to be able to find it accurately when scanning the stack area), that tells me where the non-primitive variables begin and where they end, so I can only scan those. This means that each method will use an additional 16/32-bit (depending on arch) of memory, however this should still be more memory efficient than the heap handle approach.
Example:
void my_func() {
int i = 5;
int z = 3;
bool b = false;
void* person;
void* person_info = ...;
.... // logic
volatile int offset = 0x034;
}
My aim is for something that works universally across GCC compilers, thus my concerns are:
Can the compiler reorder the variables from how they're declared in
the source code?
Can I force the compiler to put some data in the
method's stack frame (using volatile)?
Can I find the offset accurately when scanning the stack?
I'd like to avoid assembly so this approach can work (by default) across multiple platforms, however I'm open for methods even if they involve assembly (if they're reliable).
Typing information could be somehow encoded in the C function name; this is done by C++ and other implementations and called name mangling.
Actually, you could decide, since all your C code is generated, to adopt a different convention: generate long C identifiers which are practically unique and sort-of random program-wide, such as tiziw_7oa7eIzzcxv03TmmZ and keep their typing information elsewhere (e.g. some database). On Linux, such an approach is friendly to both libbacktrace and dlsym(3) + dladdr(3) (and of course nm(1) or readelf(1) or gdb(1)), so used in both bismon and RefPerSys projects.
Typing information is practically tied to calling conventions and ABIs. For example, the x86-64 ABI for Linux mandates different processor registers for passing floating points or pointers.
Read the Garbage Collection handbook or at least P.Wilson Uniprocessor Garbage Collection Techniques survey. You could decide to use tagged integers instead of boxing them, and you could decide to have a conservative GC (e.g. Boehm's GC) instead of a precise one. In my old GCC MELT project I generated C or C++ code for a generational copying GC. Similar techniques are used both in Bismon and in RefPerSys.
Since you are transpiling to C, consider also alternatives, such as libgccjit or LLVM. Look into libjit and asmjit.
Study also the implementation of other transpilers (compilers to C), including Chicken/Scheme and Bigloo.
Can the GCC compiler reorder the variables from how they're declared in the source code?
Of course yes, depending upon the optimizations you are asking. Some variables won't even exist in the binary (e.g. those staying in registers).
Can I force the compiler to put some data in the method's stack frame (using volatile)?
Better generate a single struct variable containing all your language variables, and leave optimizations to the compiler. You will be surprised (see this draft report).
Can I find the offset accurately when scanning the stack?
This is the most difficult, and depends a lot of compiler optimizations (e.g. if you run gcc with -O1 or -O3 on the generated C code; in some cases a recent GCC -e.g GCC 9 or GCC 10 on x86-64 for Linux- is capable of tail-call optimizations; check by compiling using gcc -O3 -S -fverbose-asm then looking into the produced assembler code). If you accept some small target processor and compiler specific tricks, this is doable. Study the implementation of the Ocaml compiler.
Send me (to basile#starynkevitch.net) an email for discussion. Please mention the URL of your question in it.
If you want to have an efficient generational copying GC with multi-threading, things become extremely tricky. The question is then how many years of development can you afford spending.
If you have exceptions in your language, take also a great care. You could with great caution generate calls to longjmp.
See of course this answer of mine.
With transpiling techniques, the evil is in the details
On Linux (specifically!) see also my manydl.c program. It demonstrates that on a Linux x86-64 laptop you could generate, in practice, hundred of thousands of dlopen(3)-ed plugins. Read then How to write shared libraries
Study also the implementation of SBCL and of GNU Prolog, at least for inspiration.
PS. The dream of a totally architecture-neutral and operating-system independent transpiler is an illusion.

Why does glibc library use assembly

I am looking at this page: https://sys.readthedocs.io/en/latest/doc/01_introduction.html
that goes into explanation about how glibc does system calls. In one of the examples the code is examined and it is shown, that the last instruction glibc does to actually do a system call (meaning the interrupt to the cpu) is written in assembly.... So why is part of glibc in assembly? Is there some sort of advantage by writing that small part in assembly?
Also, the shared libraries during runtime are already compiled to machine code correct?
So why would there be any advantage using two different languages before compilation? Thank you.
The answer is super simple - since C doesn't cover system calls (because it doesn't cover any physical hardware in general, and prefers to express itself in terms of abstract machine), there is no C construct glibc can use to perform system call.
One could argue that compiler could provide a sort of intrinsic to do that, but since in Linux glibc is actually part of the compiler suit of tools (in contains CRT as well) there is really no need for it, glibc can do the job.
Also, last, but not the least, in modern CPUs syscall is usually not an interrupt. Instead, it's a specific instruction (syscall in x86_64).
I want to address this piece of your question:
Also, the shared libraries during runtime are already compiled to machine code correct?
So why would there be any advantage using two different languages before compilation?
SergeyA correctly points out that there isn't any C construct (even with all of GCC's extensions) that will cause the compiler to emit a syscall instruction. That's not the only thing that the C library is supposed to do that simply can't be written purely in C: the implementations of setjmp and longjmp, makecontext and setcontext, the "entry point" code that calls main, the "trampoline" that you return to when you return from a signal handler, and several other low-level bits all require a little bit of hand-written assembly. (Exercise: what do they all have in common?)
But there's another reason to mix assembly language into a program mostly written in C. This is one of the several implementations of memcpy for x86-64 in glibc. It is 3100 lines of hand-written assembly language and preprocessor macros. What it does could be expressed in four lines of C. Why would anyone go to that much trouble? Speed. Compilers are always getting closer, but they haven't yet quite managed to beat the human brain when it comes to squeezing every last possible cycle out of a critical innermost loop. (It is worth mentioning that in early 2018 the glibc devs spent a bunch of time replacing hand-written assembly implementations of math.h functions with C, because the compilers have caught up on those, and C is ever so much more maintainable.)
And yet a third answer, which isn't particularly relevant to glibc but comes up a bunch elsewhere, is that maybe you have two different languages in your program because each of them is better at part of your problem. The statistical language R is mostly implemented in C, but a bunch of its mathematical primitives are (or were, I haven't checked in a while) written in FORTRAN, because FORTRAN is still the language that numerical computation wizards think in. Both C and FORTRAN get compiled to machine code, and in principle you could rewrite all the FORTRAN in C, but nobody wants to.

How can I know where function ends in memory(get the address)- c/c++

I'm looking for a simple way to find function ending in memory. I'm working on a project that will find problems on run time in other code, such as: code injection, viruses and so fourth. My program will run with the code that is going to be checked on run time, so that I will have access to memory. I don't have access to the source code itself. I would like to examine only specific functions from it. I need to know where functions start and end in stack. I'm working with windows 8.1 64 bit.
In general, you cannot find where the function is ending in memory, because the compiler could have optimized, inlined, cloned or removed that function, split it in different parts, etc. That function could be some system call mostly implemented in the kernel, or some function in an external shared library ("outside" of your program's executable)... For the C11 standard (see n1570) point of view, your question has no sense. That standard defines the semantics of the language, i.e. properties on the behavior of the produced program. See also explanations in this answer.
On some computers (Harvard architecture) the code would stay in a different memory, so there is no point in asking where that function starts or ends.
If you restrict your question to a particular C implementation (that is a specific compiler with particular optimization settings, for a specific operating system and instruction set architecture and ABI) you might (in some cases, not in all of them) be able to find the "end of a function" (but that won't be simple, and won't be failproof). For example, you could post-process the assembler code and/or the object file produced by the compiler, inspect the ELF executable and its symbol table, examine DWARF debug information, etc...
Your question smells a lot like some XY problem, so you should motivate it, whith a lot more explanation and context.
I need to know where functions start and end in stack.
Functions don't sit on the stack, but mostly in the code segment of your executable (or library). What is on the call stack is a sequence of call frames. The organization of the call frames is specific to your ABI. Some compiler options (e.g. -fomit-frame-pointer) would make difficult to explore the call stack (without access to the source code and help from the compiler).
I don't have access to the source code itself. I would like to examine only specific functions from it.
Your problem is still ill-defined, probably undecidable, much more complex than what you believe (since related to the halting problem), and there is considerable literature related to it (read about decompiler, static code analysis, anti-virus & malware analysis). I recommend spending several months or years learning more about compilers (start with the Dragon Book), linkers, instruction set architecture, ABIs. Then look into several proceedings of conferences related to ACM SIGPLAN etc. On a practical side, study the assembler code generated by compilers (e.g. use GCC with gcc -O2 -S -fverbose-asm....); the CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is a nice introduction.
I'm working on a project that will find problems on run time in other code, such as: code injection, viruses and so fourth.
I hope you can dedicate several years of full time work to your ambitious project. It probably is much more difficult than what you thought, because optimizing compilers are much more complex than what you believe (and malware software uses various complex tricks to hide itself from inspection). Malware research is really difficult, but interesting.

Can I run GCC as a daemon (or use it as a library)?

I would like to use GCC kind of as a JIT compiler, where I just compile short snippets of code every now and then. While I could of course fork a GCC process for each function I want to compile, I find that GCC's startup overhead is too large for that (it seems to be about 50 ms on my computer, which would make it take 50 seconds to compile 1000 functions). Therefore, I'm wondering if it's possible to run GCC as a daemon or use it as a library or something similar, so that I can just submit a function for compilation without the startup overhead.
In case you're wondering, the reason I'm not considering using an actual JIT library is because I haven't found one that supports all the features I want, which include at least good knowledge of the ABI so that it can handle struct arguments (lacking in GNU Lightning), nested functions with closure (lacking in libjit) and having a C-only interface (lacking in LLVM; I also think LLVM lacks nested functions).
And no, I don't think I can batch functions together for compilation; half the point is that I'd like to compile them only once they're actually called for the first time.
I've noticed libgccjit, but from what I can tell, it seems very experimental.
My answer is "No (you can't run GCC as a daemon process, or use it as a library)", assuming you are trying to use the standard GCC compiler code. I see at least two problems:
The C compiler deals in complete translation units, and once it has finished reading the source, compiles it and exits. You'd have to rejig the code (the compiler driver program) to stick around after reading each file. Since it runs multiple sub-processes, I'm not sure that you'll save all that much time with it, anyway.
You won't be able to call the functions you create as if they were normal statically compiled and linked functions. At the least you will have to load them (using dlopen() and its kin, or writing code to do the mapping yourself) and then call them via the function pointer.
The first objection deals with the direct question; the second addresses a question raised in the comments.
I'm late to the party, but others may find this useful.
There exists a REPL (read–eval–print loop) for c++ called Cling, which is based on the Clang compiler. A big part of what it does is JIT for c & c++. As such you may be able to use Cling to get what you want done.
The even better news is that Cling is undergoing an attempt to upstream a lot of the Cling infrastructure into Clang and LLVM.
#acorn pointed out that you'd ruled out LLVM and co. for lack of a c API, but Clang itself does have one which is the only one they guarantee stability for: https://clang.llvm.org/doxygen/group__CINDEX.html

Do I really need libgcc?

I've been using GCC 4.6.2 on Mac OS X 10.6. I use the -static-libgcc option when I compile, otherwise my binaries look for libgcc on the system and I'm not sure anything over GCC 4.2 is supported on OS X. This works fine, but why do I even need libgcc? I read up on it and the GNU docs say it contains "arithmetic operations that the target processor cannot perform directly." How do I know what these operations are? And why are they so complex that I need to include this library? Why can't GCC just optimize the code directly instead of having to resort to these library functions? I'm a little confused. Any insight into this would be appreciated!
Yes, you do need it .... probably. If you don't need it then statically linking it is harmless. You can tell if you need it by using the -t link trace option (I think).
There are various things that you cant do in one instruction (typically things like 64-bit operations on 32-bit architectures). These things can be done, but if they use a non-trivial number of instructions then it's more space-efficient to have them all in one place.
When you disable optimization using -O0 (that's actually the default anyway) then GCC pretty much always uses the libgcc routines.
When you enable speed optimization then GCC may choose to insert the instruction sequence directly into the code (if it knows how). You may find that it ends up using none of the libgcc versions - it will certainly use fewer libgcc calls.
When you enable size optimizations then GCC may prefer the function call, or may not - it depends on what the GCC developers think is the best speed/size trade-off in each case. Note that even when you tell it to optimize for speed, the compiler may judge that some functions are unlikely to be used, and optimize those for size - even more so if you use PGO.
Basically, you can think of it in the same way as memcpy or the math-library functions: the compiler will inline functions it judges to be beneficial, and call library functions otherwise. The compiler can "inline" standard functions and libgcc function without looking at the library definition, of course - it just "knows" what they do.
Whether to use static or dynamic libgcc is an interesting trade-off. On the one hand, a dynamic (shared) library will use less memory across your whole system, and is more likely to be cached, etc. On the other hand, a static libgcc has a lower call overhead.
The most important thing though is compatibility. Obviously the libgcc library has to be present for your program to run, but it also has to be a compatible version. You're ok on a Linux distro with a stable GCC version, but otherwise static linking is safer.
I hope that answers your questions.

Resources