How can I optimize GCC compilation for memory usage?

How can I optimize GCC compilation for memory usage? - c

I am developing a library which should use as little memory as possible (I am not concerned about anything else, like the binary size, or speed optimizations).
Are there any GCC flags (or any other GCC-related options) I can use? Should I avoid some level of -O* optimization?

You library -or any code in idiomatic C- has several kinds of memory usage :
binary code size, and indeed -Os should optimize that
heap memory, using C dynamic allocation, that is malloc; you obviously should know how, and how much, heap memory is allocated (and later free-d). The actual memory consumption would depend upon your particular malloc implementation (e.g. many implementations, when calling malloc(25) could in fact consume 32 bytes), not on the compiler. BTW, you might design your library to use some memory pools or even implement your own allocator (above OS syscalls like mmap, or above malloc etc...)
local variables, that is the call frames on the call stack. This mostly depend upon your code (but an optimizing compiler, e.g. -Os or -O2 for gcc, would probably use more registers and perhaps slightly less stack when optimizing). You could pass -fstack-usage to gcc to ask it to give the size of every call frame and you might give -Wstack-usage=len to be warned when a call frame exceeds len bytes.
global or static variables. You should know how much memory they need (and you might use nm or some other binutils program to query them). BTW, declaring carefully some variables inside a function as static would lower the stack consumption (but you cannot do that for every variable or every function).
Notice also that in some limited cases, GCC is doing tail calls, and then the stack usage is lowered (since the stack frame of the caller is reused in the callee). (See also this old question).
You might also ask the compiler to pack some particular struct-s (beware, this could slowdown the performance significantly). You'll want to use some type attributes like __attribute__((packed)), etc... and perhaps also some variable attributes etc...
Perhaps you should read more about Garbage Collection, since GC techniques, concepts, and terminology might be relevant. See this answer.
If on Linux, the valgrind tool should be useful too... (and during the debugging phase the -fsanitize=address option of recent GCC).
You might perhaps also use some code generation options like -fstack-reuse= or -fshort-enums or -fpack-struct or -fstack-limit-symbol= or -fsplit-stack ; be very careful: some such options make your binary code incompatible with your existing C (and others!) libraries (then you might need to recompile all used libraries, including your libc, with the same code generation flags).
You probably should enable link-time optimizations by compiling and linking with -flto (in addition of other optimization flags like -Os).
You certainly should use a recent version of GCC. Notice that GCC 5.1 has been released a few days ago (in april 2015).
If your library is large enough to worth the effort, you might even consider customizing your GCC compiler with MELT (to help you find out how to spend less memory). This might take weeks or months of work.

there are advantages to using 'stack frames', but that does use more stack space to save the stack frame pointer.
You can tell the compiler to not use stack frames. This will (generally) slightly increase the code size but will reduce the amount of stack used.
you can only use char and short for values rather than int.
It is poor programing practice, but can re-use variables and arrays for multiple purposes.
if some set of variables are mutually exclusive on usage, then can place them in a union.
If the function parameter lists are all very short, then can for the compiler to pass all the parameters in registers. (having an architecture with lots of general purpose registers really helps here.
Only use one malloc that contains ALL the area needed for malloc kind of operations, so as to minimize the amount of allocated memory overhead.
there are many techniques. Most make the code much more difficult to debug/maintain and often make the code much harder for humans to read

When possible, you can use -m32 option to compile your application for 32-bit. So, the application will consume only half of the memory on 64-bit systems.
apt-get install libc6-dev-i386
gcc -m32 application.c -o application

Related

Is using canaries for bss or data-sections to detect overflows/smashing useful?

In our GCC-based C embedded system we are using the -ffunction-sections and -fdata-sections options to allow the linker, when linking the final executable, to remove unused (unreferenced) sections. This works well since years.
In the same system most of the data-structures and buffers are allocated statically (often as static-variables at file-scope).
Of course we have bugs, sometimes nasty ones, where we would like to quickly exclude the possibility of buffer-overflows.
One idea we have is to place canaries in between each bss-section and data-section - each one presenting exactly one symbol (because of -fdata-sections). Like the compiler is doing for functions-stacks when Stack-Smashing and StackProtection is activated. Checking these canaries could be done from the host by reading the canary-addresses "from time to time".
It seems that modifying the linker-script (placing manually the section and adding a canary-word in between) seems feasible, but does it make sense?
Is there a project or an article in the wild? Using my keywords I couldn't find anything.

Canaries are mostly useful for the stack, since it expands and collapses beyond the programmer's direct control. The things you have on data/bss do not behave like that. Either they are static variables, or in case they are buffers, they should keep within their fixed size, which should be checked with defensive programming in-place with the algorithm, rather than unorthodox tricks.
Also, stack canaries are used specifically in RAM-based, PC-like systems that don't know any better way. In embedded systems, they aren't very meaningful. Some useful things you can do instead:
Memory map the stack so that it grows into a memory area where writes will yield a hardware exception. Like for example, if your MCU has the ability to separate executable memory from data memory and yield exceptions if you try to execute code in the data area, or write to the executable area.
Ensure that everything in your program dealing with buffers perform their error checks and not write out-of-bounds. Static analysis tools are usually decent at spotting out-of-bounds bugs. Even some compilers can do this.
Add lots of defensive programming with static asserts. Check sizes of structs, buffers etc at compile-time, it's free.
Run-time defensive programming. For example if(x==good) {...} else if(x == bad) {... } is missing an else. And switch(x) case A: { ... } is missing a default. "But it can't go there in theory!" No but in practice, when you get runaway code caused by bugs (very likely), data retention of flash (100% likely) or EMI influence on RAM (quite unlikely).
And so on.

C function code in malloc'd memory

Is there a way to malloc memory space and then copy function code inside the space in C?
This question might not make sense in practice. I ask this question out of curiosity so that I can get a better understanding about how c and its underlying implementation work.
Here's the follow-up questions if it is possible to copy the code into heap:
How to determine the size for the function binary code when copy?
Can we use function pointer to execute the code? (the code is placed inside malloc'd memory, and that part of memory might be marked as non-executable for safety reason, but I'm not sure about this)

This (or something like it) is possible on most machines, but the techniques you'd use are system-specific -- there's no standard C or C++ way to do it.
Even figuring out the length of a function so you can copy it is difficult. I don't think you can do it reliably if the function is in the same translation unit, because the compiler may have done optimization magic that you can't see. However, if the function is in a different file, then the interface to it will probably be more reliable (although there could be linker magic going on that you would have to understand and emulate to accomplish your goal.)
Other problems (on some systems) are that malloc'd memory may not be executable. (This is often the case to improve security by preventing execution of code placed in an overrun buffer area.) However, systems with executable protection often have an alternate memory allocation function that can give you a chunk of memory where executable code can be placed, and to which execution can transfer. Some variation of this feature is necessary to implement shared libraries.
Finally, although self modifying code is probably the first thing people probably think of when considering your question, a reasonable, legitimate use of the relevant techniques might be in a native-code, just-in-time compilation system.
You may get better answers by specifying a particular OS and CPU where you want to do this.

The C standard (e.g. C11, read n1570) or the C++ one (e.g. C++11, C++14 and notice that they have lambda expressions and std::function; read more about closures ...) does not define what is a function address or pointer (it only defines what calling such an address does, then function pointers should point to existing functions and there is no standard way to build new ones dynamically at runtime). In some systems (pure Harvard architectures) a function sits in a different address space than the C heap (and on these systems executing anything in malloc-ed heap makes no sense and is undefined behavior). so the C11 standard forbids casting function pointers to data pointers and vice-versa.
So, to your question
Is there a way to malloc memory space and then put function code inside the space in C?
the answer is NO in general (but on some systems you could generate code at runtime, see below).
However, on desktop or laptop PCs or server PCs or tablets (running common OSes like Linux, Windows, MacOSX, Android), you usually have a Von Neumann architecture and there is (for a given process) a single virtual address space sharing both code and data (notably heap data obtained with malloc). That virtual address space organised in pages, and each page has its own memory protection. Read more about computer architecture, instruction sets, MMUs. Quite often heap allocated data is non-executable thru the NX bit.
The operating system plays an essential role. You need to read an entire book about OS, such as Operating Systems : Three Easy Pieces.
(I am guessing that you want to "create" some new functions in your program at runtime and call them thru C function pointers; you should explain why; I suppose you are coding some application for a PC or a tablet with a Unix-like OS, practically a Linux-x86_64 distribution, but you could adapt my answer to Windows)
You could use some libraries for JIT compilation such as asmjit, libgccjit, LLVM (or libjit or GNU lightning) and they generate code which is executable.
You could also use dynamic loading techniques on some plugin; on POSIX systems look into dlopen & dlsym (which can be used to "create" function addresses from a loaded plugin, beyond what the C11 standard allows). A possible way would be to generate some C code in a temporary file, compile it into a plugin, and dlopen that generated plugin. See this answer for more details.
On Linux, you can use the mmap(2) and related system calls (used to implement malloc in your C standard library, and also by dlopen(3)) to change your virtual address space, and the mprotect(2) system call to change protection (on a page by page basis). So if you want to explicitly copy or generate some function code it has to go into an executable page (PROT_EXEC).
Notice that because of relocation issues (and offsets or absolute addresses in machine code), it is not easy to copy machine code. Copying with memcpy the bytes of a given function code into some executable page usually won't work without pain: often CALL or JUMP machine instructions are using PC-relative addressing, so copying them without changing their offset won't work.
if it is possible to copy the code into heap
No, it is not possible in general; and in practice it is much more difficult than what you believe (even on Linux-x86_64, where other approaches that I mentioned are preferable); if you want to go that route you need to care about low level implementation details (instruction set, processor, compiler, calling conventions, ABIs, relocation) and your code would be non-portable and brittle.
How to determine the size for the function binary code when copy?
That question (and the notion of function size) has no sense in general. Some optimizing compilers are able to emit some machine code which is shared between several C functions, or to emit several non-contiguous machine code chunks for a given function (and gcc -O2 is likely to do these optimizations, read about function cloning). On Linux you could use dladdr(3) (or the nm or readelf programs) to get a "symbol size" in the ELF sense, but that size might not mean much. And as I explained, you can't just byte-copy binary machine code, you need to relocate (some parts of) it.

How to instrument/profile memory(heap, pointers) reads and writes in C?

I know this might be a bit vague and far-fetched (sorry, stackoverflow police!).
Is there a way, without external forces, to instrument (track basically) each pointer access and track reads and writes - either general reads/writes or quantity of reads/writes per access. Bonus if it can be done for all variables and differentiate between stack and heap ones.
Is there a way to wrap pointers in general or should this be done via custom heap? Even with custom heap I can't think of a way.
Ultimately I'd like to see a visual representation of said logs that would show me variables represented as blocks (of bytes or multiples of) and heatmap over them for reads and writes.
Ultra simple example:
int i = 5;
int *j = &i;
printf("%d", *j); /* Log would write *j was accessed for read and read sizeof(int) bytes
Attempt of rephrasing in more concise manner:
(How) can I intercept (and log) access to a pointer in C without external instrumentation of binary? - bonus if I can distinguish between read and write and get name of the pointer and size of read/write in bytes.

I guess (or hope for you) that you are developing on Linux/x86-64 with a recent GCC (5.2 in october 2015) or perhaps Clang/LLVM compiler (3.7).
I also guess that you are tracking a naughty bug, and not asking this (too broad) question from a purely theoretical point of view.
(Notice that practically there is no simple answer to your question, because in practice C compilers produce machine code close to the hardware, and most hardware do not have sophisticated instrumentations like the one you dream of)
Of course, compile with all warnings and debug info (gcc -Wall -Wextra -g). Use the debugger (gdb), notably its watchpoint facilities which are related to your issue. Use also valgrind.
Notice also that GDB (recent versions like 7.10) is scriptable in Python (or Guile), and you could code some scripts for GDB to assist you.
Notice also that recent GCC & Clang/LLVM have several sanitizers. Use some of the -fsanitize= debugging options, notably the address sanitizer with -fsanitize=address; they are instrumenting the code to help in detecting pointer accesses, so they are sort-of doing what you want. Of course, the performance of the instrumented generated code is decreasing (depending on the sanitizer, can be 10 or 20% or a factor of 50x).
At last, you might even consider adding your own instrumentation by customizing your compiler, e.g. with MELT -a high level domain specific language designed for such customization tasks for GCC. This would take months of work, unless you are already familiar with GCC internals (then, only several weeks). You could add an "optimization" pass inside GCC which would instrument (by changing the Gimple code) whatever accesses or stores you want.
Read more about aspect-oriented programming.
Notice also that if your C code is generated, that is if you are meta-programming, then changing the C code generator might be very relevant. Read more about reflection and homoiconicity. Dynamic software updating is also related to your issues.
Look also into profiling tools like oprofile and into sound static source analyzers like Frama-C.
You could also run your program inside some (instrumenting) emulator (like Qemu, Unisim, etc...).
You might also compile for a fictitious architecture like MMIX and instrument its emulator.

C embedded systems stack and heap size

How could I determine the current stack and heap size of a running C program on an embedded system? Also, how could I discover the maximum stack and heap sizes that my embedded system will allow? I thought about linearly calling malloc() with an increasing size until it fails to find the heap size, however I am more interested in the size of the stack.
I am using an mbed NXP LPC1768, and I am using an offline compiler developed on GitHub called gcc4mbed.
Any better ideas? All help is greatly appreciated!

For this look at your linker script, this will define how much space you have allocated to each.
For stack size usage do this:
At startup (before C main()) during initialization of memory, init all your stack bytes to known values such as 0xAA, or 0xCD. Run your program, at any point you can stop and see how many magic values you have left. If you don't see any magic values then you have overflowed your stack and weirdness may start to happen.
At runtime you can also check the last 4 bytes or so (maybe last two words, this is really up to you). If they don't match your magic value then force a reset. This only works if your system is well behaved on reset and it is best if it starts up quick and isn't doing something "real time" or mission critical.
Here's a really helpful whitepaper from IAR on the subject.

A crude way of measuring at runtime the current stack size is to declare
static void* mainsp;
then start your main with e.g:
int main(int argc, char**argv) {
int here;
mainsp = (void*) &here;
then inside some leaf routine, when the call stack is deep enough, do something similar to
int local;
printf ("stack size = %ld\n",
(long) ((intptr_t) &local - (intptr_t) mainsp));
Statically estimating from full source code of an application the required stack size is in general undecidable (think of recursion, function pointers), and in practice very difficult (even on a severely restricted class of applications). Look into Couverture. You might also consider customizing a recent GCC compiler with your plugin (perhaps Bismon in mid 2021; email me to basile.starynkevitch#cea.fr about it) for such purposes, but that won't be easy and will give you over-approximations.
If compiling with GCC, you might use the return address bultins to query the stack frame pointer at run time. On some architectures it is not available with some optimization flags. You could also use the -Wstack-usage=byte-size and/or -Wframe-larger-than=byte-size warning options to recent GCC.
As to how heap and stack spaces are distributed, this is system dependent. You might parse /proc/self/maps file on Linux. See proc(5). You could limit stack space on Linux in user-space using setrlimit(2).
Be however aware of Rice's theorem.
With multi-threaded applications things could be more difficult. Read some Pthread tutorial.
Notice that in simple cases, GCC may be capable of tail-call optimizations. You could compile your foo.c with gcc -Os -fverbose-asm -S foo.c and look inside the generated foo.s assembler code.
If you don't care about portability, consider also using the extended asm features of GCC.

memcpy vs assignment in C

Under what circumstances should I expect memcpys to outperform assignments on modern INTEL/AMD hardware? I am using GCC 4.2.x on a 32 bit Intel platform (but am interested in 64 bit as well).

You should never expect them outperform assignments. The reason is, the compiler will use memcpy anyway when it thinks it would be faster (if you use optimize flags). If not and if the structure is reasonable small that it fits into registers, direct register manipulation could be used which wouldn't require any memory access at all.
GCC has special block-move patterns internally that figure out when to directly change registers / memory cells, or when to use the memcpy function. Note when assigning the struct, the compiler knows at compile time how big the move is going to be, so it can unroll small copies (do a move n-times in row instead of looping) for instance. Note -mno-memcpy:
-mmemcpy
-mno-memcpy
Force (do not force) the use of "memcpy()" for non-trivial block moves.
The default is -mno-memcpy, which allows GCC to inline most constant-sized copies.
Who knows it better when to use memcpy than the compiler itself?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight