Can small functions defined in dynamic libraries be inlined? - c

This is a follow-up to this question: Can the linker inline functions?
This time I am wondering about the same optimization, not at link time, but at run time when linking to a dynamic library. Is it possible at all? Do modern OSes do it? Why?

In theory it's possible, but there are many reasons not to do it. In practice, "dynamic linking" is not really full linking; position-independent code is used for all but the main program (and possibly the main program too), whereby the full range of relocations a full (static) linker might have to perform are not needed. Instead, only a small number of relocation types are required, which basically just amount to filling in addresses of functions and objects contained in libraries in a big contiguous table. Of course, such references in objects of static storage duration in the .data segment must also be filled in, so it's a little bit more work than just filling in a contiguous table, but the key point is that only data, not code, is modified.
If you start modifying code at all, you throw away most of the advantages of dynamic linking: code pages could not be shared among multiple instances of the application/library, and a lot more time would be spent at startup duplicating (via page faults and copy-on-write semantics) the mapped code pages. And this is just the minimal cost for patching a few bytes here and there in code.
For actually inlining code from dynamic libraries, what you would have to do would amount to full link-time optimization. Measure how much time it takes to LTO-link a large program, and then ask yourself if users would be acceptable waiting that long every time they start the program. The answer is almost surely no.

Related

__attribute__((section("name"))) usage?

I have ran through code that use _ attribute _((section("name")). I understand that for gcc compiler this allows you to tell the linker to put the object created at a specific section "name" (with "name" absolute address declared in a linker file).
What is the point of doing this instead of just using the .data section?
There are many possible uses. [Edit to add note: this is just a sample of uses I've seen myself or considered, not a complete list.]
The Linux kernel, for instance, marks some code and data sections as used only during kernel bootstrap. These can be jettisoned after the kernel is running, reclaiming the space for other uses.
You can use this to mark code or data values that need patching on a particular processor variant, e.g., with or without a coprocessor.
You can use it to make things live in "special" address spaces that will be burned to PROM or saved on an EEPROM, rather than in ordinary memory.
You can use it to collect together code or data areas for purposes like initialization and cleanup, as with C++ constructors and destructors that run before the program starts and when it ends, or for using shorter addressing modes (I don't know how much that would apply on ARM as I have not written any ARM code myself).
The actual use depends on the linker script(s).
From a usecase point of view, there are lots of different types of .data, like:
data local to a specific CPU and/or NUMA node
data shared between contexts (like user/kernelspace, as are the .vdso or vsyscall pages. Or, another example, bootloader and kernel)
readonly data or other data with specific access mode/type restrictions (say, cacheability or cache residency - the latter can be specificed on some ARM SoCs)
data that persists "state transitions" (such as hibernation image loads, or crash kernel / fast reboot reinitializations)
data with specific lifetimes/lifecycles (only used in specific stages during boot or during operation, write-once data)
data specific to a particular kernel subsystem or particular kernel module
"code colocated" data (addressing offsets in x64 are plus/minus 2GB so if you want RIP-relative addressing, the data must be within that range of your currently executing code)
data mapped to some specific hardware register space VA range
So in the end it's often about attributes (the word here used in a more generic sense than what __attribute__(...) allows you to state from within gcc sourcecode. Whether another section is needed and/or useful is ... in the eye of the beholder - the system designer, that is.
The availabiltiy of the section attribute, therefore, allows for flexibility and that is, IMHO, a good thing.
Years later, I'm going to add a specific detail because it's worth writing down.
If you create your own section, you can manage it yourself. In particular, you can use preprocessor macros to insert certain data items into your special section. If the only thing that uses that special section is your preprocessor macros, then you have the ability to create a data structure in a distributed fashion.
What does this mean? It means you can write a preprocessor macro like ADD_VAR_TO_SPECIAL_SECTION(...) and concatenate a bunch of different values in random order into what amounts to an array (or just a big old pile, if they aren't all the same type) in your section.
This gives you the ability to create a (randomly-ordered) array of data at compile time. There is no initialization, no registration, no overhead. You just compile and link your code, and all the macros that were in all the different source files have added all their values into one big array.
How can you use this? Create a bunch of "modules." Register the init functions and destroy functions in an ad-hoc array. Process the array at startup time. (You can add some kind of topological sort if you need to.) You don't need to have a master list of modules anywhere, it gets built automatically. Or, create a macro to register unit test functions into a test suite. Again, it creates an ad-hoc list with no "registration" required.

Determine total memory usage of embedded C program

I would like to be able to debug how much total memory is being used by C program in a limited resource environment of 256 KB memory (currently I am testing in an emulator program).
I have the ability to print debug statements to a screen, but what method should I use to calculate how much my C program is using (including globals, local variables [from perspective of my main function loop], the program code itself etc..)?
A secondary aspect would be to display the location/ranges of specific variables as opposed to just their size.
-Edit- The CPU is Hitachi SH2, I don't have an IDE that lets me put breakpoints into the program.
Using the IDE options make the proper actions (mark a checkobx, probably) so that the build process (namely, the linker) will generate a map file.
A map file of an embedded system will normally give you the information you need in a detailed fashion: The memory segments, their sizes, how much memory is utilzed in each one, program memory, data memory, etc.. There is usually a lot of data supplied by the map file, and you might need to write a script to calculate exactly what you need, or copy it to Excel. The map file might also contain summary information for you.
The stack is a bit trickier. If the map file gives that, then there you have it. If not, you need to find it yourself. Embedded compilers usually let you define the stack location and size. Put a breakpoint in the start of you program. When the application stops there zero the entire stack. Resume the application and let it work for a while. Finally stop it and inspect the stack memory. You will see non-zero values instead of zeros. The used stack goes until the zeros part starts again.
Generally you will have different sections in mmap generated file, where data goes, like :
.intvect
.intvect_end
.rozdata
.robase
.rosdata
.rodata
.text .... and so on!!!
with other attributes like Base,Size(hex),Size(dec) etc for each section.
While at any time local variables may take up more or less space (as they go in and out of scope), they are instantiated on the stack. In a single threaded environment, the stack will be a fixed allocation known at link time. The same is true of all statically allocated data. The only run-time variable part id dynamically allocated data, but even then sich data is allocated from the heap, which in most bare-metal, single-threaded environments is a fixed link-time allocation.
Consequently all the information you need about memory allocation is probably already provided by your linker. Often (depending on your tool-chain and linker parameters used) basic information is output when the linker runs. You can usually request that a full linker map file is generated and this will give you detailed information. Some linkers can perform stack usage analysis that will give you worst case stack usage for any particular function. In a single threaded environment, the stack usage from main() will give worst case overall usage (although interrupt handlers need consideration, the linker is not thread or interrupt aware, and some architectures have separate interrupt stacks, some are shared).
Although the heap itself is typically a fixed allocation (often all the available memory after the linker has performed static allocation of stack and static data), if you are using dynamic memory allocation, it may be useful at run-time to know how much memory has been allocated from the heap, as well as information about the number of allocations, average size of allocation, and the number of free blocks and their sizes also. Because dynamic memory allocation is implemented by your system's standard library any such analysis facility will be specific to your library, and may not be provided at all. If you have the library source you could implement such facilities yourself.
In a multi-threaded environment, thread stacks may be allocated statically or from the heap, but either way the same analysis methods described above apply. For stack usage analysis, the worst-case for each thread is measured from the entry point of each thread rather than from main().

How can I manually (programmatically) place objects in my multicore project?

I am developing a mutlicore project for our embedded architecture using the gnu toolchain. In this architecture, all independent cores share the same global flat memory space. Each core has its own internal memory, which is addressable from any other core through its global 32-bit address.
There is no OS implemented and we do low-level programming, but in C instead of assembly. Each core has its own executable, generated with a separate compilation. The current method we use for inter-core communication is through calculation of absolute addresses of objects in the destination core's data space. If we build the same code for all cores, then the objects are located by the linker in the same place, so accessing an object in a remote core is merely changing the high-order bits of the address of the object in the current core and making the transaction. Similar concept allows us to share objects that are located in the external DRAM.
Things start getting complicated when:
The code is not the same in the two cores, so objects may not be allocated in similar addresses,
We sometimes use a "host", which is another processor running some control code that requires access to objects in the cores, as well as shared objects in the external memory.
In order to overcome this problem, I am looking for an elegant way of placing variables in build time. I would like to avoid changing the linker script file as possible. However, it seems like in the C level, I could only control placement up to using a combination of the section attribute (which is too coarse) and the align attribute (which doesn't guarantee the exact place).
A possible hack is to use inline assembly to define the objects and explicitly place them (using the .org and .global keywords), but it seems somewhat ugly (and we did not yet actually test this idea...)
So, here's the questions:
Is there a semistandard way, or an elegant solution for manually placing objects in a C program?
Can I declare an "uber"-extarnel objects in my code and make the linker resolve their addresses using another project's executable?
This question describes a similar situation, but there the user references a pre-allocated resource (like a peripheral) whose address is known prior to build time.
Maybe you should try to use 'placement' tag from new operator. More exactly if you have already an allocated/shared memory you may create new objects on that. Please see: create objects in pre-allocated memory
You don't say exactly what sort of data you'll be sharing, but assuming it's mostly fixed-size statically allocated variables, I would place all the data in a single struct and share only that.
The key point here is that this struct must be shared code, even if the rest of the programs are not. It would be possible to append extra fields (perhaps with a version field so that the reader can interpret it correctly), but existing fields must not be removed or modifed. structs are already used as the interface between libraries everywhere, so their layout can be relied upon (although a little more care will be need in a heterogeneous environment, as long as the type sizes and alignments are the same you should be ok).
You can then share structs by either:
a) putting them in a special section and using the linker script to put that in a known location;
b) allocating the struct in static data, and placing a pointer to that at a known location, say in your assembly start-up files; or
c) as (b), but allocate the struct on the heap, and copy the pointer to the known pointer location at run-time. The has the advantage that the pointer can be pre-adjusted for external consumers, thus avoiding a certain amount of messing about.
Hope that helps
Response to question 1: no, there isn't.
As for the rest, it depends very much of the operating system you use. On our system at the time I was in embedded, we had only one processor's memory to handle (80186 and 68030 based), but had multi-tasking but from the same binary. Our tool chain was extended to handle the memory in a certain way.
The toolchain looked like that (on 80186):
Microsoft C 16bit or Borland-C
Linker linking to our specific crt.o which defined some special symbols and segments.
Microsoft linker, generating an exe and a map file with a MS-DOS address schema
A locator that adjusted the addresses in the executable and generated a flat binary
Address patcher.
An EPROM burner (later a Flash loader).
In our assembly we defined a symbol that was always at the beginning of data segment and we patched the binary with a hard coded value coming from the located map file. This allowed the library to use all the remaining memory as a heap.
In fact, if you haven't the controle on the locator (the elf loader on linux or the exe/dll loader in windows) you're screwed.
You're well off the beaten path here - don't expect anything 'standard' for any of this :)
This answer suggests a method of passing a list of raw addresses to the linker. When linking the external executable, generate a linker map file, then process it to produce this raw symbol table.
You could also try linking the entire program (all cores' programs) into a single executable. Use section definitions and a linker script to put each core's program into its internal memory address space; you can build each core's program separately, incrementally link it to a single .o file, then use objcopy to rename its sections to contain the core ID for the linker script, and rename (hide) private symbols if you're duplicating the same code across multiple cores. Finally, manually supply the start address for each core to your bootstrap code instead of using the normal start symbol.

In C, does using static variables in a function make it faster?

My function will be called thousands of times. If i want to make it faster, will changing the local function variables to static be of any use? My logic behind this is that, because static variables are persistent between function calls, they are allocated only the first time, and thus, every subsequent call will not allocate memory for them and will become faster, because the memory allocation step is not done.
Also, if the above is true, then would using global variables instead of parameters be faster to pass information to the function every time it is called? i think space for parameters is also allocated on every function call, to allow for recursion (that's why recursion uses up more memory), but since my function is not recursive, and if my reasoning is correct, then taking off parameters will in theory make it faster.
I know these things I want to do are horrible programming habits, but please, tell me if it is wise. I am going to try it anyway but please give me your opinion.
The overhead of local variables is zero. Each time you call a function, you are already setting up the stack for the parameters, return values, etc. Adding local variables means that you're adding a slightly bigger number to the stack pointer (a number which is computed at compile time).
Also, local variables are probably faster due to cache locality.
If you are only calling your function "thousands" of times (not millions or billions), then you should be looking at your algorithm for optimization opportunities after you have run a profiler.
Re: cache locality (read more here):
Frequently accessed global variables probably have temporal locality. They also may be copied to a register during function execution, but will be written back into memory (cache) after a function returns (otherwise they wouldn't be accessible to anything else; registers don't have addresses).
Local variables will generally have both temporal and spatial locality (they get that by virtue of being created on the stack). Additionally, they may be "allocated" directly to registers and never be written to memory.
The best way to find out is to actually run a profiler. This can be as simple as executing several timed tests using both methods and then averaging out the results and comparing, or you may consider a full-blown profiling tool which attaches itself to a process and graphs out memory use over time and execution speed.
Do not perform random micro code-tuning because you have a gut feeling it will be faster. Compilers all have slightly different implementations of things and what is true on one compiler on one environment may be false on another configuration.
To tackle that comment about fewer parameters: the process of "inlining" functions essentially removes the overhead related to calling a function. Chances are a small function will be automatically in-lined by the compiler, but you can suggest a function be inlined as well.
In a different language, C++, the new standard coming out supports perfect forwarding, and perfect move semantics with rvalue references which removes the need for temporaries in certain cases which can reduce the cost of calling a function.
I suspect you're prematurely optimizing, however, you should not be this concerned with performance until you've discovered your real bottlenecks.
Absolutly not! The only "performance" difference is when variables are initialised
int anint = 42;
vs
static int anint = 42;
In the first case the integer will be set to 42 every time the function is called in the second case ot will be set to 42 when the program is loaded.
However the difference is so trivial as to be barely noticable. Its a common misconception that storage has to be allocated for "automatic" variables on every call. This is not so C uses the already allocated space in the stack for these variables.
Static variables may actually slow you down as its some aggresive optimisations are not possible on static variables. Also as locals are in a contiguous area of the stack they are easier to cache efficiently.
There is no one answer to this. It will vary with the CPU, the compiler, the compiler flags, the number of local variables you have, what the CPU's been doing before you call the function, and quite possibly the phase of the moon.
Consider two extremes; if you have only one or a few local variables, it/they might easily be stored in registers rather than be allocated memory locations at all. If register "pressure" is sufficiently low that this may happen without executing any instructions at all.
At the opposite extreme there are a few machines (e.g., IBM mainframes) that don't have stacks at all. In this case, what we'd normally think of as stack frames are actually allocated as a linked list on the heap. As you'd probably guess, this can be quite slow.
When it comes to accessing the variables, the situation's somewhat similar -- access to a machine register is pretty well guaranteed to be faster than anything allocated in memory can possible hope for. OTOH, it's possible for access to variables on the stack to be pretty slow -- it normally requires something like an indexed indirect access, which (especially with older CPUs) tends to be fairly slow. OTOH, access to a global (which a static is, even though its name isn't globally visible) typically requires forming an absolute address, which some CPUs penalize to some degree as well.
Bottom line: even the advice to profile your code may be misplaced -- the difference may easily be so tiny that even a profiler won't detect it dependably, and the only way to be sure is to examine the assembly language that's produced (and spend a few years learning assembly language well enough to know say anything when you do look at it). The other side of this is that when you're dealing with a difference you can't even measure dependably, the chances that it'll have a material effect on the speed of real code is so remote that it's probably not worth the trouble.
It looks like the static vs non-static has been completely covered but on the topic of global variables. Often these will slow down a programs execution rather than speed it up.
The reason is that tightly scoped variables make it easy for the compiler to heavily optimise, if the compiler has to look all over your application for instances of where a global might be used then its optimising won't be as good.
This is compounded when you introduce pointers, say you have the following code:
int myFunction()
{
SomeStruct *A, *B;
FillOutSomeStruct(B);
memcpy(A, B, sizeof(A);
return A.result;
}
the compiler knows that the pointer A and B can never overlap and so it can optimise the copy. If A and B are global then they could possibly point to overlapping or identical memory, this means the compiler must 'play it safe' which is slower. The problem is generally called 'pointer aliasing' and can occur in lots of situations not just memory copies.
http://en.wikipedia.org/wiki/Pointer_alias
Using static variables may make a function a tiny bit faster. However, this will cause problems if you ever want to make your program multi-threaded. Since static variables are shared between function invocations, invoking the function simultaneously in different threads will result in undefined behaviour. Multi-threading is the type of thing you may want to do in the future to really speed up your code.
Most of the things you mentioned are referred to as micro-optimizations. Generally, worrying about these kind of things is a bad idea. It makes your code harder to read, and harder to maintain. It's also highly likely to introduce bugs. You'll likely get more bang for your buck doing optimizations at a higher level.
As M2tM suggests, running a profiler is also a good idea. Check out gprof for one which is quite easy to use.
You can always time your application to truly determine what is fastest. Here is what I understand: (all of this depends on the architecture of your processor, btw)
C functions create a stack frame, which is where passed parameters are put, and local variables are put, as well as the return pointer back to where the caller called the function. There is no memory management allocation here. It usually a simple pointer movement and thats it. Accessing data off the stack is also pretty quick. Penalties usually come into play when you're dealing with pointers.
As for global or static variables, they're the same...from the standpoint that they're going to be allocated in the same region of memory. Accessing these may use a different method of access than local variables, depends on the compiler.
The major difference between your scenarios is memory footprint, not so much speed.
Using static variables can actually make your code significantly slower. Static variables must exist in a 'data' region of memory. In order to use that variable, the function must execute a load instruction to read from main memory, or a store instruction to write to it. If that region is not in the cache, you lose many cycles. A local variable that lives on the stack will most surely have an address that is in the cache, and might even be in a cpu register, never appearing in memory at all.
I agree with the others comments about profiling to find out stuff like that, but generally speaking, function static variables should be slower. If you want them, what you are really after is a global. Function statics insert code/data to check if the thing has been initialized already that gets run every time your function is called.
Profiling may not see the difference, disassembling and knowing what to look for might.
I suspect you are only going to get a variation as much as a few clock cycles per loop (on average depending on the compiler, etc). Sometimes the change will be dramatic improvement or dramatically slower, and that wont necessarily be because the variables home has moved to/from the stack. Lets say you save four clock cycles per function call for 10000 calls on a 2ghz processor. Very rough calculation: 20 microseconds saved. Is 20 microseconds a lot or a little compared to your current execution time?
You will likely get more a performance improvement by making all of your char and short variables into ints, among other things. Micro-optimization is a good thing to know but takes lots of time experimenting, disassembling, timing the execution of your code, understanding that fewer instructions does not necessarily mean faster for example.
Take your specific program, disassemble both the function in question and the code that calls it. With and without the static. If you gain only one or two instructions and this is the only optimization you are going to do, it is probably not worth it. You may not be able to see the difference while profiling. Changes in where the cache lines hit could show up in profiling before changes in the code for example.

C memory space and #defines

I am working on an embedded system, so memory is precious for me.
One issue that has been recurring is that I've been running out of memory space when attempting to compile a program for it. This is usually fixed by limiting the number of typedefs, etc that can take up a lot of space.
There is a macro generator that I use to create a file with a lot of #define's in it.
Some of these are simple values, others are boundary checks
ie
#define SIGNAL1 (float)0.03f
#define SIGNAL1_ISVALID(value) ((value >= 0.0f) && (value <= 10.0f))
Now, I don't use all of these defines. I use some, but not actually the majority.
I have been told that they don't actually take up any memory if they are not used, but I was unsure on this point. I'm hoping that by cutting out the unused ones that I can free up some extra memory (but again, I was told this is pointless).
Do unused #define's take up any memory space?
No, #defines take up no space unless they are used - #defines work like find/replace; whenever the compiler sees the left half, it'll replace it with the right half before it actually compiles.
So, if you have:
float f = SIGNAL1;
The compiler will literally interpret the statement:
float f = (float)0.03f;
It will never see the SIGNAL1, it won't show up in a debugger, etc.
This is usually fixed by limiting the number of typedefs, etc that can take up a lot of space.
You seem somewhat confused because typedef's do not take up space at runtime. They are merely aliases for data types. Now you may have instances of large structures (typedef'd or otherwise), but it is the instance that takes space, not the type definition. I wonder what 'etc' might cover in this statement.
Macro instances are replaced in the source code with their definition, and code generated accordingly, an unused macro does not result in any generated code.
Things that take up space are:
Executable code (functions/member functions)
Data instantiation (including C++ object instances)
The amount of space allocated to the stack (or stacks in a multi-threaded system).
What is left is typically available for dynamic memory allocation (RAM), or is unused or made for non-volatile storage (Flash/EPROM).
Reducing memory usage is primarily a case of selecting/designing efficient data structures, using appropriate data types, and efficient code and algorithm design. It is best to target the area that will get the greatest benefit. To see the size of objects and code in your application, get the linker to generate a map file. That will tell you which are the largest functions, as well as the sizes of global and static objects.
Source file text length is not a good guide to code size. Large amounts of C code is declarative (typically header files are all declarative), and do not generate memory occupying code or data.
An embedded system does not necessarily imply small memory, so you should specify. I have worked on systems with 64Mb RAM and 2Mb Flash, and even that is modest compared with many systems. A typical micro-controller with on-chip resources however, will generally have much less (especially SRAM which takes up a lot of chip area). Also whether your system is Harvard or Von Neumann architecture is relevant here, since in a Harvard architecture data and code spaces are separate, so we need to know what it is you are short of. If Von Neumann, the code/data usage is still relevant if the code is running from ROM, or is it is copied from ROM to RAM at run-time (i.e. different types of memory, even if they are in the same address space).
Clifford
Well, yes and no.
No, unused #defines won't increase the size of the resulting binary.
Yes, all #defines (whether used or unused) must be known by the compiler when building the binary.
By your question it's a bit ambiguous how you use the compiler, but it almost seems that you try to build directly on an embedded device; have you tried a cross-compiler? :)
unused #defines doesn't take up space in the resulting executable. They do take up memory in the compiler itself whist compiling.

Resources