So the question is:
How to optimize function entry & exit code in a portable way for speed using GCC, plain C?
I am interested in relevant options etc. My goal is writing a CPU emulator where the instruction set is decoded using call tables. I already eliminated any function call I could reasonably eliminate, but due to the structure of the instruction set, doing 2-3 such calls per emulated instruction is necessary (so I can neither eliminate any more branch mispredictions here, either).
Based on analysing the assembly (x86, 32bits) output the option -fomit-frame-pointer seems worthwhile (once I don't care for the lost debug-ability here). Otherwise in general if I look over the complete emulator it seems like it could be better with better overall register and stack management (don't saving every single thing on every entry), my impression of the generated assembly is that it tampers more with the stack than how much useful work it does.
So the situation is basically that there is a whole lot of little functions which are called many-many times, and which can not be eliminated from the code.
I don't want to switch over from the interpreting emulation since this should be the most portable approach to do this thing (more portable anyway than any solution which would recompile).
On x86-32, the ABIs for common operating systems have standard calling conventions that use the stack to pass arguments to functions, because there are few general-purpose registers. One way to improve function calls that take only a few arguments (and relatively simple arguments) would be to use a different calling convention (like fastcall) that use registers to pass the arguments. If moving to x86-64 is an option (and it should be, it's been around for ages...), the ABIs are much better for fast function calls, because the number of general-purpose registers doubled.
Related
I apologize if this should sound trivial and unsubtle but I couldn't figure out an intuitive way to google it, Why are some kernel actions like saving the current state of the registers and the stack(just to mention a few) written in Assembly? Why can't they be written in C because after all, presumably, when compilation is done, all we get is object code? Besides when you use ollydbg, you notice that before a function call(in C), the current state of the register is pushed to the stack
When writing an OS the main goal is to maintain the highest abstraction to make the code reusable on different architectures, but at the end inevitably there is the architecture.
Each machine performs the very low level functions in such a specialized way that no general programming language can sustain.
Task switching, bus control, device interrupt handling, just to name few, cannot be coded efficiently using an high level language (consider instruction sequences, involved registers, and eventual critical CPU timings and priority levels).
On the other hand, it is not even convenient to use mixed programming, i.e. inline assembler, because the crafted module will be no more abstract, containing specific architecture code that can't be reused.
The common solution is to write all code following the highest abstraction level, reducing to a few modules the specialized code. These routines, fully written in assembly, are normally well defined in terms of supplied input and expected output, so the programmer can produce same results on different architectures.
Compiling for different CPU is then done by simply switching the set of assembly routines.
C does not assure you that it modifies the registers you need to modify.
C just implements a logic you write in your code and the interpretation given by the language will be as you expect, hiding completely the details behind the interpretation.
If you want a kind of logic like set the X register with a given value or move data from register X to register Y, as it's necessary to do in kernel sometimes, this kind of logic is not defined by the C language.
C is a generic high level language, not specific to one target. But at the kernel level there are things that you need to do that are target specific that the C language simply cannot do. Enabling an interrupt or configuring an MMU or configuring something to do with the protection mechanism. On some targets these items and others are configured using registers in the address space but on some targets specific assembly language instructions are required and so C cannot be used, it has to be assembly. There is usually at least one thing you have to use assembly for per target if not many.
Sometimes it is a simple case of wanting the correct instruction to be used for example a 32 bit store must be used for some operation to insure that and not hope the compiler gets it right then use asm.
There is no C equivalent for "return from exception". The C compiler can't translate everything that assembly can do. For instance, if you write an operating system you will need a special return function in the interrupt service routine that goes back to where the interrupt was initiated and the C compiler can't translate such a functionality, it can only be expressed in assembly.
See also Is assembly strictly required to make the "lowest" part of an operating system?
Context switching is critical and need to be really really fast which should not be written in high level language.
Lets say I have a function with two parameters that is repetitively called. Does it increase the memory usage when you have functions with arguments?
Would it be faster to generate a function for each repetitive case, and call that function with no parameters?
I believe this is sometimes refereed to as 'internal state', but my question is which of the two options will perform faster?
EDIT>>>>>>>>
Your answers are all enlightening, allow me to clarify all at once.
It seems logical that
x = x + 10
would be faster than:
x = x + y
And I'm not talking about the time it takes to define and initialize y, I am just talking about the operation itself. I'm logically, in the second case there must be some extra step in which the CPU must find Y before performing the operation. When you amplify this with functions and then multiply it over and over, I would assume this would make a significant difference.
And yes, what in my case it applies to physics and the speed will likely be felt.
PS I am very interested in compiler functionality and debating learning assembler.
Parameters are typically passed on the stack so they don't take up more memory.
Parameters may be "un-noticeably" slower because the values may be copied to the stack (depends on how good the compiler is at optimizing).
The compiler is way smarter than you are, so don't try to outsmart the compiler. Write clear code and let the compiler worry about performance.
re: your edit
"it depends"
Does your processor have a different instruction to add 10 to a variable?
What sort of addressing modes does it support?
Regardless of the answers to the above, does the compiler make use of all the processor's features which might squeeze out every drip of performance.
e.g. - The good old 68000 chips had an "INC" opcode to increment a register by 1. It was much faster than other methods. If you were hand rolling assembly the fastest way to do x = x + 10 might have been to call INC 10 times...
I've worked with time constrained real time embedded apps and never had to worry about this level of optimization. I'd write the code and worry about performance if/when it becomes an issue.
Is the repetitive call is made with compile-time parameters, then you can indeed improve performance by "instantiating" a special version of the same function for the given set of compile-time parameters. In such cases the function will not even have a "state": the parameter values will essentially be embedded into the function code. Some compilers can do it implicitly.
The amount of improvement will depend on the nature of the function. In each given version of the function the entire blocks of code might be easily recognized as unreachable and eliminated entirely. One can also say that function inlining by nature involves the same kind of optimization.
Obviously, using such optimizations thoughtlessly might easily lead to a combinatorial explosion of the number of different versions of the same function (i.e. to code bloat), so it should be used with care.
(BTW, this is very similar to what C++ function templates with non-type template parameters do.)
If the repetitive call is made with run-time parameters, then pre-saving them in a run-time state might not achieve any significant improvement. Retrieving parameter values from some "state" is not necessarily more efficient than retrieving them from the "regular" function parameters.
Of course, there are such classic techniques as packing multiple function parameters into a struct object and passing such struct object to the function (instead of passing a large number of independent parameters). If the parameters remain unchanged between multiple calls, then this does improve overall performance by saving time on parameter preparation. But whether to call such struct object a "state" or not is a different question. It is definitely a manual technique, not something done by the compiler and involving any "internal state".
Does it increase the memory usage when you have functions with arguments?
No, function arguments are passed on the stack (or in registers if x64 calling convention).
Would it be faster to generate a function for each repetitive case, and call that function with no parameters?
No, your compiler should optimize it for you, there's no need to make your code less readable
In many debates about the inline keyword in function declarations, someone will point that it can actually make your program slower in some cases – mostly due to code explosion, if I am correct. I have never met such an example in practice myself. What is an actual code where the use of inline can be expected to be detrimental to the performance?
Exactly 10 years and one day ago I did this commit in OpenBSD:
http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/include/intr.h.diff?r1=1.3;r2=1.4
The commit message was:
deinline splraise, spllower and setsoftint.
Makes the kernel smaller and faster.
deraadt# ok
As far as I remember the kernel binary shrunk by more than 100kB and not a single test case could be produced that became slower and several macro benchmarks (like compiling the kernel) were measurably faster (5-10% if I recall correctly, but don't quote me on that).
Around the same time I went on a quest to actually measure inline functions in the OpenBSD kernel. I found a few that had minimal performance gains, but the majority had 0 measurable impact and several were making things much slower and were killed. At least one more uninlining had a huge impact and that one was the internal malloc macros (where the idea was to inline malloc if it had a size known at compile time) and packet buffer allocators that shrunk the kernel by 150kB and had a significant performance improvement.
One could speculate, although I have no proof, that this is because the kernel is large and we're struggling to stay inside the cache when executing system calls and every little bit helps. So what actually helped in those cases was just the shrinking of the binary, not the number of instructions executed.
Imagine a function that have no parameters, but intensive computation with a consistent number of intermediate values or register usage. Then Inline that function in code having a consistent number of intermediate values or register usage too.
Having no parameters make the call procedure more lightweight because no stack operations, that are time consuming, are required.
When inlined the compiler have to save many registers, and spill other to be used with the new function, reproducing the process of registers and data backup required for a function call possibly in worst way.
If the backup operations are more expansive, in terms of time and machine cycles, compared with the mechanism of function call, especially if the function is extensively called, then you have a detrimental effect.
This seems to be the case of some specific functions largely used in an OS.
As per my knowledge msse and msse2 option of gcc will improve the performance by performing arithmetic operation faster. And also I read some where like it will use more resources like registers, cache memory.
What about the performance if we use the executable generated with these options on RTOS devices(like vxworks board) ?
The OS must support SSE(2) instructions for your application to work correctly. It would seem, from googling, that VcWorks supports this (and it's not really that hard, all it takes is that the OS has a 512 byte save-area per task that uses SSE/SSE2 - given the right circumstances, it can be allocated on demand, but it's often easier to just allocate it to all tasks]. Saving/restoring SSE registers is done "on demand", that is, only when a task different from the previous one to use SSE is using SSE instructions, is it necessary to save the registers. The OS will use a special interrupt(trap) to indicate that "a new task is trying to use SSE instructions.
So, as long as the processor supports it, you should be fine.
I may not be able to directly answer your question, but here are a couple things I do know that may be of use:
SSE, SSE2, etc. must be supported/implemented by the processor for them to have any affect in the first place.
There are specific functions you can call that use these extended instructions for mathematical operations. These functions operate on wider data types or perform an operation on a set efficiently.
Enabling the options in GCC may use the previous APIs/builtins automatically. This is the part I am unsure about.
in computer literature it is generally recommended to write short functions as much as possible. I understand it may increase readability (although not always), and such approach also provides more flexibility. But does it have something to do with optimization as well? I mean -- does it matter to a compiler to compile a bunch of small routines rather than a few large routines?
Thanks.
That depends on the compiler. Many older compilers only optimized a single function at a time, so writing larger functions (up to to some limit) could improve optimization -- but (with most of them) exceeding that limit turned optimization off completely.
Most reasonably current compilers can generate inline code for functions (and C99 added the ineline keyword to facilitate that) and do global (cross-function) optimization, in which case it normally makes no difference at all.
#twain249 and #Jerry are both correct; breaking a program into multiple functions can have a negative effect on performance, but it depends on whether or not the compiler can optimize the functions into inline code.
The only way to know for sure is to examine the assembler output of your program and do some profiling. For example, if you know a particular code path is causing a performance problem, you can look at the assembler, and see how many functions are getting called, how many times parameters are being pushed onto the stack, etc. In that case, you may want to consolidate small functions into one larger one.
This has been a concern for me in the past: doing very tight optimization for embedded projects, I have consciously tried to reduce the number of function calls, especially in tight loops. However, this does produce ungainly functions, sometimes several pages long. To mitigate the maintenance cost of this, you can use macros, which I have leveraged heavily and successfully to make sure there are no function calls while at the same time preserving readability.