in computer literature it is generally recommended to write short functions as much as possible. I understand it may increase readability (although not always), and such approach also provides more flexibility. But does it have something to do with optimization as well? I mean -- does it matter to a compiler to compile a bunch of small routines rather than a few large routines?
Thanks.
That depends on the compiler. Many older compilers only optimized a single function at a time, so writing larger functions (up to to some limit) could improve optimization -- but (with most of them) exceeding that limit turned optimization off completely.
Most reasonably current compilers can generate inline code for functions (and C99 added the ineline keyword to facilitate that) and do global (cross-function) optimization, in which case it normally makes no difference at all.
#twain249 and #Jerry are both correct; breaking a program into multiple functions can have a negative effect on performance, but it depends on whether or not the compiler can optimize the functions into inline code.
The only way to know for sure is to examine the assembler output of your program and do some profiling. For example, if you know a particular code path is causing a performance problem, you can look at the assembler, and see how many functions are getting called, how many times parameters are being pushed onto the stack, etc. In that case, you may want to consolidate small functions into one larger one.
This has been a concern for me in the past: doing very tight optimization for embedded projects, I have consciously tried to reduce the number of function calls, especially in tight loops. However, this does produce ungainly functions, sometimes several pages long. To mitigate the maintenance cost of this, you can use macros, which I have leveraged heavily and successfully to make sure there are no function calls while at the same time preserving readability.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am working with embedded systems(Mostly ARM Cortex M3-M4) using C language.And was wondering what are the advantages/ disadvantages of dividing the task into many functions such as;
void Handle_Something(void)
{
// do Task-1
// do Task-2
// do Task-3
//etc.
}
to
void Hanlde_Something(void)
{
// Handle_Task1();
// Handle_Tasl2();
//etc.
}
How are these two approaches can be examined with respect to stack usage and overall processing speed, and which is safer/better for what reason ? (You can assume this is outside of ISR)
From what I know, Memory in the stack is allocated/deallocated for local variables in each call/return cycle, thus dividing the task seems reasonable in terms of memory usage but when doing this, I sometimes get Hardfaults from different sources(mostly Bus or Undefined Instruction errors.) that I couldn't figured out why.
Also, working speed is very crucial for many applications in my field, so ı do need to know which methode provides faster responses.
I would appreciate an enlightment. Thanks everybody in advance
This is what's known as "pre-mature optimization".
In the old days when compilers where horrible, they couldn't inline functions by themselves. So in the old days a keyword inline was added to C - similar non-standard versions also existed before the year 1999. This was used to tell a bad compiler how it should generate code better.
Nowadays this is mostly history. Compilers are better than programmers at determining what and when to inline. They may however struggle when a called function is located in a different "translation unit" (basically, in a different .c file). But in your case I take it this is not the case, but Handle_Task1() etc can be regarded as functions in the same file.
With the above in mind:
How are these two approaches can be examined with respect to stack usage and overall processing speed
They are to be regarded as identical. They use the same stack space and take the same time to execute.
Unless you have a bad, older compiler - in which case function calls always take extra space and execution time. Since you are working with modern MCU:s, this should not be the case, or you desperately need to get a better compiler.
As a rule of thumb, it is always better practice to split up larger functions in several smaller, for the sake of readability and maintenance. Even in hard real-time systems, there exist very few cases where function call overhead is an actual bottleneck even when bad compilers are used.
Memory on the stack isn't allocated/reallocated from some complex memory pool. The stack pointer is simply increased/decreased. An operation that is basically free in all but the tightest/smallest loops imaginable (and those will probably be optimized by the compiler).
Don't group together functions because they could reuse variables e.g. don't create a bunch of int tempInt; long tempLong; variables you use throughout your entire program. A variable should serve only a single purpose and its scope should be kept as tight as possible. Also see: is it good or bad to reuse the variables?
Expanding on that, keeping the scope of all variables as local as possible might even cause your compiler to keep the variables in a cpu register only. A shortly used variable might actually never be allocated!
Try to limit functions to only a singly purpose and try avoiding side effects: if you avoid global variables a function becomes easier to test, optimize and understand as each time you call it with the exact same set of arguments it will preform the exact same action. Have a look at: Why are global variables bad, in a single threaded, non-os, embedded application
Each solution has advantages and disadvantages.
The first approach allows to execute the code (a priori) faster, because the asm code won't have instructions related to jumps. However, you have to take the readability into account, in terms of mixing different kind of functionalities in the same function (or creating large functions, which is not a good idea from the guidelines point of view).
The second solution could be easier to understand, because each function contains a simple task, furthermore it is easier for documenting (that is, you dont have to explain different "purposes" in the same function). As I said, this
solution is slower, because your "scheduler" contains jumps, nevertheless you could declare the simple tasks as inline, given that you can split the code in several simple tasks with a proper documentation and the compiler will generate an assembler as the first approach, that is, avoiding the jumps.
Another point is the use of memory. If your simple tasks are being called from different parts of the code, the first solution and the second solution with inline are worse (in terms of memory) than the second solution without inline, because the function is added as many times as it is called from different parts of your code.
Working with modules is always more efficient in terms of error handling, debugging and re-reading. Considering some heavy working libraries (SLAM, PCL etc.) as functions, they are used as external functions and they don't cause a significant loss of performance(tbh sometimes it's almost impossible to embed such large functions into your code). You may face slightly higher stack use as #Colin commented.
Ignoring modularity and readability, what are the effects of having large functions in terms of performance against many sub divided functions? (C Language in general).
Large function probably has a small performance gain over many small functions due to less function calls. But my general rule is: let the compiler deal with optimizations and concentrate on functionality and security.
Functions are an important part of code organization in any programming language. Although, performance wise, having a single large function would decrease the use of the function calls and hence fewer stack jumps and leading to a better performing code. But, not every project has the luxury of not being modular and having code that's unreadable or worse confusing or misleading. Over time, the cost of the project with large functions will be far greater than a project with small functions, in terms of maintenance, refactoring, feature enhancements,etc.Again, a function is big only when analyzed in a certain context and there can be some situations where a big function could not be broken down into smaller pieces and its totally acceptable, as long as it is well-designed and simple.
Remember the first rule of writing a function: do one thing and one thing well.
In C programming For Each Function Call a Stack Frame is Created So in case of single function there will be only one stack and there is no need of stack jump but in case of many sub divided functions, each functions will be having a separate stack and for each function call there will be stack jump so the performance may be reduced depending upon the compiler optimization.
In many debates about the inline keyword in function declarations, someone will point that it can actually make your program slower in some cases – mostly due to code explosion, if I am correct. I have never met such an example in practice myself. What is an actual code where the use of inline can be expected to be detrimental to the performance?
Exactly 10 years and one day ago I did this commit in OpenBSD:
http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/include/intr.h.diff?r1=1.3;r2=1.4
The commit message was:
deinline splraise, spllower and setsoftint.
Makes the kernel smaller and faster.
deraadt# ok
As far as I remember the kernel binary shrunk by more than 100kB and not a single test case could be produced that became slower and several macro benchmarks (like compiling the kernel) were measurably faster (5-10% if I recall correctly, but don't quote me on that).
Around the same time I went on a quest to actually measure inline functions in the OpenBSD kernel. I found a few that had minimal performance gains, but the majority had 0 measurable impact and several were making things much slower and were killed. At least one more uninlining had a huge impact and that one was the internal malloc macros (where the idea was to inline malloc if it had a size known at compile time) and packet buffer allocators that shrunk the kernel by 150kB and had a significant performance improvement.
One could speculate, although I have no proof, that this is because the kernel is large and we're struggling to stay inside the cache when executing system calls and every little bit helps. So what actually helped in those cases was just the shrinking of the binary, not the number of instructions executed.
Imagine a function that have no parameters, but intensive computation with a consistent number of intermediate values or register usage. Then Inline that function in code having a consistent number of intermediate values or register usage too.
Having no parameters make the call procedure more lightweight because no stack operations, that are time consuming, are required.
When inlined the compiler have to save many registers, and spill other to be used with the new function, reproducing the process of registers and data backup required for a function call possibly in worst way.
If the backup operations are more expansive, in terms of time and machine cycles, compared with the mechanism of function call, especially if the function is extensively called, then you have a detrimental effect.
This seems to be the case of some specific functions largely used in an OS.
I have programmed an embedded software (using C of course) and now I'm considering ways to improve the running time of the system. The most important single module in my system is one very large nested for loop module.
That module consists of two nested for loops that loops max 122500 times. That's not very much yet, but the problem is that inside that nested for loop I have a function call to a function that is in another source file. That specific function consists mostly of two another nested for loops which loops always 22500 times. So now I have to make a function call 122500 times.
I have made that function that is to be called a lot lighter and shorter (yet still works as it should) and now I started to think that would it be faster to rip off that function call and write that process directly inside those first two for loops?
The processor in that system is ARM7TDMI and its frequency is 55MHz. The system itself isn't very time critical so it doesn't have to be real time capable. However the faster it can process its duties the better.
Also would it be also faster to use while loops instead of fors? And any piece of advice about how to improve the running time is appreciated.
-zaplec
TRY IT AND SEE!!
It'll almost certainly make a difference. Function call overhead isn't usually that much of an issue, but at over 100K repetitions it starts to add up.
...But whether or not it makes any real-world difference is something only you can answer, after trying it and timing the results.
As for for vs while... it shouldn't matter unless you actually change the behavior when changing the loop. If in doubt, make your compiler spit out assembler code for both and compare... or just change it and time it.
You need to be careful in the optimizations you make because you aren't always clear on which optimizations the compiler is making for you. Pre-optimization is a common mistake people make. Is it important that your code is readable and easily maintained or slightly faster? Like others have suggested, the best approach is to benchmark the different ways and see if there is a noticeable difference.
If you don't believe your compiler does much in the way of optimization I would look at some older concepts in optimizing C (searches on SO or google should provide some good links).
The ARM processor has an instruction pipeline (cache). When the processor encounters a branch (call) instruction, it must clear the pipeline and reload, thus wasting some time. One objective when optimizing for speed is to reduce the number of reloads to the instruction pipeline. This means reducing branch instructions.
As others have stated in SO, compile your code with optimization set for speed, and profile. I prefer to look at the assembly language listing as well (either printed from the compiler or displayed interwoven in the debugger). Use this as a baseline. If you can't profile, you can use assembly instruction counting as a rough estimate.
The next step is to reduce the number of branches; or the number times a branch is taken. Unrolling loops helps to reduce the number of times a branch is taken. Inlining helps reduce the number of branches. Before applying this fine-tuning techniques, review the design and code implementation to see if branches can be reduced. For example, reduce the number of "if" statements by using Boolean arithmetic or using Karnaugh Maps. My favorite is reducing requirements and eliminating code that doesn't need to be executed.
In the code implementation, move code that doesn't change outside of the for or while loops. Some loops may be reduce to equations (example, replacing a loop of additions with a multiplication). Also, reduce the quantity of iterations, by asking "does this loop really need to be executed this many times").
Another technique is to optimize for Data Oriented Design. Also check this reference.
Just remember to set a limit for optimizing. This is where you decide any more optimization is not generating any ROI or customer satisfaction. Also, apply optimizations in stages; which will allow you to have a deliverable when your manager asks for one.
Run a profiler on your code. If you are just guessing at where you are spending your time, you are probably wrong. A profiler will show what function is taking the most time and you can focus on that. You could be doing something in the function that takes longer than the function call itself. Did you look to see if you can change floating operations to integer, or integer math to shifts? You can spend a lot of time fiddling with things that don't make much difference. Run a profiler on your code and know for sure that the things you are changing will make a difference.
For function vs. inline, unfortunately there is no easy answer. I.e. it depends. See this FAQ. For "for" vs. "while", I wouldn't think there is any significant difference in performance.
In general, a function call should have more overhead than inlining. You really should profile however, as this can be affected quite a bit by your compiler (especially the compile/optimization settings). Some compilers will automatically inline code for example.
I'm looking to see what can a programmer do in C, that can determine the performance and/or the size of the generated object file.
For e.g,
1. Declaring simple get/set functions as inline may increase performance (at the cost of a larger footprint)
2. For loops that do not use the value of the loop variable itself, count down to zero instead of counting up to a certain value
etc.
It looks like compilers now have advanced to a level where "simple" tricks (like the two points above) are not required at all. Appropriate options during compilation do the job anyway. Heck, I also saw posts here on how compilers handle recursion - that was very interesting! So what are we left to do at a C level then? :)
My specific environment is: GCC 4.3.3 re-targeted for ARM architecture (v4). But responses on other compilers/processors are also welcome and will be munched upon.
PS: This approach of mine goes against the usual "code first!, then benchmark, and finally optimize" approach.
Edit: Just like it so happens, I found a similar post after posting the question: Should we still be optimizing "in the small"?
One thing I can think of that a compiler probably won't optimize is "cache-friendliness": If you're iterating over a two-dimensional array in row-major order, say, make sure your inner loop runs across the column index to avoid cache thrashing. Having the inner loop run over the wrong index can cause a huge performance hit.
This applies to all programming languages, but if you're programming in C, performance is probably critical to you, so it's especially relevant.
"Always" know the time and space complexity of your algorithms. The compiler will never be able to do that job as well as you can. :)
Compilers these days still aren't very good at vectorizing your code so you'll still want to do the SIMD implementation of most algorithms yourself.
Choosing the right datastructures for your exact problem can dramatically increase performance (I've seen cases where moving from a Kd-tree to a BVH would do that, in that specific case).
Compilers might pad some structs/ variables to fit into the cache but other cache optimizations such as the locality of your data are still up to you.
Compilers still don't automatically make your code multithreaded and using openmp, in my experience, doesn't really help much. (You really have to understand openmp anyway to dramatically increase performance). So currently, you're on your own doing multithreading.
To add to what Martin says above about cache-friendliness:
reordering your structures such that fields which are commonly accessed together are in the same cache line can help (for instance by loading just one cache line rather than two.) You are essentially increasing the density of useful data in your data cache by doing this. There is a linux tool which can help you in doing this: dwarves 1. http://www.linuxinsight.com/files/ols2007/melo-reprint.pdf
you can use a similar strategy for increasing density of your code. In gcc you can mark hot and cold branches using likely/unlikely tags. That enables gcc to keep the cold branches separately which helps in increasing the icache density.
And now for something completely different:
for fields that might be accessed (read and written) across CPUs, the opposite strategy makes sense. The trouble is that for coherence purposes only one CPU can be allowed to write to the same address (in reality the same cacheline.) This can lead to a condition called cache-line ping pong. This is pretty bad and could be worse if that cache-line contains other unrelated data. Here, padding this contended data to a cache-line length makes sense.
Note: these clearly are micro-optimizations, to be done only at later stages when you are trying to wring the last bits of performance from your code.
PreComputation where possible... (sorry but its not always possible... I did extensive precomputation on my chess engine.) Store those results in memory, keeping cache in mind.. the bigger the size of precomputation data in memory the lesser is the chance of doing a cache hit. Since most of recent hardware is multicore you can design your application to target it.
if you are using several big arrays make sure you group them close to each other on where they would be used, boosting cache hits
Many people are not aware of this: Define an inline label (varies by compiler) which means inline, in its intent - many compilers place the keyword in an entirely different context from the original meaning. There are also ways to increase the inline size limits, before the compiler begins popping trivial things out of line. Human directed inlining can produce much faster code (compilers are often conservative, or do not account for enough of the program), but you need to learn to use it correctly, because it can (easily) be counterproductive. And yes, this absolutely applies to code size as well as speed.