How can inlining in C actually be slowing down this program?

How can inlining in C actually be slowing down this program? - c

In order to force a function to not be inlined that was consuming 46% of the runtime, I used __attribute__((noinline)) on the it and compiled the code with gcc -Wall -Winline -O2(these plus -g are what is used by the Makefile - I also see roughly the same effect when using -g as well) using gcc 4.5.2. I found that the program with the non-inlined function is more than 20% faster than the original. Does anyone know why this might be?
Let me provide some more details. The program that this occurred in is the latest version of the compression utility bzip2 for Linux. The key function ( generateMTFValues found in compress.c) in the program is the one that does the Move To Front transform. This function is only called by one function in the program.
Does anyone have any idea why the program runs faster in this case by forcing the compiler not to inline this function? The function only takes one parameter - a pointer to a struct that contains all of the block and compression info. Also, it only calls one other function which doesn't really consume any substantial processing time.

It can slow down the program, because the resulting code is larger and can lead to more misses of the CPU's instruction cache.

This is a complete WAG (Wild Ass Guess) based on near-perfect ignorance.
Could it be that for the inline version the optimizer is really busy juggling which values are in which registers and when? If that's the case, the procedure call version may give it room to devote more registers to what is happening in the loop.
As I said, just a WAG.

Related

Tell C to inline function but still have it available to call debugger

Is there anyway to mark a function as inline, but still have at available to call at the debugger? All of my functions I would like to call are marked as static inline, because we are only allowed to expose certain functions in our file. I'm using gcc.

-ginline-points could help:
Generate extended debug information for inlined functions. Location view tracking markers are inserted at inlined entry points, so that address and view numbers can be computed and output in debug information. This can be enabled independently of location views, in which case the view numbers won’t be output, but it can only be enabled along with statement frontiers, and it is only enabled by default if location views are enabled.

In-lined functions have no return instruction, so even if you had the address of the start of the in-lined function, calling it from the debugger would execute the code that follows the in-lining, of which it would almost certainly not have a suitable stack frame.
It is not usual, and certainly not easy do debug optimised code in any case. Normally one would simply switch optimisation off for debugging - in GCC at least, the inline keyword is ignored at -O0.

This is one of the issues when optimizing the code. You need to lower the optimizations a little bit (Usual recommendations in CMake for instance is to use -O2 instead of -O3) and add -fno-omit-frame-pointer to the command line (it will make code slower, as it allocates a register to keep track on the stack frame pointer during function calls).
On compilers like ICC, you can have even more debug info by using -debug all.

Is it possible to disable gcc/g++ optimisations for specific sections of code?

I am compiling some code that works without optimisation but breaks with optimisations enabled. I suspect certain critical sections of the code of being optimised out, resulting in the logic breaking.
I want to do something like:
code...
#disable opt
more code...
#enable opt
Even better if I can set the level of optimisation for that section (like O0, O1...)
For those suggesting it is the code:
The section of the code being deleted is (checked by disassembling the object file):
void wait(uint32_t time)
{
while (time > 0) {
time--;
}
}
I seriously doubt there is something wrong with that code

If optimization causes your program to crash, then you have a bug and should fix it. Hiding the problem by not optimizing this portion of code is poor practice that will leave your code fragile, and its like leaving a landmine for the next developer who supports your code. Worse, by ignoring it, you will not learn how to debug these problems.
Some Possible Root Causes:
Hardware Accesses being optimized out: Use Volatile
It is unlikely that critical code is being optimized out, although if you are touching hardware registers then you should add the volatile attribute to force the compiler to access those registers regardless of the optimization settings.
Race Condition: Use a Mutex or Semaphore to control access to shared data
It is more likely that you have a race condition that is timing specific, and the optimization causes this timing condition to show itself. That is a good thing, because it means you can fix it. Do you have multiple threads or processes that access the same hardware or shared data? You might need to add a mutex or semaphore to control access to avoid timing problems.
Heisenbug: This is when the behavior of code changes based on whether or not debug statements are added, or whether the code is optimized or not. There is a nice example here where the optimized code does floating point comparisons in registers in high precision, but when printf's are added, then the values are stored as doubles and compared with less precision. This resulted in the code failing one way, but not the other. Perhaps that will give you some ideas.
Timing Loop gets Optimized Out: Creating a wait function that works by creating a timing loop that increments a local variable in order to add a delay is not good programming style. Such loops can be completely optimized out based on the compiler and optimization settings. In addition, the amount of delay will change if you move to a different processor. Delay functions should work based on CPU ticks or real-time, which will not get optimized out. Have the delay function use the CPU clock, or a real time clock, or call a standard function such as nanosleep() or use a select with a timeout. Note that if you are using a CPU tick, be sure to comment the function well and highlight that the implementation needs to be target specific.
Bottom line: As others have suggested, place the suspect code in a separate file and compile that single source file without optimization. Test it to ensure it works, then migrate half the code back into the original, and retest with half the code optimized and half not, to determine where your bug is. Once you know which half has the Heisenbug, use divide and conquer to repeat the process until you identify the smallest portion of code that fails when optimized.
If you can find the bug at that point, great. Otherwise post that fragment here so we can help debug it. Provide the compiler optmization flags used to cause it to fail when optimized.

You can poke around the GCC documentation. For example:
Function specific option pragmas
#pragma GCC optimize ("string"...)
This pragma allows you to set global optimization options for functions defined later in the source file. One or more strings can be specified. Each function that is defined after this point is as if attribute((optimize("STRING"))) was specified for that function. The parenthesis around the options is optional. See Function Attributes, for more information about the optimize attribute and the attribute syntax.
See also the pragmas for push_options, pop_options and reset_options.
Common function attributes
optimize The optimize attribute is used to specify that a function is to be compiled with different optimization options than specified on the command line. Arguments can either be numbers or strings. Numbers are assumed to be an optimization level. Strings that begin with O are assumed to be an optimization option, while other options are assumed to be used with a -f prefix. You can also use the '#pragma GCC optimize' pragma to set the optimization options that affect more than one function. See Function Specific Option Pragmas, for details about the '#pragma GCC optimize' pragma.
This attribute should be used for debugging purposes only. It is not suitable in production code.
Optimize options
This page lists a lot of optimization options that can be used with the -f option on the command line, and hence with the optimize attribute and/or pragma.

The best thing you can do is to move the code you do not want optimized into a separate source file. Compile that without optimization, and link it against the rest of your code.
With GCC you can also declare a function with __attribute__((optimize("O0")) to inhibit optimization.

Can I run GCC as a daemon (or use it as a library)?

I would like to use GCC kind of as a JIT compiler, where I just compile short snippets of code every now and then. While I could of course fork a GCC process for each function I want to compile, I find that GCC's startup overhead is too large for that (it seems to be about 50 ms on my computer, which would make it take 50 seconds to compile 1000 functions). Therefore, I'm wondering if it's possible to run GCC as a daemon or use it as a library or something similar, so that I can just submit a function for compilation without the startup overhead.
In case you're wondering, the reason I'm not considering using an actual JIT library is because I haven't found one that supports all the features I want, which include at least good knowledge of the ABI so that it can handle struct arguments (lacking in GNU Lightning), nested functions with closure (lacking in libjit) and having a C-only interface (lacking in LLVM; I also think LLVM lacks nested functions).
And no, I don't think I can batch functions together for compilation; half the point is that I'd like to compile them only once they're actually called for the first time.
I've noticed libgccjit, but from what I can tell, it seems very experimental.

My answer is "No (you can't run GCC as a daemon process, or use it as a library)", assuming you are trying to use the standard GCC compiler code. I see at least two problems:
The C compiler deals in complete translation units, and once it has finished reading the source, compiles it and exits. You'd have to rejig the code (the compiler driver program) to stick around after reading each file. Since it runs multiple sub-processes, I'm not sure that you'll save all that much time with it, anyway.
You won't be able to call the functions you create as if they were normal statically compiled and linked functions. At the least you will have to load them (using dlopen() and its kin, or writing code to do the mapping yourself) and then call them via the function pointer.
The first objection deals with the direct question; the second addresses a question raised in the comments.

I'm late to the party, but others may find this useful.
There exists a REPL (read–eval–print loop) for c++ called Cling, which is based on the Clang compiler. A big part of what it does is JIT for c & c++. As such you may be able to use Cling to get what you want done.
The even better news is that Cling is undergoing an attempt to upstream a lot of the Cling infrastructure into Clang and LLVM.
#acorn pointed out that you'd ruled out LLVM and co. for lack of a c API, but Clang itself does have one which is the only one they guarantee stability for: https://clang.llvm.org/doxygen/group__CINDEX.html

Find unused functions in a C project by static analysis

I am trying to run static analysis on a C project to identify dead code i.e functions or code lines that are never ever called. I can build this project with Visual Studio .Net for Windows or using gcc for Linux. I have been trying to find some reasonable tool that can do this for me but so far I have not succeeded. I have read related questions on Stack Overflow i.e this and this and I have tried to use -Wunreachable-code with gcc but the output in gcc is not very helpful. It is of the following format
/home/adnan/my_socket.c: In function ‘my_sockNtoH32’:
/home/adnan/my_socket.c:666: warning: will never be executed
but when I look at line 666 in my_socket.c, it's actually inside another function that is being called from function my_sockNtoH32() and will not be executed for this specific instance but will be executed when called from some other functions.
What I need is to find the code which will never be executed. Can someone plz help with this?
PS: I can't convince management to buy a tool for this task, so please stick to free/open source tools.

If GCC isn't cutting it for you, try clang (or more accurately, its static analyzer). It (generally, your mileage may vary of course) has a much better static analysis than GCC (and produces much better output). It's used in Apple's Xcode but it's open-source and can be used seperately.

When GCC says "will never be executed", it means it. You may have a bug that, in fact, does make that dead code. For example, something like:
if (a = 42) {
// some code
} else {
// warning: unreachable code
}
Without seeing the code it's not possible to be specific, of course.
Note that if there is a macro at line 666, it's possible GCC refers to a part of that macro as well.

GCC will help you find dead code within a compilation. I'd be surprised if it can find dead code across multiple compilation units. A file-level declaration of a function or variable in a compilation unit means that some other compilation unit might reference it. So anything declared at the top level of a file, GCC can't eliminate, as it arguably only sees one compilation unit at a time.
The problem gets get harder. Imagine that compilation unit A declares function a, and compilation unit B has a function b that calls a. Is a dead? On the face of it, no. But in fact, it depends; if b is dead, and the only reference to a is in b, then a is dead, too. We get the same problem if b merely takes &a and puts it into an array X. Now to decide if a is dead, we need a points-to analysis across the entire system, to see if that pointer to a is used anywhere.
To get this kind of accurate "dead" information, you need a global view of the entire set of compilation units, and need to compute a points-to analysis, followed by the construction of a call-graph based on that points-to analysis. Function a is dead only if the call graph (as a tree,
with main as the root) doesn't reference it somewhere.
(Some caveats are necessary: whatever the analysis is, as a practical matter it must be conservative, so even a full-points to analysis may not identify a function correctly as dead. You also have to worry about uses of a C artifact from outside the set of C functions, e.g., a call to a from some bit of assembler code).
Threading makes this worse; each thread has some root function which is probably at the top of the call DAG. Since how a thread gets started isn't defined by C compilers, it should be clear that to determine if a multithreaded C application has dead code, somehow the analysis has to be told the thread root functions, or be told how to discover them by looking for thread-initialization primitives.
You aren't getting a lot responses on how to get a correct answer. While it isn't open source, our DMS Software Reengineering Toolkit with its C Front End has all the machinery to do this, including C parsers, control- and dataflow- analysis, local and global points-to analysis, and global call graph construction. DMS is easily customized to include extra information such as external calls from assembler, and/or a list of thread roots or specific source-patterns that are thread initialization calls, and we've actually done that (easily) for some large embedded engine controllers with millions of lines of code. DMS has been applied to systems as large as 26 million lines of code (some 18,000 compilation units) for the purpose of building such calls graphs.
[An interesting aside: in processing individual comilation units, DMS for scaling reasons in effect deletes symbols and related code that aren't used in that compilation unit. Remarkably, this gets rid of about 95% of code by volume when you take into account the iceberg usually hiding in the include file nest. It says C software typically has poorly factored include files. I suspect you all know that already.]
Tools like GCC will remove dead code while compiling. That's helpful, but the dead code is still lying around in your compilation unit source code using up developer's attention (they have to figure out if it is dead, too!). DMS in its program transformation mode can be configured, modulo some preprocessor issues, to actually remove that dead code from the source. On very large software systems, you don't really want to do this by hand.

GCC: Force a function call after every instruction (for multithreaded testing)?

I'm trying to test a rather threading-sensitive area in my program and was wondering if there's a way to force gcc to insert a call after every instruction it emits so that I can manually yield to a different thread?
Thanks,
Robert

No, GCC does not has such an option.
However, you may be able hack together a script that does that job. You can compile your code to assembler using the -S option. Compiler generated assembler is relative easy to parse.
Don't forget to save the flags and all registers inside your debugging code though.