Well, there is no guarantee by the standard that inline functions are actually inlined; one must use macros to have 100 % guarantee. The compiler always decides which function is or is not inlined based on its own rules irrespective of the inline keyword.
Then when will the inline keyword actually have some effect to what the compiler does when using modern compilers such as the recent version of GCC?
It has a semantic effect. To simplify, a function marked inline may be defined multiple times in one program — though all definitions must be equivalent to each other — so presence of inline is required for correctness when including the function definition in headers (which is, in turn, makes the definition visible so the compiler can inline it without LTO).
Other than that, for inlining-the-optimization, "never" is a perfectly safe approximation. It probably has some effect in some compilers, but nothing worth losing sleep over, especially not without actual hard data. For example, in the following code, using Clang 3.0 or GCC 4.7, main contains the same code whether work is marked inline or not. The only difference is whether work remains as stand-alone function for other translation units to link to, or is removed.
void work(double *a, double *b) {
if (*b > *a) *a = *b;
}
void maxArray(double* x, double* y) {
for (int i = 0; i < 65536; i++) {
//if (y[i] > x[i]) x[i] = y[i];
work(x+i, y+i);
}
}
If you want to control inlining, stick to whatever pragmas or attributes your compiler provides with which to control that behaviour. For example __attribute__((always_inline)) on GCC and similar compilers. As you've mentioned, the inline keyword is often ignored depending on optimization settings, etc.
Related
Take this function
__attribute_const__ static inline int mul(int a, int b)
{
return a * b;
}
versus this one
__attribute_const__ static int mul(int a, int b)
{
return a * b;
}
Is there a reason to use inline when using a const attribute? Does it help the compiler at all to use inline here?
None of the attributes necessarily help here, because a static function would be inlined anyway regardless of inline if the compiler so decides, and because it is a static function then the source would be present in the translation unit where it would be used, then the compiler can also see that it calculates the product of the two arguments and compilers are smart enough to conclude that the product of two arguments depends only the values of those arguments.
The inline case gets more interesting in the case of inline/extern inline. Also, the attribute case gets more interesting when the compiler cannot see the code (because the function is defined only in another translation unit), or cannot deduce its behaviour properly - for example a const function might touch some common lookup tables initialized in the beginning of the program but the compiler wouldn't be able to ensure that they will remain constant.
The inline functions are substituted where they are called (at compilation time) and the attribute const is telling the compiler that further calls to the function using the same parameters could be avoided, because the result will be the same. If you are marking the const function as inline then you are loosing the const behavior since the inline is not per se a "call" and the const optimization relays on repeated calls
In my understanding, INLINE can speed up code execution, is it?
How much speed can we gain from it?
Ripped from here:
Yes and no. Sometimes. Maybe.
There are no simple answers. inline functions might make the code faster, they might make it slower. They might make the executable larger, they might make it smaller. They might cause thrashing, they might prevent thrashing. And they might be, and often are, totally irrelevant to speed.
inline functions might make it faster: As shown above, procedural integration might remove a bunch of unnecessary instructions, which might make things run faster.
inline functions might make it slower: Too much inlining might cause code bloat, which might cause "thrashing" on demand-paged virtual-memory systems. In other words, if the executable size is too big, the system might spend most of its time going out to disk to fetch the next chunk of code.
inline functions might make it larger: This is the notion of code bloat, as described above. For example, if a system has 100 inline functions each of which expands to 100 bytes of executable code and is called in 100 places, that's an increase of 1MB. Is that 1MB going to cause problems? Who knows, but it is possible that that last 1MB could cause the system to "thrash," and that could slow things down.
inline functions might make it smaller: The compiler often generates more code to push/pop registers/parameters than it would by inline-expanding the function's body. This happens with very small functions, and it also happens with large functions when the optimizer is able to remove a lot of redundant code through procedural integration — that is, when the optimizer is able to make the large function small.
inline functions might cause thrashing: Inlining might increase the size of the binary executable, and that might cause thrashing.
inline functions might prevent thrashing: The working set size (number of pages that need to be in memory at once) might go down even if the executable size goes up. When f() calls g(), the code is often on two distinct pages; when the compiler procedurally integrates the code of g() into f(), the code is often on the same page.
inline functions might increase the number of cache misses: Inlining might cause an inner loop to span across multiple lines of the memory cache, and that might cause thrashing of the memory-cache.
inline functions might decrease the number of cache misses: Inlining usually improves locality of reference within the binary code, which might decrease the number of cache lines needed to store the code of an inner loop. This ultimately could cause a CPU-bound application to run faster.
inline functions might be irrelevant to speed: Most systems are not CPU-bound. Most systems are I/O-bound, database-bound or network-bound, meaning the bottleneck in the system's overall performance is the file system, the database or the network. Unless your "CPU meter" is pegged at 100%, inline functions probably won't make your system faster. (Even in CPU-bound systems, inline will help only when used within the bottleneck itself, and the bottleneck is typically in only a small percentage of the code.)
There are no simple answers: You have to play with it to see what is best. Do not settle for simplistic answers like, "Never use inline functions" or "Always use inline functions" or "Use inline functions if and only if the function is less than N lines of code." These one-size-fits-all rules may be easy to write down, but they will produce sub-optimal results.
Copyright (C) Marshall Cline
Using inline makes the system use the substitution model of evaluation, but this is not guaranteed to be used all the time. If this is used, the generated code will be longer and may be faster, but if some optimizations are active, the sustitution model is not faster not all the time.
The reason I use inline function specifier (specifically, static inline), is not because of "speed", but because
static part tells the compiler the function is only visible in the current translation unit (the current file being compiled and included header files)
inline part tells the compiler it can include the implementation of the function at the call site, if it wants to
static inline tells the compiler that it can skip the function completely if it is not used at all in the current translation unit
(Specifically, the compiler that I use most with the options I use most, gcc -Wall, does issue a warning if a function marked static is unused; but will not issue a warning if a function marked static inline is unused.)
static inline tells us humans that the function is a macro-like helper function, in addition adding type-checker to the same behavior as macro's.
Thus, in my opinion, the assumption that inline has anything to do with speed per se, is incorrect. Answering the stated question with a straight answer would be misleading.
In my code, you see them associated with some data structures, or occasionally global variables.
A typical example is when I want to implement a Xorshift pseudorandom number generator in my own C code:
#include <inttypes.h>
static uint64_t prng_state = 1; /* Any nonzero uint64_t seed is okay */
static inline uint64_t prng_u64(void)
{
uint64_t state;
state = prng_state;
state ^= state >> 12;
state ^= state << 25;
state ^= state >> 27;
prng_state = state;
return state * UINT64_C(2685821657736338717);
}
The static uint64_t prng_state = 1; means that prng_state is a variable of type uint64_t, visible only in the current compilation unit, and initialized to 1. The prng_u64() function returns an unsigned 64-bit pseudorandom integer. However, if you do not use prng_u64(), the compiler will not generate code for it either.
Another typical use case is when I have data structures, and they need accessor functions. For example,
#ifndef GRID_H
#define GRID_H
#include <stdlib.h>
typedef struct {
int rows;
int cols;
unsigned char *cell;
} grid;
#define GRID_INIT { 0, 0, NULL }
#define GRID_OUTSIDE -1
static inline int grid_get(grid *const g, const int row, const int col)
{
if (!g || row < 0 || col < 0 || row >= g->rows || col >= g->cols)
return GRID_OUTSIDE;
return g->cell[row * (size_t)(g->cols) + col];
}
static inline int grid_set(grid *const g, const int row, const int col,
const unsigned char value)
{
if (!g || row < 0 || col < 0 || row >= g->rows || col >= g->cols)
return GRID_OUTSIDE;
return g->cell[row * (size_t)(g->cols) + col] = value;
}
static inline void grid_init(grid *g)
{
g->rows = 0;
g->cols = 0;
g->cell = NULL;
}
static inline void grid_free(grid *g)
{
free(g->cell);
g->rows = 0;
g->cols = 0;
g->cell = NULL;
}
int grid_create(grid *g, const int rows, const int cols,
const unsigned char initial_value);
int grid_load(grid *g, FILE *handle);
int grid_save(grid *g, FILE *handle);
#endif /* GRID_H */
That header file defines some useful helper functions, and declares the functions grid_create(), grid_load(), and grid_save(), that would be implemented in a separate .c file.
(Yes, those three functions could be implemented in the header file just as well, but it would make the header file quite large. If you had a large project, spread over many translation units (.c source files), each one including the header file would get their own local copies of the functions. The accessor functions defined as static inline above are short and trivial, so it is perfectly okay for them to be copied here and there. The three functions I omitted are much larger.)
For example, if we have function A, is it possible to tell the compiler that hey you need to inline this function at this point of the code but not do it (make a call to it) at that point.
You cannot selectively tell a compiler to inline some calls, atleast not portably.
Note that inline is just an suggestion to the compiler, the compiler may or may not obey the suggestion to inline the body of the function inline to the point of call but some conditions like One definition rules will be relaxed by the compiler for such a function.
gcc has attributes noinline and always_inline.
http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html
I think the requirement is not useful, and the answer to your question is: No.
However, you can achieve something to that effect by using a macro:
#define f_inline do { int i = 1 + 2; } while( 0 )
void f() {
f_inline;
}
You can now use f_inline; if you want to force the code of f to be applied in-line.
It doesn't particularly matter whether you inline function calls at all. See What does __inline__ mean ?. I would just write the function non-inlined, and let the compiler decide how to inline it optimally.
If the compiler unconditionally honors your use or non-use of the inline keyword, or if you use the gcc extensions __attribute__((__always_inline__)) and __attribute__(__noinline__)), then you can achieve what you want with a simple wrapper function:
static inline int foo_inline(int a, int b)
{
/* ... */
}
static int foo_noninline(int a, int b)
{
return foo_inline(a, b);
}
I've written it with the inline keyword, but since compilers will normally treat it just as a hint or even ignore it, you probably want the gcc attribute version.
please take a look at my codes below
#include <stdio.h>
void printOut()
{
static int i = 0;
if (i < 10)
{
printOut(i);
}
}
int main(int argc, char *argv[])
{
return 0;
}
i guess there should be an error due to my invoking the non-existed function prototype.Actually, the code compiles well with mingw5 compiler, which is weird for me, then i change to Borland Compiler, i get a warning message said that no printOut function prototype, is this only a warning ? What is more, the code executes well without any pop-up error windows.
In C, a function without any parameters can still take parameters.
That's why it compiles. The way to specify that it doesn't take any parameters is:
void printOut(void)
This is the proper way to do, but is less common especially for those from a C++ background.
Your program's behavior is undefined, because you define printOut() with no parameters, but you call it with one argument. You need to fix it. But you've written it in such a way that the compiler isn't required to diagnose the problem. (gcc, for example, doesn't warn about the parameter mismatch, even with -std=c99 -pedantic -Wall -Wextra -O3.)
The reasons for this are historical.
Pre-ANSI C (prior to 1989) didn't have prototypes; function declarations could not specify the expected type or number of arguments. Function definition, on the other hand, specified the function's parameters, but not in a way that the compiler could use to diagnose mismatched calls. For example, a function with one int parameter might be declared (say, in a header file) like this:
int plus_one();
and defined (say, in the corresponding .c file) like this:
int plus_one(n)
int n;
{
return n + 1;
}
The parameter information was buried inside the definition.
ANSI C added prototypes, so the above could written like this:
int plus_one(int n);
int plus_one(int n)
{
return n + 1;
}
But the language continued to support the old-style declarations and definitions, so as not to break existing code. Even the upcoming C201X standard still permits pre-ANSI function declarations and definitions, though they've been obsolescent for 22 years now.
In your definition:
void printOut()
{
...
}
you're using an old-style function definition. It says that printOut has no parameters -- but it doesn't let the compiler warn you if you call it incorrectly. Inside your function you call it with one argument. The behavior of this call is undefined. It could quietly ignore the extraneous argument -- or it could conceivably corrupt the stack and cause your program to die horribly. (The latter is unlikely; for historical reasons, most C calling conventions are tolerant of such errors.)
If you want your printOut() function to have no parameters and you want the compiler to complain if you call it incorrectly, define it as:
void printOut(void)
{
...
}
This is the one and only correct way to write it in C.
Of course if you simply make this change in your program and then add a call to printOut() in main(), you'll have an infinite recursive loop on your hands. You probably want printOUt() to take an int argument:
void printOut(int n)
{
...
}
As it happens, C++ has different rules. C++ was derived from C, but with less concern for backward compatibility. When Stroustrup added prototypes to C++, he dropped old-style declarations altogether. Since there was no need for a special-case void marker for parameterless functions, void printOut() in C++ says explicitly that printOut has no parameters, and a call with arguments is an error. C++ also permits void printOut(void) for compatibility with C, but that's probably not used very often (it's rarely useful to write code that's both valid C and valid C++.) C and C++ are two different languages; you should follow the rules for whichever language you're using.
I'm trying to understand when and when not to use the restrict keyword in C and in what situations it provides a tangible benefit.
After reading, "Demystifying The Restrict Keyword", ( which provides some rules of thumb on usage ), I get the impression that when a function is passed pointers, it has to account for the possibility that the data pointed to might overlap (alias) with any other arguments being passed into the function. Given a function:
foo(int *a, int *b, int *c, int n) {
for (int i = 0; i<n; ++i) {
b[i] = b[i] + c[i];
a[i] = a[i] + b[i] * c[i];
}
}
the compiler has to reload c in the second expression, because maybe b and c point to the same location. It also has to wait for b to be stored before it can load a for the same reason. It then has to wait for a to be stored and must reload b and c at the beginning of the next loop. If you call the function like this:
int a[N];
foo(a, a, a, N);
then you can see why the compiler has to do this. Using restrict effectively tells the compiler that you will never do this, so that it can drop the redundant load of c and load a before b is stored.
In a different SO post, Nils Pipenbrinck, provides a working example of this scenario demonstrating the performance benefit.
So far I've gathered that it's a good idea to use restrict on pointers you pass into functions which won't be inlined. Apparently if the code is inlined the compiler can figure out that the pointers don't overlap.
Now here's where things start getting fuzzy for me.
In Ulrich Drepper's paper, "What every programmer should know about memory" he makes the statement that, "unless restrict is used, all pointer accesses are potential sources of aliasing," and he gives a specific code example of a submatrix matrix multiply where he uses restrict.
However, when I compile his example code either with or without restrict I get identical binaries in both cases. I'm using gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
The thing I can't figure out in the following code is whether it needs to be rewritten to make more extensive use of restrict, or if the alias analysis in GCC is just so good that it's able to figure out that none of the arguments alias each other. For purely educational purposes, how can I make using or not using restrict matter in this code - and why?
For restrict compiled with:
gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE) -DUSE_RESTRICT -Wextra -std=c99 -O3 matrixMul.c -o matrixMul
Just remove -DUSE_RESTRICT to not use restrict.
#include <stdlib.h>
#include <stdio.h>
#include <emmintrin.h>
#ifdef USE_RESTRICT
#else
#define restrict
#endif
#define N 1000
double _res[N][N] __attribute__ ((aligned (64)));
double _mul1[N][N] __attribute__ ((aligned (64)))
= { [0 ... (N-1)]
= { [0 ... (N-1)] = 1.1f }};
double _mul2[N][N] __attribute__ ((aligned (64)))
= { [0 ... (N-1)]
= { [0 ... (N-1)] = 2.2f }};
#define SM (CLS / sizeof (double))
void mm(double (* restrict res)[N], double (* restrict mul1)[N],
double (* restrict mul2)[N]) __attribute__ ((noinline));
void mm(double (* restrict res)[N], double (* restrict mul1)[N],
double (* restrict mul2)[N])
{
int i, i2, j, j2, k, k2;
double *restrict rres;
double *restrict rmul1;
double *restrict rmul2;
for (i = 0; i < N; i += SM)
for (j = 0; j < N; j += SM)
for (k = 0; k < N; k += SM)
for (i2 = 0, rres = &res[i][j],
rmul1 = &mul1[i][k]; i2 < SM;
++i2, rres += N, rmul1 += N)
for (k2 = 0, rmul2 = &mul2[k][j];
k2 < SM; ++k2, rmul2 += N)
for (j2 = 0; j2 < SM; ++j2)
rres[j2] += rmul1[k2] * rmul2[j2];
}
int main (void)
{
mm(_res, _mul1, _mul2);
return 0;
}
It is a hint to the code optimizer. Using restrict ensures it that it can store a pointer variable in a CPU register and not have to flush an update of the pointer value to memory so that an alias is updated as well.
Whether or not it takes advantage of it depends heavily on implementation details of the optimizer and the CPU. Code optimizers already are heavily invested in detecting non-aliasing since it is such an important optimization. It should have no trouble detecting that in your code.
Also, GCC 4.0.0-4.4 has a regression bug that causes the restrict keyword to be ignored. This bug was reported as fixed in 4.5 (I lost the bug number though).
(I don't know if using this keyword gives you a significant advantage, actually. It's very easy for programmer to err with this qualifier as there is no enforcement, so an optimizer cannot be certain that the programmer doesn't "lie".)
When you know that a pointer A is the only pointer to some region of memory, that is, it doesn't have aliases (that is, any other pointer B will necessarily be unequal to A, B != A), you can tell this fact to the optimizer by qualifying the type of A with the "restrict" keyword.
I have written about this here: http://mathdev.org/node/23 and tried to show that some restricted pointers are in fact "linear" (as mentioned in that post).
It's worth noting that recent versions of clang are capable of generating code with a run-time check for aliasing, and two code paths: one for cases where there is potential aliasing and the other for case where is is obvious there is no chance of it.
This clearly depends on the extents of data pointed to being conspicuous to the compiler - as they would be in the example above.
I believe the prime justification is for programs making heavy use of STL - and particularly <algorithm> , where is either difficult or impossible to introduce the __restrict qualifier.
Of course, this all comes at the expense of code-size, but removes a great deal of potential for obscure bugs that could result for pointers declared as __restrict not being quite as non-overlapping as the developer thought.
I would be surprised if GCC hadn't also got this optimisation.
May be the optimisation done here don't rely on pointers not being aliased ? Unless you preload multiple mul2 element before writing result in res2, I don't see any aliasing problem.
In the first piece of code you show, it is quite clear what kind of aliases problem can occur.
Here it is not so clear.
Rereading Dreppers article, he does not specifically says restrict might solve anything. There is even this phrase :
{In theory the restrict keyword
introduced into the C language in the
1999 revision should solve the
problem. Compilers have not caught up
yet, though. The reason is mainly that
too much incorrect code exists which
would mislead the compiler and cause
it to generate incorrect object code.}
In this code, optimisations of memory access has already been done within the algorithm. The residual optimisation seems to be done in the vectorized code presented in appendice. So for the code presented here, I guess there is no difference, because no optimisation relying on restrict is done. Every pointer access is a source of aliasing, but not every optimisation relies on aliassing.
Premature optimization being the root of all evil, the use of the restrict keyword should be limited to the case your are actively studying and optimizing, not used wherever it could be used.
If there is a difference at all, moving mm to a seperate DSO (such that gcc can no longer know everything about the calling code) will be the way to demonstrate it.
Are you running on 32 or 64-bit Ubuntu? If 32-bit, then you need to add -march=core2 -mfpmath=sse (or whatever your processor architecture is), otherwise it doesn't use SSE. Secondly, in order to enable vectorization with GCC 4.2, you need to add the -ftree-vectorize option (as of 4.3 or 4.4 this is included as default in -O3). It might also be necessary to add -ffast-math (or another option providing relaxed floating point semantics) in order to allow the compiler to reorder floating point operations.
Also, add the -ftree-vectorizer-verbose=1 option to see whether it manages to vectorize the loop or not; that's an easy way to check the effect of adding the restrict keyword.
The problem with your example code is that the compiler will just inline the call and see that there is no aliasing ever possible in your example. I suggest you remove the main() function and compile it using -c.
The following C99 code can show you that the output of the program depends on restrict :
__attribute__((noinline))
int process(const int * restrict const a, int * const b) {
*b /= (*a + 1) ;
return *a + *b ;
}
int main(void) {
int data[2] = {1, 2};
return process(&data[0], &data[0]);
}
The software terminates with code 1 using restrict and 0 without restrict qualifier.
The compilation is done with gcc -std=c99 -Wall -pedantic -O3 main.c.
The flag -O1 do the job too.
It is useful to use restrict when, for example, you can tell the compiler that the loop condition remains unchanged, even if another pointer has been updated (necessarily, the loop condition couldn't change due to restrict).
And certainly so on.