I have a function which operates on piece of data (let's say, an int), and I want to change it in place by passing a reference to the valule. As such, I have the function: void myFunction(int *thing) { ... }. When I use it I call it thus: myFunction(&anInt).
As my function is called frequently (but from many different places) I am concerned about its performance. The reason I have refactored it into a function is testability and code reuse.
Will the compiler be able to optimize the function, inlining it to operate directly on my anInt variable?
I hope you'll take this question in the spirit in which it's asked (i.e. I'm not prematurely worrying about optimisation, I'm curious about the answer). Similarly, I don't want to make it into a macro.
One way to find out if the function is inlined is to use -Winline gcc option:
-Winline
Warn if a function can not be inlined and it was declared as inline.
Even with this option, the compiler will not warn about failures to inline
functions declared in system headers.
The compiler uses a variety of heuristics to determine whether or not to
inline a function. For example, the compiler takes into account the size
of the function being inlined and the amount of inlining that has already
been done in the current function. Therefore, seemingly insignificant
changes in the source program can cause the warnings produced by -Winline
to appear or disappear.
GCC is quite smart. Consider this code fragment:
#include <stdio.h>
void __inline__ inc(int *val)
{
++ *val;
}
int main()
{
int val;
scanf("%d", &val);
inc(&val);
printf("%d\n", val);
return 0;
}
After a gcc -S -O3 test.c you'll get the following relevant asm:
...
call __isoc99_scanf
movl 12(%rsp), %esi
movl $.LC1, %edi
xorl %eax, %eax
addl $1, %esi
movl %esi, 12(%rsp)
call printf
...
As you can see, there's no need to be an asm expert to see the inc() call has been converted to an increment instruction.
There are two issues here - can the code be optimised, and will it. It certainly can be, but the "will" depends on the mood the optimiser is in. If this is really important to you, compile the code and take a look at the assembly language output.
Also, you seem to be conflating two issues. An inlined function effectively has its body pasted at the call site. Whether or not you are using pointers is neither here nor there. But you seem to be asking if the compiler can transform:
int x = 42;
int * p = & x;
* p = * p + 1;
into
x = x + 1;
This is much more difficult for the optimiser to see.
It should not matter whether the argument is a pointer or not.
But first, if the compiler should automatically inline the function, it must be static. And it must be contained in the same compilation unit. NOTE: we are talking about C, not C++. C++ inline rules differ.
If it is not possible to have the function in the same compilation unit, then try global optimizations (check documentation of your compiler for details).
C99 gives you an "inline" keyword, just as in C++. Which lifts the restriction to the compilation unit.
Here is some further information.
It will (or at least can). There are some reasons where the function cannot be inlined - e.g. when you try to access a pointer of the function (calling function by reference - you are accessing parameters by reference which is ok). There may be other situation (static variables? unsure)
Try to declare the function with "extern inline" - this prevents the compiler from emitting the standalone body. If it cannot inline the function, it will emit an error.
If you're concerned about the compiler generating suboptimal code and want to change a simple valuetype, declare your function as int myFunction(int) and return the new value.
What compiler version are you using? With what options? On what platform?
All these questions effect the answer. You really need to compile the code and look at the assembly to be sure.
This looks to me like a classic case of premature optimization. Do you really know there is a performance issue here? One worth wasting your valuable time worrying about. I mean, really know? Like, have you measured it?
By itself this isn't too bad, but if you take this attitude over a large amount of code, you can do some serious damage and waste a large amount of deveolpement and maintanence time for no good reason.
Related
Would there be any use for a function that does nothing when run, i.e:
void Nothing() {}
Note, I am not talking about a function that waits for a certain amount of time, like sleep(), just something that takes as much time as the compiler / interpreter gives it.
Such a function could be necessary as a callback function.
Supposed you had a function that looked like this:
void do_something(int param1, char *param2, void (*callback)(void))
{
// do something with param1 and param2
callback();
}
This function receives a pointer to a function which it subsequently calls. If you don't particularly need to use this callback for anything, you would pass a function that does nothing:
do_something(3, "test", Nothing);
When I've created tables that contain function pointers, I do use empty functions.
For example:
typedef int(*EventHandler_Proc_t)(int a, int b); // A function-pointer to be called to handle an event
struct
{
Event_t event_id;
EventHandler_Proc_t proc;
} EventTable[] = { // An array of Events, and Functions to be called when the event occurs
{ EventInitialize, InitializeFunction },
{ EventIncrement, IncrementFunction },
{ EventNOP, NothingFunction }, // Empty function is used here.
};
In this example table, I could put NULL in place of the NothingFunction, and check if the .proc is NULL before calling it. But I think it keeps the code simpler to put a do-nothing function in the table.
Yes. Quite a lot of things want to be given a function to notify about a certain thing happening (callbacks). A function that does nothing is a good way to say "I don't care about this."
I am not aware of any examples in the standard library, but many libraries built on top have function pointers for events.
For an example, glib defines a callback "GLib.LogFunc(log_domain, log_level, message, *user_data)" for providing the logger. An empty function would be the callback you provide when logging is disabled.
One use case would be as a possibly temporary stub function midway through a program's development.
If I'm doing some amount of top-down development, it's common for me to design some function prototypes, write the main function, and at that point, want to run the compiler to see if I have any syntax errors so far. To make that compile happen I need to implement the functions in question, which I'll do by initially just creating empty "stubs" which do nothing. Once I pass that compile test, I can go on and flesh out the functions one at a time.
The Gaddis textbook Starting out with C++: From Control Structures Through Objects, which I teach out of, describes them this way (Sec. 6.16):
A stub is a dummy function that is called instead of the actual
function it represents. It usually displays a test message
acknowledging that it was called, and nothing more.
A function that takes arguments and does nothing with them can be used as a pair with a function that does something useful, such that the arguments are still evaluated even when the no-op function is used. This can be useful in logging scenarios, where the arguments must still be evaluated to verify the expressions are legal and to ensure any important side-effects occur, but the logging itself isn't necessary. The no-op function might be selected by the preprocessor when the compile-time logging level was set at a level that doesn't want output for that particular log statement.
As I recall, there were two empty functions in Lions' Commentary on UNIX 6th Edition, with Source Code, and the introduction to the re-issue early this century called Ritchie, Kernighan and Thompson out on it.
The function that gobbles its argument and returns nothing is actually ubiquitous in C, but not written out explicitly because it is implicitly called on nearly every line. The most common use of this empty function, in traditional C, was the invisible discard of the value of any statement. But, since C89, this can be explicitly spelled as (void). The lint tool used to complain whenever a function return value was ignored without explicitly passing it to this built-in function that returns nothing. The motivation behind this was to try to prevent programmers from silently ignoring error conditions, and you will still run into some old programs that use the coding style, (void)printf("hello, world!\n");.
Such a function might be used for:
Callbacks (which the other answers have mentioned)
An argument to higher-order functions
Benchmarking a framework, with no overhead for the no-op being performed
Having a unique value of the correct type to compare other function pointers to. (Particularly in a language like C, where all function pointers are convertible and comparable with each other, but conversion between function pointers and other kinds of pointers is not portable.)
The sole element of a singleton value type, in a functional language
If passed an argument that it strictly evaluates, this could be a way to discard a return value but execute side-effects and test for exceptions
A dummy placeholder
Proving certain theorems in the typed Lambda Calculus
Another temporary use for a do-nothing function could be to have a line exist to put a breakpoint on, for example when you need to check the run-time values being passed into a newly created function so that you can make better decisions about what the code you're going to put in there will need to access. Personally, I like to use self-assignments, i.e. i = i when I need this kind of breakpoint, but a no-op function would presumably work just as well.
void MyBrandNewSpiffyFunction(TypeImNotFamiliarWith whoKnowsWhatThisVariableHas)
{
DoNothing(); // Yay! Now I can put in a breakpoint so I can see what data I'm receiving!
int i = 0;
i = i; // Another way to do nothing so I can set a breakpoint
}
From a language lawyer perspective, an opaque function call inserts a barrier for optimizations.
For example:
int a = 0;
extern void e(void);
int b(void)
{
++a;
++a;
return a;
}
int c(void)
{
++a;
e();
++a;
return a;
}
int d(void)
{
++a;
asm(" ");
++a;
return a;
}
The ++a expressions in the b function can be merged to a += 2, while in the c function, a needs to be updated before the function call and reloaded from memory after, as the compiler cannot prove that e does not access a, similar to the (non-standard) asm(" ") in the d function.
In the embedded firmware world, it could be used to add a tiny delay, required for some hardware reason. Of course, this could be called as many times in a row, too, making this delay expandable by the programmer.
Empty functions are not uncommon in platform-specific abstraction layers. There are often functions that are only needed on certain platforms. For example, a function void native_to_big_endian(struct data* d) would contain byte-swapping code on a little-endian CPU but could be completely empty on a big-endian CPU. This helps keep the business logic platform-agnostic and readable. I've also seen this sort of thing done for tasks like converting native file paths to Unix/Windows style, hardware initialization functions (when some platforms can run with defaults and others must be actively reconfigured), etc.
At the risk of being considered off-topic, I'm going to argue from a Thomistic perspective that a function that does nothing, and the concept of NULL in computing, really has no place anywhere in computing.
Software is constituted in substance by state, behavior, and control flow which belongs to behavior. To have the absence of state is impossible; and to have the absence of behavior is impossible.
Absence of state is impossible because a value is always present in memory, regardless of initialization state for the memory that is available. Absence of behavior is impossible because non-behavior cannot be executed (even "nop" instructions do something).
Instead, we might better state that there is negative and positive existence defined subjectively by the context with an objective definition being that negative existence of state or behavior means no explicit value or implementation respectively, while the positive refers to explicit value or implementation respectively.
This changes the perspective concerning the design of an API.
Instead of:
void foo(void (*bar)()) {
if (bar) { bar(); }
}
we instead have:
void foo();
void foo_with_bar(void (*bar)()) {
if (!bar) { fatal(__func__, "bar is NULL; callback required\n"); }
bar();
}
or:
void foo(bool use_bar, void (*bar)());
or if you want even more information about the existence of bar:
void foo(bool use_bar, bool bar_exists, void (*bar)());
of which each of these is a better design that makes your code and intent well-expressed. The simple fact of the matter is that the existence of a thing or not concerns the operation of an algorithm, or the manner in which state is interpreted. Not only do you lose a whole value by reserving NULL with 0 (or any arbitrary value there), but you make your model of the algorithm less perfect and even error-prone in rare cases. What more is that on a system in which this reserved value is not reserved, the implementation might not work as expected.
If you need to detect for the existence of an input, let that be explicit in your API: have a parameter or two for that if it's that important. It will be more maintainable and portable as well since you're decoupling logic metadata from inputs.
In my opinion, therefore, a function that does nothing is not practical to use, but a design flaw if part of the API, and an implementation defect if part of the implementation. NULL obviously won't disappear that easily, and we just use it because that's what currently is used by necessity, but in the future, it doesn't have to be that way.
Besides all the reasons already given here, note that an "empty" function is never truly empty, so you can learn a lot about how function calls work on your architecture of choice by looking at the assembly output. Let's look at a few examples. Let's say I have the following C file, nothing.c:
void DoNothing(void) {}
Compile this on an x86_64 machine with clang -c -S nothing.c -o nothing.s and you'll get something that looks like this (stripped of metadata and other stuff irrelevant to this discussion):
nothing.s:
_Nothing: ## #Nothing
pushq %rbp
movq %rsp, %rbp
popq %rbp
retq
Hmm, that doesn't really look like nothing. Note the pushing and popping of %rbp (the frame pointer) onto the stack. Now let's change the compiler flags and add -fomit-frame-pointer, or more explicitly: clang -c -S nothing.c -o nothing.s -fomit-frame-pointer
nothing.s:
_Nothing: ## #Nothing
retq
That looks a lot more like "nothing", but you still have at least one x86_64 instruction being executed, namely retq.
Let's try one more. Clang supports the gcc gprof profiler option -pg so what if we try that: clang -c -S nothing.c -o nothing.s -pg
nothing.s:
_Nothing: ## #Nothing
pushq %rbp
movq %rsp, %rbp
callq mcount
popq %rbp
retq
Here we've added a mysterious additional call to a function mcount() that the compiler has inserted for us. This one looks like the least amount of nothing-ness.
And so you get the idea. Compiler options and architecture can have a profound impact on the meaning of "nothing" in a function. Armed with this knowledge you can make much more informed decisions about both how you write code, and how you compile it. Moreover, a function like this called millions of times and measured can give you a very accurate measure of what you might call "function call overhead", or the bare minimum amount of time required to make a call given your architecture and compiler options. In practice given modern superscalar instruction scheduling, this measurement isn't going to mean a whole lot or be particularly useful, but on certain older or "simpler" architectures, it might.
These functions have a great place in test driven development.
class Doer {
public:
int PerformComplexTask(int input) { return 0; } // just to make it compile
};
Everything compiles and the test cases says Fail until the function is properly implemented.
About inline functions in C, GCC documentation says that,
When a function is both inline and static, if all calls to the function are integrated into the caller, and the function’s address is never used, then the function’s own assembler code is never referenced. In this case, GCC does not actually output assembler code for the function, unless you specify the option -fkeep-inline-functions. If there is a nonintegrated call, then the function is compiled to assembler code as usual. The function must also be compiled as usual if the program refers to its address, because that cannot be inlined.
First case is clear: If all the calls are integrated into the caller, then there is no need to produce the assembler code.
But what could be a non integrated call?
A function can be not suitable for inlining, but how come a function call become so?
A function can be not suitable for inlining, but how come a function call become so?
There are several possibilities:
1. Taking the Address of the Function
In general, if taking the address of a function, for example to build a dispatch table, then the compiler can't optimize the "take address + indirect call" back to "dirett call". Simple example:
static inline int inc (int i)
{
return i + 1;
}
int (*pfunc)(int) = inc;
2. Inlining the Function makes the Code more expensive
The cost model of the compiler is not perfect, and code inlining usually happens in an early stage of compilation, which consists of several hundred passes in the case of GCC. This means that the cost computation might be way off the actual costs of the final code due to inexact estimates of code size, register pressure, call overhead, costs of stack frame, benefit of value propagation, etc.
This may lead to the conclusion that inlining the function will decrease code performance compared to non-inlined version of the code.
This typically happens when optimizing for code size (-Os). You can use -Winline to diagnose such situations. Sometimes the code is actually worse, but you still want the inlined version because you prefer speed over size in that place, even when using -Os. And sometimes the cost model of the compiler is not 100% correct.
In this case of "not suitable for inlining" you can add __attribute__((__always_inline__)) and the compiler will inline the function — if not prohibited otherwise — even when optimization is turned off.
There was a lot said and written about volatile variables, and their use. In those articles, two slightly different ideas can be found:
1 - Volatile should be used when variable is getting changed outside of the compiled program.
2 - Volatile should be used when variable is getting changed outside the normal flow of the function.
First statement limits volatile usage to memory-mapped registers etc. and multi-threaded stuff, but the second one actually adds interrupts into the scope.
This article (http://www.barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword) for example explicitly states that volatile modifier should be used for globals changed during the interrupt, and provides this example:
int etx_rcvd = FALSE;
void main()
{
...
while (!ext_rcvd)
{
// Wait
}
...
}
interrupt void rx_isr(void)
{
...
if (ETX == rx_char)
{
etx_rcvd = TRUE;
}
...
}
Note how setting up a rx_isr() as a callback is conveniently omitted here.
Hence I wrote my own example:
#include <stdio.h>
#include <time.h>
#include <signal.h>
void f();
int n = 0;
void main()
{
signal(2,f);
time_t tLastCalled = 0;
printf("Entering the loop\n");
while (n == 0)
{
if (time(NULL) - tLastCalled > 1)
{
printf("Still here...\n");
tLastCalled = time(NULL);
}
}
printf ("Done\n");
}
void f()
{
n = 1;
}
Compiled with gcc on linux with various levels of optimizations, every time loop exited and I saw "Done" when I pressed ctrl+c, which means that really gcc compiler is smart enough not to optimize variable n here.
That said, my question is:
If compiler can really optimize global variables modified by an interrupt service routine, then:
1. Why it has a right of optimizing a global variable in the first place when it can possibly be called from another file?
2. Why the example article and many others on the internet state that the compiler will not "notice" the interrupt callback function?
3. How do I modify my code to accomplish this?
Because you have a function call to an external function, the while loop does check n every time. However, if you remove those function calls the optimizer may registerize or do away with any checks of n.
Ex (gcc x86_64 -O3):
volatile int n;
int main() {
while(n==0) {}
return 0;
}
becomes:
.L3:
movl n(%rip), %eax
testl %eax, %eax
je .L3
xorl %eax, %eax
ret
But
int n;
int main() {
while(n==0) {}
return 0;
}
becomes:
movl n(%rip), %eax
testl %eax, %eax
jne .L2
.L3:
jmp .L3
In this case, n is never looked at in the infinite loop.
If there is a signal handler that modifies a global, you really should label that global volatile. You might not get in trouble by skipping this, but you are either getting lucky or you are counting on the optimizer not being able to verify whether or not a global is being touched.
There is some movement in cross module optimization at link time (llvm), so someday an optimizer may be able to tell that calls to time or printf aren't touching globals in your file. When that happens, missing the volatile keyword may cause problems even if you have external function calls.
If the compiler can really optimize global variables modified by an interrupt service routine, then:
Why does it have the right of optimizing a global variable in the first place when it can possibly be called from another file?
The key here is that in a "normal", single-threaded program with no interrupts, the global variable cannot be modified at any time. All accesses to the variable are sequenced in a predictable manner, no matter which file makes the access.
And the optimizations may be subtle. It is not as simple as "ah ok this global doesn't seem to be used, let's remove it entirely". Rather, for some code like
while(global)
{
do_stuff(global);
}
the optimizer might create something behaving like:
register tmp = global;
loop:
do_stuff(tmp);
goto loop;
Which changes the meaning of the program completely. How such bugs caused by the lack of volatile manifest themselves is always different from case-to-case. They are very hard to find.
Why the example article and many others on the internet state that the compiler will not "notice" the interrupt callback function?
Because embedded compilers are traditionally stupid when it comes to this aspect. Traditionally, when a compiler spots your non-standard interrupt keyword, it will just do 2 things:
Generate the specific return code from that function, since interrupts usually have different calling conventions compared to regular function calls.
Ensure that the function gets linked even though it is never called from the program. Possibly allocated in a separate memory segment. This is actually done by the linker and not the compiler.
There might nowadays be smarter compilers. PC/desktop compilers face the very same issue when dealing with callback functions/threads, but they are usually smart enough to realize that they shouldn't assume things about global variables shared with a callback.
Embedded compilers are traditionally far dumber than PC/desktop compilers when it comes to optimizations. They are generally of lower quality and worse at standard compliance. If you are one of just a few compiler vendors supporting a specific target, or perhaps the only vendor, then the lack of competition means that you don't have to worry much about quality. You can sell crap and charge a lot for it.
But even good compilers can struggle with such scenarios, especially multi-platform ones that don't know anything about how interrupts etc work specifically in "target x".
So you have the case where the good, multi-platform compiler is too generic to handle this bug. While at the same time, the bad, narrow compiler for "target x" is too poorly written to handle it, even though it supposedly knows all about how interrupts work on "target x".
How do I modify my code to accomplish this?
Make such globals volatile.
As the title suggests, what happens if I have:
void a(uint8_t i) {
b(i, 0);
}
Will a compiler be able to replace a call to a(i) with b(i, 0)?
Also, in either case, would the following be considered good practice to replace the above:
#define a(i) b(i, 0)
This is pretty easy to test. If the call to a is in the same compilation unit most compilers will optimize it. Let's see what happens:
$ cat > foo.c
void b(int, int);
void
a(int a)
{
b(a, 0);
}
void
foo(void)
{
a(17);
}
Then compile it to just assembler with some basic optimizations (I added omit-frame-pointer to create cleaner output, you can verify that exactly the same thing will happen without that flag):
$ cc -fomit-frame-pointer -S -O2 foo.c
And then look at the output (I cleaned it up and just kept the code, there's lot of annotations in generated assembler that aren't relevant here):
$ cat foo.s
a:
xorl %esi, %esi
jmp b
foo:
xorl %esi, %esi
movl $17, %edi
jmp b
So we can see here that the compiler first generated a normal function a that calls b (except it's tail call optimized, so it's jmp instead of a call). Then when compiling foo instead of calling a it just inlined it.
The compiler I used in this case was a relatively old version of gcc, I also checked that clang does the exact same thing. This is pretty standard optimization and as long as the compiler does any inlining, a simple function like this will always be inlined.
It depends on a few things, not least of which is your choice of toolchain (compiler, linker, etc) and optimisation settings.
If the compiler has visibility of the definition of a() - not just a declaration - it might elect to inline a(). A compiler is not required to do that but, depending on optimisation settings and quality of implementation of the compiler itself, it might. Your case is, however, a fairly common and straight-forward optimisation for modern compilers.
If the function is not declared static (which very over-simplistically makes it local to a particular compilation unit) then most compilers will still keep a definition of the function a() in the object file, so it can be linked in with other object files (for other compilation units). Even if it choose to inline calls of the function within the compilation unit that defines it.
If the function is declared inline (and the compiler has visibility of the definition) the same actually applies. inline is a hint which the standard permits a compiler to ignore, no matter how adamant the programmer is. In practice, modern compilers can often do a better job of deciding which functions to inline than a programmer can.
If you have code that stores the address of a() (e.g. in a pointer to function) the compiler might elect to not inline it.
Even if the compiler does not inline the function, a smart linker might choose to (in effect) inline it. Most C implementations, however, use a traditional dumb linker as part of the toolchain - so this type of link-time optimisation is unlikely in practice.
Even if the linker doesn't, some virtual machine host environments might elect to inline at run time. This would be highly unusual for a C program but not beyond realms of possibility.
Personally, I wouldn't worry about it. There will be few observable differences (e.g. in program performance, size, etc) whether the compiler does this style of optimisation or not, unless you have a truly large number of such functions.
I would not use a macro. If you really don't want to type , 0 whenever you use b(), then simply write your function a(), and let the compiler worry about it. Only try to optimise further by hand if performance measures and profiling show your function a() is a performance hotspot. Which it probably won't be.
Or, use C++, and declare the function b() with a default value of 0 for the second argument. ;)
The compiler will most likely optimize this code, and make it an inline function:
inline void a(uint8_t i) {
b(i, 0);
}
So calls like a(i) will indeed be replaced with b(i, 0)
I am confused about inline in C99.
Here is what I want:
I want my function get inlined everywhere, not just limited in one translation unit (or one compilation unit, a .c file).
I want the address of the function consistent. If I save the address of the function in a function pointer, I want the function callable from the pointer, and I don't want duplication of the same function in different translation units (basically, I mean no static inline).
C++ inline does exactly this.
But (and please correct me if I am wrong) in C99 there is no way to get this behavior.
I could have use static inline, but it leads to duplication (the address of the same function in different translation unit is not the same). I don't want this duplication.
So, here are my questions:
What is idea behind inline in C99?
What benefits does this design give over C++'s approach?
References:
Here's a link that speaks highly of C99 inline, but I don't understand why. Is this “only in exactly one compilation unit” restriction really that nice?http://gustedt.wordpress.com/2010/11/29/myth-and-reality-about-inline-in-c99/
Here's the Rationale for C99 inline. I've read it, but I don't understand it.Is "inline" without "static" or "extern" ever useful in C99?
A nice post, provides strategies for using inline functions.http://www.greenend.org.uk/rjk/tech/inline.html
Answers Summary
How to get C++ inline behavior in C99 (Yes we can)
head.h
#ifndef __HEAD_H__
#define __HEAD_H__
inline int my_max(int x, int y) {
return (x>y) ? (x) : (y);
}
void call_and_print_addr();
#endif
src.c
#include "head.h"
#include <stdio.h>
// This is necessary! And it should occurs and only occurs in one [.c] file
extern inline int my_max(int x, int y);
void call_and_print_addr() {
printf("%d %u\n", my_max(10, 100), (unsigned int)my_max);
}
main.c
#include <stdio.h>
#include "head.h"
int main() {
printf("%d %u\n", my_max(10, 100), (unsigned int)my_max);
call_and_print_addr();
return 0;
}
Compile it with: gcc -O3 main.c src.c -std=c99
Check the assembly with: gcc -O3 -S main.c src.c -std=c99, You'll find that my_max is inlined in both call_and_print_addr() and main().
Actually, this is exactly the same instructions given by ref 1 and ref 3. And what's wrong with me?
I used a too old version of GCC (3.4.5) to experiment, it give me “multiple definition of my_max” error message, and this is the real reason why I am so confused. Shame.
Difference between C99 and C++ inline
Actually you can compile the example above by g++: g++ main.c src.c
extern inline int my_max(int x, int y);
is redundant in C++, but necessary in C99.
So what does it do in C99?
Again, use gcc -O3 -S main.c src.c -std=c99, you'll find something like this in src.s:
_my_max:
movl 4(%esp), %eax
movl 8(%esp), %edx
cmpl %eax, %edx
cmovge %edx, %eax
ret
.section .rdata,"dr"
If you cut extern inline int my_max(int x, int y); and paste it into main.c, you'll find these assembly code in main.s.
So, by extern inline, you tell the compiler where the true function my_max(), which you can call it by its address, will be defined and compiled.
Now look back in C++, we can't specify it. We will never know where my_max() will be, and this is the “vague linkage” by #Potatoswatter.
As is said by #Adriano, most of the time, we don't care about this detail, but C99 really removes the ambiguity.
To get C++-like behavior, you need to give each TU with potentially-inlined calls an inline definition, and give one TU an externally-visible definition. This is exactly what is illustrated by Example 1 in the relevant section (Function specifiers) of the C standard. (In that example, external visibility is retroactively applied to an inline definition by declaring the function extern afterward: this declaration could be done in the .c file after the definition in the .h file, which turns usual usage on its head.)
If inlining could be accomplished literally everywhere, you wouldn't need the extern function. Non-inlined calls are used, however, in contexts such as recursion and referencing the function address. You may get "always inline" semantics, in a sense, by omitting the extern parts, however this can arbitrarily fail for any simple function call because the standard does not demand that a call be inlined just because there is no alternative. (This is the subject of the linked question.)
C++ handles this with the implementation concept of "vague linkage"; this isn't specified in the standard but it is very real, and tricky, inside the compiler. C compilers are supposed to be easier to write than C++; I believe this accounts for the difference between the languages.
I want my function get inlined everywhere, not just limited in one translation unit(or one compile unit, a [.c] file).
With inline you politely ask your compiler to inline your function (if it has time and mood). It's unrelated to one compilation unit, at best it may even get inlined in every single call site and it won't have a body anywhere (and its code will be duplicated everywhere). It's purpose of inlining, speed in favor of size.
I want the address of the function consistent. If I save the address of the function in a function pointer, I want the function callable from the pointer, and I don't want duplication of the same function in different translation unit. (Basically, I mean no 'static inline')
Again you can't. If function is inlined then there is not any function pointer to it. Of course compiler will need a compilation unit where function will stay (because, well yes, you may need a function pointer or sometimes it may decide to do not inline that function in a specific call site).
From your description it seems that static inline is good. IMO it's not, a function body (when used, see above paragraph) in each compilation unit will lead to code duplication (and problem in comparison of function pointers because each compilation unit will have its own version of your function). It's here that C99 did something pretty good: you declare exactly one place to put function body (when and if required). Compiler won't do it for you (if you ever care about it) and there is nothing left to implementor.
What is idea behind inline in C99?
Pick a good thing (inline functions) but remove ambiguity (each C++ compiler did his own job about where function body has to stay).
What benefits does this design give over C++'s approach?
Honestly I can't see such big problem (even article you linked is pretty vague about this benefit). In a modern compiler you won't see any issue and you will never care about that. Why it's good what C did? IMO because it removed an ambiguity even if - frankly speaking - I'd prefer my compiler does that for me when I don't care about it (99.999%, I suppose).
That said, but I may be wrong, C and C++ have different targets. If you're using C (not C++ without classes and few C++ features) then probably you want to address this kind of details because they matters in your context so C and C++ had to diverge about that. There is not a better design: just different decision for a different audience.