__attribute__((OS_main)) results in strange behaviour in AVR - c

I don't know how to precisely described the error I am seeing. If I set up my port register in main() everything works as intended. However if I try to do it in a function, the program halts.
main.c:
__attribute__((OS_main)) int main(void);
int main(void) {
DDRD = 0xF0;
PORTD = 0xF0;
led( LED_GREEN, true );
while( true );
}
This turns on the green LED. However if I move the port setup to a seperate function nothing happens, like so:
__attribute__((OS_main)) int main(void);
int main(void) {
hwInit();
led( LED_GREEN, true );
while( true );
}
The culprit seems to be the attribute line because if I comment it out the second example works as expected. My problem is understanding why, since as I understand it the OS_main attribute should only tell the compiler that it should not store any registers upon entry or exit to the function. Is this not correct?

The following was compiled with avr-gcc 4.8.0 under ArchLinux. The distribution should be irrelevant to the situation, compiler and compiler version however, may produce different outputs. The code:
#include <avr/io.h>
#define LED_GREEN PD7
#define led(p, s) { if(s) PORTD |= _BV(p); \
else PORTD &= _BV(p); }
__attribute__((OS_main)) int main(void);
__attribute__((noinline)) void hwInit(void);
void hwInit(void){
DDRD = 0xF0;
}
int main(void){
hwInit();
led(LED_GREEN, 1);
while(1);
}
Generates:
000000a4 <hwInit>:
a4: 80 ef ldi r24, 0xF0 ; 240
a6: 8a b9 out 0x0a, r24 ; 10
a8: 08 95 ret
000000aa <main>:
aa: 0e 94 52 00 call 0xa4 ; 0xa4 <hwInit>
ae: 5f 9a sbi 0x0b, 7 ; 11
b0: ff cf rjmp .-2 ; 0xb0 <main+0x6>
when compiled with
avr-gcc -Wall -Os -fpack-struct -fshort-enums -std=gnu99 -funsigned-char -funsigned-bitfields -mmcu=atmega168 -DF_CPU=1000000UL -MMD -MP -MF"core.d" -MT"core.d" -c -o "core.o" "../core.c" and linked accordingly.
Commenting __attribute__((OS_main)) int main(void); from the above sources has no effect on the generated assembly. Curiously, however, removing the noinline directive from hwInit() has the effect of the compiler inlining the function into main, as expected, but the function itself still exists as part of the final binary, even when compiled with -Os.
This leads me to believe that your compiler version/arguments to the compiler are generating some assembly which is not correct. If possible, can you post the disassembly for the relevant areas for further examination?
Edited late to add two contributions, second of which resolves the problem at hand:
Hanno Binder states: "In order to remove those 'unused' functions from the binary you would also need -ffunction-sections -Wl,--gc-sections."
Asker adds [paraphrased]: "I followed a tutorial which neglected to mention the avr-objcopy step to create the hex file. I assumed that compiling and linking the project for the correct target was sufficient (which it for some reason was for basic functionality). After having added an avr-objcopy step to generate the file everything works."

Related

Why Interrupts not generates by C code but easy generates by assembly instructions?

I am programming a little kernel, and implement idt and interrupts.
This C code in my little kernel not generate any interrupt:
int x = 5/0;
int f[4];
f[5] = 8;
But this Assembly code can generate any interrupt:
asm("int $0");
(and handlers work right).
Help me to understand why this situation can happens.
I also tried this:
int a = 3;
int b = 3;
int c = a-b;
int x = a/c;
Nothing I try in c code can generate exception for me.
Even this not worked:
int div_by_0(int a, int b){return a/b;}
int x = div_by_0(5, 0);
void fun ( void )
{
int a = 3;
int b = 3;
int c = a-b;
int x = a/c;
}
Disassembly of section .text:
0000000000000000 <fun>:
0: f3 c3 repz retq
there is no divide to trigger a divide by zero. It is all dead code.
And none of this has anything to do with the int instruction, these are completely separate topics.
As mentioned in the comments test it without using dead code.
int fun0 ( int x )
{
return(5/x);
}
int fun1 ( void )
{
return(fun0(0));
}
but understand that it still may not have the desired effect:
Disassembly of section .text:
0000000000000000 <fun0>:
0: b8 05 00 00 00 mov $0x5,%eax
5: 99 cltd
6: f7 ff idiv %edi
8: c3 retq
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
0000000000000010 <fun1>:
10: 0f 0b ud2
because the optimizer for fun1 could see the fun0 function. You want to have the code under test in a separate optimization domain. In this case above then the idiv would generate the divide by zero. And then it is becomes an operating system issue as to how that is handled and if it is visible to you.
The problem you are seeing is because division by 0 is undefined behaviour in C/C++. The compiler has managed to do enough optimization at compile time to realize you are dividing by zero. The compiler is free to do anything from things like halting and catching fire to making the result 0. Some compilers will emit a ud2 instruction to raise a CPU exception. The result is undefined.
You have a couple of options. Write your division in assembly and call that function from C/C++. Since you are using GCC (works for CLANG as well) You can also use inline assembly to generate a division by zero with something like:
#include <stdint.h> /* or replace uint16_t with unsigned short int */
void div_by_0 (void)
{
asm ("div %b0" :: "a"((uint16_t)0));
return;
}
This sets AX to 0 then divides AX by AL with the DIV instruction. 0/0 is undefined and will raise a Division Exception (#DE). This inline assembly should work with 16, 32, and 64-bit code.
In protected mode or long mode using int $# (Where # is the vector number) to trigger an exception is not always the same as getting a CPU generated exception. Some exceptions generated by the CPU push an error code on the stack after the return address that needs to be cleaned up by an interrupt handler. If you were to use int $0x0d from ring 0 to cause a #GP exception the interrupt handler would likely fault as it returns from the interrupt because using int to generate an exception never places an error code on the stack. This isn't a problem with int $0 because #DE doesn't have an error code placed on the stack by the CPU.
It turned out to be due to optimization flags. Due to a bit of confusion at Makefiles, the -O2 flag worked. If you enable the -O0 flag, exceptions work directly from C. And even this simple code throws an exceptions:
int x = 5/0;

Why getting wrong results when performing pointer arithmetic in C on dynamically linked symbols?

I encountered a weird situation where performing pointer arithmetic involving
dynamically linked symbols leads to incorrect results. I'm unsure if there
are simply missing some linker parameters or if it's a linker bug. Can someone
explain what's wrong in the following example?
Consider the following code (lib.c) of a simple shared library:
#include <inttypes.h>
#include <stdio.h>
uintptr_t getmask()
{
return 0xffffffff;
}
int fn1()
{
return 42;
}
void fn2()
{
uintptr_t mask;
uintptr_t p;
mask = getmask();
p = (uintptr_t)fn1 & mask;
printf("mask: %08x\n", mask);
printf("fn1: %p\n", fn1);
printf("p: %08x\n", p);
}
The operation in question is the bitwise AND between the address of fn1 and
the variable mask. The application (app.c) just calls fn2 like that:
extern int fn2();
int main()
{
fn2();
return 0;
}
It leads to the following output ...
mask: ffffffff
fn1: 0x2aab43c0
p: 000003c0
... which is obviously incorrect, because the same result is expected for fn1
and p. The code runs on an AVR32 architecture and is compiled as follows:
$ avr32-linux-uclibc-gcc -Os -Wextra -Wall -c -o lib.o lib.c
$ avr32-linux-uclibc-gcc -Os -Wextra -Wall -shared -o libfoo.so lib.o
$ avr32-linux-uclibc-gcc -Os -Wextra -Wall -o app app.c -L. -lfoo
The compiler thinks, it is the optimal solution to load the variable
mask into 32 bit register 7 and splitting the &-operation into two assembler
operations with immediate operands.
$ avr32-linux-uclibc-objdump -d libfoo.so
000003ce <fn1>:
3ce: 32 ac mov r12,42
3d0: 5e fc retal r12
000003d2 <fn2>:
...
3f0: e4 17 00 00 andh r7,0x0
3f4: e0 17 03 ce andl r7,0x3ce
I assume the immediate operands of the and instructions are not relocated
to the loading address of fn1 when the shared library is loaded into the
applications address space:
Is this behaviour intentional?
How can I investigate whether problem occurs when linking the shared library or when loading the executable?
Background: This is not an academic questions. OpenSSL and LibreSSL
use similar code, so changing the C source is not an option. The code runs
well on other architectures and certainly there is an unapparent reason for
doing bitwise operations on function pointers.
after correcting all the 'slopiness' in the code, the result is:
#include <inttypes.h>
#include <stdio.h>
int fn1( void );
void fn2( void );
uintptr_t getmask( void );
int main( void )
{
fn2();
return 0;
}
uintptr_t getmask()
{
return 0xffffffff;
}
int fn1()
{
return 42;
}
void fn2()
{
uintptr_t mask;
uintptr_t p;
mask = getmask();
p = (uintptr_t)fn1 & mask;
printf("mask: %08x\n", (unsigned int)mask);
printf("fn1: %p\n", fn1);
printf("p: %08x\n", (unsigned int)p);
}
and the output (on my linux 64bit computer) is:
mask: ffffffff
fn1: 0x4007c1
p: 004007c1

C linker and duplicate symbols

I did an experiment to see what kind of assembly language would be generate if I try to get the same function to compile in there twice. I did the following:
I created two simple test files and their corresponding headers. Let's call them a.c/a.h, and b.c/b.h. Here are the contents of those files:
a.h:
#ifndef __A_H__
#define __A_H__
int a( void );
#endif
b.h:
#ifndef __B_H__
#define __B_H__
int b( void );
#endif
a.c:
#include "a.h"
int a( void )
{
return 1;
}
b.c:
#include "b.h"
#include "a.h"
int b( void )
{
return 1 + a();
}
I then created a static archive for a:
gcc -c a.c -o a.o
ar -rsc a.a a.o
and the same for b, including the static archive for a this time:
gcc -c b.c -o b.o
ar -rsc b.a a.a b.o
At this point, I disassemble the static archive for b to verify that it has assembly code for both functions a() and b(). It does.
Now, I define one last file:
main.c:
#include <stdio.h>
#include "a.h"
#include "b.h"
int main( void )
{
printf( "%d %d\n", a(), b() );
return 0;
}
and I compile it thusly:
gcc main.c a.a b.a -o main
This works fine. When I disassemble it, I see the following definitions for a and b in the code:
140 0000000000400561 <a>:
141 400561: 55 push %rbp
142 400562: 48 89 e5 mov %rsp,%rbp
143 400565: b8 01 00 00 00 mov $0x1,%eax
144 40056a: 5d pop %rbp
145 40056b: c3 retq
146
147 000000000040056c <b>:
148 40056c: 55 push %rbp
149 40056d: 48 89 e5 mov %rsp,%rbp
150 400570: e8 ec ff ff ff callq 400561 <a>
151 400575: 83 c0 01 add $0x1,%eax
152 400578: 5d pop %rbp
153 400579: c3 retq
154 40057a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
As you can see, the code has clearly defined b as calling a rather than inlining it, however, there is only one definition of a in the code, no duplicates.
It seems that gcc has either:
Detected the duplicate object code and removed the duplicates
--or--
the b archive was used first, and it included the reference to int a(), so the a archive was ignored.
My question is: is this behavior circumstantial to my test or is it standard, and can I expect the same behavior from other compilers? Obviously duplicate code is one problem, however there could be duplicate global references as well. Is it safe/good practice to build a large application that has multiple dependency paths to the same static archive? Are there less obvious situations than just duplicate symbol names where issues can arise when doing this?
Asking this because I've been playing with this idea for a project I'm on, and want to make the right choices.
My question is: is this behavior circumstantial to my test or is it standard, and can I expect the same behavior from other compilers?
As far as the compiler itself is concerned, there is no issue: you have one definition for each function among your sources.
As far as ar is concerned, you also have no issue: neither of the archives you built contains any duplicate symbols.
Different linkers may exhibit different behaviors, however. It is conceivable that some would reject linking archives that contain duplicate external symbols. Typical UNIX linkers will handle the situation you present, but they may vary in some details, such as whether a duplicate copy of function a() is included in the binary.
Obviously duplicate code is one problem, however there could be duplicate global references as well. Is it safe/good practice to build a large application that has multiple dependency paths to the same static archive?
"Multiple paths to the same static archive" does not seem to be a good characterization of the situation you present. In neither case do you provide the same archive more than once. Rather, in the b case you provide different archives with duplicate members. Linkers generally do not have problems with specifying the same archive multiple times in the same link command. Under some circumstances it may even be necessary to do so; it should not present a problem.
Providing distinct archives with duplicate members probably will not present a problem, except possibly for bloating your code with duplicate function implementations. This is a bit less certain, but I doubt it would present a problem in practice.
Whether that's good practice is a matter of opinion, but I'm inclined to think not. It's also not clear to me what gain you seen in such an approach. On the other hand, I won't be sharpening any stakes or preparing any kindling if you decide to go ahead anyway.

How to change interpreter path and pass command line arguments to an "executable" shared library on Linux?

Here is a minimal example for an "executable" shared library (assumed file name: mini.c):
// Interpreter path is different on some systems
//+definitely different for 32-Bit machines
const char my_interp[] __attribute__((section(".interp")))
= "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#include <stdio.h>
#include <stdlib.h>
int entry() {
printf("WooFoo!\n");
exit (0);
}
If one compiles it with e.g.: gcc -fPIC -o mini.so -shared -Wl,-e,entry mini.c. "Running" the resulting .so will look like this:
confus#confusion:~$ ./mini.so
WooFoo!
My question is now:
How do I have to change the above program to pass command line arguments to a call of the .so-file? An example shell session after the change might e.g. look like this:
confus#confusion:~$ ./mini.so 2 bar
1: WooFoo! bar!
2: WooFoo! bar!
confus#confusion:~$ ./mini.so 3 bla
1: WooFoo! bla!
2: WooFoo! bla!
3: WooFoo! bla!
5: WooFoo! Bar!
It would also be nice to detect on compile time, wheter the target is a 32-Bit or 64-Bit binary to change the interpreter string accordingly. Otherwise one gets a "Accessing a corrupted shared library" warning. Something like:
#ifdef SIXTY_FOUR_BIT
const char my_interp[] __attribute__((section(".interp"))) = "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#else
const char my_interp[] __attribute__((section(".interp"))) = "/lib/ld-linux.so.2";
#endif
Or even better, to detect the appropriate path fully automatically to ensure it is right for the system the library is compiled on.
How do I have to change the above program to pass command line arguments to a call of the .so-file?
When you run your shared library, argc and argv will be passed to your entry function on the stack.
The problem is that the calling convention used when you compile your shared library on x86_64 linux is going to be that of the System V AMD64 ABI, which doesn't take arguments on the stack but in registers.
You'll need some ASM glue code that fetches argument from the stack and puts them into the right registers.
Here's a simple .asm file you can save as entry.asm and just link with:
global _entry
extern entry, _GLOBAL_OFFSET_TABLE_
section .text
BITS 64
_entry:
mov rdi, [rsp]
mov rsi, rsp
add rsi, 8
call .getGOT
.getGOT:
pop rbx
add rbx,_GLOBAL_OFFSET_TABLE_+$$-.getGOT wrt ..gotpc
jmp entry wrt ..plt
That code copies the arguments from the stack into the appropriate registers, and then calls your entry function in a position-independent way.
You can then just write your entry as if it was a regular main function:
// Interpreter path is different on some systems
//+definitely different for 32-Bit machines
const char my_interp[] __attribute__((section(".interp")))
= "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#include <stdio.h>
#include <stdlib.h>
int entry(int argc, char* argv[]) {
printf("WooFoo! Got %d args!\n", argc);
exit (0);
}
And this is how you would then compile your library:
nasm entry.asm -f elf64
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry.o
The advantage is that you won't have inline asm statements mixed with your C code, instead your real entry point is cleanly abstracted away in a start file.
It would also be nice to detect on compile time, wheter the target is a 32-Bit or 64-Bit binary to change the interpreter string accordingly.
Unfortunately, there's no completely clean, reliable way to do that. The best you can do is rely on your preferred compiler having the right defines.
Since you use GCC you can write your C code like this:
#if defined(__x86_64__)
const char my_interp[] __attribute__((section(".interp")))
= "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#elif defined(__i386__)
const char my_interp[] __attribute__((section(".interp")))
= "/lib/ld-linux.so.2";
#else
#error Architecture or compiler not supported
#endif
#include <stdio.h>
#include <stdlib.h>
int entry(int argc, char* argv[]) {
printf("%d: WooFoo!\n", argc);
exit (0);
}
And have two different start files.
One for 64bit:
global _entry
extern entry, _GLOBAL_OFFSET_TABLE_
section .text
BITS 64
_entry:
mov rdi, [rsp]
mov rsi, rsp
add rsi, 8
call .getGOT
.getGOT:
pop rbx
add rbx,_GLOBAL_OFFSET_TABLE_+$$-.getGOT wrt ..gotpc
jmp entry wrt ..plt
And one for 32bit:
global _entry
extern entry, _GLOBAL_OFFSET_TABLE_
section .text
BITS 32
_entry:
mov edi, [esp]
mov esi, esp
add esi, 4
call .getGOT
.getGOT:
pop ebx
add ebx,_GLOBAL_OFFSET_TABLE_+$$-.getGOT wrt ..gotpc
push edi
push esi
jmp entry wrt ..plt
Which means you now have two slightly different ways to compile your library for each target.
For 64bit:
nasm entry.asm -f elf64
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry.o -m64
And for 32bit:
nasm entry32.asm -f elf32
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry32.o -m32
So to sum it up you now have two start files entry.asm and entry32.asm, a set of defines in your mini.c that picks the right interpreter automatically, and two slightly different ways of compiling your library depending on the target.
So if we really want to go all the way, all that's left is to create a Makefile that detects the right target and builds your library accordingly.Let's do just that:
ARCH := $(shell getconf LONG_BIT)
all: build_$(ARCH)
build_32:
nasm entry32.asm -f elf32
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry32.o -m32
build_64:
nasm entry.asm -f elf64
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry.o -m64
And we're done here. Just run make to build your library and let the magic happen.
Add
int argc;
char **argv;
asm("mov 8(%%rbp), %0" : "=&r" (argc));
asm("mov %%rbp, %0\n"
"add $16, %0" : "=&r" (argv));
to the top of your entry function. On x86_64 platforms, this will give you access to the arguments.
The LNW article that John Bollinger linked to in the comments explains why this code works. It might interest you why this is not required when you write a normal C program, or rather, why it does not suffice do just give your entry function the two usual int argc, char **argv arguments: The entry point for a C program normally is not the main function, but instead an assembler function by glibc that does some preparations for you - among others fetch the arguments from the stack - and that eventually (via some intermediate functions) calls your main function. Note that this also means that you might experience other problems, since you skip this initialization! For some history, the cdecl wikipedia page, especially on the difference between x86 and x86_64, might be of further interest.

(How) Can I inline a particular function call?

Let's say that I have a function that gets called in multiple parts of a program. Let's also say that I have a particular call to that function that is in an extremely performance-sensitive section of code (e.g., a loop that iterates tens of millions of times and where each microsecond counts). Is there a way that I can force the complier (gcc in my case) to inline that single, particular function call, without inlining the others?
EDIT: Let me make this completely clear: this question is NOT about forcing gcc (or any other compiler) to inline all calls to a function; rather, it it about requesting that the compiler inline a particular call to a function.
In C (as opposed to C++) there's no standard way to suggest that a function should be inlined. It's only vender-specific extensions.
However you specify it, as far as I know the compiler will always try to inline every instance, so use that function only once:
original:
int MyFunc() { /* do stuff */ }
change to:
inline int MyFunc_inlined() { /* do stuff */ }
int MyFunc() { return MyFunc_inlined(); }
Now, in theplaces where you want it inlined, use MyFunc_inlined()
Note: "inline" keyword in the above is just a placeholder for whatever syntax gcc uses to force an inlining. If H2CO3's deleted answer is to be trusted, that would be:
static inline __attribute__((always_inline)) int MyFunc_inlined() { /* do stuff */ }
It is possible to enable inlining per translation unit (but not per call). Though this is not an answer for the question and is an ugly trick, it conforms to C standard and may be interesting as related stuff.
The trick is to use extern definition where you do not want to inline, and extern inline where you need inlining.
Example:
$ cat func.h
int func();
$ cat func.c
int func() { return 10; }
$ cat func_inline.h
extern inline int func() { return 5; }
$ cat main.c
#include <stdio.h>
#ifdef USE_INLINE
# include "func_inline.h"
#else
# include "func.h"
#endif
int main() { printf("%d\n", func()); return 0; }
$ gcc main.c func.c && ./a.out
10 // non-inlined version
$ gcc main.c func.c -DUSE_INLINE && ./a.out
10 // non-inlined version
$ gcc main.c func.c -DUSE_INLINE -O2 && ./a.out
5 // inlined!
You can also use non-standard attribute (e.g. __attribute__(always_inline)) in GCC) for extern inline definition, instead of relying on -O2.
BTW, the trick is used in glibc.
the traditional way to force inline a function in C was to not use a function at all, but to use a function like macro. This method will always inline the function, but there are some problems with function like macros. For example:
#define ADD(x, y) ((x) + (y))
printf("%d\n", ADD(2, 2));
There is also the inline keyword, which was added to C in the C99 standard. Notably, Microsoft's Visual C compiler doesn't support C99, and thus you can't use inline with that (miserable) compiler. Inline only hints to the compiler that you want the function inlined - it does not guarantee it.
GCC has an extension which requires the compiler to inline the function.
inline __attribute__((always_inline)) int add(int x, int y) {
return x + y;
}
To make this cleaner, you may want want to use a macro:
#define ALWAYS_INLINE inline __attribute__((always_inline))
ALWAYS_INLINE int add(int x, int y) {
return x + y;
}
I don't know of a direct way of having a function that can be force inlined on certain calls. But you can combine the techniques like this:
#define ALWAYS_INLINE inline __attribute__((always_inline))
#define ADD(x, y) ((x) + (y))
ALWAYS_INLINE int always_inline_add(int x, int y) {
return ADD(x, y);
}
int normal_add(int x, int y) {
return ADD(x, y);
}
Or, you could just have this:
#define ADD(x, y) ((x) + (y))
int add(int x, int y) {
return ADD(x, y);
}
int main() {
printf("%d\n", ADD(2,2)); // always inline
printf("%d\n", add(2,2)); // normal function call
return 0;
}
Also, note that forcing the inline of a function might not make your code faster. Inline functions cause larger code to be generated, which might cause more cache misses to occur.
I hope that helps.
The answer is it depends on your function, what you request and the nature of your function. Your best bet is to:
tell the compiler you want it inlined
make the function static (be careful with extern as it's semantics change a little in gcc in some modes)
set the compiler options to inform the optimizer you want inlining, and set inline limits appropriately
turn on any couldn't inline warnings on the compiler
verify the output (you could check the assembler generated) that the function is in-lined.
Compiler hints
The answers here cover just one side of inlining, the language hints to the compiler. When the standard says:
Making a function an inline function suggests that calls to the function be as
fast as possible. The extent to which such suggestions are effective is
implementation-defined
This can be the case for other stronger hints such as:
GNU's __attribute__((always_inline)): Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
Microsoft's __forceinline: The __forceinline keyword overrides the cost/benefit analysis and relies on the judgment of the programmer instead. Exercise caution when using __forceinline. Indiscriminate use of __forceinline can result in larger code with only marginal performance gains or, in some cases, even performance losses (due to increased paging of a larger executable, for example).
Even both of these would rely on the inlining being possible, and crucially on compiler flags. To work with inlined functions you also need to understand the optimisation settings of your compiler.
It may be worth saying inlining can also be used to provide replacements for existing functions just for the compilation unit you are in. This can be used when an approximate answers are good enough for your algorithm, or a result can be achieved in a faster way with local data-structures.
An inline definition
provides an alternative to an external definition, which a translator may use to implement
any call to the function in the same translation unit. It is unspecified whether a call to the
function uses the inline definition or the external definition.
Some functions cannot be inlined
For example, for the GNU compiler functions that cannot be inlined are:
Note that certain usages in a function definition can make it unsuitable for inline substitution. Among these usages are: variadic functions, use of alloca, use of variable-length data types (see Variable Length), use of computed goto (see Labels as Values), use of nonlocal goto, and nested functions (see Nested Functions). Using -Winline warns when a function marked inline could not be substituted, and gives the reason for the failure.
So even always_inline may not do what you expect.
Compiler Options
Using C99's inline hints will rely on you instructing the compiler the inline behavour you are looking for.
GCC for instance has:
-fno-inline, -finline-small-functions, -findirect-inlining, -finline-functions, -finline-functions-called-once, -fearly-inlining, -finline-limit=n
Microsoft compiler also has options that dictate the effectiveness of inline. Some compilers will also allow optimization to take into account running profile.
I do think it's worth seeing inlining in the broader context of program optimization.
Preventing Inlining
You mention that you don't want certain functions inlined. This might be done by setting something like __attribute__((always_inline)) without turning on the optimizer. However you would probably would want the optimizer. One option here would be to hint you don't want it: __attribute__ ((noinline)). But why would this be the case?
Other forms of optimization
You may also consider how you might restructure your loop and avoiding branches. Branch prediction can have a dramatic effect. For an interesting discussion on this see: Why is it faster to process a sorted array than an unsorted array?
Then you also might smaller inner loops to be unrolled and to look at invariants.
There's a kernel source that uses #defines in a very interesting way to define several different named functions with the same body. This solves the problem of having two different functions to maintain. (I forgot which one it was...). My idea is based on this same principle.
The way to use the defines is that you'll define the inline function on the compilation unit you need it. To demonstrate the method I'll use a simple function:
int add(int a, int b);
It works like this: you make a function generator #define in a header file and declare the function prototype of the normal version of the function (the one not inlined).
Then you declare two separate function generators, one for the normal function and one for the inline function. The inline function you declare as static __inline__. When you need to call the inline function in one of your files, you use the generator define to get the source for it. In all other files you need to use the normal function, you just include the header with the prototype.
The code was tested on:
Intel(R) Core(TM) i5-3330 CPU # 3.00GHz
Kernel Version: 3.16.0-49-generic
GCC 4.8.4
Code is worth more than a thousand words, so:
File Hierarchy
+
| Makefile
| add.h
| add.c
| loop.c
| loop2.c
| loop3.c
| loops.h
| main.c
add.h
#define GENERATE_ADD(type, prefix) \
type int prefix##add(int a, int b) { return a + b; }
#define DEFINE_ADD() GENERATE_ADD(,)
#define DEFINE_INLINE_ADD() GENERATE_ADD(static __inline__, inline_)
int add(int, int);
This doesn't look nice, but cuts the work of maintaining two different functions. The function is fully defined within the GENERATE_ADD(type,prefix) macro, so if you ever need to change the function, you change this macro and everything else changes.
Next, DEFINE_ADD() will be called from add.c to generate the normal version of add. DEFINE_INLINE_ADD() will give you access to a function called inline_add, which has the same signature as your normal addfunction, but it has a different name (the inline_ prefix).
Note: I didn't use the __attribute((always_inline))__ when using the -O3 flag - the __inline__ did the job. However, if you don't wanna use -O3, use:
#define DEFINE_INLINE_ADD() GENERATE_ADD(static __inline__ __attribute__((always_inline)), inline_)
add.c
#include "add.h"
DEFINE_ADD()
Simple call to the DEFINE_ADD() macro generator. This will declare the normal version of the function (the one that won't get inlined).
loop.c
#include <stdio.h>
#include "add.h"
DEFINE_INLINE_ADD()
int loop(void)
{
register int i;
for (i = 0; i < 100000; i++)
printf("%d\n", inline_add(i + 1, i + 2));
return 0;
}
Here in loop.c you can see the call to DEFINE_INLINE_ADD(). This gives this function access to the inline_add function. When you compile, all inline_add function will be inlined.
loop2.c
#include <stdio.h>
#include "add.h"
int loop2(void)
{
register int i;
for (i = 0; i < 100000; i++)
printf("%d\n", add(i + 1, i + 2));
return 0;
}
This is to show you can use the normal version of add normally from other files.
loop3.c
#include <stdio.h>
#include "add.h"
DEFINE_INLINE_ADD()
int loop3(void)
{
register int i;
printf ("add: %d\n", add(2,3));
printf ("add: %d\n", add(4,5));
for (i = 0; i < 100000; i++)
printf("%d\n", inline_add(i + 1, i + 2));
return 0;
}
This is to show that you can use both the functions in the same compilation unit, yet one of the functions will be inlined, and the other wont (see GDB disass bellow for details).
loops.h
/* prototypes for main */
int loop (void);
int loop2 (void);
int loop3 (void);
main.c
#include <stdio.h>
#include <stdlib.h>
#include "add.h"
#include "loops.h"
int main(void)
{
printf("%d\n", add(1,2));
printf("%d\n", add(2,3));
loop();
loop2();
loop3();
return 0;
}
Makefile
CC=gcc
CFLAGS=-Wall -pedantic --std=c11
main: add.o loop.o loop2.o loop3.o main.o
${CC} -o $# $^ ${CFLAGS}
add.o: add.c
${CC} -c $^ ${CFLAGS}
loop.o: loop.c
${CC} -c $^ -O3 ${CFLAGS}
loop2.o: loop2.c
${CC} -c $^ ${CFLAGS}
loop3.o: loop3.c
${CC} -c $^ -O3 ${CFLAGS}
If you use the __attribute__((always_inline)) you can change the Makefile to:
CC=gcc
CFLAGS=-Wall -pedantic --std=c11
main: add.o loop.o loop2.o loop3.o main.o
${CC} -o $# $^ ${CFLAGS}
%.o: %.c
${CC} -c $^ ${CFLAGS}
Compilation
$ make
gcc -c add.c -Wall -pedantic --std=c11
gcc -c loop.c -O3 -Wall -pedantic --std=c11
gcc -c loop2.c -Wall -pedantic --std=c11
gcc -c loop3.c -O3 -Wall -pedantic --std=c11
gcc -Wall -pedantic --std=c11 -c -o main.o main.c
gcc -o main add.o loop.o loop2.o loop3.o main.o -Wall -pedantic --std=c11
Disassembly
$ gdb main
(gdb) disass add
0x000000000040059d <+0>: push %rbp
0x000000000040059e <+1>: mov %rsp,%rbp
0x00000000004005a1 <+4>: mov %edi,-0x4(%rbp)
0x00000000004005a4 <+7>: mov %esi,-0x8(%rbp)
0x00000000004005a7 <+10>:mov -0x8(%rbp),%eax
0x00000000004005aa <+13>:mov -0x4(%rbp),%edx
0x00000000004005ad <+16>:add %edx,%eax
0x00000000004005af <+18>:pop %rbp
0x00000000004005b0 <+19>:retq
(gdb) disass loop
0x00000000004005c0 <+0>: push %rbx
0x00000000004005c1 <+1>: mov $0x3,%ebx
0x00000000004005c6 <+6>: nopw %cs:0x0(%rax,%rax,1)
0x00000000004005d0 <+16>:mov %ebx,%edx
0x00000000004005d2 <+18>:xor %eax,%eax
0x00000000004005d4 <+20>:mov $0x40079d,%esi
0x00000000004005d9 <+25>:mov $0x1,%edi
0x00000000004005de <+30>:add $0x2,%ebx
0x00000000004005e1 <+33>:callq 0x4004a0 <__printf_chk#plt>
0x00000000004005e6 <+38>:cmp $0x30d43,%ebx
0x00000000004005ec <+44>:jne 0x4005d0 <loop+16>
0x00000000004005ee <+46>:xor %eax,%eax
0x00000000004005f0 <+48>:pop %rbx
0x00000000004005f1 <+49>:retq
(gdb) disass loop2
0x00000000004005f2 <+0>: push %rbp
0x00000000004005f3 <+1>: mov %rsp,%rbp
0x00000000004005f6 <+4>: push %rbx
0x00000000004005f7 <+5>: sub $0x8,%rsp
0x00000000004005fb <+9>: mov $0x0,%ebx
0x0000000000400600 <+14>:jmp 0x400625 <loop2+51>
0x0000000000400602 <+16>:lea 0x2(%rbx),%edx
0x0000000000400605 <+19>:lea 0x1(%rbx),%eax
0x0000000000400608 <+22>:mov %edx,%esi
0x000000000040060a <+24>:mov %eax,%edi
0x000000000040060c <+26>:callq 0x40059d <add>
0x0000000000400611 <+31>:mov %eax,%esi
0x0000000000400613 <+33>:mov $0x400794,%edi
0x0000000000400618 <+38>:mov $0x0,%eax
0x000000000040061d <+43>:callq 0x400470 <printf#plt>
0x0000000000400622 <+48>:add $0x1,%ebx
0x0000000000400625 <+51>:cmp $0x1869f,%ebx
0x000000000040062b <+57>:jle 0x400602 <loop2+16>
0x000000000040062d <+59>:mov $0x0,%eax
0x0000000000400632 <+64>:add $0x8,%rsp
0x0000000000400636 <+68>:pop %rbx
0x0000000000400637 <+69>:pop %rbp
0x0000000000400638 <+70>:retq
(gdb) disass loop3
0x0000000000400640 <+0>: push %rbx
0x0000000000400641 <+1>: mov $0x3,%esi
0x0000000000400646 <+6>: mov $0x2,%edi
0x000000000040064b <+11>:mov $0x3,%ebx
0x0000000000400650 <+16>:callq 0x40059d <add>
0x0000000000400655 <+21>:mov $0x400798,%esi
0x000000000040065a <+26>:mov %eax,%edx
0x000000000040065c <+28>:mov $0x1,%edi
0x0000000000400661 <+33>:xor %eax,%eax
0x0000000000400663 <+35>:callq 0x4004a0 <__printf_chk#plt>
0x0000000000400668 <+40>:mov $0x5,%esi
0x000000000040066d <+45>:mov $0x4,%edi
0x0000000000400672 <+50>:callq 0x40059d <add>
0x0000000000400677 <+55>:mov $0x400798,%esi
0x000000000040067c <+60>:mov %eax,%edx
0x000000000040067e <+62>:mov $0x1,%edi
0x0000000000400683 <+67>:xor %eax,%eax
0x0000000000400685 <+69>:callq 0x4004a0 <__printf_chk#plt>
0x000000000040068a <+74>:nopw 0x0(%rax,%rax,1)
0x0000000000400690 <+80>:mov %ebx,%edx
0x0000000000400692 <+82>:xor %eax,%eax
0x0000000000400694 <+84>:mov $0x40079d,%esi
0x0000000000400699 <+89>:mov $0x1,%edi
0x000000000040069e <+94>:add $0x2,%ebx
0x00000000004006a1 <+97>:callq 0x4004a0 <__printf_chk#plt>
0x00000000004006a6 <+102>:cmp $0x30d43,%ebx
0x00000000004006ac <+108>:jne 0x400690 <loop3+80>
0x00000000004006ae <+110>:xor %eax,%eax
0x00000000004006b0 <+112>:pop %rbx
0x00000000004006b1 <+113>:retq
Symbol table
$ objdump -t main | grep add
0000000000000000 l df *ABS* 0000000000000000 add.c
000000000040059d g F .text 0000000000000014 add
$ objdump -t main | grep loop
0000000000000000 l df *ABS* 0000000000000000 loop.c
0000000000000000 l df *ABS* 0000000000000000 loop2.c
0000000000000000 l df *ABS* 0000000000000000 loop3.c
00000000004005c0 g F .text 0000000000000032 loop
00000000004005f2 g F .text 0000000000000047 loop2
0000000000400640 g F .text 0000000000000072 loop3
$ objdump -t main | grep main
main: file format elf64-x86-64
0000000000000000 l df *ABS* 0000000000000000 main.c
0000000000000000 F *UND* 0000000000000000 __libc_start_main##GLIBC_2.2.5
00000000004006b2 g F .text 000000000000005a main
$ objdump -t main | grep inline
$
Well, that's it. After 3 hours of banging my head in the keyboard trying to figure it out, this was the best I could come up with. Feel free to point any errors, I'll really appreciate it. I got really interested in this particular inline one function call.
If you do not mind having two names for the same function, you could create a small wrapper around your function to "block" the always_inline attribute from affecting every call. In my example, loop_inlined would be the name you would use in performance-critical sections, while the plain loop would be used everywhere else.
inline.h
#include <stdlib.h>
static inline int loop_inlined() __attribute__((always_inline));
int loop();
static inline int loop_inlined() {
int n = 0, i;
for(i = 0; i < 10000; i++)
n += rand();
return n;
}
inline.c
#include "inline.h"
int loop() {
return loop_inlined();
}
main.c
#include "inline.h"
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("%d\n", loop_inlined());
printf("%d\n", loop());
return 0;
}
This works regardless of the optimization level. Compiling with gcc inline.c main.c on Intel gives:
4011e6: c7 44 24 18 00 00 00 movl $0x0,0x18(%esp)
4011ed: 00
4011ee: eb 0e jmp 4011fe <_main+0x2e>
4011f0: e8 5b 00 00 00 call 401250 <_rand>
4011f5: 01 44 24 1c add %eax,0x1c(%esp)
4011f9: 83 44 24 18 01 addl $0x1,0x18(%esp)
4011fe: 81 7c 24 18 0f 27 00 cmpl $0x270f,0x18(%esp)
401205: 00
401206: 7e e8 jle 4011f0 <_main+0x20>
401208: 8b 44 24 1c mov 0x1c(%esp),%eax
40120c: 89 44 24 04 mov %eax,0x4(%esp)
401210: c7 04 24 60 30 40 00 movl $0x403060,(%esp)
401217: e8 2c 00 00 00 call 401248 <_printf>
40121c: e8 7f ff ff ff call 4011a0 <_loop>
401221: 89 44 24 04 mov %eax,0x4(%esp)
401225: c7 04 24 60 30 40 00 movl $0x403060,(%esp)
40122c: e8 17 00 00 00 call 401248 <_printf>
The first 7 instructions are the inlined call, and the regular call happens 5 instructions later.
Here's a suggestion, write the body of the code in a separate header file.
Include the header file in place where it has to be inline and into a body in a C file for other calls.
void demo(void)
{
#include myBody.h
}
importantloop
{
// code
#include myBody.h
// code
}
I assume that your function is a little one since you want to inline it, if so why don't you write it in asm?
As for inlining only a specific call to a function I don't think there exists something to do this task for you. Once a function is declared as inline and if the compiler will inline it for you it will do it everywhere it sees a call to that function.

Resources