Prevent compiler optimization on static struct variable - c

In our project, we are using ticlang compiler, i.e. a flavor of clang from TI.
Optimization is set to level -Os.
In the code we have variables that have a struct type and are only used within a C file and hence are defined as static struct_type_xy variable;
The compiler performs some optimization where the members of such a struct are not kept in sequence in one block of memory but are re-ordered and even split.
This means that while debugging such variables cannot be displayed properly.
Of course, I could define them as volatile but that would also prevent optimizing multiple accesses to same members which I don't want to happen.
Therefore I want to prevent this kind of optimization.
What is the name of such an optimization and how can I disable it in clang?
I don't have a MCVE yet but I can provide a few details:
typedef struct
{
Command_t Command; // this is an enum type
int Par_1; // System uses 32 bit integers.
int Par_2;
int Par_3;
int Par_4;
size_t Num_Tok;
} Cmd_t;
static Cmd_t Cmd;
The map file then contains:
20000540 00000004 Cmd.o (.bss.Cmd.1)
20000544 00000004 Cmd.o (.bss.Cmd.2)
20000548 00000004 Cmd.o (.bss.Cmd.5)
2000054c 00000004 HAL_*
...
2000057b 00000001 XY_*
2000057c 00000001 Cmd.o (.bss.Cmd.0)
The parts of Cmd are split accross the memory and some are even removed. (I used a bulid configuration where the missing 2 members are not used but the struct definition is identical for all configurations)
If I remove static this changes to
200004c4 00000018 (.common:Cmd)

Clang is apparently scalarizing the static struct, breaking it up into separate members, since the address is never taken or used, and doesn't escape the compilation unit. This lets it optimize away unused members.
LLVM has a "Scalar Replacement of Aggregates" (sroa) optimization pass.
https://llvm.org/docs/Passes.html#sroa-scalar-replacement-of-aggregates
(The alloca mentioned in that doc is an LLVM IR instruction, not the C alloca() function. Also, google found a random copy of the LLVM source that implements this while I was trying to find the right search terms.)
clang -O3 -Rpass=sroa might print a "remark" for each struct it optimizes, if that pass supports optimization reports.
According to Clang optimization levels, -sroa is enabled at -O1 and higher. But -sroa isn't a clang option, nor it an LLVM option for clang -mllvm -sroa. In 2011, someone asked about adding a command-line option to disable an arbitrary optimization pass; IDK if any feature ever got added.
clang -cc1 -mllvm -help-list-hidden does show some interesting option names, like --stop-before=<pass-name> and --start-after=<pass-name>, and there's a --sroa-strict-inbounds.
clang -mllvm --sroa-strict-inbounds -O1 does actually compile, but I don't know what it does.
clang -mllvm --stop-before=sroa -O3 hello.c doesn't work on my system with clang 13. Or with --stop-before=-sroa. I get error in backend: "sroa" pass is not registered.
So I don't know how to actually disable this optimization pass, but that's almost certainly the one responsible. This is as far as I've gotten.
It's enabled at -O1, so it's not viable to use a lower optimization level and enabling the other optimization flags that normally implies. -O0 is special, and marks everything as optnone, to make sure code-gen is suitably literal, storing/reloading everything between C statements.

Related

Why would gcc change the order of functions in a binary?

Many questions about forcing the order of functions in a binary to match the order of the source file
For example, this post, that post and others
I can't understand why would gcc want to change their order in the first place?
What could be gained from that?
Moreover, why is toplevel-reorder default value is true?
GCC can change the order of functions, because the C standard (e.g. n1570 or newer) allows to do that.
There is no obligation for GCC to compile a C function into a single function in the sense of the ELF format. See elf(5) on Linux
In practice (with optimizations enabled: try compiling foo.c with gcc -Wall -fverbose-asm -O3 foo.c then look into the emitted foo.s assembler file), the GCC compiler is building intermediate representations like GIMPLE. A big lot of optimizations are transforming GIMPLE to better GIMPLE.
Once the GIMPLE representation is "good enough", the compiler is transforming it to RTL
On Linux systems, you could use dladdr(3) to find the nearest ELF function to a given address. You can also use backtrace(3) to inspect your call stack at runtime.
GCC can even remove functions entirely, in particular static functions whose calls would be inline expanded (even without any inline keyword).
I tend to believe that if you compile and link your entire program with gcc -O3 -flto -fwhole-program some non static but unused functions can be removed too....
And you can always write your own GCC plugin to change the order of functions.
If you want to guess how GCC works: download and study its source code (since it is free software) and compile it on your machine, invoke it with GCC developer options, ask questions on GCC mailing lists...
See also the bismon static source code analyzer (some work in progress which could interest you), and the DECODER project. You can contact me by email about both. You could also contribute to RefPerSys and use it to generate GCC plugins (in C++ form).
What could be gained from that?
Optimization. If the compiler thinks some code is like to be used a lot it may put that code in a different region than code which is not expected to execute often (or is an error path, where performance is not as important). And code which is likely to execute after or temporally near some other code should be placed nearby, so it is more likely to be in cache when needed.
__attribute__((hot)) and __attribute__((cold)) exist for some of the same reasons.
why is toplevel-reorder default value is true?
Because 99% of developers are not bothered by this default, and it makes programs faster. The 1% of developers who need to care about ordering use the attributes, profile-guided optimization or other features which are likely to conflict with no-toplevel-reorder anyway.

How can I get the GCC compiler to not optimize a standard library function call like 'printf'?

Is there a way that GCC does not optimize any function calls?
In the generated assembly code, the printf function is replaced by putchar. This happens even with the default -O0 minimal optimization flag.
#include <stdio.h>
int main(void) {
printf("a");
return 0;
}
(Godbolt is showing GCC 9 doing it, and Clang 8 keeping it unchanged.)
Use -fno-builtin to disable all replacement and inlining of standard C functions with equivalents. (This is very bad for performance in code that assumes memcpy(x,y, 4) will compile to just an unaligned/aliasing-safe load, not a function call. And disables constant-propagation such as strlen of string literals. So normally you'd want to avoid that for practical use.)
Or use -fno-builtin-FUNCNAME for a specific function, like -fno-builtin-printf.
By default, some commonly-used standard C functions are handled as builtin functions, similar to __builtin_popcount. The handler for printf replaces it with putchar or puts
if possible.
6.59 Other Built-in Functions Provided by GCC
The implementation details of a C statement like printf("a") are not considered a visible side effect by default, so they aren't something that get preserved. You can still set a breakpoint at the call site and step into the function (at least in assembly, or in source mode if you have debug symbols installed).
To disable other kinds of optimizations for a single function, see __attribute__((optimize(0))) on a function or #pragma GCC optimize. But beware:
The optimize attribute should be used for debugging purposes only. It is not suitable in production code.
You can't disable all optimizations. Some optimization is inherent in the way GCC transforms through an internal representation on the way to assembly. See Disable all optimization options in GCC.
E.g., even at -O0, GCC will optimize x / 10 to a multiplicative inverse.
It still stores everything to memory between C statements (for consistent debugging; that's what -O0 really means); GCC doesn't have a "fully dumb" mode that tries to transliterate C to assembly as naively as possible. Use tcc for that. Clang and ICC with -O0 are somewhat more literal than GCC, and so is MSVC debug mode.
Note that -g never has any effect on code generation, only on the metadata emitted. GCC uses other options (mostly -O, -f*, and -m*) to control code generation, so you can always safely enable -g without hurting performance, other than a larger binary. It's not debug mode (that's -O0); it's just debug symbols.

Is memory in same section always allocated contiguously?

char a_1[512];
int some_variable;
char a_2[512];
main()
{
...
}
Here in the above program, I have declared some variables, all in bss section of the code. Considering that I have kept in mind the alignment issues, can I be sure that the memory allocated for those 3 variables will always be contiguous?
Considering that I have kept in mind the alignment issues, can I be sure that the memory allocated for those 3 variables will always be contiguous?
Certainly not. Read the C11 standard n1570, and you won't find any guarantee about that.
Different compilers are likely to order variables differently, in particular when they are optimizing. Some variables might even stay in a register, and not even have any memory location. In practice some compilers are following the order of the source, others are using some different order.
And you practically could customize (perhaps with some pain) your GCC or your Clang compiler to change that order. And this does happen in practice. For example, recent versions of the GCC kernel might be configured with some GCC plugin which could reorder variables. With GCC or Clang you might also add some variable attribute to alter that order.
BTW, if you need some specific order, you could pack the fields in some struct e.g. code:
struct {
char a_1[512];
int some_variable;
char a_2[512];
} my_struct;
#define a_1 my_struct.a_1
#define some_variable my_struct.some_variable
#define a_2 my_struct.a_2
BTW, some old versions of GCC had an optional optimization pass which reordered (in some cases) fields in struct-s (but recent GCC removed that optimization pass).
In a comment (which should go into your question) you mention hunting some bug. Consider using the gdb debugger and its watchpoints (and/or valgrind). Don't forget to enable all warnings and debug info when compiling (so gcc -Wall -Wextra -g with GCC). Maybe you want also instrumentation options like -fsanitize=address etc...
Beware of undefined behavior.

Volatile and compiler optimization

Is it OK to say that 'volatile' keyword makes no difference if the compiler optimization is turned off i.e (gcc -o0 ....)?
I had made some sample 'C' program and seeing the difference between volatile and non-volatile in the generated assembly code only when the compiler optimization is turned on i.e ((gcc -o1 ....).
No, there is no basis for making such a statement.
volatile has specific semantics that are spelled out in the standard. You are asserting that gcc -O0 always generates code such that every variable -- volatile or not -- conforms to those semantics. This is not guaranteed; even if it happens to be the case for a particular program and a particular version of gcc, it could well change when, for example, you upgrade your compiler.
Probably volatile does not make much difference with gcc -O0 -for GCC 4.7 or earlier. However, this is probably changing in the next version of GCC (i.e. future 4.8, that is current trunk). And the next version will also provide -Og to get debug-friendly optimization.
In GCC 4.7 and earlier no optimizations mean that values are not always kept in registers from one C (or even Gimple, that is the internal representation inside GCC) instruction to the next.
Also, volatile has a specific meaning, both for standard conforming compilers and for human. For instance, I would be upset if reading some code with a sig_atomic_t variable which is not volatile!
BTW, you could use the -fdump-tree-all option to GCC to get a lot of dump files, or use the MELT domain specific language and plugin, notably its probe to query the GCC internal representations thru a graphical interface.

How many GCC optimization levels are there?

How many GCC optimization levels are there?
I tried gcc -O1, gcc -O2, gcc -O3, and gcc -O4
If I use a really large number, it won't work.
However, I have tried
gcc -O100
and it compiled.
How many optimization levels are there?
To be pedantic, there are 8 different valid -O options you can give to gcc, though there are some that mean the same thing.
The original version of this answer stated there were 7 options. GCC has since added -Og to bring the total to 8.
From the man page:
-O (Same as -O1)
-O0 (do no optimization, the default if no optimization level is specified)
-O1 (optimize minimally)
-O2 (optimize more)
-O3 (optimize even more)
-Ofast (optimize very aggressively to the point of breaking standard compliance)
-Og (Optimize debugging experience. -Og enables optimizations that do not interfere with debugging. It should be the
optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization
while maintaining fast compilation and a good debugging experience.)
-Os (Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations
designed to reduce code size.
-Os disables the following optimization flags: -falign-functions -falign-jumps -falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition -fprefetch-loop-arrays -ftree-vect-loop-version)
There may also be platform specific optimizations, as #pauldoo notes, OS X has -Oz.
Let's interpret the source code of GCC 5.1
We'll try to understand what happens on -O100, since it is not clear on the man page.
We shall conclude that:
anything above -O3 up to INT_MAX is the same as -O3, but that could easily change in the future, so don't rely on it.
GCC 5.1 runs undefined behavior if you enter integers larger than INT_MAX.
the argument can only have digits, or it fails gracefully. In particular, this excludes negative integers like -O-1
Focus on subprograms
First remember that GCC is just a front-end for cpp, as, cc1, collect2. A quick ./XXX --help says that only collect2 and cc1 take -O, so let's focus on them.
And:
gcc -v -O100 main.c |& grep 100
gives:
COLLECT_GCC_OPTIONS='-O100' '-v' '-mtune=generic' '-march=x86-64'
/usr/local/libexec/gcc/x86_64-unknown-linux-gnu/5.1.0/cc1 [[noise]] hello_world.c -O100 -o /tmp/ccetECB5.
so -O was forwarded to both cc1 and collect2.
O in common.opt
common.opt is a GCC specific CLI option description format described in the internals documentation and translated to C by opth-gen.awk and optc-gen.awk.
It contains the following interesting lines:
O
Common JoinedOrMissing Optimization
-O<number> Set optimization level to <number>
Os
Common Optimization
Optimize for space rather than speed
Ofast
Common Optimization
Optimize for speed disregarding exact standards compliance
Og
Common Optimization
Optimize for debugging experience rather than speed or size
which specify all the O options. Note how -O<n> is in a separate family from the other Os, Ofast and Og.
When we build, this generates a options.h file that contains:
OPT_O = 139, /* -O */
OPT_Ofast = 140, /* -Ofast */
OPT_Og = 141, /* -Og */
OPT_Os = 142, /* -Os */
As a bonus, while we are grepping for \bO\n inside common.opt we notice the lines:
-optimize
Common Alias(O)
which teaches us that --optimize (double dash because it starts with a dash -optimize on the .opt file) is an undocumented alias for -O which can be used as --optimize=3!
Where OPT_O is used
Now we grep:
git grep -E '\bOPT_O\b'
which points us to two files:
opts.c
lto-wrapper.c
Let's first track down opts.c
opts.c:default_options_optimization
All opts.c usages happen inside: default_options_optimization.
We grep backtrack to see who calls this function, and we see that the only code path is:
main.c:main
toplev.c:toplev::main
opts-global.c:decode_opts
opts.c:default_options_optimization
and main.c is the entry point of cc1. Good!
The first part of this function:
does integral_argument which calls atoi on the string corresponding to OPT_O to parse the input argument
stores the value inside opts->x_optimize where opts is a struct gcc_opts.
struct gcc_opts
After grepping in vain, we notice that this struct is also generated at options.h:
struct gcc_options {
int x_optimize;
[...]
}
where x_optimize comes from the lines:
Variable
int optimize
present in common.opt, and that options.c:
struct gcc_options global_options;
so we guess that this is what contains the entire configuration global state, and int x_optimize is the optimization value.
255 is an internal maximum
in opts.c:integral_argument, atoi is applied to the input argument, so INT_MAX is an upper bound. And if you put anything larger, it seem that GCC runs C undefined behaviour. Ouch?
integral_argument also thinly wraps atoi and rejects the argument if any character is not a digit. So negative values fail gracefully.
Back to opts.c:default_options_optimization, we see the line:
if ((unsigned int) opts->x_optimize > 255)
opts->x_optimize = 255;
so that the optimization level is truncated to 255. While reading opth-gen.awk I had come across:
# All of the optimization switches gathered together so they can be saved and restored.
# This will allow attribute((cold)) to turn on space optimization.
and on the generated options.h:
struct GTY(()) cl_optimization
{
unsigned char x_optimize;
which explains why the truncation: the options must also be forwarded to cl_optimization, which uses a char to save space. So 255 is an internal maximum actually.
opts.c:maybe_default_options
Back to opts.c:default_options_optimization, we come across maybe_default_options which sounds interesting. We enter it, and then maybe_default_option where we reach a big switch:
switch (default_opt->levels)
{
[...]
case OPT_LEVELS_1_PLUS:
enabled = (level >= 1);
break;
[...]
case OPT_LEVELS_3_PLUS:
enabled = (level >= 3);
break;
There are no >= 4 checks, which indicates that 3 is the largest possible.
Then we search for the definition of OPT_LEVELS_3_PLUS in common-target.h:
enum opt_levels
{
OPT_LEVELS_NONE, /* No levels (mark end of array). */
OPT_LEVELS_ALL, /* All levels (used by targets to disable options
enabled in target-independent code). */
OPT_LEVELS_0_ONLY, /* -O0 only. */
OPT_LEVELS_1_PLUS, /* -O1 and above, including -Os and -Og. */
OPT_LEVELS_1_PLUS_SPEED_ONLY, /* -O1 and above, but not -Os or -Og. */
OPT_LEVELS_1_PLUS_NOT_DEBUG, /* -O1 and above, but not -Og. */
OPT_LEVELS_2_PLUS, /* -O2 and above, including -Os. */
OPT_LEVELS_2_PLUS_SPEED_ONLY, /* -O2 and above, but not -Os or -Og. */
OPT_LEVELS_3_PLUS, /* -O3 and above. */
OPT_LEVELS_3_PLUS_AND_SIZE, /* -O3 and above and -Os. */
OPT_LEVELS_SIZE, /* -Os only. */
OPT_LEVELS_FAST /* -Ofast only. */
};
Ha! This is a strong indicator that there are only 3 levels.
opts.c:default_options_table
opt_levels is so interesting, that we grep OPT_LEVELS_3_PLUS, and come across opts.c:default_options_table:
static const struct default_options default_options_table[] = {
/* -O1 optimizations. */
{ OPT_LEVELS_1_PLUS, OPT_fdefer_pop, NULL, 1 },
[...]
/* -O3 optimizations. */
{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
[...]
}
so this is where the -On to specific optimization mapping mentioned in the docs is encoded. Nice!
Assure that there are no more uses for x_optimize
The main usage of x_optimize was to set other specific optimization options like -fdefer_pop as documented on the man page. Are there any more?
We grep, and find a few more. The number is small, and upon manual inspection we see that every usage only does at most a x_optimize >= 3, so our conclusion holds.
lto-wrapper.c
Now we go for the second occurrence of OPT_O, which was in lto-wrapper.c.
LTO means Link Time Optimization, which as the name suggests is going to need an -O option, and will be linked to collec2 (which is basically a linker).
In fact, the first line of lto-wrapper.c says:
/* Wrapper to call lto. Used by collect2 and the linker plugin.
In this file, the OPT_O occurrences seems to only normalize the value of O to pass it forward, so we should be fine.
Seven distinct levels:
-O0 (default): No optimization.
-O or -O1 (same thing): Optimize, but do not spend too much time.
-O2: Optimize more aggressively
-O3: Optimize most aggressively
-Ofast: Equivalent to -O3 -ffast-math. -ffast-math triggers non-standards-compliant floating point optimizations. This allows the compiler to pretend that floating point numbers are infinitely precise, and that algebra on them follows the standard rules of real number algebra. It also tells the compiler to tell the hardware to flush denormals to zero and treat denormals as zero, at least on some processors, including x86 and x86-64. Denormals trigger a slow path on many FPUs, and so treating them as zero (which does not trigger the slow path) can be a big performance win.
-Os: Optimize for code size. This can actually improve speed in some cases, due to better I-cache behavior.
-Og: Optimize, but do not interfere with debugging. This enables non-embarrassing performance for debug builds and is intended to replace -O0 for debug builds.
There are also other options that are not enabled by any of these, and must be enabled separately. It is also possible to use an optimization option, but disable specific flags enabled by this optimization.
For more information, see GCC website.
Four (0-3): See the GCC 4.4.2 manual. Anything higher is just -O3, but at some point you will overflow the variable size limit.

Resources