I have some code that I was debugging and I noticed that the compiler (MIPS gcc-4.7.2) is not pre-calculating certain values that I would expect to just a static value in memory. Here's the crux of the code that is causing what I am seeing:
#define SAMPLES_DATA 500
#define VECTORX 12
#define VECTORY 18
(int *)calloc(SAMPLES_DATA * VECTORX * VECTORY, sizeof(int));
In the assembly, I see that these values are multiplied as (500*12*18) instead of a static value of 108000. This is only an issue because I have some code that runs in realtime where these defines are used to calculate the offsets into an array and the same behavior is seen. I only noticed because the time to write to memory was taking much longer than expected on the hardware. I currently have a "hot fix" that is a function that uses assembly, but I'd rather not push that into production.
Is this standard gcc compiler behavior? If so, is there some way to force or create a construction that precomputes these static multiplication values?
edit:
I'm compiling with -O2; however, the build chain is huge. I don't see anything on the commands being generated by the Makefile that are unusual.
edit:
The issue seems to not be present when gcc 5 is used. Whatever is causing my issue, seems to not carry on to later versions.
Related
I'm working in an embedded system and have "mapped" some defines to an array for inputs.
volatile int INPUT_ARRAY[40];
#define INPUT01 INPUT_ARRAY[0]
#define INPUT02 INPUT_ARRAY[1]
// section 2
if ( INPUT01 && INPUT02 ) {
writepin(outputpin, value);
}
If I want to read from Input 1, I can simply say newvariable = INPUT01 or I can compare data with Input 1, like in section 2 of my code. I'm not sure if this is a normal way of mapping the name INPUT01 to where the array position is. Or for an Input pin in the first place. Each array value represents a binary pin, and are read into the array by decoding a port value (16 bit). Question: Is using the defines and array like this reasonably efficient?
Yes, your solution is efficient.
Before the C compiler even sees your code, the C preprocessor substitutes INPUT_ARRAY[0] for INPUT01 and, similarly, INPUT_ARRAY[1] for INPUT02; so this substitution uses zero time and zero power at run time.
Moreover, when the C compiler sees INPUT_ARRAY[1] in the preprocessed code, it adds 1 at compile time to the base address of INPUT_ARRAY. Therefore, you get maximal efficiency at run time.
Admittedly, were you manually to turn your C compiler's optimizer off, as with the -O0 option of GCC, then it is conceivable that the compiler would emit assembly code to add the 1 at run time. So don't do that.
The only likely exception to the foregoing would be the case that the base address of INPUT_ARRAY were unknown to the compiler at run time, not likely because INPUT_ARRAY were dynamically allocated on the heap (which would make little sense for hardware device addressing), but likely because the base address of INPUT_ARRAY were configurable during boot via device configuration registers. Some hardware does this, but if yours does, why, that is exactly the reason your MCU (or MPU) possesses an index-offset indirect addressing mode in the first place. Though this mode engages the MCU's integer arithmetic unit, [a] the mode does not multiply (multiplication being a power-hungry operation); and, [b] anyway, the mode is such a normal, often-used mode that MCUs are invariably designed to support it efficiently—not perhaps as efficiently as precomputed direct addressing, but as efficiently as one can reasonably expect for such a use. The MCU's manufacturer knows that device pins are things you need to address. The engineer who designed your MCU will have given priority to making the index-offset indirect mode as efficient as possible for this and other reasons. (You could maybe still cheat the matter to save a few millijoules via self-modifying code, if your MCU even allowed that; but, as an engineer, you'd regret the cheat, I suspect, unless security and maintainability were non-issues to you. The problem probably is not much of a real problem. Index-offset indirect addressing is the normal technique when the base address remains unknown until run time. If you really need to save that last millijoule, then you might not be using a C compiler for your code's inner loop, anyway, but might be handcrafting assembly code.)
I suspect that you would find it instructive to tell your compiler to emit assembly code for your inspection. I do not know which compiler you are using but, if you were using GCC, then gcc -S myfile.c.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I've read a lot of articles talking about undefined behavior (UB), but all do talk about theory. I am wondering what could happen in practice, because the programs containing UB may actually run.
My questions relates to unix-like systems, not embedded systems.
I know that one should not write code that relies on undefined behavior. Please do not send answers like this:
Everything could happen
Daemons can fly out of your nose
Computer could jump and catch fire
Especially for the first one, it is not true. You obviously cannot get root by doing a signed integer overflow. I'm asking this for educational purpose only.
Question A)
Source
implementation-defined behavior: unspecified behavior where each implementation documents how the choice is made
Is the implementation the compiler?
Question B)
*"abc" = '\0';
For something else than a segfault to happen, do I need my system to be broken? What could actually happen even if it is not predictable? Could the first byte be set to zero ? What else, and how?
Question C)
int i = 0;
foo(i++, i++, i++);
This is UB because the order in which parameters are evaluated is undefined. Right. But, when the program runs, who decides in what order the parameters are evaluated: is is the compiler, the OS, or something else?
Question D)
Source
$ cat test.c
int main (void)
{
printf ("%d\n", (INT_MAX+1) < 0);
return 0;
}
$ cc test.c -o test
$ ./test
Formatting root partition, chomp chomp
According to other SO users, this is possible. How could this happen? Do I need a broken compiler?
Question E)
Use the same code as above. What could actually happen, except of the expression (INT_MAX+1) yielding a random value ?
Question F)
Does the GCC -fwrapv option defines the behavior of a signed integer overflow, or does it only make GCC assume that it will wrap around but it could in fact not wrap around at runtime?
Question G)
This one concerns embedded systems. Of course, if the PC jumps to an unexpected place, two outputs could be wired together and create a short-circuit (for example).
But, when executing code similar to this:
*"abc" = '\0';
Wouldn't the PC be vectored to the general exception handler? Or what am I missing?
In practice, most compilers use undefined behavior in either of the following ways:
Print a warning at compile time, to inform the user that he probably made a mistake
Infer properties on the values of variables and use those to simplify code
Perform unsafe optimizations as long as they only break the expected semantic of undefined behavior
Compilers are usually not designed to be malicious. The main reason to exploit undefined behavior is usually to get some performance benefit from it. But sometimes that can involve total dead code elimination.
A) Yes. The compiler should document what behavior he chose. But usually that is hard to predict or explain the consequences of UB.
B) If the string is actually instantiated in memory and is in a writable page (by default it will be in a read-only page), then its first character might become a null character. Most probably, the entire expression will be thrown out as dead-code because it is a temporary value that disappears out of the expression.
C) Usually, the order of evaluation is decided by the compiler. Here it might decide to transform it into a i += 3 (or a i = undef if it is being silly). The CPU could reorder instructions at run-time but preserve the order chosen by the compiler if it breaks the semantic of its instruction set (the compiler usually cannot forward the C semantic further down). An incrementation of a register cannot commute or be executed in parallel to an other incrementation of that same register.
D) You need a silly compiler that print "Formatting root partition, chomp chomp" when it detects undefined behavior. Most probably, it will print a warning at compile time, replace the expression by a constant of his choice and produce a binary that simply perform the print with that constant.
E) It is a syntactically correct program, so the compiler will certainly produce a "working" binary. That binary could in theory have the same behavior as any binary you could download on the internet and that you run. Most probably, you get a binary that exit straight away, or that print the aforementioned message and exit straight away.
F) It tells GCC to assume the signed integers wrap around in the C semantic using 2's complement semantic. It must therefore produce a binary that wrap around at run-time. That is rather easy because most architecture have that semantic anyway. The reason for C to have that an UB is so that compilers can assume a + 1 > a which is critical to prove that loops terminate and/or predict branches. That's why using signed integer as loop induction variable can lead to faster code, even though it is mapped to the exact same instructions in hardware.
G) Undefined behavior is undefined behavior. The produced binary could indeed run any instructions, including a jump to an unspecified place... or cleanly trigger an interruption. Most probably, your compiler will get rid of that unnecessary operation.
You obviously cannot get root by doing a signed integer overflow.
Why not?
If you assume that signed integer overflow can only yield some particular value, then you're unlikely to get root that way. But the thing about undefined behavior is that an optimizing compiler can assume that it doesn't happen, and generate code based on that assumption.
Operating systems have bugs. Exploiting those bugs can, among other things, invoke privilege escalation.
Suppose you use signed integer arithmetic to compute an index into an array. If the computation overflows, you could accidentally clobber some arbitrary chunk of memory outside the intended array. That could cause your program to do arbitrarily bad things.
If a bug can be exploited deliberately (and the existence of malware clearly indicates that that's possible), it's at least possible that it could be exploited accidentally.
Also, consider this simple contrived program:
#include <stdio.h>
#include <limits.h>
int main(void) {
int x = INT_MAX;
if (x < x + 1) {
puts("Code that gets root");
}
else {
puts("Code that doesn't get root");
}
}
On my system, it prints
Code that doesn't get root
when compiled with gcc -O0 or gcc -O1, and
Code that gets root
with gcc -O2 or gcc -O3.
I don't have concrete examples of signed integer overflow triggering a security flaw (and I wouldn't post such an example if I had one), but it's clearly possible.
Undefined behavior can in principle make your program do accidentally anything that a program starting with the same privileges could do deliberately. Unless you're using a bug-free operating system, that could include privilege escalation, erasing your hard drive, or sending a nasty e-mail message to your boss.
To my mind, the worst thing that can happen in the face of undefined behavior is something different tomorrow.
I enjoy programming, but I also enjoy finishing a program, and going on to work on something else. I do not delight in continuously tinkering with my already-written programs, to keep them working in the face of bugs they spontaneously develop as hardware, compilers, or other circumstances keep changing.
So when I write a program, it is not enough for it to work. It has to work for the right reasons. I have to know that it works, and that it will keep working next week and next month and next year. It can't just seem to work, to have given apparently correct answers on the -- necessarily finite -- set of test cases I've run it on so far.
And that's why undefined behavior is so pernicious: it might do something perfectly fine today, and then do something completely different tomorrow, when I'm not around to defend it. The behavior might change because someone ran it on a slightly different machine, or with more or less memory, or on a very different set of inputs, or after recompiling it with a different compiler.
See also the third part of this other answer (the part starting with "And now, one more thing, if you're still with me").
It used to be that you could count on the compiler to do something "reasonable". More and more often, though, compilers are truly taking advantage of their license to do weird things when you write undefined code. In the name of efficiency, these compilers are introducing very strange optimizations, which don't do anything close to what you probably want.
Read these posts:
Linus Torvalds describes a kernel bug that was much worse than it could have been given that gcc took advantage of undefined behavior
LLVM blog post on undefined behavior (first of three parts, also two, three)
another great blog post by John Regehr (also first of three parts: two, three)
The central function in my code looks like this (everything else is vanilla input and output):
const int n = 40000;
double * foo (double const * const x)
{
double * y = malloc (n*sizeof(double));
y[0] = x[0] + (0.2*x[1]*x[0] - x[2]*x[2]);
y[1] = x[1] + (0.2*x[1]*x[0] - x[2]*x[2]);
// …
// 39997 lines of similar code
// that cannot be simplified to fewer lines
// …
y[40000] = 0.5*x[40000] - x[12345] + 5*x[0];
return y;
}
Assume for the purpose of this question that hard-coding these 40000 lines like this (or very similar) is really necessary. All these lines only contain basic arithmetic operations with fixed numbers and entries of x (forty per line on average); no functions are called. The total size of the source is 14 MB.
When trying to compile this code I face an extensive memory usage by the compiler. I could get Clang to compile it with -O0 (which takes only 20 s), but I failed with the GCC (even with -O0) or with -O1.
While there is little that can be optimised on the code side or on a global scale (i.e., by computing the individual lines in another order), I am confident that a compiler will find some things to optimise on a local scale (e.g., calculating the bracketed term needed to calculate y[0] and y[1]).
My questions are thus:
Are there some compiler flags that activate only optimisations that do not require much additional memory?
Are there some other ways to make the compiler handle this source better (without losing more speed than gained through optimisation)?
The following comment by Lee Daniel Crocker solved the problem:
I suspect the limit you're running into is the size of the structures needed for a single stack frame/block/function. Try breaking it up into, say, 100 functions of 400 lines each and see if that does better.
When using functions of 100 lines each (and calling all of them in a row), I obtained a program that I could compile with -O2 without any problem.
You can add swap space as much as required to get the compiler to compile the code with optimizations enabled. Using this technique, the compilation process will become much slower. However, the amount of memory available to the compiler would be limited only by the size of the virtual address space. Another less convenient option is to install more RAM. Also make sure that the process in which the compiler is running doesn't have limits on the amount of memory it can allocate. Regarding compiler flags, I don't think there are flags that you can use to directly control the memory usage of the compiler and let the compiler adjust itself to the specified limit.
Write it in assembly.
I assume you have a tool that is generating this C file. Why not have it spit out assembly code instead?
We're writing code inside the Linux kernel so, try as I might, I wasn't able to get PC-Lint/Flexelint working on Linux kernel code. Just too many built-in symbols etc. But that's a side issue.
We have any number of compilers, starting with gcc, but others also. Their warnings options have been getting stronger over time, to where they are pretty strong static analysis tools too.
Here is what I want to catch. Yes, I know it violates some things that are easy to catch in code review, such as "no magic numbers", and "beware of bit shifting", but that's only if you happen to look at that section of code. Anyway, here it is:
unsigned long long foo;
unsigned long bar;
[... lots of other code ...]
foo = ~(foo + (1<<bar));
Further UPDATED problem description -- even with bar limited to 16, still a problem. Clarifying, the problem is implicit int type of constant that, unplanned, makes the complex expression violate the rule that all calculations be carried out in the same size and signedness.
Problem: '1' is not long long, but, as a small-value constant, defaults to an int. Therefore even if bar's actual value never exceeds, say, 16, still the (1<<bar) expression will overflow and ruin the entire calculation.
Possibly correct solution: write 1ULL instead.
Is there a well-known compiler and compiler warning flag that will point out this (revised) problem?
I am not sure what criteria you are thinking of to flag
this construction as suspicious. There is clearly
something wrong if the value of bar is as large as than
the size (in bits) of an int, but usually the compiler
wouldn't know that.
From the point of view of a heuristic, bug-finding tool,
having good patterns to separate likely bugs from
normal constructions is key to avoiding too many false
positives (which make users hate the tool and refuse to
use it).
The Open Source tool in my URL flags logical shifts by a number larger
than the size of the type, but it is primarily a verification
tool for critical embedded software and expect a lot of work
to appropriate it if you intend to use it on the Linux kernel
with its linked structures and other difficulties.
I witnessed the following weird behavior. I have two functions, which do almost the same - they measure the number of cycles it takes to do a certain operation. In one function, inside the loop I increment a variable; in the other nothing happens. The variables are volatile so they won't be optimized away. These are the functions:
unsigned int _osm_iterations=5000;
double osm_operation_time(){
// volatile is used so that j will not be optimized, and ++ operation
// will be done in each loop
volatile unsigned int j=0;
volatile unsigned int i;
tsc_counter_t start_t, end_t;
start_t = tsc_readCycles_C();
for (i=0; i<_osm_iterations; i++){
++j;
}
end_t = tsc_readCycles_C();
if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t))
return -1;
return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations;
}
double osm_empty_time(){
volatile unsigned int i;
volatile unsigned int j=0;
tsc_counter_t start_t, end_t;
start_t = tsc_readCycles_C();
for (i=0; i<_osm_iterations; i++){
;
}
end_t = tsc_readCycles_C();
if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t))
return -1;
return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations;
}
There are some non-standard functions there but I'm sure you'll manage.
The thing is, the first function returns 4, while the second function (which supposedly does less) returns 6, although the second one obviously does less than the first one.
Does that make any sense to anyone?
Actually I made the first function so I could reduce the loop overhead for my measurement of the second. Do you have any idea how to do that (as this method doesn't really cut it)?
I'm on Ubuntu (64 bit I think).
Thanks a lot.
I can see a couple of things here. One is that the code for the two loops looks identical. Secondly, the compiler will probably realise that the variable i and the variable j will always have the same value and optimise one of them away. You should look at the generated assembly and see what is really going on.
Another theory is that the change to the inner body of the loop has affected the cachability of the code - this could have moved it across cache lines or some other thing.
Since the code is so trivial, you may find it difficult to get an accuate timing value, even if you are doing 5000 iterations, you may find that the time is inside the margin for error for the timing code you are using. A modern computer can probably run that in far less than a millisecond - perhaps you should increase the number of iterations?
To see the generated assembly in gcc, specify the -S compiler option:
Q: How can I peek at the assembly code
generated by GCC?
Q: How can I create a file where I can
see the C code and its assembly
translation together?
A: Use the -S (note: capital S) switch
to GCC, and it will emit the assembly
code to a file with a .s extension.
For example, the following command:
gcc -O2 -S -c foo.c
will leave the generated assembly code
on the file foo.s.
If you want to see the C code together
with the assembly it was converted to,
use a command line like this:
gcc -c -g -Wa,-a,-ad [other GCC
options] foo.c > foo.lst
which will output the combined
C/assembly listing to the file
foo.lst.
It's sometimes difficult to guess at this sort of thing, especially due to the small number of iterations. One thing that might be happening, though, is the increment could be executing on a free integer execution unit, gaining some slight degree of parallelism, since it has no dep on the value of i.
Since you mentioned this was 64 bit os, it's almost certain all these values are in registers, since there's more registers in the x86_64 architecture. Other than that, i'd say perform many more iterations, and see how stable the results are.
If you are truly trying to test the operation of a piece of code ("j++;" in this case), you're actually better off doing the following:
1/ Do it in two separate executables since there is a possibility that position within the executable may affect the code.
2/ Make sure you use CPU time rather than elapsed time (I'm not sure what "tsc_readCycles_C()" gives you). This is to avoid errant results from a CPU loaded up with other tasks.
3/ Turn off compiler optimization (e.g., "gcc -O0") to ensure gcc doesn't put in any fancy stuff that's likely to skew the results.
4/ You don't need to worry about volatile if you use the actual result, such as placing:
printf ("%d\n",j);
after the loop, or:
FILE *fx = fopen ("/dev/null","w");
fprintf (fx, "%d\n", j);
fclose (fx);
if you don't want any output at all. I can't remember whether volatile was a suggestion to the compiler or enforced.
5/ Iterations of 5,000 seem a little on the low side, where "noise" could affect the readings. Maybe a higher value would be better. This may not be an issue if you're timing a larger piece of code and you've just included "j++;" as a place-holder.
When I'm running tests similar to this, I normally:
Ensure that the times are measured in at least seconds, preferably (small) tens of seconds.
Have a single run of the program call the first function, then the second, then the first again, then the second again, and so on, just to see if there are weird cache warmup issues.
Run the program multiple times to see how stable the timing is across runs.
I'm still at a loss to explain your observed results, but if you're sure you've got your functions identified properly (not self-evidently the case given that there were copy'n'paste errors earlier, for example), then looking at the assembler output is the main option left.