Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I've read a lot of articles talking about undefined behavior (UB), but all do talk about theory. I am wondering what could happen in practice, because the programs containing UB may actually run.
My questions relates to unix-like systems, not embedded systems.
I know that one should not write code that relies on undefined behavior. Please do not send answers like this:
Everything could happen
Daemons can fly out of your nose
Computer could jump and catch fire
Especially for the first one, it is not true. You obviously cannot get root by doing a signed integer overflow. I'm asking this for educational purpose only.
Question A)
Source
implementation-defined behavior: unspecified behavior where each implementation documents how the choice is made
Is the implementation the compiler?
Question B)
*"abc" = '\0';
For something else than a segfault to happen, do I need my system to be broken? What could actually happen even if it is not predictable? Could the first byte be set to zero ? What else, and how?
Question C)
int i = 0;
foo(i++, i++, i++);
This is UB because the order in which parameters are evaluated is undefined. Right. But, when the program runs, who decides in what order the parameters are evaluated: is is the compiler, the OS, or something else?
Question D)
Source
$ cat test.c
int main (void)
{
printf ("%d\n", (INT_MAX+1) < 0);
return 0;
}
$ cc test.c -o test
$ ./test
Formatting root partition, chomp chomp
According to other SO users, this is possible. How could this happen? Do I need a broken compiler?
Question E)
Use the same code as above. What could actually happen, except of the expression (INT_MAX+1) yielding a random value ?
Question F)
Does the GCC -fwrapv option defines the behavior of a signed integer overflow, or does it only make GCC assume that it will wrap around but it could in fact not wrap around at runtime?
Question G)
This one concerns embedded systems. Of course, if the PC jumps to an unexpected place, two outputs could be wired together and create a short-circuit (for example).
But, when executing code similar to this:
*"abc" = '\0';
Wouldn't the PC be vectored to the general exception handler? Or what am I missing?
In practice, most compilers use undefined behavior in either of the following ways:
Print a warning at compile time, to inform the user that he probably made a mistake
Infer properties on the values of variables and use those to simplify code
Perform unsafe optimizations as long as they only break the expected semantic of undefined behavior
Compilers are usually not designed to be malicious. The main reason to exploit undefined behavior is usually to get some performance benefit from it. But sometimes that can involve total dead code elimination.
A) Yes. The compiler should document what behavior he chose. But usually that is hard to predict or explain the consequences of UB.
B) If the string is actually instantiated in memory and is in a writable page (by default it will be in a read-only page), then its first character might become a null character. Most probably, the entire expression will be thrown out as dead-code because it is a temporary value that disappears out of the expression.
C) Usually, the order of evaluation is decided by the compiler. Here it might decide to transform it into a i += 3 (or a i = undef if it is being silly). The CPU could reorder instructions at run-time but preserve the order chosen by the compiler if it breaks the semantic of its instruction set (the compiler usually cannot forward the C semantic further down). An incrementation of a register cannot commute or be executed in parallel to an other incrementation of that same register.
D) You need a silly compiler that print "Formatting root partition, chomp chomp" when it detects undefined behavior. Most probably, it will print a warning at compile time, replace the expression by a constant of his choice and produce a binary that simply perform the print with that constant.
E) It is a syntactically correct program, so the compiler will certainly produce a "working" binary. That binary could in theory have the same behavior as any binary you could download on the internet and that you run. Most probably, you get a binary that exit straight away, or that print the aforementioned message and exit straight away.
F) It tells GCC to assume the signed integers wrap around in the C semantic using 2's complement semantic. It must therefore produce a binary that wrap around at run-time. That is rather easy because most architecture have that semantic anyway. The reason for C to have that an UB is so that compilers can assume a + 1 > a which is critical to prove that loops terminate and/or predict branches. That's why using signed integer as loop induction variable can lead to faster code, even though it is mapped to the exact same instructions in hardware.
G) Undefined behavior is undefined behavior. The produced binary could indeed run any instructions, including a jump to an unspecified place... or cleanly trigger an interruption. Most probably, your compiler will get rid of that unnecessary operation.
You obviously cannot get root by doing a signed integer overflow.
Why not?
If you assume that signed integer overflow can only yield some particular value, then you're unlikely to get root that way. But the thing about undefined behavior is that an optimizing compiler can assume that it doesn't happen, and generate code based on that assumption.
Operating systems have bugs. Exploiting those bugs can, among other things, invoke privilege escalation.
Suppose you use signed integer arithmetic to compute an index into an array. If the computation overflows, you could accidentally clobber some arbitrary chunk of memory outside the intended array. That could cause your program to do arbitrarily bad things.
If a bug can be exploited deliberately (and the existence of malware clearly indicates that that's possible), it's at least possible that it could be exploited accidentally.
Also, consider this simple contrived program:
#include <stdio.h>
#include <limits.h>
int main(void) {
int x = INT_MAX;
if (x < x + 1) {
puts("Code that gets root");
}
else {
puts("Code that doesn't get root");
}
}
On my system, it prints
Code that doesn't get root
when compiled with gcc -O0 or gcc -O1, and
Code that gets root
with gcc -O2 or gcc -O3.
I don't have concrete examples of signed integer overflow triggering a security flaw (and I wouldn't post such an example if I had one), but it's clearly possible.
Undefined behavior can in principle make your program do accidentally anything that a program starting with the same privileges could do deliberately. Unless you're using a bug-free operating system, that could include privilege escalation, erasing your hard drive, or sending a nasty e-mail message to your boss.
To my mind, the worst thing that can happen in the face of undefined behavior is something different tomorrow.
I enjoy programming, but I also enjoy finishing a program, and going on to work on something else. I do not delight in continuously tinkering with my already-written programs, to keep them working in the face of bugs they spontaneously develop as hardware, compilers, or other circumstances keep changing.
So when I write a program, it is not enough for it to work. It has to work for the right reasons. I have to know that it works, and that it will keep working next week and next month and next year. It can't just seem to work, to have given apparently correct answers on the -- necessarily finite -- set of test cases I've run it on so far.
And that's why undefined behavior is so pernicious: it might do something perfectly fine today, and then do something completely different tomorrow, when I'm not around to defend it. The behavior might change because someone ran it on a slightly different machine, or with more or less memory, or on a very different set of inputs, or after recompiling it with a different compiler.
See also the third part of this other answer (the part starting with "And now, one more thing, if you're still with me").
It used to be that you could count on the compiler to do something "reasonable". More and more often, though, compilers are truly taking advantage of their license to do weird things when you write undefined code. In the name of efficiency, these compilers are introducing very strange optimizations, which don't do anything close to what you probably want.
Read these posts:
Linus Torvalds describes a kernel bug that was much worse than it could have been given that gcc took advantage of undefined behavior
LLVM blog post on undefined behavior (first of three parts, also two, three)
another great blog post by John Regehr (also first of three parts: two, three)
Related
The following code produces strange things on my system:
#include <stdio.h>
void f (int x) {
int y = x + x;
int v = !y;
if (x == (1 << 31))
printf ("y: %d, !y: %d\n", y, !y);
}
int main () {
f (1 << 31);
return 0;
}
Compiled with -O1, this prints y: 0, !y: 0.
Now beyond the puzzling fact that removing the int v or the if lines produces the expected result, I'm not comfortable with undefined behavior of overflows translating to logical inconsistency.
Should this be considered a bug, or is the GCC team philosophy that one unexpected behavior can cascade into logical contradiction?
When invoking undefined behavior, anything can happen. There's a reason why it's called undefined behavior, after all.
Should this be considered a bug, or is the GCC team philosophy that one unexpected behavior can cascade into logical contradiction?
It's not a bug. I don't know much about the philosophy of the GCC team, but in general undefined behavior is "useful" to compiler developers to implement certain optimizations: assuming something will never happen makes it easier to optimize code. The reason why anything can happen after UB is exactly because of this. The compiler makes a lot of assumptions and if any of them is broken then the emitted code cannot be trusted.
As I said in another answer of mine:
Undefined behavior means that anything can happen. There is no explanation as to why anything strange happens after invoking undefined behavior, nor there needs to be. The compiler could very well emit 16-bit Real Mode x86 assembly, produce a binary that deletes your entire home folder, emit the Apollo 11 Guidance Computer assembly code, or whatever else. It is not a bug. It's perfectly conforming to the standard.
The 2018 C standard defines, in clause 3.4.3, paragraph 1, “undefined behavior” to be:
behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this document imposes no requirements
That is quite simple. There are no requirements from the standard. So, no, the standard does not require the behavior to be “consistent.” There is no requirement.
Furthermore, compilers, operating systems, and other things involved in building and running a program generally do not impose any requirement of “consistency” in the sense asked about in this question.
Addendum
Note that answers that say “anything can happen” are incorrect. The C standard only says that it imposes no requirements when there is behavior that it deems “undefined.” It does not nullify other requirements and has no authority to nullify them. Any specifications of compilers, operating systems, machine architectures, or consumer product laws; or laws of physics; laws of logic; or other constraints still apply. One situation where this matters is simply linking to software libraries not written in C: The C standard does not define what happens, but what does happen is still constrained by the other programming language(s) used and the specifications of the libraries, as well as the linker, operating system, and so on.
Marco Bonelli has given the reasons why such behaviour is allowed; I'd like to attempt an explanation of why it might be practical.
Optimising compilers, by definition, are expected to do various stuff in order to make programs run faster. They are allowed to delete unused code, unwrap loops, rearrange operations and so on.
Taking your code, can the compiler be really expected to perform the !y operation strictly before the call to printf()? I'd say if you impose such rules, there'll be no place left for any optimisations. So, a compiler should be free to rewrite the code as
void f (int x) {
int y = x + x;
int notY = !(x + x);
if (x == (1 << 31))
printf ("y: %d, !y: %d\n", y, notY);
}
Now, it should be obvious that for any inputs which don't cause overflow the behaviour would be identical. However, in the case of overflow y and notY experience the effects of UB independently, and may both become 0 because why not.
For some reason, a myth has emerged that the authors of the Standard used the phrase "Undefined Behavior" to describe actions which earlier descriptions of the language by its inventor characterized as "machine dependent" was to allow compilers to infer that various things wouldn't happen. While it is true that the Standard doesn't require that implementations process such actions meaningfully even on platforms where there would be a natural "machine-dependent" behavior, the Standard also doesn't require that any implementation be capable of processing any useful programs meaningfully; an implementation could be conforming without being able to meaningfully process anything other than a single contrived and useless program. That's not a twisting of the Standard's intention: "While a deficient implementation could probably contrive
a program that meets this requirement, yet still succeed in being useless, the C89 Committee felt that such ingenuity would probably require more work than making something useful."
In discussing the decision to make short unsigned values promote to signed int, the authors of the Standard observed that most current implementations used quiet wraparound integer-overflow semantics, and having values promote to signed int would not adversely affect behavior if the value was used in overflow scenarios where the upper bits wouldn't matter.
From a practical perspective, guaranteeing clean wraparound semantics costs a little more than would allowing integer computations to behave as though performed on larger types at unspecified times. Even in the absence of "optimization", even straightforward code generation for an expression like long1 = int1*int2+long2; would on many platforms benefit from being able to use the result of a 16x16->32 or 32x32->64 multiply instruction directly, rather than having to sign-extend the lower half of the result. Further, allowing a compiler to evaluate x+1 as a type larger than x at its convenience would allow it to replace x+1 > y with x >= y--generally a useful and safe optimization.
Compilers like gcc go further, however. Even though the authors of the Standard observed that in the evaluation of something like:
unsigned mul(unsigned short x, unsigned short y) { return x*y; }
the Standard's decision to promote x and y to signed int wouldn't adversely affect behavior compared with using unsigned ("Both schemes give the same answer in the vast majority of cases, and both give the same effective result in even more cases in implementations with two’s-complement arithmetic and quiet wraparound on signed overflow—that is, in most current implementations."), gcc will sometimes use the above function to infer within calling code that x cannot possibly exceed INT_MAX/y. I've seen no evidence that the authors of the Standard anticipated such behavior, much less intended to encourage it. While the authors of gcc claim any code that would invoke overflow in such cases is "broken", I don't think the authors of the Standard would agree, since in discussing conformance, they note: "The goal is to give the programmer a fighting chance to make powerful C programs that are also highly portable, without seeming to demean perfectly useful C programs that happen not to be portable, thus the adverb strictly."
Because the authors of the Standard failed to forbid the authors of gcc from processing code nonsensically in case of integer overflow, even on quiet-wraparound platforms, they insist that they should jump the rails in such cases. No compiler writer who was trying to win paying customers would take such an attitude, but the authors of the Standard failed to realize that compiler writers might value cleverness over customer satisfaction.
I am working on a tutorial for binary numbers. Something I have wondered for a while is why all the integer maximum and minimum values touch. For example for an unsigned byte 255 + 1 = 0 and 0 - 1 = 255. I understand all the binary math that goes into it, but why was the decision made to have them work this way instead of a straight number line that gives an error when the extremes are breached?
Since your example is unsigned, I assume it's OK to limited the scope to unsigned types.
Allowing wrapping is useful. For example, it's what allows you (and the compiler) to always reorder (and constant-fold) a sequence of additions and subtractions. Even something such as x + 3 - 1 could not be optimized to x + 2 if the language requires trapping, because it changes the conditions under which the expression would trap. Wrapping also mixes better with bit manipulation, with the interpretation of an unsigned number as a vector of bits it makes very little sense if there's trapping. That applies especially to shifts, but addition, subtraction and even multiplication also make sense on bitvectors and combine usefully with the usual bitwise operations.
The algebraic structure you get when allowing wrapping, Z/2kZ, is fairly nice (perhaps not as nice as modulo a prime, but that would interact badly with the bitvector interpretation and it doesn't match typical hardware) and well known, so it's not like anything particularly unexpected or weird will happen, it's not like a wrapped result is a "uselessly arbitrary" result.
And of course testing the carry flag (or whatever may be required) after just about every operation has a big direct overhead as well.
Trapping on "unsigned overflow" is both expensive and undesirable, at least if it is the default behaviour.
Why not "give an error when the extremes are breached"?
Error handling is one of the hardest things in software development. When an error happens, there are many possible ways software could be required to do:
Show an annoying message to the user? Like Hey user, you have just tried to add 1 to this variable, which is already too big. Stop that! - there often is no context to show the user, that would be of any help.
Throw an exception? (BTW C has support for that) - that would show a stack trace, if you happened to execute your code in a debugger. Otherwise, it would just crash - not bad (it won't corrupt your files) but not good either (can be exploited as a denial of service attack).
Write it to a log file? - sometimes it's the best thing to do - record the error and move on, so it can be debugged later.
The right thing to do depends on your code. So a generic programming language like C doesn't want to restrict you by providing any mandatory behavior.
Instead, C provides two guidelines:
For unsigned types like unsigned int or uint8_t or (usually) char - it provides silent wraparound, for best performance.
For signed types like int - it provides "undefined behavior", which makes it possible to "choose", in a very limited way, what will happen on overflow
Throw an exception if using -ftrapv in gcc
Silent wraparound if using -fwrapv in gcc
By default (no fancy command-line options) - the compiler may assume it will never happen, which may help it produce optimized code
The idea here is that you (the programmer) should think where checking for overflow is worth doing, and how to recover from overflow (if the language provided a standard error handling mechanism, it would deny you the latter part). This approach has maximum flexibility, (potentially) maximum performance, and (usually) hardest to do - which fits the philosophy of C.
I have been messing around trying to learn C lately. Coming from Java, it surprised me that you can perform certain operations declared as "undefined".
This just seems extremely unsafe to me. I understand it is the programmer's responsibility not to perform undefined operations, but why is it even allowed to start with? Why does the compiler not catch, for instance, array indices out of bounds, or even dangling pointers? You just end up accessing blocks of memory you never should access, with no (apparent) good reason.
As a comparison, Java makes extra sure you don't do anything stupid, throwing Exceptions around like hot cakes.
Surely there must be a reason why this is allowed? What is it?
ANSWER: To my understanding, the main reason is performance. Also, Java does have undefined behaviours, although not labeled as such.
EDIT: restricted question to C
Undefined behavior is not allowed, it's just not caught by the compiler.
The tradeoff here is between the speed and the safety. Many kinds of undefined behavior could be prevented at the expense of a few additional CPU cycles.
For example, you could prevent UB that happens when you read from memory that has been allocated but not initialized by having the compiled code write zeros into it. This, however, costs you a whole additional write into a memory, which is entirely unnecessary.
Similarly, one could prevent reading/writing past the end of an array by checking its bounds inside [] operator. However, this would cost you a few additional CPU cycles on each array access.
C++ designers decided that it is better to have speed and allow potential UB than to force everyone pay for what they do not need. This approach, however, is incompatible with Java's "write once, run anywhere" requirement, so designers of Java language insisted on fully defined behavior in nearly all situations.
Originally, most forms of Undefined Behavior represented things which some implementations might trap, but other implementations might not. Because there was no way for the authors of the Standard to predict all the things a platform might do in case of a trap (including, literally, the possibility that a system would sound an alarm and lock up until an operator manually cleared the fault), the consequences of traps fell outside the jurisdiction of the C Standard, and thus almost every action for which some platform might conceivably cause a trap is--from the point of view of the Standard--considered "Undefined Behavior".
That should not be taken to imply that the authors of the Standard didn't believe implementations should try to behave sensibly for such things when practical. The authors of the C89 Standard noted, for example, that the majority of current systems of that era would define behavior for:
/* Assume USmall is half the size of "int" */
unsigned mult(USmall x, USmall y) { return x*y; }
which would in all cases, including those where the mathematical product of x and y was between INT_MAX+1 and UINT_MAX, be equivalent to (unsigned)x*y;. I see no reason to believe they wouldn't have expected that trend to continue.
Unfortunately, a new philosophy has become fashionable, based on the revisionist viewpoint that compiler writers only supported useful behaviors in cases not mandated by the Standard because they were too unsophisticated to do anything else. In gcc, for example, using optimization level 2 but no other non-default options, the above "mult" routine will sometimes generate bogus code in cases where the product would be between 0x80000000u and 0xFFFFFFFFu, even when running on platforms where such computations would historically have worked. This is supposedly being done in the name of "optimization"; it would be interesting to know how many of the "optimizations" such techniques end up performing are actually useful and could not have been achieved via safer means.
Historically, Undefined Behavior was a license for a C compiler to expose the behavior of the underlying platform; in cases where the underlying platform's behavior fit the programmer's needs, this allowed the programmer's requirements to be expressed in machine code more efficiently than if everything had to be done in ways defined by the Standard. Lately, however, it has been interpreted as license for compilers to implement behaviors which not only bear no relation to anything in the underlying platform nor to any plausible programmer expectations, but aren't even bound by laws of time and causality.
Java has a run time environment to take care of you. That's why an exception is thrown when going out of bounds - it's something that can't be figured out at compile time.
There is run time bounds checking in C++ when using the at() method for a vector. It's what distinguishes at() from the []operator
The line
a = a++;
is undefined behaviour in C. The question I am asking is: why?
I mean, I get that it might be hard to provide a consistent order in which things should be done. But, certain compilers will always do it in one order or the other (at a given optimization level). So why exactly is this left up to the compiler to decide?
To be clear, I want to know if this was a design decision and if so, what prompted it? Or maybe there is a hardware limitation of some kind?
UPDATE: This question was the subject of my blog on June 18th, 2012. Thanks for the great question!
Why? I want to know if this was a design decision and if so, what prompted it?
You are essentially asking for the minutes of the meeting of the ANSI C design committee, and I don't have those handy. If your question can only be answered definitively by someone who was in the room that day, then you're going to have to find someone who was in that room.
However, I can answer a broader question:
What are some of the factors that lead a language design committee to leave the behaviour of a legal program (*) "undefined" or "implementation defined" (**)?
The first major factor is: are there two existing implementations of the language in the marketplace that disagree on the behaviour of a particular program? If FooCorp's compiler compiles M(A(), B()) as "call A, call B, call M", and BarCorp's compiler compiles it as "call B, call A, call M", and neither is the "obviously correct" behaviour then there is strong incentive to the language design committee to say "you're both right", and make it implementation defined behaviour. Particularly this is the case if FooCorp and BarCorp both have representatives on the committee.
The next major factor is: does the feature naturally present many different possibilities for implementation? For example, in C# the compiler's analysis of a "query comprehension" expression is specified as "do a syntactic transformation into an equivalent program that does not have query comprehensions, and then analyze that program normally". There is very little freedom for an implementation to do otherwise.
By contrast, the C# specification says that the foreach loop should be treated as the equivalent while loop inside a try block, but allows the implementation some flexibility. A C# compiler is permitted to say, for example "I know how to implement foreach loop semantics more efficiently over an array" and use the array's indexing feature rather than converting the array to a sequence as the specification suggests it should.
A third factor is: is the feature so complex that a detailed breakdown of its exact behaviour would be difficult or expensive to specify? The C# specification says very little indeed about how anonymous methods, lambda expressions, expression trees, dynamic calls, iterator blocks and async blocks are to be implemented; it merely describes the desired semantics and some restrictions on behaviour, and leaves the rest up to the implementation.
A fourth factor is: does the feature impose a high burden on the compiler to analyze? For example, in C# if you have:
Func<int, int> f1 = (int x)=>x + 1;
Func<int, int> f2 = (int x)=>x + 1;
bool b = object.ReferenceEquals(f1, f2);
Suppose we require b to be true. How are you going to determine when two functions are "the same"? Doing an "intensionality" analysis -- do the function bodies have the same content? -- is hard, and doing an "extensionality" analysis -- do the functions have the same results when given the same inputs? -- is even harder. A language specification committee should seek to minimize the number of open research problems that an implementation team has to solve!
In C# this is therefore left to be implementation-defined; a compiler can choose to make them reference equal or not at its discretion.
A fifth factor is: does the feature impose a high burden on the runtime environment?
For example, in C# dereferencing past the end of an array is well-defined; it produces an array-index-was-out-of-bounds exception. This feature can be implemented with a small -- not zero, but small -- cost at runtime. Calling an instance or virtual method with a null receiver is defined as producing a null-was-dereferenced exception; again, this can be implemented with a small, but non-zero cost. The benefit of eliminating the undefined behaviour pays for the small runtime cost.
A sixth factor is: does making the behaviour defined preclude some major optimization? For example, C# defines the ordering of side effects when observed from the thread that causes the side effects. But the behaviour of a program that observes side effects of one thread from another thread is implementation-defined except for a few "special" side effects. (Like a volatile write, or entering a lock.) If the C# language required that all threads observe the same side effects in the same order then we would have to restrict modern processors from doing their jobs efficiently; modern processors depend on out-of-order execution and sophisticated caching strategies to obtain their high level of performance.
Those are just a few factors that come to mind; there are of course many, many other factors that language design committees debate before making a feature "implementation defined" or "undefined".
Now let's return to your specific example.
The C# language does make that behaviour strictly defined(†); the side effect of the increment is observed to happen before the side effect of the assignment. So there cannot be any "well, it's just impossible" argument there, because it is possible to choose a behaviour and stick to it. Nor does this preclude major opportunities for optimizations. And there are not a multiplicity of possible complex implementation strategies.
My guess, therefore, and I emphasize that this is a guess, is that the C language committee made ordering of side effects into implementation defined behaviour because there were multiple compilers in the marketplace that did it differently, none was clearly "more correct", and the committee was unwilling to tell half of them that they were wrong.
(*) Or, sometimes, its compiler! But let's ignore that factor.
(**) "Undefined" behaviour means that the code can do anything, including erasing your hard disk. The compiler is not required to generate code that has any particular behaviour, and not required to tell you that it is generating code with undefined behaviour. "Implementation defined" behaviour means that the compiler author is given considerable freedom in choice of implementation strategy, but is required to pick a strategy, use it consistently, and document that choice.
(†) When observed from a single thread, of course.
It's undefined because there is no good reason for writing code like that, and by not requiring any specific behaviour for bogus code, compilers can more aggressively optimize well-written code. For example, *p = i++ may be optimized in a way that causes a crash if p happens to point to i, possibly because two cores write to the same memory location at the same time. The fact that this also happens to be undefined in the specific case that *p is explicitly written out as i, to get i = i++, logically follows.
It's ambiguous but not syntactically wrong. What should a be? Both = and ++ have the same "timing." So instead of defining an arbitrary order it was left undefined since either order would be in conflict with one of the two operators definitions.
With a few exceptions, the order in which expressions are evaluated is unspecified; this was a deliberate design decision, and it allows implementations to rearrange the evaluation order from what's written if that will result in more efficient machine code. Similarly, the order in which the side effects of ++ and -- are applied is unspecified beyond the requirement that it happen before the next sequence point, again to give implementations the freedom to arrange operations in an optimal manner.
Unfortunately, this means that the result of an expression like a = a++ will vary based on the compiler, compiler settings, surrounding code, etc. The behavior is specifically called out as undefined in the language standard so that compiler implementors don't have to worry about detecting such cases and issuing a diagnostic against them. Cases like a = a++ are obvious, but what about something like
void foo(int *a, int *b)
{
*a = (*b)++;
}
If that's the only function in the file (or if its caller is in a different file), there's no way to know at compile time whether a and b point to the same object; what do you do?
Note that it's entirely possible to mandate that all expressions be evaluated in a specific order, and that all side effects be applied at a specific point in evaluation; that's what Java and C# do, and in those languages expressions like a = a++ are always well-defined.
The postfix ++ operator returns the value prior to the incrementation. So, at the first step, a gets assigned to its old value (that's what ++ returns). At the next point it is undefined whether the increment or the assignment will take place first, because both operations are applied over the same object (a), and the language says nothing about the order of evaluation of these operators.
Somebody may provide another reason, but from an optimization (better say assembler presentation) point of view, a needs be loaded into a CPU register, and the postfix operator's value should be placed into another register or the same.
So the last assignment can depend on either the optimizer using one register or two.
Updating the same object twice without an intervening sequence point is undefined behaviour ...
because that makes compiler writers happier
because it allows implementations to define it anyway
because it doesn't force a specific constraint when it isn't needed
Suppose a is a pointer with value 0x0001FFFF. And suppose the architecture is segmented so that the compiler needs to apply the increment to the high and low parts separately, with a carry between them. The optimiser could conceivably reorder the writes so that the final value stored is 0x0002FFFF; that is, the low part before the increment and the high part after the increment.
This value is twice either value that you might have expected. It may point to memory not owned by the application, or it may (in general) be a trapping representation. In other words, the CPU may raise a hardware fault as soon as this value is loaded into a register, crashing the application. Even if it doesn't cause an immediate crash, it is a profoundly wrong value for the application to be using.
The same kind of thing can happen with other basic types, and the C language allows even ints to have trapping representations. C tries to allow efficient implementation on a wide range of hardware. Getting efficient code on a segmented machine such as the 8086 is hard. By making this undefined behaviour, a language implementer has a bit more freedom to optimise aggressively. I don't know if it has ever made a performance difference in practice, but evidently the language committee wanted to give every benefit to the optimiser.
I am working with a large C library where some array indices are computed using int.
I need to find a way to trap integer overflows at runtime in such way as to narrow to problematic line of code. Libc manual states:
FPE_INTOVF_TRAP
Integer overflow (impossible in a C program unless you enable overflow trapping in a hardware-specific fashion).
however gcc option -ffpe-trap suggests that those only apply to FP numbers?
So how I do enable integer overflow trap? My system is Xeon/Core2, gcc-4.x, Linux 2.6
I have looked through similar questions but they all boil to modifying the code. I need to know however which code is problematic in the first place.
If Xeons can't trap overflows, which processors can? I have access to non-emt64 machines as well.
I have found a tool designed for llvm meanwhile: http://embed.cs.utah.edu/ioc/
There doesn't seem to be however an equivalent for gcc/icc?
Ok, I may have to answer my own question.
I found gcc has -ftrapv option, a quick test does confirm that at least on my system overflow is trapped. I will post more detailed info as I learn more since it seems very useful tool.
Unsigned integer arithmetic does not overflow, of course.
With signed integer arithmetic, overflow leads to undefined behaviour; anything could happen. And optimizers are getting aggressive about optimizing stuff that overflows. So, your best bet is to avoid the overflow, rather than trapping it when it happens. Consider using the CERT 'Secure Integer Library' (the URL referenced there seems to have gone AWOL/404; I'm not sure what's happened yet) or Google's 'Safe Integer Operation' library.
If you must trap overflow, you are going to need to specify which platform you are interested in (O/S including version, compiler including version), because the answer will be very platform specific.
Do you know exactly which line the overflow is occuring on? If so, you might be able to look at the assembler's Carry flag if the operation in question caused an overflow. This is the flag that the CPU uses to do large number calculation and, while not available at the C level, might help you to debug the problem - or at least give you a chance to do something.
BTW, found this link for gcc (-ftrapv) that talks about an integer trap. Might be what you are looking for.
You can use inline assembler in gcc to use an instruction that might generate an overflow and then test the overflow flag to see if it actually does:
int addo(int a, int b)
{
asm goto("add %0,%1; jo %l[overflow]" : : "r"(a), "r"(b) : "cc" : overflow);
return a+b;
overflow:
return 0;
}
In this case, it tries to add a and b, and if it does, it goes to the overflow label. If there's no overflow, it continues, doing the add again and returning it.
This runs into the GCC limitation that an inline asm block cannot both output a value and maybe branch -- if it weren't for that, you wouldn't need a second add to actually get the result.