Why is the C preprocessor a subject of undefined behavior? - c

I can understand that:
One of the origins of the UB is a performance increase (e.g. by removing never executed code, such as if (i+1 < i) { /* never_executed_code */ }; UPD: if i is a signed integer).
UB can be triggered at compile time because C does not clearly distinguish between compile time and run time. The "whole language is based on the (rather unhelpful) concept of an "abstract machine" (link).
However, I cannot understand yet why C preprocessor is a subject of undefined behavior? It is known that preprocessing directives are executed at compile time.
Consider C11, 6.10.3.3 The ## operator, 3:
If the result is not a valid preprocessing token, the behavior is undefined.
Why not make it a constraint? For example:
The result shall be a valid preprocessing token.
The same question goes for all the other "the behavior is undefined" in 6.10 Preprocessing directives.

Why is the C preprocessor a subject of undefined behavior?
When the C standard was created, there were some existing C preprocessors and there was some imaginary ideal C preprocessor in the minds of standardization committee members.
So there were these gray areas, where committee members weren't completely sure what would they want to do and/or existing C preprocessor implementations differed which each other in behavior.
So, these cases are not defined behavior. Because the C committee members are not completely sure what the behavior actually should be. So there is no requirement on what it should be.
One of the origins of the UB
Yes, one of.
UB may exist to ease up implementing the language. Like for example, in case of the preprocessor, the preprocessor writers don't have to care about what happens when an invalid preprocessor token is a result of ##.
Or UB may exist to reconcile existing implementations with different behaviors or as a point for extensions. So a preprocessor that segfaults in case of UB, a preprocessor that accepts and works in case of UB, and a preprocessor that formats your hard drive in case of UB, all can be standard conformant (but I wouldn't want to work on that one that formats your drive).

Suppose a file which is read in via include directive ends with the partial line:
#define foo bar
Depending upon the design of the preprocessor, it's possible that the partial token bar might be concatenated to whatever appears at the start of the line following the #include directive, or that whatever appears on that line will behave as though it were placed on the line with the #define directive, but with a whitespace separating it from the token bar, and it would hardly be inconceivable that a build script might rely upon such behaviors. It's also possible that implementations might behave as though a newline were inserted at the end of the included file, or might ignore the last partial line of such a file.
Any code which relied upon one of the former behaviors would clearly have been non-portable, but if code exploited such behavior to do something that would otherwise not be practical, such code would hardly be "erroneous", and the authors of the Standard would not have wanted to forbid an implementation that would process it usefully from continuing to do so.
When the Standard uses the phrase "non-portable or erroneous", that does not mean "non-portable, therefore erroneous". Prior to the publication of C89, C implementations defined many useful constructs, but none of them were defined by "the C Standard" since there wasn't one. If an implementation defined the behavior of some construct, some didn't, and the Standard left the construct as "Undefined", that would simply preserve the status quo where implementations that chose to define a useful behavior would do so, those that chose not to wouldn't, and programs that relied upon such behaviors would be "non-portable", working correctly on implementations that supported the behaviors, but not on those that didn't.

Without getting into specifics, my guess is, there exist several preprocessor implementations which have bugs, but the Standard doesn't want to declare them non-conforming, for compatibility reasons.
In human language: if you write a program which has X in it, preprocessor does weird stuff.
In standardese: the behavior of program with X is undefined.
If the standard says something like "The result shall be a valid preprocessing token", it might be unclear what "shall" means in this context.
The programmer shall write the program so this condition holds? If so, the wording with "undefined behavior" is clearer and more uniform (it appears in other places too)
The preprocessor shall make sure this condition holds? If so, this requires dedicated logic which checks the condition; may be impractical to implement.

Related

Why does the preprocessor forbid macro pasting of weird tokens?

I am writing my own C-preprocessor based on GCC. So far it is nearly identical, but what I consider redundant is to perform any form of checking on the tokens being concatenated by virtue of ##.
So in my preprocessor manual, I've written this:
3.5 Concatenation
...
GCC forbids concatenation with two mutually incompatible preprocessing
tokens such as "x" and "+" (in any order). That would result in the
following error: "pasting "x" and "+" does not give a valid
preprocessing token" However this isn't true for this preprocessor - concatenation
may occur between any token.
My reasoning is simple: if it expands to invalid code, then the compiler will produce an error and so I don't have to explicitly handle such cases, making the preprocessor slower and increasing in code complexity. If it results in valid code, then this restriction removal just makes it more flexible (although probably in rare cases).
So I would like to ask, why does this error actually happen, why is this restriction actually applied and is it a true crime if I dismiss it in my preprocessor?
As far as ISO C goes, if ## creates an invalid token, the behavior is undefined. But there is a somewhat strange situation there, in the following way. The output of the preprocessing translation phases of C is a stream of preprocessing tokens (pp-tokens). These are converted to tokens, and then syntactically and semantically analyzed. Now here is an important rule: it is a constraint violation if a pp-token doesn't have a form which lets it be converted to a token. Thus, a preprocessor token which is garbage that you write yourself without help from the ## operator must be diagnosed for bad lexical syntax. But if you use ## to create a bad preprocessing token, the behavior is undefined.
Note the subtlety there: the behavior of ## is undefined if it is used to create a bad preprocessing token. It's not the case that the pasting is well-defined, and then caught at the stage where pp-tokens are converted to tokens: it's undefined right from that point where ## is evaluated.
Basically, this is historical. C preprocessors were historically (and probably some are) separate programs, with lexical analysis that was different from and looser from the downstream compiler. The C standard tried to capture that somehow in terms of a single language with translation phases, and the result has some quirks and areas of perhaps surprising under-specification. (For instance in the preprocessing translation phases, a number token ("pp-number") is a strange lexical grammar which allows gibberish, such as tokens with multiple floating-point E exponents.)
Now, back to your situation. Your textual C preprocessor does not actually output pp-token objects; it outputs another text stream. You may have pp-token objects internally, but they get flattened on output. Thus, you might think, why not allow your ## operator to just blindly glue together any two tokens? The net effect is as if those tokens were dumped into the output stream without any intervening whitespace. (And this is probably all it was, in early preprocessors which supported ##, and ran as separate programs).
Unfortunately what that means is that your ## operator is not purely a bona fide token pasting operator; it's just a blind juxtaposing operator which sometimes produces one token, when it happens to juxtapose two tokens that will be lexically analyzed as one by the downstream compiler. If you go that way, it may be best to be honest and document it as such, rather than portraying it as a flexibility feature.
A good reason, on the other hand, to reject bad preprocessing tokens in the ## operator is to catch situations in which it cannot achieve its documented job description: the requirement of making one token out of two. The idea is that the programmer knows the language spec (the contract between the programmer and the implementation) and knows that ## is supposed to make one token, and relies on that. For such a programmer, any situation involving a bad token paste is a mistake, and that programmer is best supported by diagnosis.
The maintainers of GCC and the GNU CPP preprocessor probably took this view: that the preprocessor isn't a flexible text munging tool, but part of a toolchain supporting disciplined C programming.
Moreover, the undefined behavior of a bad token paste job is easily diagnosed, so why not diagnose it? The lack of a diagnosis requirement in this area in the standard looks like just a historic concession. It is a kind of "low-hanging fruit" of diagnosis. Let those undefined behaviors go undiagnosed for which diagnosis is difficult or intractable, or requires run-time penalties.

Does i=i++ cause undefined behavior even if not executed? [duplicate]

The code that invokes undefined behavior (in this example, division by zero) will never get executed, is the program still undefined behavior?
int main(void)
{
int i;
if(0)
{
i = 1/0;
}
return 0;
}
I think it still is undefined behavior, but I can't find any evidence in the standard to support or deny me.
So, any ideas?
Let's look at how the C standard defines the terms "behavior" and "undefined behavior".
References are to the N1570 draft of the ISO C 2011 standard; I'm not aware of any relevant differences in any of the three published ISO C standards (1990, 1999, and 2011).
Section 3.4:
behavior
external appearance or action
Ok, that's a bit vague, but I'd argue that a given statement has no "appearance", and certainly no "action", unless it's actually executed.
Section 3.4.3:
undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements
It says "upon use" of such a construct. The word "use" is not defined by the standard, so we fall back to the common English meaning. A construct is not "used" if it's never executed.
There's a note under that definition:
NOTE Possible undefined behavior ranges from ignoring the situation
completely with unpredictable results, to behaving during translation
or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or execution (with the issuance of a
diagnostic message).
So a compiler is permitted to reject your program at compile time if its behavior is undefined. But my interpretation of that is that it can do so only if it can prove that every execution of the program will encounter undefined behavior. Which implies, I think, that this:
if (rand() % 2 == 0) {
i = i / 0;
}
which certainly can have undefined behavior, cannot be rejected at compile time.
As a practical matter, programs have to be able to perform runtime tests to guard against invoking undefined behavior, and the standard has to permit them to do so.
Your example was:
if (0) {
i = 1/0;
}
which never executes the division by 0. A very common idiom is:
int x, y;
/* set values for x and y */
if (y != 0) {
x = x / y;
}
The division certainly has undefined behavior if y == 0, but it's never executed if y == 0. The behavior is well defined, and for the same reason that your example is well defined: because the potential undefined behavior can never actually happen.
(Unless INT_MIN < -INT_MAX && x == INT_MIN && y == -1 (yes, integer division can overflow), but that's a separate issue.)
In a comment (since deleted), somebody pointed out that the compiler may evaluate constant expressions at compile time. Which is true, but not relevant in this case, because in the context of
i = 1/0;
1/0 is not a constant expression.
A constant-expression is a syntactic category that reduces to conditional-expression (which excludes assignments and comma expressions). The production constant-expression appears in the grammar only in contexts that actually require a constant expression, such as case labels. So if you write:
switch (...) {
case 1/0:
...
}
then 1/0 is a constant expression -- and one that violates the constraint in 6.6p4: "Each constant expression shall evaluate to a constant that is in the range of representable
values for its type.", so a diagnostic is required. But the right hand side of an assignment does not require a constant-expression, merely a conditional-expression, so the constraints on constant expressions don't apply. A compiler can evaluate any expression that it's able to at compile time, but only if the behavior is the same as if it were evaluated during execution (or, in the context of if (0), not evaluated during execution().
(Something that looks exactly like a constant-expression is not necessarily a constant-expression, just as, in x + y * z, the sequence x + y is not an additive-expression because of the context in which it appears.)
Which means the footnote in N1570 section 6.6 that I was going to cite:
Thus, in the following initialization,
static int i = 2 || 1 / 0;
the expression is a valid integer constant expression with value one.
isn't actually relevant to this question.
Finally, there are a few things that are defined to cause undefined behavior that aren't about what happens during execution. Annex J, section 2 of the C standard (again, see the N1570 draft) lists things that cause undefined behavior, gathered from the rest of the standard. Some examples (I don't claim this is an exhaustive list) are:
A nonempty source file does not end in a new-line character which is not immediately preceded by a backslash character or ends in a partial
preprocessing token or comment
Token concatenation produces a character sequence matching the syntax of a universal character name
A character not in the basic source character set is encountered in a source file, except in an identifier, a character constant, a string
literal, a header name, a comment, or a preprocessing token that is
never converted to a token
An identifier, comment, string literal, character constant, or header name contains an invalid multibyte character or does not begin
and end in the initial shift state
The same identifier has both internal and external linkage in the same translation unit
These particular cases are things that a compiler could detect. I think their behavior is undefined because the committee didn't want to, or couldn't, impose the same behavior on all implementations, and defining a range of permitted behaviors just wasn't worth the effort. They don't really fall into the category of "code that will never be executed", but I mention them here for completeness.
This article discusses this question in section 2.6:
int main(void){
guard();
5 / 0;
}
The authors consider that the program is defined when guard() does not terminate. They also find themselves distinguishing notions of “statically undefined” and “dynamically undefined”, e.g.:
The intention behind the standard11 appears to be that, in general, situations are made statically undefined if it is not easy to generate code for them. Only when code can be generated, then the situation can be undefined dynamically.
11) Private correspondence with committee member.
I would recommend looking at the entire article. Taken together, it paints a consistent picture.
The fact that the authors of the article had to discuss the question with a committee member confirms that the standard is currently fuzzy on the answer to your question.
In this case the undefined behavior is the result of executing the code. So if the code is not executed, there is no undefined behavior.
Non executed code could invoke undefined behavior if the undefined behavior was the result of solely the declaration of the code (e.g. if some case of variable shadowing was undefined).
I'd go with the last paragraph of this answer: https://stackoverflow.com/a/18384176/694576
... UB is a runtime issue, not a compiletime issue ...
So, no, there is no UB invoked.
Only when the standard makes breaking changes and your code suddenly is no longer "never gets executed". But I don't see any logical way in which this can cause 'undefined behaviour'. Its not causing anything.
On the subject of undefined behaviour it is often hard to separate the formal aspects from the practical ones. This is the definition of undefined behaviour in the 1989 standard (I don't have a more recent version at hand, but I don't expect this to have changed substantially):
1 undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of
erroneous data, for which this International Standard imposes no requirements
2 NOTE Possible undefined behavior ranges from ignoring the situation completely
with unpredictable results, to behaving during translation or program execution
in a documented manner characteristic of the environment (with or without the
issuance of a diagnostic message), to terminating a translation or
execution (with the issuance of a diagnostic message).
From a formal point of view I'd say your program does invoke undefined behaviour, which means that the standard places no requirement whatsoever on what it will do when run, just because it contains division by zero.
On the other hand, from a practical point of view I'd be surprised to find a compiler that didn't behave as you intuitively expect.
The standard says, as I remember right, it's allowed to do anything from the moment, a rule got broken. Maybe there are some special events with kind of global flavour (but I never heard or read about something like that)... So I would say: No this can't be UB, because as long the behavior is well defined 0 is allways false, so the rule can't get broken on runtime.
I think it still is undefined behavior, but I can't find any evidence in the standard to support or deny me.
I think the program does not invoke undefined behavior.
Defect Report #109 addresses a similar question and says:
Furthermore, if every possible execution of a given program would result in undefined behavior, the given program is not strictly conforming.
A conforming implementation must not fail to translate a strictly conforming program simply because some possible execution of that program would result in undefined behavior. Because foo might never be called, the example given must be successfully translated by a conforming implementation.
It depends on how the expression "undefined behavior" is defined, and whether "undefined behavior" of a statement is the same as "undefined behavior" for a program.
This program looks like C, so a deeper analysis of what the C standard used by the compiler (as some answers did) is appropriate.
In absence of a specified standard, the correct answer is "it depends". In some languages, compilers after the first error try to guess what the programmer might mean and still generate some code, according to the compilers guess. In other, more pure languages, once somerthing is undefined, the undefinedness propagate to the whole program.
Other languages have a concept of "bounded errors". For some limited kinds of errors, these languages define how much damage an error can produce. In particular languages with implied garbage collection frequently make a difference whether an error invalidates the typing system or does not.

what is the difference Global constant variable vs local constant variable in C [duplicate]

This question already has answers here:
Undefined, unspecified and implementation-defined behavior
(9 answers)
Closed 7 years ago.
The classic apocryphal example of "undefined behavior" is, of course, "nasal demons" — a physical impossibility, regardless of what the C and C++ standards permit.
Because the C and C++ communities tend to put such an emphasis on the unpredictability of undefined behavior and the idea that the compiler is allowed to cause the program to do literally anything when undefined behavior is encountered, I had assumed that the standard puts no restrictions whatsoever on the behavior of, well, undefined behavior.
But the relevant quote in the C++ standard seems to be:
[C++14: defns.undefined]: [..] Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). [..]
This actually specifies a small set of possible options:
Ignoring the situation -- Yes, the standard goes on to say that this will have "unpredictable results", but that's not the same as the compiler inserting code (which I assume would be a prerequisite for, you know, nasal demons).
Behaving in a documented manner characteristic of the environment -- this actually sounds relatively benign. (I certainly haven't heard of any documented cases of nasal demons.)
Terminating translation or execution -- with a diagnostic, no less. Would that all UB would behave so nicely.
I assume that in most cases, compilers choose to ignore the undefined behavior; for example, when reading uninitialized memory, it would presumably be an anti-optimization to insert any code to ensure consistent behavior. I suppose that the stranger types of undefined behavior (such as "time travel") would fall under the second category--but this requires that such behaviors be documented and "characteristic of the environment" (so I guess nasal demons are only produced by infernal computers?).
Am I misunderstanding the definition? Are these intended as mere examples of what could constitute undefined behavior, rather than a comprehensive list of options? Is the claim that "anything can happen" meant merely as an unexpected side-effect of ignoring the situation?
Two minor points of clarification:
I thought it was clear from the original question, and I think to most people it was, but I'll spell it out anyway: I do realize that "nasal demons" is tongue-in-cheek.
Please do not write an(other) answer explaining that UB allows for platform-specific compiler optimizations, unless you also explain how it allows for optimizations that implementation-defined behavior wouldn't allow.
This question was not intended as a forum for discussion about the (de)merits of undefined behavior, but that's sort of what it became. In any case, this thread about a hypothetical C-compiler with no undefined behavior may be of additional interest to those who think this is an important topic.
Yes, it permits anything to happen. The note is just giving examples. The definition is pretty clear:
Undefined behavior: behavior for which this International Standard imposes no requirements.
Frequent point of confusion:
You should understand that "no requirement" also means means the implementation is NOT required to leave the behavior undefined or do something bizarre/nondeterministic!
The implementation is perfectly allowed by the C++ standard to document some sane behavior and behave accordingly.1 So, if your compiler claims to wrap around on signed overflow, logic (sanity?) would dictate that you're welcome to rely on that behavior on that compiler. Just don't expect another compiler to behave the same way if it doesn't claim to.
1Heck, it's even allowed to document one thing and do another. That'd be stupid, and it'd probably make you toss it into the trash—why would you trust a compiler whose documentation lies to you?—but it's not against the C++ standard.
One of the historical purposes of Undefined Behavior was to allow for the possibility that certain actions may have different potentially-useful effects on different platforms. For example, in the early days of C, given
int i=INT_MAX;
i++;
printf("%d",i);
some compilers could guarantee that the code would print some particular value (for a two's-complement machine it would typically be INT_MIN), while others would guarantee that the program would terminate without reaching the printf. Depending upon the application requirements, either behavior could be useful. Leaving the behavior undefined meant that an application where abnormal program termination was an acceptable consequence of overflow but producing seemingly-valid-but-wrong output would not be, could forgo overflow checking if run on a platform which would reliably trap it, and an application where abnormal termination in case of overflow would not be acceptable, but producing arithmetically-incorrect output would be, could forgo overflow checking if run on a platform where overflows weren't trapped.
Recently, however, some compiler authors seem to have gotten into a contest to see who can most efficiently eliminate any code whose existence would not be mandated by the standard. Given, for example...
#include <stdio.h>
int main(void)
{
int ch = getchar();
if (ch < 74)
printf("Hey there!");
else
printf("%d",ch*ch*ch*ch*ch);
}
a hyper-modern compiler may conclude that if ch is 74 or greater, the computation of ch*ch*ch*ch*ch would yield Undefined Behavior, and as a
consequence the program should print "Hey there!" unconditionally regardless
of what character was typed.
Nitpicking: You have not quoted a standard.
These are the sources used to generate drafts of the C++ standard. These sources should not be considered an ISO publication, nor should documents generated from them unless officially adopted by the C++ working group (ISO/IEC JTC1/SC22/WG21).
Interpretation: Notes are not normative according to the ISO/IEC Directives Part 2.
Notes and examples integrated in the text of a document shall only be used for giving additional information intended to assist the understanding or use of the document. They shall not contain requirements ("shall"; see 3.3.1 and Table H.1) or any information considered indispensable for the use of the document e.g. instructions (imperative; see Table H.1), recommendations ("should"; see 3.3.2 and Table H.2) or permission ("may"; see Table H.3). Notes may be written as a statement of fact.
Emphasis mine. This alone rules out "comprehensive list of options". Giving examples however does count as "additional information intended to assist the understanding .. of the document".
Do keep in mind that the "nasal demon" meme is not meant to be taken literally, just as using a balloon to explain how universe expansion works holds no truth in physical reality. It's to illustrate that it's foolhardy to discuss what "undefined behavior" should do when it's permissible to do anything. Yes, this means that there isn't an actual rubber band in outer space.
The definition of undefined behaviour, in every C and C++ standard, is essentially that the standard imposes no requirements on what happens.
Yes, that means any outcome is permitted. But there are no particular outcomes that are required to happen, nor any outcomes that are required to NOT happen. It does not matter if you have a compiler and library that consistently yields a particular behaviour in response to a particular instance of undefined behaviour - such a behaviour is not required, and may change even in a future bugfix release of your compiler - and the compiler will still be perfectly correct according to each version of the C and C++ standards.
If your host system has hardware support in the form of connection to probes that are inserted in your nostrils, it is within the realms of possibility that an occurrence of undefined behaviour will cause undesired nasal effects.
I thought I'd answer just one of your points, since the other answers answer the general question quite well, but have left this unaddressed.
"Ignoring the situation -- Yes, the standard goes on to say that this will have "unpredictable results", but that's not the same as the compiler inserting code (which I assume would be a prerequisite for, you know, nasal demons)."
A situation in which nasal demons could very reasonably be expected to occur with a sensible compiler, without the compiler inserting ANY code, would be the following:
if(!spawn_of_satan)
printf("Random debug value: %i\n", *x); // oops, null pointer deference
nasal_angels();
else
nasal_demons();
A compiler, if it can prove that that *x is a null pointer dereference, is perfectly entitled, as part of some optimisation, to say "OK, so I see that they've dereferenced a null pointer in this branch of the if. Therefore, as part of that branch I'm allowed to do anything. So I can therefore optimise to this:"
if(!spawn_of_satan)
nasal_demons();
else
nasal_demons();
"And from there, I can optimise to this:"
nasal_demons();
You can see how this sort of thing can in the right circumstances prove very useful for an optimising compiler, and yet cause disaster. I did see some examples a while back of cases where actually it IS important for optimisation to be able to optimise this sort of case. I might try to dig them out later when I have more time.
EDIT: One example that just came from the depths of my memory of such a case where it's useful for optimisation is where you very frequently check a pointer for being NULL (perhaps in inlined helper functions), even after having already dereferenced it and without having changed it. The optimising compiler can see that you've dereferenced it and so optimise out all the "is NULL" checks, since if you've dereferenced it and it IS null, anything is allowed to happen, including just not running the "is NULL" checks. I believe that similar arguments apply to other undefined behaviour.
First, it is important to note that it is not only the behaviour of the user program that is undefined, it is the behaviour of the compiler that is undefined. Similarly, UB is not encountered at runtime, it is a property of the source code.
To a compiler writer, "the behaviour is undefined" means, "you do not have to take this situation into account", or even "you can assume no source code will ever produce this situation".
A compiler can do anything, intentionally or unintentionally, when presented with UB, and still be standard compliant, so yes, if you granted access to your nose...
Then, it is not always possible to know if a program has UB or not.
Example:
int * ptr = calculateAddress();
int i = *ptr;
Knowing if this can ever be UB or not would require knowing all possible values returned by calculateAddress(), which is impossible in the general case (See "Halting Problem"). A compiler has two choices:
assume ptr will always have a valid address
insert runtime checks to guarantee a certain behaviour
The first option produces fast programs, and puts the burden of avoiding undesired effects on the programmer, while the second option produces safer but slower code.
The C and C++ standards leave this choice open, and most compilers choose the first, while Java for example mandates the second.
Why is the behaviour not implementation-defined, but undefined?
Implementation-defined means (N4296, 1.9§2):
Certain aspects and operations of the abstract machine are described in this International Standard as
implementation-defined (for example,
sizeof(int)
). These constitute the parameters of the abstract machine. Each implementation shall include documentation describing its characteristics and behavior in these
respects.
Such documentation shall define the instance of the abstract machine that corresponds to that
implementation (referred to as the “corresponding instance” below).
Emphasis mine. In other words: A compiler-writer has to document exactly how the machine-code behaves, when the source code uses implementation-defined features.
Writing to a random non-null invalid pointer is one of the most unpredictable things you can do in a program, so this would require performance-reducing runtime-checks too.
Before we had MMUs, you could destroy hardware by writing to the wrong address, which comes very close to nasal demons ;-)
Undefined behavior is simply the result of a situation coming up that the writers of the specification did not foresee.
Take the idea of a traffic light. Red means stop, yellow means prepare for red, and green means go. In this example people driving cars are the implementation of the spec.
What happens if both green and red are on? Do you stop, then go? Do you wait until red turns off and it's just green? This is a case that the spec did not describe, and as a result, anything the drivers do is undefined behavior. Some people will do one thing, some another. Since there is no guarantee about what will happen you want to avoid this situation. The same applies to code.
One of the reasons for leaving behavior undefined is to allow the compiler to make whatever assumptions it wants when optimizing.
If there exists some condition that must hold if an optimization is to be applied, and that condition is dependent on undefined behavior in the code, then the compiler may assume that it's met, since a conforming program can't depend on undefined behavior in any way. Importantly, the compiler does not need to be consistent in these assumptions. (which is not the case for implementation-defined behavior)
So suppose your code contains an admittedly contrived example like the one below:
int bar = 0;
int foo = (undefined behavior of some kind);
if (foo) {
f();
bar = 1;
}
if (!foo) {
g();
bar = 1;
}
assert(1 == bar);
The compiler is free to assume that !foo is true in the first block and foo is true in the second, and thus optimize the entire chunk of code away. Now, logically either foo or !foo must be true, and so looking at the code, you would reasonably be able to assume that bar must equal 1 once you've run the code. But because the compiler optimized in that manner, bar never gets set to 1. And now that assertion becomes false and the program terminates, which is behavior that would not have happened if foo hadn't relied on undefined behavior.
Now, is it possible for the compiler to actually insert completely new code if it sees undefined behavior? If doing so will allow it to optimize more, absolutely. Is it likely to happen often? Probably not, but you can never guarantee it, so operating on the assumption that nasal demons are possible is the only safe approach.
Undefined behaviors allow compilers to generate faster code in some cases. Consider two different processor architectures that ADD differently:
Processor A inherently discards the carry bit upon overflow, while processor B generates an error. (Of course, Processor C inherently generates Nasal Demons - its just the easiest way to discharge that extra bit of energy in a snot-powered nanobot...)
If the standard required that an error be generated, then all code compiled for processor A would basically be forced to include additional instructions, to perform some sort of check for overflow, and if so, generate an error. This would result in slower code, even if the developer know that they were only going to end up adding small numbers.
Undefined behavior sacrifices portability for speed. By allowing 'anything' to happen, the compiler can avoid writing safety-checks for situations that will never occur. (Or, you know... they might.)
Additionally, when a programmer knows exactly what an undefined behavior will actually cause in their given environment, they are free to exploit that knowledge to gain additional performance.
If you want to ensure that your code behaves exactly the same on all platforms, you need to ensure that no 'undefined behavior' ever occurs - however, this may not be your goal.
Edit: (In respons to OPs edit)
Implementation Defined behavior would require the consistent generation of nasal demons. Undefined behavior allows the sporadic generation of nasal demons.
That's where the advantage that undefined behavior has over implementation specific behavior appears. Consider that extra code may be needed to avoid inconsistent behavior on a particular system. In these cases, undefined behavior allows greater speed.

Can code that will never be executed invoke undefined behavior?

The code that invokes undefined behavior (in this example, division by zero) will never get executed, is the program still undefined behavior?
int main(void)
{
int i;
if(0)
{
i = 1/0;
}
return 0;
}
I think it still is undefined behavior, but I can't find any evidence in the standard to support or deny me.
So, any ideas?
Let's look at how the C standard defines the terms "behavior" and "undefined behavior".
References are to the N1570 draft of the ISO C 2011 standard; I'm not aware of any relevant differences in any of the three published ISO C standards (1990, 1999, and 2011).
Section 3.4:
behavior
external appearance or action
Ok, that's a bit vague, but I'd argue that a given statement has no "appearance", and certainly no "action", unless it's actually executed.
Section 3.4.3:
undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements
It says "upon use" of such a construct. The word "use" is not defined by the standard, so we fall back to the common English meaning. A construct is not "used" if it's never executed.
There's a note under that definition:
NOTE Possible undefined behavior ranges from ignoring the situation
completely with unpredictable results, to behaving during translation
or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or execution (with the issuance of a
diagnostic message).
So a compiler is permitted to reject your program at compile time if its behavior is undefined. But my interpretation of that is that it can do so only if it can prove that every execution of the program will encounter undefined behavior. Which implies, I think, that this:
if (rand() % 2 == 0) {
i = i / 0;
}
which certainly can have undefined behavior, cannot be rejected at compile time.
As a practical matter, programs have to be able to perform runtime tests to guard against invoking undefined behavior, and the standard has to permit them to do so.
Your example was:
if (0) {
i = 1/0;
}
which never executes the division by 0. A very common idiom is:
int x, y;
/* set values for x and y */
if (y != 0) {
x = x / y;
}
The division certainly has undefined behavior if y == 0, but it's never executed if y == 0. The behavior is well defined, and for the same reason that your example is well defined: because the potential undefined behavior can never actually happen.
(Unless INT_MIN < -INT_MAX && x == INT_MIN && y == -1 (yes, integer division can overflow), but that's a separate issue.)
In a comment (since deleted), somebody pointed out that the compiler may evaluate constant expressions at compile time. Which is true, but not relevant in this case, because in the context of
i = 1/0;
1/0 is not a constant expression.
A constant-expression is a syntactic category that reduces to conditional-expression (which excludes assignments and comma expressions). The production constant-expression appears in the grammar only in contexts that actually require a constant expression, such as case labels. So if you write:
switch (...) {
case 1/0:
...
}
then 1/0 is a constant expression -- and one that violates the constraint in 6.6p4: "Each constant expression shall evaluate to a constant that is in the range of representable
values for its type.", so a diagnostic is required. But the right hand side of an assignment does not require a constant-expression, merely a conditional-expression, so the constraints on constant expressions don't apply. A compiler can evaluate any expression that it's able to at compile time, but only if the behavior is the same as if it were evaluated during execution (or, in the context of if (0), not evaluated during execution().
(Something that looks exactly like a constant-expression is not necessarily a constant-expression, just as, in x + y * z, the sequence x + y is not an additive-expression because of the context in which it appears.)
Which means the footnote in N1570 section 6.6 that I was going to cite:
Thus, in the following initialization,
static int i = 2 || 1 / 0;
the expression is a valid integer constant expression with value one.
isn't actually relevant to this question.
Finally, there are a few things that are defined to cause undefined behavior that aren't about what happens during execution. Annex J, section 2 of the C standard (again, see the N1570 draft) lists things that cause undefined behavior, gathered from the rest of the standard. Some examples (I don't claim this is an exhaustive list) are:
A nonempty source file does not end in a new-line character which is not immediately preceded by a backslash character or ends in a partial
preprocessing token or comment
Token concatenation produces a character sequence matching the syntax of a universal character name
A character not in the basic source character set is encountered in a source file, except in an identifier, a character constant, a string
literal, a header name, a comment, or a preprocessing token that is
never converted to a token
An identifier, comment, string literal, character constant, or header name contains an invalid multibyte character or does not begin
and end in the initial shift state
The same identifier has both internal and external linkage in the same translation unit
These particular cases are things that a compiler could detect. I think their behavior is undefined because the committee didn't want to, or couldn't, impose the same behavior on all implementations, and defining a range of permitted behaviors just wasn't worth the effort. They don't really fall into the category of "code that will never be executed", but I mention them here for completeness.
This article discusses this question in section 2.6:
int main(void){
guard();
5 / 0;
}
The authors consider that the program is defined when guard() does not terminate. They also find themselves distinguishing notions of “statically undefined” and “dynamically undefined”, e.g.:
The intention behind the standard11 appears to be that, in general, situations are made statically undefined if it is not easy to generate code for them. Only when code can be generated, then the situation can be undefined dynamically.
11) Private correspondence with committee member.
I would recommend looking at the entire article. Taken together, it paints a consistent picture.
The fact that the authors of the article had to discuss the question with a committee member confirms that the standard is currently fuzzy on the answer to your question.
In this case the undefined behavior is the result of executing the code. So if the code is not executed, there is no undefined behavior.
Non executed code could invoke undefined behavior if the undefined behavior was the result of solely the declaration of the code (e.g. if some case of variable shadowing was undefined).
I'd go with the last paragraph of this answer: https://stackoverflow.com/a/18384176/694576
... UB is a runtime issue, not a compiletime issue ...
So, no, there is no UB invoked.
Only when the standard makes breaking changes and your code suddenly is no longer "never gets executed". But I don't see any logical way in which this can cause 'undefined behaviour'. Its not causing anything.
On the subject of undefined behaviour it is often hard to separate the formal aspects from the practical ones. This is the definition of undefined behaviour in the 1989 standard (I don't have a more recent version at hand, but I don't expect this to have changed substantially):
1 undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of
erroneous data, for which this International Standard imposes no requirements
2 NOTE Possible undefined behavior ranges from ignoring the situation completely
with unpredictable results, to behaving during translation or program execution
in a documented manner characteristic of the environment (with or without the
issuance of a diagnostic message), to terminating a translation or
execution (with the issuance of a diagnostic message).
From a formal point of view I'd say your program does invoke undefined behaviour, which means that the standard places no requirement whatsoever on what it will do when run, just because it contains division by zero.
On the other hand, from a practical point of view I'd be surprised to find a compiler that didn't behave as you intuitively expect.
The standard says, as I remember right, it's allowed to do anything from the moment, a rule got broken. Maybe there are some special events with kind of global flavour (but I never heard or read about something like that)... So I would say: No this can't be UB, because as long the behavior is well defined 0 is allways false, so the rule can't get broken on runtime.
I think it still is undefined behavior, but I can't find any evidence in the standard to support or deny me.
I think the program does not invoke undefined behavior.
Defect Report #109 addresses a similar question and says:
Furthermore, if every possible execution of a given program would result in undefined behavior, the given program is not strictly conforming.
A conforming implementation must not fail to translate a strictly conforming program simply because some possible execution of that program would result in undefined behavior. Because foo might never be called, the example given must be successfully translated by a conforming implementation.
It depends on how the expression "undefined behavior" is defined, and whether "undefined behavior" of a statement is the same as "undefined behavior" for a program.
This program looks like C, so a deeper analysis of what the C standard used by the compiler (as some answers did) is appropriate.
In absence of a specified standard, the correct answer is "it depends". In some languages, compilers after the first error try to guess what the programmer might mean and still generate some code, according to the compilers guess. In other, more pure languages, once somerthing is undefined, the undefinedness propagate to the whole program.
Other languages have a concept of "bounded errors". For some limited kinds of errors, these languages define how much damage an error can produce. In particular languages with implied garbage collection frequently make a difference whether an error invalidates the typing system or does not.

Portability of using stddef.h's offsetof rather than rolling your own

This is a nitpicky-details question with three parts. The context is that I wish to persuade some folks that it is safe to use <stddef.h>'s definition of offsetof unconditionally rather than (under some circumstances) rolling their own. The program in question is written entirely in plain old C, so please ignore C++ entirely when answering.
Part 1: When used in the same manner as the standard offsetof, does the expansion of this macro provoke undefined behavior per C89, why or why not, and is it different in C99?
#define offset_of(tp, member) (((char*) &((tp*)0)->member) - (char*)0)
Note: All implementations of interest to the people whose program this is supersede the standard's rule that pointers may only be subtracted from each other when they point into the same array, by defining all pointers, regardless of type or value, to point into a single global address space. Therefore, please do not rely on that rule when arguing that this macro's expansion provokes undefined behavior.
Part 2: To the best of your knowledge, has there ever been a released, production C implementation that, when fed the expansion of the above macro, would (under some circumstances) behave differently than it would have if its offsetof macro had been used instead?
Part 3: To the best of your knowledge, what is the most recently released production C implementation that either did not provide stddef.h or did not provide a working definition of offsetof in that header? Did that implementation claim conformance with any version of the C standard?
For parts 2 and 3, please answer only if you can name a specific implementation and give the date it was released. Answers that state general characteristics of implementations that may qualify are not useful to me.
There is no way to write a portable offsetof macro. You must use the one provided by stddef.h.
Regarding your specific questions:
The macro invokes undefined behavior. You cannot subtract pointers except when they point into the same array.
The big difference in practical behavior is that the macro is not an integer constant expression, so it can't safely be used for static initializers, bitfield widths, etc. Also strict bounds-checking-type C implementations might completely break it.
There has never been any C standard that lacked stddef.h and offsetof. Pre-ANSI compilers might lack it, but they have much more fundamental problems that make them unusable for modern code (e.g. lack of void * and const).
Moreover, even if some theoretical compiler did lack stddef.h, you could just provide a drop-in replacement, just like the way people drop in stdint.h for use with MSVC...
To answer #2: yes, gcc-4* (I'm currently looking at v4.3.4, released 4 Aug 2009, but it should hold true for all gcc-4 releases to date). The following definition is used in their stddef.h:
#define offsetof(TYPE, MEMBER) __builtin_offsetof (TYPE, MEMBER)
where __builtin_offsetof is a compiler builtin like sizeof (that is, it's not implemented as a macro or run-time function). Compiling the code:
#include <stddef.h>
struct testcase {
char array[256];
};
int main (void) {
char buffer[offsetof(struct testcase, array[0])];
return 0;
}
would result in an error using the expansion of the macro that you provided ("size of array ‘buffer’ is not an integral constant-expression") but would work when using the macro provided in stddef.h. Builds using gcc-3 used a macro similar to yours. I suppose that the gcc developers had many of the same concerns regarding undefined behavior, etc that have been expressed here, and created the compiler builtin as a safer alternative to attempting to generate the equivalent operation in C code.
Additional information:
A mailing list thread from the Linux kernel developer's list
GCC's documentation on offsetof
A sort-of-related question on this site
Regarding your other questions: I think R's answer and his subsequent comments do a good job of outlining the relevant sections of the standard as far as question #1 is concerned. As for your third question, I have not heard of a modern C compiler that does not have stddef.h. I certainly wouldn't consider any compiler lacking such a basic standard header as "production". Likewise, if their offsetof implementation didn't work, then the compiler still has work to do before it could be considered "production", just like if other things in stddef.h (like NULL) didn't work. A C compiler released prior to C's standardization might not have these things, but the ANSI C standard is over 20 years old so it's extremely unlikely that you'll encounter one of these.
The whole premise to this problems begs a question: If these people are convinced that they can't trust the version of offsetof that the compiler provides, then what can they trust? Do they trust that NULL is defined correctly? Do they trust that long int is no smaller than a regular int? Do they trust that memcpy works like it's supposed to? Do they roll their own versions of the rest of the C standard library functionality? One of the big reasons for having language standards is so that you can trust the compiler to do these things correctly. It seems silly to trust the compiler for everything else except offsetof.
Update: (in response to your comments)
I think my co-workers behave like yours do :-) Some of our older code still has custom macros defining NULL, VOID, and other things like that since "different compilers may implement them differently" (sigh). Some of this code was written back before C was standardized, and many older developers are still in that mindset even though the C standard clearly says otherwise.
Here's one thing you can do to both prove them wrong and make everyone happy at the same time:
#include <stddef.h>
#ifndef offsetof
#define offsetof(tp, member) (((char*) &((tp*)0)->member) - (char*)0)
#endif
In reality, they'll be using the version provided in stddef.h. The custom version will always be there, however, in case you run into a hypothetical compiler that doesn't define it.
Based on similar conversations that I've had over the years, I think the belief that offsetof isn't part of standard C comes from two places. First, it's a rarely used feature. Developers don't see it very often, so they forget that it even exists. Second, offsetof is not mentioned at all in Kernighan and Ritchie's seminal book "The C Programming Language" (even the most recent edition). The first edition of the book was the unofficial standard before C was standardized, and I often hear people mistakenly referring to that book as THE standard for the language. It's much easier to read than the official standard, so I don't know if I blame them for making it their first point of reference. Regardless of what they believe, however, the standard is clear that offsetof is part of ANSI C (see R's answer for a link).
Here's another way of looking at question #1. The ANSI C standard gives the following definition in section 4.1.5:
offsetof( type, member-designator)
which expands to an integral constant expression that has type size_t,
the value of which is the offset in bytes, to the structure member
(designated by member-designator ), from the beginning of its
structure (designated by type ).
Using the offsetof macro does not invoke undefined behavior. In fact, the behavior is all that the standard actually defines. It's up to the compiler writer to define the offsetof macro such that its behavior follows the standard. Whether it's implemented using a macro, a compiler builtin, or something else, ensuring that it behaves as expected requires the implementor to deeply understand the inner workings of the compiler and how it will interpret the code. The compiler may implement it using a macro like the idiomatic version you provided, but only because they know how the compiler will handle the non-standard code.
On the other hand, the macro expansion you provided indeed invokes undefined behavior. Since you don't know enough about the compiler to predict how it will process the code, you can't guarantee that particular implementation of offsetof will always work. Many people define their own version like that and don't run into problems, but that doesn't mean that the code is correct. Even if that's the way that a particular compiler happens to define offsetof, writing that code yourself invokes UB while using the provided offsetof macro does not.
Rolling your own macro for offsetof can't be done without invoking undefined behavior (ANSI C section A.6.2 "Undefined behavior", 27th bullet point). Using stddef.h's version of offsetof will always produce the behavior defined in the standard (assuming a standards-compliant compiler). I would advise against defining a custom version since it can cause portability problems, but if others can't be persuaded then the #ifndef offsetof snippet provided above may be an acceptable compromise.
(1) The undefined behavior is already there before you do the substraction.
First of all, (tp*)0 is not what you think it is. It is a null
pointer, such a beast is not necessarily represented with all-zero
bit pattern.
Then the member operator -> is not simply an offset addition. On a CPU with segmented memory this might be a more complicated operation.
Taking the address with a & operation is UB if the expression is
not a valid object.
(2) For the point 2., there are certainly still archictures out in the wild (embedded stuff) that use segmented memory. For 3., the point that R makes about integer constant expressions has another drawback: if the code is badly optimized the & operation might be done at runtime and signal an error.
(3) Never heard of such a thing, but this is probably not enough to convice your colleagues.
I believe that nearly every optimizing compiler has broken that macro at multiple points in time. Your coworkers have apparently been lucky enough not to have been hit by it.
What happens is that some junior compiler engineer decides that because the zero page is never mapped on their platform of choice, any time anyone does anything with a pointer to that page, that's undefined behavior and they can safely optimize away the whole expression. At that point, everyone's homebrew offsetof macros break until enough people scream about it, and those of us who were smart enough not to roll our own go happily about our business.
I don't know of any compiler where this is the behavior in the current released version, but I think I've seen it happen at some point with every compiler I've ever worked with.

Resources