confusing C code, someone explain it for me? - c

The evaluation order does matter a lot, so, is this something called non-referential-transparency?
int i = 1;
int counter(){
i = i + 1;
return i;
}
int foo(int i, int j){
return i*2 + 3*j;
}
int main(){
printf("%d", foo(counter(), counter()));
}

I guess what you might have in mind is that the evaluation order of function parameters is not standardized in C. Since counter() will return a different result on each call, and the result of foo(2, 3) is different from that of foo(3, 2), compiling and executing this code may give you different results on different platforms.
On the same platform, however, it is deterministic, as others have explained well. [Update] (To be precise: once compiled into an executable on a specific platform with specific compiler options, all executions will produce the same output. However, as commenters pointed out, it might even produce different output on the same platform when built with different compilation options.)[/Update]

Strictly speaking, the code in question might give different results even when compiled on the same platform with the same compiler and settings. The order in which function arguments are evaluated is unspecified. The C standard defines "unspecified behavior" as
use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance (C99 §3.4.4/1).
The important part is that "in any instance" the implementation might do something different, so, for example, your compiler could emit code that randomly selects the order in which to evaluate the arguments.
Obviously, it is highly unlikely that any implementation would evaluate the arguments to a function differently during different runs of the same program.
The point is that you should never rely on the order in which function arguments are evaluated; in a correct program, it should not matter.

It is deterministic, it will return the same values every time.

counter() will return a different number each time you call it because i is global. However, a global variable only keeps it value during an execution. If you restart the program, it gets the value 1 and starts again!

Several answers have indicated that while different platforms might give different results, the result is deterministic on a given platform.
This is not correct
The C99 Standard says (6.5/3 Expressions):
Except as specified later (for the function-call (), &&, ||, ?:, and comma operators), the order of evaluation of subexpressions and the order in which side effects take place are both unspecified.
So, the order of evaluation of the parameters in the call to foo() is not specified by the standard. The order that the 2 calls to counter() cannot be counted on. A particular compiler could order the calls differently depending on:
the optimizations the compiler is asked to perform
the exact set of code (include files, slightly or significantly different source code in the translation unit, whatever)
the day of the week the program is built
a random number
While it's unlikely that things other than the optimizations used, differences in other compiler options, or differences in the translation unit will result in a different ordering of the argument evaluation (since there probably wouldn't be much reason for the compiler to generate different output), the fact is you simply can't depend on the ordering.
In fact, it's even OK (as far as the standard is concerned) for the order of evaluation of the call to be made differently each time foo() is invoked. For example, say your example program looked like (to make what's happening when more obvious):
#include <stdio.h>
int i = 1;
int counter1(){
i = i * 3;
printf( "counter1()\n");
return i;
}
int counter2(){
i = i * 5;
printf( "counter2()\n");
return i;
}
int foo(int i, int j){
return i + j;
}
int main(){
int x;
for (x=0; x<2; ++x) {
printf("%d\n", foo(counter1(), counter2()));
}
return 0;
}
It would be perfectly valid for the output to look like any of the following (note there's at least one additional possibility):
Possibility 1:
counter1()
counter2()
18
counter1()
counter2()
270
Possibility 2:
counter1()
counter2()
18
counter2()
counter1()
300
Possibility 3:
counter2()
counter1()
20
counter2()
counter1()
300
It would be OK (even if very weird) for the compiler to evaluate the arguments differently each time that line of code is executed, but it's permitted by the fact that the order is unspecified by the standard.
While it's highly unlikely that the evaluation would be 'randomized', I do think that such difficult to control things as the optimization level (or other compiler settings), the precise version/patch level of the compiler, or even the exact code that surrounds the expressions could cause the compiler to chose to a different evaluation path.
Relying on the order of evaluation of function arguments, even on a particular platform, is flirting with danger.
As a side note, this is one of the reasons that having hidden side-effects in a function is something to avoid if possible.

The code is deterministic but what it prints may depend on the compiler because foo may receive 2,3 or 3,2.

As Code Clown has mentioned the code is deterministic. It would give you the same output on same "compiler".
C standard doesn't specify order of evaluation of method call arguments. So which of the two calls to method foo will get called first is up to the compiler to decide.

The function foo() is not referentially transparent. Referential transparency means the function should return the same value when called on same input. For this to happen the function has to be pure, that is it should not have any side effects.
C language doesn't guarantee a function to be pure, one has to manage it oneself by:
not storing the formal arguments of a method inside a local static field
not depending on value of global variable
(there are many ways to make a function referentially opaque, these are more common)
Here subsequent calls to counter() result in different values so it is referentially opaque.

Related

Do the C compiler know when a statement operates on a file and thus has "observable behaviour"?

The C99 standard 5.1.2.3$2 says
Accessing a volatile object, modifying an object, modifying a file, or calling a function
that does any of those operations are all side effects, 12) which are changes in the state of
the execution environment. Evaluation of an expression in general includes both value
computations and initiation of side effects. Value computation for an lvalue expression
includes determining the identity of the designated object.
I guess that in a lot of cases the compiler can't inline and possibly eliminate the functions doing I/O since they live in a different translation unit. And the parameters to functions doing I/O are often pointers, further hindering the optimizer.
But link-time-optimization gives the compiler "more to chew on".
And even though the paragraph I quoted says that "modifying an object" (that's standard-speak for memory) is a side-effect, stores to memory is not automatically treated as a side effect when the optimizer kicks in. Here's an example from John Regehrs Nine Ways to Break your Systems Software using Volatile where the message store is reordered relative to the volatile ready variable.
.
volatile int ready;
int message[100];
void foo (int i) {
message[i/10] = 42;
ready = 1;
}
How do a C compiler determine if a statement operates on a file? In a free-standing embedded environment I declare registers as volatile, thus hindering the compiler from optimizing calls away and swapping order of I/O calls.
Is that the only way to tell the compiler that we're doing I/O? Or do the C standard dictate that these N calls in the standard library do I/O and thus must receive special treatment? But then, what if someone created their own system call wrapper for say read?
As C has no statement dedicated to IO, only function calls can modify files. So if the compiler sees no function call in a sequence of statements, it knows that this sequence has not modified any file.
If only functions from the standard library are called, and if the environment is hosted, the compiler could know what they do and use that to guess what will happen.
But what is really important, is that the compiler only needs to respect side effects. It is perfectly allowed when it does not know, to assume that a function call could involve side effects and act accordingly. It will not be a violation of the standard if no side effects are actually involved, it will just possibly lose a higher optimization.

What's the consequence of a sequence-point "immediately before a library function returns"?

In this recent question, some code was shown to have undefined behavior:
a[++i] = foo(a[i-1], a[i]);
because even though the actual call of foo() is a sequence point, the assignment is unsequenced, so you don't know whether the function is called after the side-effect of ++i took place or before that.
Thinking further about this, the sequence point at a function call only guarantees that side effects from evaluating the function arguments are carried out once the function is entered, e.g.
int y = 1;
int func1(int x) { return x + y; }
int main(void)
{
int result = func1( y++ ); // guaranteed to be 3
}
But looking at the standard, there's also §7.1.4 p3 (in the chapter about the standard library):
There is a sequence point immediately before a library function returns.
My question here is: What's the consequence of this paragraph? Why does it only concern library functions and what kind of code would actually rely on that?
Simple ideas like (nonsensical code to follow)
errno = 0;
long result = ftell(file) * errno;
would still be undefined as this time, the multiplication is unsequenced. I'm looking for an example that makes use of this special guarantee §7.1.4 p3 makes for library functions.
Regarding the suggested duplicate, Sequence point after a return statement?, this is indeed closely related and I found it before asking this question. It's not a duplicate, because
it asks about normative text stating there is a sequence point immediately after a return, without asking about the consequences when there is one.
it only mentions the special rule for library functions this question is about, without further elaborating on it.
Consequently, my questions here are not answered over there. The accepted answer uses a return value in an unsequenced expression (in this case an addition) and explains how the result depends on the sequencing of this addition, only finding that if you knew the sequencing of the addition, the whole result would be defined with a sequence point immediately after return. It doesn't show an example of code that is actually defined because of this rule, and it doesn't say anything about how/why library functions are special.
Library functions don't have the code that implements them covered by the standard (they might not even be implemented in C). The standard only specifies their behaviour. So the provision about return statements does not apply to implementation of library functions.
The purpose of this clause (in combination with there being a sequence point on entry of a library function) is to say that any side-effects of the library functions are sequenced either before or after any other evaluations that might be in the code which calls the library function.
So the example in your question is not undefined behaviour (unless the multiplication overflows!): the read of errno is either sequenced before or after the modification by ftell, it's unspecified which.

Why does the following code give different results when compiling with gcc and g++?

#include<stdio.h>
int main()
{
const int a=1;
int *p=(int *)&a;
(*p)++;
printf("%d %d\n",*p,a);
if(a==1)
printf("No\n");//"No" in g++.
else
printf("Yes\n");//"Yes" in gcc.
return 0;
}
The above code gives No as output in g++ compilation and Yes in gcc compilation. Can anybody please explain the reason behind this?
Your code triggers undefined behaviour because you are modifying a const object (a). It doesn't have to produce any particular result, not even on the same platform, with the same compiler.
Although the exact mechanism for this behaviour isn't specified, you may be able to figure out what is happening in your particular case by examining the assembly produced by the code (you can see that by using the -S flag.) Note that compilers are allowed to make aggressive optimizations by assuming code with well defined behaviour. For instance, a could simply be replaced by 1 wherever it is used.
From the C++ Standard (1.9 Program execution)
4 Certain other operations are described in this International
Standard as undefined (for example, the effect of attempting to
modify a const object). [ Note: This International Standard imposes
no requirements on the behavior of programs that contain undefined
behavior. —end note ]
Thus your program has undefined behaviour.
In your code, notice following two lines
const int a=1; // a is of type constant int
int *p=(int *)&a; // p is of type int *
you are putting the address of a const int variable to an int * and then trying to modify the value, which should have been treated as const. This is not allowed and invokes undefined behaviour.
For your reference, as mentioned in chapter 6.7.3, C11 standard, paragraph 6
If an attempt is made to modify an object defined with a const-qualified type through use
of an lvalue with non-const-qualified type, the behavior is undefined. If an attempt is
made to refer to an object defined with a volatile-qualified type through use of an lvalue
with non-volatile-qualified type, the behavior is undefined
So, to cut the long story short, you cannot rely on the outputs for comaprison. They are the result of undefined behaviour.
Okay we have here 'identical' code passed to "the same" compiler but once
with a C flag and the other time with a C++ flag. As far as any reasonable
user is concerned nothing has changed. The code should be interpreted
identically by the compiler because nothing significant has happened.
Actually, that's not true. While I would be hard pressed to point to it in
a standard but the precise interpretation of 'const' has slight differences
between C and C++. In C it's very much an add-on, the 'const' flag
says that this normal variable 'a' should not be written to by the code
round here. But there is a possibility that it will be written to
elsewhere. With C++ the emphasis is much more to the immutable constant
concept and the compiler knows that this constant is more akin to an
'enum' that a normal variable.
So I expect this slight difference means that slightly different parse
trees are generated which eventually leads to different assembler.
This sort of thing is actually fairly common, code that's in the C/C++
subset does not always compile to exactly the same assembler even with
'the same' compiler. It tends to be caused by other language features
meaning that there are some things you can't prove about the code right
now in one of the languages but it's okay in the other.
Usually C is the performance winner (as was re-discovered by the Linux
kernel devs) because it's a simpler language but in this example, C++
would probably turn out faster (unless the C dev switches to a macro
or enum
and catches the unreasonable act of taking the address of an immutable constant).

Order of evaluation of arguments in function calling?

I am studying about undefined behavior in C and I came to a statement that states that
there is no particular order of evaluation of function arguments
but then what about the standard calling conventions like _cdecl and _stdcall, whose definition said (in a book) that arguments are evaluated from right to left.
Now I am confused with these two definitions one, in accordance of UB, states different than the other which is in accordance of the definition of calling convention. Please justify the two.
As Graznarak's answer correctly points out, the order in which arguments are evaluated is distinct from the order in which arguments are passed.
An ABI typically applies only to the order in which arguments are passed, for example which registers are used and/or the order in which argument values are pushed onto the stack.
What the C standard says is that the order of evaluation is unspecified. For example (remembering that printf returns an int result):
some_func(printf("first\n"), printf("second\n"));
the C standard says that the two messages will be printed in some order (evaluation is not interleaved), but explicitly does not say which order is chosen. It can even vary from one call to the next, without violating the C standard. It could even evaluate the first argument, then evaluate the second argument, then push the second argument's result onto the stack, then push the first argument's result onto the stack.
An ABI might specify which registers are used to pass the two arguments, or exactly where on the stack the values are pushed, which is entirely consistent with the requirements of the C standard.
But even if an ABI actually requires the evaluation to occur in a specified order (so that, for example, printing "second\n" followed by "first\n" would violate the ABI) that would still be consistent with the C standard.
What the C standard says is that the C standard itself does not define the order of evaluation. Some secondary standard is still free to do so.
Incidentally, this does not by itself involve undefined behavior. There are cases where the unspecified order of evaluation can lead to undefined behavior, for example:
printf("%d %d\n", i++, i++); /* undefined behavior! */
Argument evaluation and argument passing are related but different problems.
Arguments tend to be passed left to right, often with some arguments passed in registers rather than on the stack. This is what is specified by the ABI and _cdecl and _stdcall.
The order of evaluation of arguments before placing them in the locations that the function call requires is unspecified. It can evaluate them left to right, right to left, or some other order. This is compiler dependent and may even vary depending on optimization level.
_cdecl and _stdcall merely specify that the arguments are pushed onto the stack in right-to-left order, not that they are evaluated in that order. Think about what would happen if calling conventions like _cdecl, _stdcall, and pascal changed the order that the arguments were evaluated.
If evaluation order were modified by calling convention, you would have to know the calling convention of the function you're calling in order to understand how your own code would behave. That's a leaky abstraction if I've ever seen one. Somewhere, buried in a header file someone else wrote, would be a cryptic key to understanding just that one line of code; but you've got a few hundred thousand lines, and the behavior changes for each one? That would be insanity.
I feel like much of the undefined behavior in C89 arose from the fact that the standard was written after multiple conflicting implementations existed. They were maybe more concerned with agreeing on a sane baseline that most implementers could accept than they were with defining all behavior. I like to think that all undefined behavior in C is just a place where a group of smart and passionate people agreed to disagree, but I wasn't there.
I'm tempted now to fork a C compiler and make it evaluate function arguments as if they're a binary tree that I'm running a breadth-first traversal of. You can never have too much fun with undefined behavior!
Check the book you mentioned for any references to "Sequence points", because I think that's what you're trying to get at.
Basically, a sequence point is a point that, once you've arrived there, you are certain that all preceding expressions have been fully evaluated, and its side-effects are sure to be no more.
For example, the end of an initializer is a sequence point. This means that after:
bool foo = !(i++ > j);
You are sure that i will be equal to i's initial value +1, and that foo has been assigned true or false. Another example:
int bar = i++ > j ? i : j;
Is perfectly predictable. It reads as follows: if the current value of i is greater than j, and add one to i after this comparison (the question mark is a sequence point, so after the comparison, i is incremented), then assign i (NEW VALUE) to bar, else assign j. This is down to the fact that the question mark in the ternary operator is also a valid sequence point.
All sequence points listed in the C99 standard (Annex C) are:
The following are the sequence points described in 5.1.2.3:
— The call to a function, after the arguments have been evaluated (6.5.2.2).
— The end of the first operand of the following operators: logical AND && (6.5.13);
logical OR || (6.5.14); conditional ? (6.5.15); comma , (6.5.17).
— The end of a full declarator: declarators (6.7.5);
— The end of a full expression: an initializer (6.7.8); the expression in an expression
statement (6.8.3); the controlling expression of a selection statement (if or switch)
(6.8.4); the controlling expression of a while or do statement (6.8.5); each of the
expressions of a for statement (6.8.5.3); the expression in a return statement
(6.8.6.4).
— Immediately before a library function returns (7.1.4).
— After the actions associated with each formatted input/output function conversion
specifier (7.19.6, 7.24.2).
— Immediately before and immediately after each call to a comparison function, and
also between any call to a comparison function and any movement of the objects
passed as arguments to that call (7.20.5).
What this means, in essence is that any expression that is not a followed by a sequence point can invoke undefined behaviour, like, for example:
printf("%d, %d and %d\n", i++, i++, i--);
In this statement, the sequence point that applies is "The call to a function, after the arguments have been evaluated". After the arguments are evaluated. If we then look at the semantics, in the same standard under 6.5.2.2, point ten, we see:
10 The order of evaluation of the function designator, the actual arguments, and
subexpressions within the actual arguments is unspecified, but there is a sequence point
before the actual call.
That means for i = 1, the values that are passed to printf could be:
1, 2, 3//left to right
But equally valid would be:
1, 0, 1//evaluated i-- first
//or
1, 2, 1//evaluated i-- second
What you can be sure of is that the new value of i after this call will be 2.
But all of the values listed above are, theoretically, equally valid, and 100% standard compliant.
But the appendix on undefined behaviour explicitly lists this as being code that invokes undefined behaviour, too:
Between two sequence points, an object is modified more than once, or is modified
and the prior value is read other than to determine the value to be stored (6.5).
In theory, your program could crash, instead of printinf 1, 2, and 3, the output "666, 666 and 666" would be possible, too
so finally i found it...yeah.
it is because the arguments are passed after they are evaluated.So passing arguments is a completely different story from the evaluation.Compiler of c as it is traditionally build to maximize the speed and optimization can evaluate the expression in any way.
so the both argument passing and evaluation are different stories altogether.
since the C standard does not specify any order for evaluating parameters, every compiler implementation is free to adopt one. That's one reason why coding something like foo(i++) is complete insanity- you may get different results when switching compilers.
One other important thing which has not been highlighted here - if your favorite ARM compiler evaluates parameters left to right, it will do so for all cases and for all subsequent versions. Reading order of parameters for a compiler is merely a convention...

In C99, is f()+g() undefined or merely unspecified?

I used to think that in C99, even if the side-effects of functions f and g interfered, and although the expression f() + g() does not contain a sequence point, f and g would contain some, so the behavior would be unspecified: either f() would be called before g(), or g() before f().
I am no longer so sure. What if the compiler inlines the functions (which the compiler may decide to do even if the functions are not declared inline) and then reorders instructions? May one get a result different of the above two? In other words, is this undefined behavior?
This is not because I intend to write this kind of thing, this is to choose the best label for such a statement in a static analyzer.
The expression f() + g() contains a minimum of 4 sequence points; one before the call to f() (after all zero of its arguments are evaluated); one before the call to g() (after all zero of its arguments are evaluated); one as the call to f() returns; and one as the call to g() returns. Further, the two sequence points associated with f() occur either both before or both after the two sequence points associated with g(). What you cannot tell is which order the sequence points will occur in - whether the f-points occur before the g-points or vice versa.
Even if the compiler inlined the code, it has to obey the 'as if' rule - the code must behave the same as if the functions were not interleaved. That limits the scope for damage (assuming a non-buggy compiler).
So the sequence in which f() and g() are evaluated is unspecified. But everything else is pretty clean.
In a comment, supercat asks:
I would expect function calls in the source code remain as sequence points even if a compiler decides on its own to inline them. Does that remain true of functions declared "inline", or does the compiler get extra latitude?
I believe the 'as if' rule applies and the compiler doesn't get extra latitude to omit sequence points because it uses an explicitly inline function. The main reason for thinking that (being too lazy to look for the exact wording in the standard) is that the compiler is allowed to inline or not inline a function according to its rules, but the behaviour of the program should not change (except for performance).
Also, what can be said about the sequencing of (a(),b()) + (c(),d())? Is it possible for c() and/or d() to execute between a() and b(), or for a() or b() to execute between c() and d()?
Clearly, a executes before b, and c executes before d. I believe it is possible for c and d to be executed between a and b, though it is fairly unlikely that it the compiler would generate the code like that; similarly, a and b could be executed between c and d. And although I used 'and' in 'c and d', that could be an 'or' - that is, any of these sequences of operation meet the constraints:
Definitely allowed
abcd
cdab
Possibly allowed (preserves a ≺ b, c ≺ d ordering)
acbd
acdb
cadb
cabd
 
I believe that covers all possible sequences. See also the chat between Jonathan Leffler and AnArrayOfFunctions — the gist is that AnArrayOfFunctions does not think the 'possibly allowed' sequences are allowed at all.
If such a thing would be possible, that would imply a significant difference between inline functions and macros.
There are significant differences between inline functions and macros, but I don't think the ordering in the expression is one of them. That is, any of the functions a, b, c or d could be replaced with a macro, and the same sequencing of the macro bodies could occur. The primary difference, it seems to me, is that with the inline functions, there are guaranteed sequence points at the function calls - as outlined in the main answer - as well as at the comma operators. With macros, you lose the function-related sequence points. (So, maybe that is a significant difference...) However, in so many ways the issue is rather like questions about how many angels can dance on the head of a pin - it isn't very important in practice. If someone presented me with the expression (a(),b()) + (c(),d()) in a code review, I would tell them to rewrite the code to make it clear:
a();
c();
x = b() + d();
And that assumes there is no critical sequencing requirement on b() vs d().
See Annex C for a list of sequence points. Function calls (the point between all arguments being evaluated and execution passing to the function) are sequence points. As you've said, it's unspecified which function gets called first, but each of the two functions will either see all the side effects of the other, or none at all.
#dmckee
Well, that won't fit inside a comment, but here is the thing:
First, you write a correct static analyzer. "Correct", in this context, means that it won't remain silent if there is anything dubious about the analyzed code, so at this stage you merrily conflate undefined and unspecified behaviors. They are both bad and unacceptable in critical code, and you warn, rightly, for both of them.
But you only want to warn once for one possible bug, and also you know that your analyzer will be judged in benchmarks in terms of "precision" and "recall" when compared to other, possibly not correct, analyzers, so you mustn't warn twice about one same problem... Be it a true or false alarm (you don't know which. you never know which, otherwise it would be too easy).
So you want to emit a single warning for
*p = x;
y = *p;
Because as soon as p is a valid pointer at the first statement, it can be assumed to be a valid pointer at the second statement. And not inferring this will lower your score on the precision metric.
So you teach your analyzer to assume that p is a valid pointer as soon as you have warned about it the first time in the above code, so that you don't warn about it the second time. More generally, you learn to ignore values (and execution paths) that correspond to something you have already warned about.
Then, you realize that not many people are writing critical code, so you make other, lightweight analyses for the rest of them, based on the results of the initial, correct analysis. Say, a C program slicer.
And you tell "them": You don't have to check about all the (possibly, often false) alarms emitted by the first analysis. The sliced program behaves the same as the original program as long as none of them is triggered. The slicer produces programs that are equivalent for the slicing criterion for "defined" execution paths.
And users merrily ignore the alarms and use the slicer.
And then you realize that perhaps there is a misunderstanding. For instance, most implementations of memmove (you know, the one that handles overlapping blocks) actually invoke unspecified behavior when called with pointers that do not point to the same block (comparing addresses that do not point to the same block). And your analyzer ignore both execution paths, because both are unspecified, but in reality both execution paths are equivalent and all is well.
So there shouldn't be any misunderstanding on the meaning of alarms, and if one intends to ignore them, only unmistakable undefined behaviors should be excluded.
And this is how you end up with a strong interest in distinguishing between unspecified behavior and undefined behavior. No-one can blame you for ignoring the latter. But programmers will write the former without even thinking about it, and when you say that your slicer excludes "wrong behaviors" of the program, they will not feel as they are concerned.
And this is the end of a story that definitely did not fit in a comment. Apologies to anyone who read that far.

Resources