Why a declaration is not a statement in C? - c

The following example is illegal C program, which is confusing and shows that a declaration is not a statement in C language.
int main() {
if (1) int x;
}
I've read the specification of C (N2176) and I know C language distinguish declaration and statement in the syntax specification. I told my teacher who teaches compiler, and he seems not believe it and I cannot convince him unless I showed him the specification.
So, I am also really confused. Why C is designed like this? Why a declaration is not a statement in C? How to convince someone of the reason of this design?

Because there is no apparent grammatical or technical semantic reason that a declaration cannot appear wherever a statement may appear, this appears to be largely due to history and lack of utility.
Considering the Grammar
Statements enter the C grammar in the function-definition rule, in which a compound-statement appears. A compound-statement allows a statement. Then inspecting the replacements for a statement reveals the places where a statement may appear but a declaration may not:
In a labeled-statement, after a label followed by :.
In a selection-statement, after the ) of an if or a switch or after an else.
In an iteration-statement, after the ) of a while or for or after a do.
The following is not a formal analysis of the grammar, but it appears the places where a statement may appear but a declaration may not are quite limited: After the : that ends a label, after a keyword (else or do) or after the closing ) for a ( that immediately follows a keyword (if, switch, while, or for). These seem to me like unambiguous points in the grammar, where it should be as easy to distinguish declarations and statements as it is to do so after a ; in a compound-statement.
Therefore, I do not think there is a grammatical reason not to allow a declaration to appear anywhere a statement may appear (or, equivalently, to define a declaration as a kind of statement).
Considering the Semantics
Now consider the semantic effects of allowing declarations in the places where currently a statement may appear but a declaration may not.
In the case of a labeled-statement, where we desired to have label: declaration, we can use label: ; declaration, where we have inserted a null statement after the :. The result is a defined code sequence with semantic effect equivalent to what we would desire to have by allowing a declaration immediately after the label.
In the other cases, where we desire to have declaration, we can use { declaration }. Again, the result is a defined code sequence with semantic effect equivalent to what we would desire to have by allowing a bare declaration. That effect is minimal; any expressions in the declaration (in array declarators or initializers) will be evaluated, but anything that is declared goes out of scope immediately. Note that even if the scope were not ended by the closing }, it is ended by the fact that the C standard defines each of these places to be a block. (C 2018 6.8.4 3 says the substatement of a selection-statement is a block, and 6.8.5 5 says the loop body of an iteration-statement is a block.)
Nonetheless, this shows there is no technical semantic impediment to allowing a declaration wherever a statement may appear.
Conclusion
Since the grammar and semantics of C apparently do not preclude allowing a declaration to be a type of statement, we are left with reasons of history and utility. In C as described in the first edition of Kernighan and Ritchie’s The C Programming Language, the locations of declarations were limited. Inside functions, they could only appear at the start of a compound statement. A declaration could not follow a statement. As we see from modern C, there was no grammatical or semantic reason for this limitation; we can allow declarations anywhere within a compound statement. So it seems simply that, around 1978, work on the language had not progressed that far.
Similarly, it seems that current work on C has not gone to the point of allowing a declaration to appear anywhere a statement may appear, as if it were a type of statement, even though there may be no technical impediment. However, in this case, there is less motivation for loosening the rules. Of the above cases, the only one that is of much use is allowing a declaration in a labeled statement. And, as its desired effect is easily accomplished by inserting a null statement, there is likely insufficient motivation to change compilers and to advocate for the changes in the C committee.

That's because a declaration doesn't instruct the compiler to do anything, it's purely informative for the compiler; at least by the standard. Compilers may do something if they see a declaration, the standard does not forbid it either but it doesn't require them to do anything if they only see a declaration and whatever it declares is never used within any statement.
Consider this code:
int main ( )
{
int x;
printf("Hello World!\n");
return 0;
}
What do you think will int x; do? You are declaring that x is of type int but you are never using x anywhere in the rest of the code. The compiler doesn't even have to reserve any memory on stack for it. It may to so but it isn't required to do so.
The standard allows the compiler to create exactly the same code as if you had written:
int main ( )
{
printf("Hello World!\n");
return 0;
}
There is simply nothing a compiler must do if you let it know the type of a variable. This variable doesn't have to exist anywhere at all unless it is ever used by a statement.
C is not an interpreted language where every piece of code instructs the interpreter to directly do something. C is a compiled language which means you tell the compiler to generate CPU code for you that performs the actions you described in a predefined language. So there is no one-to-one relationship between the code you write and the CPU code the compiler generates.
You may write
int x = a / 8;
but the CPU code that the compiler generate may be equivalent to
int x = a >> 3;
As that is exactly the same thing and if shifting is faster than division (and you can bet it is), the compiler does not have to generate a division just because you told it to do so. What you told the compiler is "I want x to be one eighth of a" and the compiler will be like "okay, I'll generate code that makes this happen" but how the compiler is making it happen is up to the compiler.
Thus the compiler only needs to translate statements to CPU code. Actually only statements that have an effect but to find out about that expensive analysis may be required so it's no standard violation to translate all statements to code, even those that do nothing. A declaration on its own has never an effect, it just lets the compiler know the type of a variable or function, which may become important in statements later on but only if the variable/function is ever actually used.

If it were valid, what would you like this program to do ? :
#include <stdio.h>
int main (int argc, char **argv)
{
if (argc > 1) int x=42;
printf("%d\n", x);
return 0;
}

Related

Why nested functions in C are against C standards

Nested functions(function declarations in block scope) are not allowed in C standards(ANSI[C89], C99, C11).
But I couldn't find stating it in C standards.
Edit :
Why a function definition cannot be in a function definition(compound statement)?
There is a difference between a function declaration and a function definition. A declaration merely declares the existence of a function and a definition defines the function.
int f(void) { /* ... */ } // function definition
int f(void); // function declaration
In 6.9.1 the syntax for a function is defined as
function-definition:
declaration-specifiers declarator declaration-listopt compound-statment
In 6.8.2, the things you can put in a compound statement are defined as a declaration or a statement. A function definition isn't considered to be either of these syntactically.
So yes, a function declaration is legal in a function but a function definition is not e.g.
int main(int argc, char*argv[])
{
int f(void); // legal
int g(void) { return 1; } ; // ILLEGAL
// blah blah
}
It may not be stated directly but if you work through the grammar for function-definition you'll find they're not accepted within the grammar.
Why? According to Dennis Richie (who is a bit of an authority on the matter) it appears they were excluded from the start:
"Procedures can be nested in BCPL, but may not refer to non-static objects defined in containing procedures. B and C avoid this restriction by imposing a more severe one: no nested procedures at all."
https://www.bell-labs.com/usr/dmr/www/chist.html
It's humor to avoid a restriction by imposing a more severe one. I read this as it being a simplifying maneuver. Nested procedures add complexity to the compiler (which Ritchie was very keen to limit on machines of the time) and add little value.
The standardization process was (wisely) never seen as an opportunity to extend C willy-nilly and (from the same document):
"From the beginning, the X3J11 committee took a cautious, conservative view of language extensions."
It's difficult to put a case that nested functions offer significant benefits so it's not surprising that even if some implementations were supporting them they weren't adopted as standard.
In general the standards efforts ever since have been at least equally conservative and again it's difficult to see a lot of support amongst implementers to add such a feature.
At the end of the day if you're worried that some function be used outside its intended purpose and is (logically) a sub-function of exactly one given function then give it static linkage and introduce another source file or even whole translation unit.
Nested or private functions are something that many C compilers used to allow, but are not part of the C standard, and now it's quite rare to find compilers that support them, certainly by default.
The standard is determined by a committee, and nested functions will be something that they have discussed, and there will be a rationale, but I don't know what it is offhand, nor do most C programmers. Nested functions aren't inherently a bad idea, but you can achieve virtually all of the benefits by writing a static file scope function, which is the method of creating private functions which was standardised.

When does macro substitution happen in C

I was reading the book "Compilers: Principles, Techniques, and Tools (2nd Edition)" by Alfred V. Aho. There is an example in this book (example 1.7) which asks to analyze the scope of x in the following macro definition in C:
#define a (x+1)
From this example,
We cannot resolve x statically, that is, in terms of the program text.
In fact, in order to interpret x, we must use the usual dynamic-scope
rule. We examine all the function calls that are currently active, and
we take the most recently called function that has a declaration of x.
It is to this declaration that the use of x refers.
I've become confused reading this - as far as I know, macro substitution happens in the preprocessing stage, before compilation starts. But if I get it right, the book says it happens when the program is getting executed. Can anyone please clarify this?
The macro itself has no notion of scope, at least not in the same sense as the C language has. Wherever the symbol a appears in the source after the #define (and before a possible #undef) it is replaced by (x + 1).
But the text talks about the scope of x, the symbol in the macro substitution. That is interpreted by the usual C rules. If there is no symbol x in the scope where a was substituted, this is a compilation error.
The macro is not self-contained. It uses a symbol external to the macro, some kind of global variable if you will, but one whose meaning will change according to the place in the source text where the macro is invoked. I think what the quoted text wants to say is that we cannot know what macro a does unless we know where it is evoked.
I've become confused reading this - as far as I know, macro substitution happens in preprocessing stage, before compilation starts.
Yes, this is how a compiler works.
But if I get it right, the book says it happens when the program is getting executed. Can anyone please clarify this?
Speaking without referring to the book, there are other forms of program analysis besides translating source code to object code (a.k.a. compilation). A C compiler replaces macros before compiling, thus losing information about what was originally a macro, because that information is not significant to the rest of the translation process. The question of the scope of x within the macro never comes up, so the compiler may ignore the issue.
Debuggers often implement tighter integration with source code, though. One could conceive of a debugger that points at subexpressions while stepping through the program (I have seen this feature in an embedded toolchain), and furthermore points inside macros which generate expressions (this I have never seen, but it's conceivable). Or, some debuggers allow you to point at any identifier and see its value. Pointing at the macro definition would then require resolving the identifiers used in the macro, as Aho et al discuss there.
It's difficult to be sure without seeing more context from the book, but I think that passage is at least unclear, and probably incorrect. It's basically correct about how macro definitions work, but not about how the name x is resolved.
#define a (x+1)
C macros are expanded early in the compilation process, in translation phase 4 of 8, as specified in N1570 5.1.1.2. Variable names aren't resolved until phase 7).
So the name x will be meaningfully visible to the compiler, not at the point where the macro is defined, but at the point in the source code where the macro a is used. Two different uses of the a macro could refer to two different declarations of variables named x.
We cannot resolve x statically, that is, in terms of the program text.
We cannot resolve it at the point of the macro definition.
In fact, in order to interpret x, we must use the usual dynamic-scope
rule. We examine all the function calls that are currently active, and
we take the most recently called function that has a declaration of x.
It is to this declaration that the use of x refers.
This is not correct for C. When the compiler sees a reference to x, it must determine what declaration it refers to (or issue a diagnostic if there is no such declaration). That determination does not depend on currently active function calls, something that can only be determined at run time. C is statically scoped, meaning that the appropriate declaration of x can be determined entirely by examining the program text.
At compile time, the compiler will examine symbol table entries for the current block, then for the enclosing block, then for the current function (x might be the name of a parameter), then for file scope.
There are languages that uses dynamic scoping, where the declaration a name refers to depends on the current run-time call stack. C is not one of them.
Here's an example of dynamic scoping in Perl (note that this is considered poor style):
#!/usr/bin/perl
use strict;
use warnings;
no strict "vars";
sub inner {
print " name=\"$name\"\n";
}
sub outer1 {
local($name) = "outer1";
print "outer1 calling inner\n";
inner();
}
sub outer2 {
local($name) = "outer2";
print "outer2 calling inner\n";
inner();
}
outer1();
outer2();
The output is:
outer1 calling inner
name="outer1"
outer2 calling inner
name="outer2"
A similar program in C would be invalid, since the declaration of name would not be statically visible in the function inner.

Declare variable in C - Many ways?

I'm discussing with a friend what's the correct way to declare some variables in C, exactly in the for loop.
He has a compiler I can't remember and I have Dev-C++.
He does:
for (int i = 0; i<10; i++)
// ... and it works
I do:
int i;
for (i = 0; i<10; i++)
// ... and it works
If I do it like he does, Dev-C++ gives me an error. What's the technically correct way to do this? I was taught to do it the way I do but now I'm confused because he does it in the other way and it works for him D:
Declaring the variable in the loop, like your friend does, is supported in C99 and in C++. It is likely that your friend is coming from a C++ background, where such style of declaration is the norm. Declaring the loop variable outside the loop, like you do, is correct in older C, such as C89, which is what your compiler apparently supports.
If you have access to a C99 compiler, which style to choose is mostly a matter of preference. Seasoned C programmers don't mind declaring variables outside loop bodies, but it is considered slightly cleaner to declare them inside because it restricts the scope of the variable to the least possible lexical region. Declaring the variable outside the loop body is, of course, necessary if you plan to use it after the loop is done — for example, to inspect how far the loop has progressed.
Depending of which version of C you're using. Ansi C (original, Ritchie & Kernighan) only supports declaration at begin of block while modern C (and any flavour of C++) allows mixing statement and declaration.
{
int a;
printf ("Stuff);
int b; /* not allowed */
}
Declaring a value inside of the for-loop header causes an error in any compiler that predates c99. If you compile this with the c99 standard or newer, it will work just fine.
Formally the physical difference between THESE TWO is the performance. put the definition within the brackets after for may have more chances to be as a register-only variables. But on the other hand there are many other factors which can decide the detail of optimization results, with the help of the analizing mechanism of the compiler. So the final result may be no different or even be opposite.
There is indeed a different for sure: if you define a variable within the brackets after 'for', that variable won't be able to be used at the outside of that for-loop.

In C99, is f()+g() undefined or merely unspecified?

I used to think that in C99, even if the side-effects of functions f and g interfered, and although the expression f() + g() does not contain a sequence point, f and g would contain some, so the behavior would be unspecified: either f() would be called before g(), or g() before f().
I am no longer so sure. What if the compiler inlines the functions (which the compiler may decide to do even if the functions are not declared inline) and then reorders instructions? May one get a result different of the above two? In other words, is this undefined behavior?
This is not because I intend to write this kind of thing, this is to choose the best label for such a statement in a static analyzer.
The expression f() + g() contains a minimum of 4 sequence points; one before the call to f() (after all zero of its arguments are evaluated); one before the call to g() (after all zero of its arguments are evaluated); one as the call to f() returns; and one as the call to g() returns. Further, the two sequence points associated with f() occur either both before or both after the two sequence points associated with g(). What you cannot tell is which order the sequence points will occur in - whether the f-points occur before the g-points or vice versa.
Even if the compiler inlined the code, it has to obey the 'as if' rule - the code must behave the same as if the functions were not interleaved. That limits the scope for damage (assuming a non-buggy compiler).
So the sequence in which f() and g() are evaluated is unspecified. But everything else is pretty clean.
In a comment, supercat asks:
I would expect function calls in the source code remain as sequence points even if a compiler decides on its own to inline them. Does that remain true of functions declared "inline", or does the compiler get extra latitude?
I believe the 'as if' rule applies and the compiler doesn't get extra latitude to omit sequence points because it uses an explicitly inline function. The main reason for thinking that (being too lazy to look for the exact wording in the standard) is that the compiler is allowed to inline or not inline a function according to its rules, but the behaviour of the program should not change (except for performance).
Also, what can be said about the sequencing of (a(),b()) + (c(),d())? Is it possible for c() and/or d() to execute between a() and b(), or for a() or b() to execute between c() and d()?
Clearly, a executes before b, and c executes before d. I believe it is possible for c and d to be executed between a and b, though it is fairly unlikely that it the compiler would generate the code like that; similarly, a and b could be executed between c and d. And although I used 'and' in 'c and d', that could be an 'or' - that is, any of these sequences of operation meet the constraints:
Definitely allowed
abcd
cdab
Possibly allowed (preserves a ≺ b, c ≺ d ordering)
acbd
acdb
cadb
cabd
 
I believe that covers all possible sequences. See also the chat between Jonathan Leffler and AnArrayOfFunctions — the gist is that AnArrayOfFunctions does not think the 'possibly allowed' sequences are allowed at all.
If such a thing would be possible, that would imply a significant difference between inline functions and macros.
There are significant differences between inline functions and macros, but I don't think the ordering in the expression is one of them. That is, any of the functions a, b, c or d could be replaced with a macro, and the same sequencing of the macro bodies could occur. The primary difference, it seems to me, is that with the inline functions, there are guaranteed sequence points at the function calls - as outlined in the main answer - as well as at the comma operators. With macros, you lose the function-related sequence points. (So, maybe that is a significant difference...) However, in so many ways the issue is rather like questions about how many angels can dance on the head of a pin - it isn't very important in practice. If someone presented me with the expression (a(),b()) + (c(),d()) in a code review, I would tell them to rewrite the code to make it clear:
a();
c();
x = b() + d();
And that assumes there is no critical sequencing requirement on b() vs d().
See Annex C for a list of sequence points. Function calls (the point between all arguments being evaluated and execution passing to the function) are sequence points. As you've said, it's unspecified which function gets called first, but each of the two functions will either see all the side effects of the other, or none at all.
#dmckee
Well, that won't fit inside a comment, but here is the thing:
First, you write a correct static analyzer. "Correct", in this context, means that it won't remain silent if there is anything dubious about the analyzed code, so at this stage you merrily conflate undefined and unspecified behaviors. They are both bad and unacceptable in critical code, and you warn, rightly, for both of them.
But you only want to warn once for one possible bug, and also you know that your analyzer will be judged in benchmarks in terms of "precision" and "recall" when compared to other, possibly not correct, analyzers, so you mustn't warn twice about one same problem... Be it a true or false alarm (you don't know which. you never know which, otherwise it would be too easy).
So you want to emit a single warning for
*p = x;
y = *p;
Because as soon as p is a valid pointer at the first statement, it can be assumed to be a valid pointer at the second statement. And not inferring this will lower your score on the precision metric.
So you teach your analyzer to assume that p is a valid pointer as soon as you have warned about it the first time in the above code, so that you don't warn about it the second time. More generally, you learn to ignore values (and execution paths) that correspond to something you have already warned about.
Then, you realize that not many people are writing critical code, so you make other, lightweight analyses for the rest of them, based on the results of the initial, correct analysis. Say, a C program slicer.
And you tell "them": You don't have to check about all the (possibly, often false) alarms emitted by the first analysis. The sliced program behaves the same as the original program as long as none of them is triggered. The slicer produces programs that are equivalent for the slicing criterion for "defined" execution paths.
And users merrily ignore the alarms and use the slicer.
And then you realize that perhaps there is a misunderstanding. For instance, most implementations of memmove (you know, the one that handles overlapping blocks) actually invoke unspecified behavior when called with pointers that do not point to the same block (comparing addresses that do not point to the same block). And your analyzer ignore both execution paths, because both are unspecified, but in reality both execution paths are equivalent and all is well.
So there shouldn't be any misunderstanding on the meaning of alarms, and if one intends to ignore them, only unmistakable undefined behaviors should be excluded.
And this is how you end up with a strong interest in distinguishing between unspecified behavior and undefined behavior. No-one can blame you for ignoring the latter. But programmers will write the former without even thinking about it, and when you say that your slicer excludes "wrong behaviors" of the program, they will not feel as they are concerned.
And this is the end of a story that definitely did not fit in a comment. Apologies to anyone who read that far.

C++ assignment - stylish or performance?

Having been writing Java code for many years, I was amazed when I saw this C++ statement:
int a,b;
int c = (a=1, b=a+2, b*3);
My question is: Is this a choice of coding style, or does it have a real benefit? (I am looking for a practicle use case)
I think the compiler will see it the same as the following:
int a=1, b=a+2;
int c = b*3;
(What's the offical name for this? I assume it's a standard C/C++ syntax.)
It's the comma operator, used twice. You are correct about the result, and I don't see much point in using it that way.
Looks like an obscure use of a , (comma) operator.
It's not a representative way of doing things in C++.
The only "good-style" use for the comma operator might be in a for statement that has multiple loop variables, used something like this:
// Copy from source buffer to destination buffer until we see a zero
for (char *src = source, *dst = dest; *src != 0; ++src, ++dst) {
*dst = *src;
}
I put "good-style" in scare quotes because there is almost always a better way than to use the comma operator.
Another context where I've seen this used is with the ternary operator, when you want to have multiple side effects, e.g.,
bool didStuff = DoWeNeedToDoStuff() ? (Foo(), Bar(), Baz(), true) : false;
Again, there are better ways to express this kind of thing. These idioms are holdovers from the days when we could only see 24 lines of text on our monitors, and squeezing a lot of stuff into each line had some practical importance.
Dunno its name, but it seems to be missing from the Job Security Coding Guidelines!
Seriously: C++ allows you to a do a lot of things in many contexts, even when they are not necessarily sound. With great power comes great responsibility...
This is called 'obfuscated C'. It is legal, but intended to confuse the reader. And it seems to have worked. Unless you're trying to be obscure it's best avoided.
Hotei
Your sample code use two not very well known by beginners (but not really hidden either) features of C expressions:
the comma operator : a normal binary operator whose role is to return the last of it's two operands. If operands are expression they are evaluated from left to right.
assignment as an operator that returns a value. C assignment is not a statement as in other languages, and returns the value that has been assigned.
Most use cases of both these feature involve some form of obfuscation. But there is some legitimate ones. The point is that you can use them anywhere you can provide an expression : inside an if or a while conditional, in a for loop iteration block, in function call parameters (is using coma you must use parenthesis to avoid confusing with actual function parameters), in macro parameter, etc.
The most usual use of comma is probably in loop control, when you want to change two variables at once, or store some value before performing loop test, or loop iteration.
For example a reverse function can be written as below, thanks to comma operator:
void reverse(int * d, int len){
int i, j;
for (i = 0, j = len - 1 ; i < j ; i++, j--){
SWAP(d[i], d[j]);
}
}
Another legitimate (not obfuscated, really) use of coma operator I have in mind is a DEBUG macro I found in some project defined as:
#ifdef defined(DEBUGMODE)
#define DEBUG(x) printf x
#else
#define DEBUG(x) x
#endif
You use it like:
DEBUG(("my debug message with some value=%d\n", d));
If DEBUGMODE is on then you'll get a printf, if not the wrapper function will not be called but the expression between parenthesis is still valid C. The point is that any side effect of printing code will apply both in release code and debug code, like those introduced by:
DEBUG(("my debug message with some value=%d\n", d++));
With the above macro d will always be incremented regardless of debug or release mode.
There is probably some other rare cases where comma and assignment values are useful and code is easier to write when you use them.
I agree that assignment operator is a great source of errors because it can easily be confused with == in a conditional.
I agree that as comma is also used with a different meaning in other contexts (function calls, initialisation lists, declaration lists) it was not a very good choice for an operator. But basically it's not worse than using < and > for template parameters in C++ and it exists in C from much older days.
Its strictly coding style and won't make any difference in your program. Especially since any decent C++ compiler will optimize it to
int a=1;
int b=3;
int c=9;
The math won't even be performed during assignment at runtime. (and some of the variables may even be eliminated entirely).
As to choice of coding style, I prefer the second example. Most of the time, less nesting is better, and you won't need the extra parenthesis. Since the use of commas exhibited will be known to virtually all C++ programmers, you have some choice of style. Otherwise, I would say put each assignment on its own line.
Is this a choice of coding style, or does it have a real benefit? (I am looking for a practicle use case)
It's both a choice of coding style and it has a real benefit.
It's clearly a different coding style as compared to your equivalent example.
The benefit is that I already know I would never want to employ the person who wrote it, not as a programmer anyway.
A use case: Bob comes to me with a piece of code containing that line. I have him transferred to marketing.
You have found a hideous abuse of the comma operator written by a programmer who probably wishes that C++ had multiple assignment. It doesn't. I'm reminded of the old saw that you can write FORTRAN in any language. Evidently you can try to write Dijkstra's language of guarded commands in C++.
To answer your question, it is purely a matter of (bad) style, and the compiler doesn't care—the compiler will generate exactly the same code as from something a C++ programmer would consider sane and sensible.
You can see this for yourself if you make two little example functions and compile both with the -S option.

Resources