Related
The ANSI C grammar specifies:
declarator:
pointer_opt direct-declarator
direct-declarator:
identifier
( declarator )
direct-declarator [ constant-expression_opt ]
direct-declarator ( parameter-type-list )
direct-declarator ( identifier-list_opt )
According to this grammar, it would be possible to derive
func()()
as a declarator, and
int func()()
as a declaration, which is semantically illegal. Why does the C grammar allow such syntactically legal, but sementically illegal declarations?
These kinds of questions typically can't be answered for certain, because you're asking for information about the collective thoughts and deliberations of the C committee, in 1989. They've never conducted the work of language development wholly in public, the way, say, the people responsible for Python do, and thirty years ago they did that even less. And if you polled them personally, they probably wouldn't remember.
We can look at the C Rationale document (I'm linking to the edition corresponding to C1999, but as far as I know it didn't change very much since 1989) for clues, but on a quick skim, I don't see anything relevant to your question.
That leaves me making guesses based on general principles of programming language design. There is a general principle relevant to your question: Particularly for older languages, designers try to make the formal syntax be context-free as much as possible. This makes it much easier to write an efficient parser. Rules like "you can't have a function that returns a function" require context, and so they are left out of the syntax. It's straightforward to handle them as post-hoc constraints applied to the parse tree instead, so that's what designers do.
The C grammar has a whole bunch of places where this principle appears to have been used, not just the one you're asking about. For instance, the "maximal munch" rule for tokenization exists because it means the tokenizer does not need to be aware of the full parser context, even though it leads to inconvenient results, such as a-----b being interpreted as a -- -- - b instead of a -- - -- b, even though the parser will reject the former but accept the latter.
This design principle for programming languages is often surprising to beginners, because it's so different from how humans understand natural languages; we will go out of our way to "repair" some kind of contextually appropriate meaning from even the most nonsensical sentences, and we actually rely on this in conversation. It might help to contemplate the meta-principle that worse is better (to oversimplify, because you can get the first 90% of the work done quickly and put it out there and then iterate on the remaining 90%).
Why does the C grammar allow syntactically legal, but semantically illegal declarations like int func()()?
Your question basically answers itself:
Quite simply, it's because it's a grammar's whole job to accept syntactically legal constructs. If something is syntactically legal, but semantically meaningless or illegal, it's not the grammar's job to reject it -- it gets rejected later, during semantic analysis.
And if the question is, "Why wasn't the grammar written differently, so that semantically illegal constructs were also syntactically illegal (such that the grammar could reject them)?", the answer is that it's often a tradeoff whether to reject things during parsing or during semantic analysis. C's declaration syntax is pretty complicated, and there's an obvious desire to make the grammar which accepts it about as complicated as, but not significantly more complicated than, it has to be. Often, you can keep a grammar nicely simple by deferring certain checks to the semantic analysis phase.
Why does the C grammar allow such syntactically legal, but sementically illegal declarations?
What makes you think it sensible to expect the language syntax to be unable to express any semantically incorrect statements?
Not all semantic problems can even be detected at compile time (example: y = 1 / x;, which is well-defined except when x is zero). Even formulating the syntax rules so that they do not accept any statements, declarations, or expressions that can be proven semantically wrong at compile time would be of little benefit. It would complicate the syntax rules tremendously for very little gain, as compilers have to do the semantic analysis either way.
Note well that the primary audience for the language standard is people, not machines. That's why it describes the language semantics with prose.
My CCS 6.1 ARM compiler (for LM3Sxxxx Stellaris) throws a warning :
"MISRA Rule 12.2. The value of an expression shall be the same under any order of evaluation that the standard permits"
for following code:
typedef struct {
...
uint32_t bufferCnt;
uint8_t buffer[100];
...
} DIAG_INTERFACE_T;
static DIAG_INTERFACE_T diagInterfaces[1];
...
DIAG_INTERFACE_T * diag = &diagInterfaces[0];
uint8_t data = 0;
diag->bufferCnt = 0;
diag->buffer[diag->bufferCnt++] = data; // line where warning is issued
...
I don't see a problem in my code. Is it false positive or my bug?
Put diag->bufferCnt++ in a separate statement (as it is also advised by Hans in OP comments) and the warning should not appear.
But regarding MISRA rule 12.2 I see no violation of 12.2 (there is a single sequence point in your statement and no unspecified behavior) in your program and I think it's a bug in your MISRA software.
For information there is also an advisory 12.13 rule in MISRA that says:
(MISRA-C:2004, 12.13) "The increment (++) and decrement (--) operators should not be mixed with other operators in an expression"
The problem with MISRA is their terminology use is far from perfect, for 12.3, while -> or = are C operators, in the explanation they then seem to talk only about arithmetic operators...
Although you don’t indicate it, this is MISRA-C:2004, Rule 12.2, and is now MISRA-C:2012 Rule 13.2. As oauh says, this has nothing to do with "order of evaluation”.
I highly recommend referring to MISRA-C:2012 even if you are required to be MISRA-C:2004 compliant, having MISRA-C:2012 around helps, because it has clarified many of the guidelines, including additional rationale, explanations and examples.
You should not be using a compiler to solely check for MISRA-C compliancy, its nice, but compilers #1 goal is not to warn you about all the traps and pitfalls of the language it is dedicated to take advantage of (optimization). They're not very precise either, as in this case. Also, there are many undefined behaviors across translation units, compilers cannot warn about. Its best to also use a dedicated MISRA Static analysis tool, one that is not compiler specific, but that warns about all unpredictable constructs from the ISO C standards point of view, not a particular implementation.
As oauh also said, this is a violation of MISRA-C:Rule 12.13, which is now MISRA-C:2012 Rule 13.3 which has been relaxed to permit ++ and -- to be mixed with other operators, provided that the ++ or -- is the only source of side-effects (in your case the assignment is also a side effect in C terminology).
The Rule is not critical, i.e. its well defined behavior, but the different values resulting from the prefix version and the postfix version can cause confusion, thus it is “advisory” meaning no formal deviation is required (again, a decent MISRA-C tool would allow you to suppress this particular violation).
I think the Question is self sufficient. Is the syntax of C Language completely defined through Context Free Grammars or do we have Language Constructs which may require non-Context Free definitions in the course of parsing?
An example of non CFL construct i thought was the declaration of variables before their use. But in Compilers(Aho Ullman Sethi), it is stated that the C Language does not distinguish between identifiers on the basis of their names. All the identifiers are tokenized as 'id' by the Lexical Analyzer.
If C is not completely defined by CFGs, please can anyone give an example of Non CFL construct in C?
The problem is that you haven't defined "the syntax of C".
If you define it as the language C in the CS sense, meaning the set of all valid C programs, then C – as well as virtually every other language aside from turing tarpits and Lisp – is not context free. The reasons are not related to the problem of interpreting a C program (e.g. deciding whether a * b; is a multiplication or a declaration). Instead, it's simply because context free grammars can't help you decide whether a given string is a valid C program. Even something as simple as int main() { return 0; } needs a more powerful mechanism than context free grammars, as you have to (1) remember the return type and (2) check that whatever occurs after the return matches the return type. a * b; faces a similar problem – you don't need to know whether it's a multiplication, but if it is a multiplication, that must be a valid operation for the types of a and b. I'm not actually sure whether a context sensitive grammar is enough for all of C, as some restrictions on valid C programs are quite subtle, even if you exclude undefined behaviour (some of which may even be undecidable).
Of course, the above notion is hardly useful. Generally, when talking grammars, we're only interested in a pretty good approximation of a valid program: We want a grammar that rules out as many strings which aren't C as possible without undue complexity in the grammar (for example, 1 a or (-)). Everything else is left to later phases of the compiler and called a semantic error or something similar to distinguish it from the first class of errors. These "approximate" grammars are almost always context free grammars (including in C's case), so if you want to call this approximation of the set of valid programs "syntax", C is indeed defined by a context free grammar. Many people do, so you'd be in good company.
The C language, as defined by the language standard, includes the preprocessor. The following is a syntactically correct C program:
#define START int main(
#define MIDDLE ){
START int argc, char** argv MIDDLE return 0; }
It seems to be really tempting to answer this question (which arises a lot) "sure, there is a CFG for C", based on extracting a subset of the grammar in the standard, which grammar in itself is ambiguous and recognizes a superset of the language. That CFG is interesting and even useful, but it is not C.
In fact, the productions in the standard do not even attempt to describe what a syntactically correct source file is. They describe:
The lexical structure of the source file (along with the lexical structure of valid tokens after pre-processing).
The grammar of individual preprocessor directives
A superset of the grammar of the post-processed language, which relies on some other mechanism to distinguish between typedef-name and other uses of identifier, as well as a mechanism to distinguish between constant-expression and other uses of conditional-expression.
There are many who argue that the issues in point 3 are "semantic", rather than "syntactic". However, the nature of C (and even more so its cousin C++) is that it is impossible to disentangle "semantics" from the parsing of a program. For example, the following is a syntactically correct C program:
#define base 7
#if base * 2 < 10
&one ?= two*}}
#endif
int main(void){ return 0; }
So if you really mean "is the syntax of the C language defined by a CFG", the answer must be no. If you meant, "Is there a CFG which defines the syntax of some language which represents strings which are an intermediate product of the translation of a program in the C language," it's possible that the answer is yes, although some would argue that the necessity to make precise what is a constant-expression and a typedef-name make the syntax necessarily context-sensitive, in a way that other languages are not.
Is the syntax of C Language completely defined through Context Free Grammars?
Yes it is. This is the grammar of C in BNF:
http://www.cs.man.ac.uk/~pjj/bnf/c_syntax.bnf
If you don't see other than exactly one symbol on the left hand side of any rule, then the grammar is context free. That is the very definition of context free grammars (Wikipedia):
In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form
V → w
where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty).
Since ambiguity is mentioned by others, I would like to clarify a bit. Imagine the following grammar:
A -> B x | C x
B -> y
C -> y
This is an ambiguous grammar. However, it is still a context free grammar. These two are completely separate concepts.
Obviously, the semantics analyzer of C is context sensitive. This answer from the duplicate question has further explanations.
There are two things here:
The structure of the language (syntax): this is context free as you do not need to know the surroundings to figure out what is an identifier and what is a function.
The meaning of the program (semantics): this is not context free as you need to know whether an identifier has been declared and with what type when you are referring to it.
If you mean by the "syntax of C" all valid C strings that some C compiler accepts, and after running the pre-processor, but ignoring typing errors, then this is the answer: yes but not unambiguously.
First, you could assume the input program is tokenized according to the C standard. The grammar will describe relations among these tokens and not the bare characters. Such context-free grammars are found in books about C and in implementations that use parser generators. This tokenization is a big assumption because quite some work goes into "lexing" C. So, I would argue that we have not described C with a context-free grammar yet, if we have not used context-free grammars to describe the lexical level. The staging between the tokenizer and the parser combined with the ordering emposed by a scanner generator (prefer keywords, longest match, etc) are a major increase in computational power which is not easily simulated in a context-free grammar.
So, If you do not assume a tokenizer which for example can distinguish type names from variable names using a symbol table, then a context-free grammar is going to be harder. However: the trick here is to accept ambiguity. We can describe the syntax of C including its tokens in a context-free grammar fully. Only the grammar will be ambiguous and produce different interpretations for the same string . For example for A *a; it will have derivations for a multiplication and a pointer declaration both. No problem, the grammar is still describing the C syntax as you requested, just not unambiguously.
Notice that we have assumed having run the pre-processor first as well, I believe your question was not about the code as it looks before pre-processing. Describing that using a context-free grammar would be just madness since syntactic correctness depends on the semantics of expanding user-defined macros. Basically, the programmer is extending the syntax of the C language every time a macro is defined. At CWI we did write context-free grammars for C given a set of known macro definitions to extend the C language and that worked out fine, but that is not a general solution.
A few years ago, before standardization of C, it was allowed to use struct selectors on addresses. For example, the following code was allowed and frequently used.
#define PTR 0xAA000
struct { int integ; };
func() {
int i;
i = PTR->integ; /* here, c is set to the first int at PTR */
return c;
}
Maybe it wasn't very neat, but I like it. In my opinion, the power and the versatility of this language relies also on its lack of constraints. Nowadays, compilers just dump an error. I'd like to know if it is possible to remove this restraint in the GNU C compiler.
PS: similar code was used on the UNIX kernel by the inventors of C. (in V6, some dummy structures have been declared in param.h)
'A few years ago' is actually a very, very long time ago. AFAICR, the C in 7th Edition UNIX™ (1979, a decade before the C89 standard was defined) didn't support that notation any more (but see below).
The code shown in the question only worked when all structure members of all structures shared the same name space. That meant that structure.integ or pointer->integ always referred to an int at the start of a structure because there was only one possible structure member integ across the entire program.
Note that in 'modern' C (1978 onwards), you cannot reference the structure type; there's neither a structure tag nor a typedef for it — the type is useless. The original code also references an undefined variable c.
To make it work, you'd need something like:
#define PTR 0xAA000
struct integ { int integ; };
int func(void)
{
struct integ *ptr = (struct integ *)PTR;
return ptr->integ;
}
C for 7th Edition UNIX
I suggested that the C with 7th Edition UNIX supported separate namespaces for separate structure types. However, the C Reference Manual published with the UNIX Programmer's Manual Vol 2 mentions in §8.5 Structures:
The names of structure members and structure tags may be the same as ordinary variables, since a distinction can
be made by context. However, names of tags and members must be distinct. The same member name can appear in
different structures only if the two members are of the same type and if their origin with respect to their structure is
the same; thus separate structures can share a common initial segment.
However, that same manual also mentions the notations (see also What does =+ mean in C):
§7.14.2 lvalue =+ expression
§7.14.3 lvalue =- expression
§7.14.4 lvalue =* expression
§7.14.5 lvalue =/ expression
§7.14.6 lvalue =% expression
§7.14.7 lvalue =>> expression
§7.14.8 lvalue =<< expression
§7.14.9 lvalue =& expression
§7.14.10 lvalue =^ expression
§7.14.11 lvalue = | expression
The behavior of an expression of the form ‘‘E1 =op E2’’ may be inferred by taking it as equivalent to
‘‘E1 = E1 op E2’’; however, E1 is evaluated only once. Moreover, expressions like ‘‘i =+ p’’ in which a pointer is
added to an integer, are forbidden.
AFAICR, that was not supported in the first C compilers I used (1983 — I'm ancient, but not quite that ancient); only the modern += notations were allowed. In other words, I don't think the C described by that reference manual was fully current when the product was released. (I've not checked my 1st Edition of K&R — does anyone have one on hand to check?) You can find the UNIX 7th Edition manuals online at http://cm.bell-labs.com/7thEdMan/.
By giving the structure a type name and adjusting your macro slightly you can achieve the same effect in your code:
typedef struct { int integ; } PTR_t;
#define PTR ((PTR_t*)0xAA000)
I'd like to know if it is possible to remove this restraint in the GNU C compiler.
I'm reasonably sure the answer is no -- that is, unless you rewrite gcc to support the older version of the language.
The gcc manual documents the -traditional command-line option:
'-traditional' '-traditional-cpp'
Formerly, these options caused GCC to attempt to emulate a
pre-standard C compiler. They are now only supported with the
`-E' switch. The preprocessor continues to support a pre-standard
mode. See the GNU CPP manual for details.
This implies that modern gcc (the quote is from the 4.8.0 manual) no longer supports pre-ANSI C.
The particular feature you're referring to isn't just pre-ANSI, it's very pre-ANSI. The ANSI standard was published in 1989. The first edition of K&R was published in 1978, and as I recall the language it described didn't support the feature you're looking for. The initial release of gcc was in 1987, so it's very likely that no version of gcc has ever supported that feature.
Furthermore, enabling such a feature would break existing code which may depend on the ability to use the same member name in different structures. (Traces of the old rules survive in the standard C library, where for example the members of type struct tm all have names starting with tm_; in modern C that would not be necessary.)
You might be able to find sources for an ancient C compiler that works the way you want. The late Dennis Ritchie's home page would be a good starting point for that. It's not at all obvious that you'd be able to get such a compiler working on any modern system without a great deal of work. And the result would be a compiler that doesn't support a number of newer features of C that you might find useful, such as the long, signed, and unsigned keywords, the ability to pass structures by value, function prototypes, and diagnostics for attempts to mix pointers and integers.
C is better now than it was then. There are a few dangerous things that are slightly more difficult than they were, but I'm not aware that any actual expressive power has been lost.
The Wikipedia article on ANSI C says:
One of the aims of the ANSI C standardization process was to produce a superset of K&R C (the first published standard), incorporating many of the unofficial features subsequently introduced. However, the standards committee also included several new features, such as function prototypes (borrowed from the C++ programming language), and a more capable preprocessor. The syntax for parameter declarations was also changed to reflect the C++ style.
That makes me think that there are differences. However, I didn't see a comparison between K&R C and ANSI C. Is there such a document? If not, what are the major differences?
EDIT: I believe the K&R book says "ANSI C" on the cover. At least I believe the version that I have at home does. So perhaps there isn't a difference anymore?
There may be some confusion here about what "K&R C" is. The term refers to the language as documented in the first edition of "The C Programming Language." Roughly speaking: the input language of the Bell Labs C compiler circa 1978.
Kernighan and Ritchie were involved in the ANSI standardization process. The "ANSI C" dialect superceded "K&R C" and subsequent editions of "The C Programming Language" adopt the ANSI conventions. "K&R C" is a "dead language," except to the extent that some compilers still accept legacy code.
Function prototypes were the most obvious change between K&R C and C89, but there were plenty of others. A lot of important work went into standardizing the C library, too. Even though the standard C library was a codification of existing practice, it codified multiple existing practices, which made it more difficult. P.J. Plauger's book, The Standard C Library, is a great reference, and also tells some of the behind-the-scenes details of why the library ended up the way it did.
The ANSI/ISO standard C is very similar to K&R C in most ways. It was intended that most existing C code should build on ANSI compilers without many changes. Crucially, though, in the pre-standard era, the semantics of the language were open to interpretation by each compiler vendor. ANSI C brought in a common description of language semantics which put all the compilers on an equal footing. It's easy to take this for granted now, some 20 years later, but this was a significant achievement.
For the most part, if you don't have a pre-standard C codebase to maintain, you should be glad you don't have to worry about it. If you do--or worse yet, if you're trying to bring an old program up to more modern standards--then you have my sympathies.
There are some minor differences, but I think later editions of K&R are for ANSI C, so there's no real difference anymore.
"C Classic" for lack of a better terms had a slightly different way of defining functions, i.e.
int f( p, q, r )
int p, float q, double r;
{
// Code goes here
}
I believe the other difference was function prototypes. Prototypes didn't have to - in fact they couldn't - take a list of arguments or types. In ANSI C they do.
function prototype.
constant & volatile qualifiers.
wide character support and internationalization.
permit function pointer to be used without dereferencing.
Another difference is that function return types and parameter types did not need to be defined. They would be assumed to be ints.
f(x)
{
return x + 1;
}
and
int f(x)
int x;
{
return x + 1;
}
are identical.
The major differences between ANSI C and K&R C are as follows:
function prototyping
support of the const and volatile data type qualifiers
support wide characters and internationalization
permit function pointers to be used without dereferencing
ANSI C adopts c++ function prototype technique where function definition and declaration include function names,arguments' data types, and return value data types. Function prototype enable ANSI C compiler to check for function calls in user programs that pass invalid numbers of arguments or incompatible arguments data types. These fix major weakness of the K&R C compiler.
Example: to declares a function foo and requires that foo take two arguments
unsigned long foo (char* fmt, double data)
{
/*body of foo */
}
FUNCTION PROTOTYPING:ANSI C adopts c++ function prototype technique where function definaton and declaration include function names,arguments t,data types and return value data types.function prototype enable ANSI ccompilers to check for function call in user program that passes invalid number number of argument or incompatiblle argument data types.these fix a major weakness of the K&R C compilers:invalid call in user program often passes compilation but cause program to crash when they are executed
The difference is:
Prototype
wide character support and internationalisation
Support for const and volatile keywords
permit function pointers to be used as dereferencing
A major difference nobody has yet mentioned is that before ANSI, C was defined largely by precedent rather than specification; in cases where certain operations would have predictable consequences on some platforms but not others (e.g. using relational operators on two unrelated pointers), precedent strongly favored making platform guarantees available to the programmer. For example:
On platforms which define a natural ranking among all pointers to all objects, application of the relational operators to arbitrary pointers could be relied upon to yield that ranking.
On platforms where the natural means of testing whether one pointer is "greater than" another never has any side-effect other than yielding a true or false value, application of the relational operators to arbitrary pointers could likewise be relied upon never to have any side-effects other than yielding a true or false value.
On platforms where two or more integer types shared the same size and representation, a pointer to any such integer type could be relied upon to read or write information of any other type with the same representation.
On two's-complement platforms where integer overflows naturally wrap silently, an operation involving an unsigned values smaller than "int" could be relied upon to behave as though the value was unsigned in cases where the result would be between INT_MAX+1u and UINT_MAX and it was not promoted to a larger type, nor used as the left operand of >>, nor either operand of /, %, or any comparison operator. Incidentally, the rationale for the Standard gives this as one of the reasons small unsigned types promote to signed.
Prior to C89, it was unclear to what lengths compilers for platforms where the above assumptions wouldn't naturally hold might be expected to go to uphold those assumptions anyway, but there was little doubt that compilers for platforms which could easily and cheaply uphold such assumptions should do so. The authors of the C89 Standard didn't bother to expressly say that because:
Compilers whose writers weren't being deliberately obtuse would continue doing such things when practical without having to be told (the rationale given for promoting small unsigned values to signed strongly reinforces this view).
The Standard only required implementations to be capable of running one possibly-contrived program without a stack overflow, and recognized that while an obtuse implementation could treat any other program as invoking Undefined Behavior but didn't think it was worth worrying about obtuse compiler writers writing implementations that were "conforming" but useless.
Although "C89" was interpreted contemporaneously as meaning "the language defined by C89, plus whatever additional features and guarantees the platform provides", the authors of gcc have been pushing an interpretation which excludes any features and guarantees beyond those mandated by C89.
The biggest single difference, I think, is function prototyping and the syntax for describing the types of function arguments.
Despite all the claims to the contary K&R was and is quite capable of providing any sort of stuff from low down close to the hardware on up.
The problem now is to find a compiler (preferably free) that can give a clean compile on a couple of millions of lines of K&R C without out having to mess with it.And running on something like a AMD multi core processor.
As far as I can see, having looked at the source of the GCC 4.x.x series there is no simple hack to reactivate the -traditional and -cpp-traditional lag functionality to their previous working state without without more effor than I am prepered to put in. And simpler to build a K&R pre-ansi compiler from scratch.