I think the Question is self sufficient. Is the syntax of C Language completely defined through Context Free Grammars or do we have Language Constructs which may require non-Context Free definitions in the course of parsing?
An example of non CFL construct i thought was the declaration of variables before their use. But in Compilers(Aho Ullman Sethi), it is stated that the C Language does not distinguish between identifiers on the basis of their names. All the identifiers are tokenized as 'id' by the Lexical Analyzer.
If C is not completely defined by CFGs, please can anyone give an example of Non CFL construct in C?
The problem is that you haven't defined "the syntax of C".
If you define it as the language C in the CS sense, meaning the set of all valid C programs, then C – as well as virtually every other language aside from turing tarpits and Lisp – is not context free. The reasons are not related to the problem of interpreting a C program (e.g. deciding whether a * b; is a multiplication or a declaration). Instead, it's simply because context free grammars can't help you decide whether a given string is a valid C program. Even something as simple as int main() { return 0; } needs a more powerful mechanism than context free grammars, as you have to (1) remember the return type and (2) check that whatever occurs after the return matches the return type. a * b; faces a similar problem – you don't need to know whether it's a multiplication, but if it is a multiplication, that must be a valid operation for the types of a and b. I'm not actually sure whether a context sensitive grammar is enough for all of C, as some restrictions on valid C programs are quite subtle, even if you exclude undefined behaviour (some of which may even be undecidable).
Of course, the above notion is hardly useful. Generally, when talking grammars, we're only interested in a pretty good approximation of a valid program: We want a grammar that rules out as many strings which aren't C as possible without undue complexity in the grammar (for example, 1 a or (-)). Everything else is left to later phases of the compiler and called a semantic error or something similar to distinguish it from the first class of errors. These "approximate" grammars are almost always context free grammars (including in C's case), so if you want to call this approximation of the set of valid programs "syntax", C is indeed defined by a context free grammar. Many people do, so you'd be in good company.
The C language, as defined by the language standard, includes the preprocessor. The following is a syntactically correct C program:
#define START int main(
#define MIDDLE ){
START int argc, char** argv MIDDLE return 0; }
It seems to be really tempting to answer this question (which arises a lot) "sure, there is a CFG for C", based on extracting a subset of the grammar in the standard, which grammar in itself is ambiguous and recognizes a superset of the language. That CFG is interesting and even useful, but it is not C.
In fact, the productions in the standard do not even attempt to describe what a syntactically correct source file is. They describe:
The lexical structure of the source file (along with the lexical structure of valid tokens after pre-processing).
The grammar of individual preprocessor directives
A superset of the grammar of the post-processed language, which relies on some other mechanism to distinguish between typedef-name and other uses of identifier, as well as a mechanism to distinguish between constant-expression and other uses of conditional-expression.
There are many who argue that the issues in point 3 are "semantic", rather than "syntactic". However, the nature of C (and even more so its cousin C++) is that it is impossible to disentangle "semantics" from the parsing of a program. For example, the following is a syntactically correct C program:
#define base 7
#if base * 2 < 10
&one ?= two*}}
#endif
int main(void){ return 0; }
So if you really mean "is the syntax of the C language defined by a CFG", the answer must be no. If you meant, "Is there a CFG which defines the syntax of some language which represents strings which are an intermediate product of the translation of a program in the C language," it's possible that the answer is yes, although some would argue that the necessity to make precise what is a constant-expression and a typedef-name make the syntax necessarily context-sensitive, in a way that other languages are not.
Is the syntax of C Language completely defined through Context Free Grammars?
Yes it is. This is the grammar of C in BNF:
http://www.cs.man.ac.uk/~pjj/bnf/c_syntax.bnf
If you don't see other than exactly one symbol on the left hand side of any rule, then the grammar is context free. That is the very definition of context free grammars (Wikipedia):
In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form
V → w
where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty).
Since ambiguity is mentioned by others, I would like to clarify a bit. Imagine the following grammar:
A -> B x | C x
B -> y
C -> y
This is an ambiguous grammar. However, it is still a context free grammar. These two are completely separate concepts.
Obviously, the semantics analyzer of C is context sensitive. This answer from the duplicate question has further explanations.
There are two things here:
The structure of the language (syntax): this is context free as you do not need to know the surroundings to figure out what is an identifier and what is a function.
The meaning of the program (semantics): this is not context free as you need to know whether an identifier has been declared and with what type when you are referring to it.
If you mean by the "syntax of C" all valid C strings that some C compiler accepts, and after running the pre-processor, but ignoring typing errors, then this is the answer: yes but not unambiguously.
First, you could assume the input program is tokenized according to the C standard. The grammar will describe relations among these tokens and not the bare characters. Such context-free grammars are found in books about C and in implementations that use parser generators. This tokenization is a big assumption because quite some work goes into "lexing" C. So, I would argue that we have not described C with a context-free grammar yet, if we have not used context-free grammars to describe the lexical level. The staging between the tokenizer and the parser combined with the ordering emposed by a scanner generator (prefer keywords, longest match, etc) are a major increase in computational power which is not easily simulated in a context-free grammar.
So, If you do not assume a tokenizer which for example can distinguish type names from variable names using a symbol table, then a context-free grammar is going to be harder. However: the trick here is to accept ambiguity. We can describe the syntax of C including its tokens in a context-free grammar fully. Only the grammar will be ambiguous and produce different interpretations for the same string . For example for A *a; it will have derivations for a multiplication and a pointer declaration both. No problem, the grammar is still describing the C syntax as you requested, just not unambiguously.
Notice that we have assumed having run the pre-processor first as well, I believe your question was not about the code as it looks before pre-processing. Describing that using a context-free grammar would be just madness since syntactic correctness depends on the semantics of expanding user-defined macros. Basically, the programmer is extending the syntax of the C language every time a macro is defined. At CWI we did write context-free grammars for C given a set of known macro definitions to extend the C language and that worked out fine, but that is not a general solution.
Related
I'm trying to study C grammar with flex/bison.
I found bison cannot parse this bison grammar: https://www.lysator.liu.se/c/ANSI-C-grammar-y.html, because LALR algorithm cannot process recursively multiple expressions.
Is GLR algorithm a must for C grammar?
There is nothing wrong with that grammar except:
it represents a very old version of C
it requires a lexical analyser which can somehow distinguish between IDENTIFIER and TYPE_NAME
it does not even attempt to handle the preprocessor phases
Also, it has one shift/reduce conflict as a result of the "dangling else" ambiguity. However, that conflict can be ignored because bison's conflict resolution algorithm produces the correct result in this case. (You can suppress the warning either with an %expect directive or by including a precedence declaration which favours shifting else over reducing if. Or you can eliminate the ambiguity in the grammar using the technique described in the Wikipedia page linked above. (Note: I'm not talking about copy-and-pasting code from the Wikipedia page. In the case of C, you need to consider all cases of compound statements which terminate with an if statement.)
Moreover, an LR parser is not recursive, and it has no problems which could be described as a failure to "process recursively multiple expressions". (You might have that problem with a recursive descent parser, although it's pretty easy to work around the issue.)
So any problems you might have experienced (if your question refers to a concrete issue) have nothing to do with what's described in your question.
Of the problems I listed above, the most troubling is the syntactic ambiguity of the cast operator. The cast operator is not actually ambiguous; clearly, C compilers manage to correct compile such expressions. But distinguishing between the two possible parses of, for example, (x)-y*z requires knowing whether x names a type or a variable.
In C, all names are lexically scoped, so it is certainly possible to resolve x at compile time. But the resolution is not context-free. Since GLR is also a technique for parsing context-free grammars, using a GLR parser won't directly help you. It might be useful in the sense that GLR parsers can theoretically produce "parse forests" rather than parse trees; that is, the output of a GLR parser might effectively contain all possible correct parses, leaving the possibility to resolve the ambiguity by building symbol tables for each scope and then choosing between alternative parses by examining the name binding in effect at each site. (This works because type alias declarations -- "typedefs" -- are not ambiguous, so all the potential parses will have the same alias declarations.)
The usual solution, though, is to parse the program text using a deterministic parser, maintaining a symbol table during the parse, and giving the lexical analyser access to this symbol table so that it can distinguish between IDENTIFIER and TYPE_NAME, as expected by the grammar you link. This technique is politely called "lexical feedback", although it's also often called "the lexer hack".
So from my limited understanding, C has syntax ambiguity as seen in the expression:
T(*b)[4];
Here it is said about this sort of thing:
The well-known "typedef problem" with parsing C is that the standard C grammar is ambiguous unless the lexer distinguishes identifiers bound by typedef and other identifiers as two separate lexical classes. This means that the parser needs to feed scope information to the lexer during parsing. One upshot is that lexing must be done concurrently with parsing.
The problem is it can be interpreted as either multiplication or as a pointer depending on context (I don't 100% understand the details of this since I'm not expert in C, but I get the gist of it and why it's a problem).
typedef a;
b * a; // multiplication
a * b; // b is pointer to type a
What I'm wondering is if you were to parse C with a Parsing Expression Grammar (PEG) such as this C grammar, how does it handle this ambiguity? I assume this grammar is not 100% correct because of this problem, and so am wondering how you would go about fixing it. What does it need to keep track of or do differently to account for this?
The usual way this is handled in a PEG grammar is to use a semantic predicate on a rule such that the rule only matches when the predicate is true, and have the predicate check whether the name in question is a type in the current context or not. In the link you give, there's a rule
typedefName : Identifier
which is the (only) one that needs the semantic predicate to resolve this ambiguity. The predicate simply checks the Identifier in question against the definitions in the current scope. If it is not defined as a type, then it rejects this rule, so the next lower priority one will (try to) match.
The ANSI C grammar specifies:
declarator:
pointer_opt direct-declarator
direct-declarator:
identifier
( declarator )
direct-declarator [ constant-expression_opt ]
direct-declarator ( parameter-type-list )
direct-declarator ( identifier-list_opt )
According to this grammar, it would be possible to derive
func()()
as a declarator, and
int func()()
as a declaration, which is semantically illegal. Why does the C grammar allow such syntactically legal, but sementically illegal declarations?
These kinds of questions typically can't be answered for certain, because you're asking for information about the collective thoughts and deliberations of the C committee, in 1989. They've never conducted the work of language development wholly in public, the way, say, the people responsible for Python do, and thirty years ago they did that even less. And if you polled them personally, they probably wouldn't remember.
We can look at the C Rationale document (I'm linking to the edition corresponding to C1999, but as far as I know it didn't change very much since 1989) for clues, but on a quick skim, I don't see anything relevant to your question.
That leaves me making guesses based on general principles of programming language design. There is a general principle relevant to your question: Particularly for older languages, designers try to make the formal syntax be context-free as much as possible. This makes it much easier to write an efficient parser. Rules like "you can't have a function that returns a function" require context, and so they are left out of the syntax. It's straightforward to handle them as post-hoc constraints applied to the parse tree instead, so that's what designers do.
The C grammar has a whole bunch of places where this principle appears to have been used, not just the one you're asking about. For instance, the "maximal munch" rule for tokenization exists because it means the tokenizer does not need to be aware of the full parser context, even though it leads to inconvenient results, such as a-----b being interpreted as a -- -- - b instead of a -- - -- b, even though the parser will reject the former but accept the latter.
This design principle for programming languages is often surprising to beginners, because it's so different from how humans understand natural languages; we will go out of our way to "repair" some kind of contextually appropriate meaning from even the most nonsensical sentences, and we actually rely on this in conversation. It might help to contemplate the meta-principle that worse is better (to oversimplify, because you can get the first 90% of the work done quickly and put it out there and then iterate on the remaining 90%).
Why does the C grammar allow syntactically legal, but semantically illegal declarations like int func()()?
Your question basically answers itself:
Quite simply, it's because it's a grammar's whole job to accept syntactically legal constructs. If something is syntactically legal, but semantically meaningless or illegal, it's not the grammar's job to reject it -- it gets rejected later, during semantic analysis.
And if the question is, "Why wasn't the grammar written differently, so that semantically illegal constructs were also syntactically illegal (such that the grammar could reject them)?", the answer is that it's often a tradeoff whether to reject things during parsing or during semantic analysis. C's declaration syntax is pretty complicated, and there's an obvious desire to make the grammar which accepts it about as complicated as, but not significantly more complicated than, it has to be. Often, you can keep a grammar nicely simple by deferring certain checks to the semantic analysis phase.
Why does the C grammar allow such syntactically legal, but sementically illegal declarations?
What makes you think it sensible to expect the language syntax to be unable to express any semantically incorrect statements?
Not all semantic problems can even be detected at compile time (example: y = 1 / x;, which is well-defined except when x is zero). Even formulating the syntax rules so that they do not accept any statements, declarations, or expressions that can be proven semantically wrong at compile time would be of little benefit. It would complicate the syntax rules tremendously for very little gain, as compilers have to do the semantic analysis either way.
Note well that the primary audience for the language standard is people, not machines. That's why it describes the language semantics with prose.
I know that C is not a context-free language, a famous example is:
int foo;
typedef int foo;
foo x;
In this case the lexer doesn't know, whether foo in the 3rd line, is an identifier, or typedef.
My question is, is this the only reason that makes C a Context-Sensitive Language?
I mean, if we get rid of typedef, would it become context-free language? Or there are other reasons (examples) that prevent it from being so?
Yes. C can be parsed with a classical lex + yacc combo. The lexer definition and the yacc grammar are freely available at
http://www.quut.com/c/ANSI-C-grammar-l-2011.html
and
http://www.quut.com/c/ANSI-C-grammar-y-2011.html
As you can see from the lex file, it's straightforward except for the context-sensitive check_type() (and comment(), but comment processing technically belongs to the preprocessor), which makes typedef the only source of context-sensitivity there. Since the yacc file doesn't contain any context-sensitivity introducing tricks either, a typedef-less C would be a perfectly context-free language.
No. C cannot be a strict context independent language. For that, you should describe a syntax that doesn't allow to use a nondeclared variable (this is context) in a similar way as what you describe in your question. The language authors always describe syntax using some kind of context free grammar, but just to describe the main syntactic constructs of the language. The case you describe (making a type identifier to fit in a different token class to be able to go in places where it shouldn't) is only an example. If you look for example, the freedom in the order for things like static unsigned long long int variable simplifies the syntax remembering by programmers, but complicates things to the compiler authors.
As per my knowledge and research there are two basic reasons that make C context sensitive language. These are:
Variable is declared before it is used.
Matching the formal and actual parameters of functions or sub-routines.
These two can't be done by PushDown Automata (PDA) but Linear Bounded Automata (LBA) can do thes two.
What are examples of non - context free languages in C language ? How the following non-CFL exists in C language ?
a) L1 = {wcw|w is {a,b}*}
b) L2 = {a^n b^m c^n d^m| n,m >=1}
The question is clumsily worded, so I'm reading between the lines, here. Still, it's a common homework/study question.
The various ambiguities [1] in the C grammar as normally presented do not render the language non-context-free. (Indeed, they don't even render the grammars non-context-free.) The general rule "if it looks like a declaration, it's a declaration regardless of other possible parses" can probably be codified in a very complex context-free grammar (although it's not 100% obvious that that is true, since CFGs are not closed under intersection or difference), but it's easier to parse with a simpler CFG and then disambiguate according to the declaration rule.
Now, the important point about C (and most programming languages) is that the syntax of the language is quite a bit more complex than the BNF used for explanatory purposes. For example, a C program is not well-formed if a variable is used without being defined. That's a syntax error, but it's not detected by the CFG parser. The grammatical productions needed to define these cases are quite complicated, due to the complicated syntax of the language, but they're going to boil down to requiring that ids appear twice in a valid program. Hence L1 = {wcw|w is {a,b}+} (here w is the identifier, and c is way too complicated to spell out). In practice, checking this requirement is normally done with a symbol table, and the formal language rules, while precise, are not written in a logical formalism. Since L1is not a context-free language, the formalism could not be context-free, but a context-sensitive grammar can recognize L1, so it's not totally impossible. (See, for example, Algol 68.)
The symbol table is also used to decide whether a particular identifier is to be reduced to typedef-name [2]. This is required to resolve a number of ambiguities in the grammar. (It also further restricts the set of strings in the language, because there are some cases where an identifier must be resolved as a typedef-name in order for the program to be valid.)
For another type of context-sensitivity, function calls need to match function declarations in the number of arguments; this sort of requirement is modelled by L2 = {a^n b^m c^n d^m| n,m >=1} where a and c represent the definition and use of some function, and b and d represent the definition and use of a different function. (Again, in a highly-simplified form.)
This second requirement is possibly less clearly a syntactic requirement. Other languages (Python, for example) allow function calls with any number of arguments, and detect a argument/parameter count match as a semantic error only detected at runtime. In the case of C, however, a mismatch is clearly a syntax error.
In short, the set of grammatically valid strings which constitute the C language is a proper subset of the set of strings recognized by the CFG presented in the C language definition; the set of valid parses is a proper subset of the set of derivations generated by the CFG, and the language itself is (a) unambiguous, and (b) not context-free.
Note 1: Most of these are not really ambiguities, because they depend upon how a given identifier is resolved (typedef name, function identifier, declared variable,...).
Note 2: It is not the case that identifier must be resolved as a typedef-name if it happens to be one; that only happens in places where the reduction is possible. It is not a syntax error to use the same identifier for both a type and a variable, even in the same scope. (It's not a good idea, but it's valid.) The following example, adapted from an example in section 6.7.8 of the standard, shows the use of t as both a field name and a typedef:
typedef signed int t;
struct tag {
unsigned t:4; // field named 't' of type unsigned int
const t:5; // unnamed field of type 't' (signed int)
};
These things aren't context-free in C:
foo * bar; // foo multiplied by bar or declaration of bar pointing to foo?
foo(*bar); // foo called with *bar as param or declaration of bar pointing to foo?
foo bar[2] // is bar an array of foo or a pointer to foo?
foo (bar baz) // is foo a function or a pointer to a function?