Does there exist an LR(k) grammar with no LL(1) equivalent - c

I haven't been able to find an answer to this yet. Are there grammars that are context free and non-ambiguous that can't be converted to LL(1)?
I found one production that I couldn't figure out how to convert into LL(1): the parameter-type-list production in C99:
parameter-type-list:
parameter-list
parameter-list , ...
Is this an example of an LR(k) grammar that doesn't have an LL(1) equivalent or am I doing something wrong?
edit: I copied the wrong name, I meant to copy parameter-declaration:
parameter-declaration:
declaration-specifiers declarator
declaration-specifiers abstract-declarator(opt)
the problem is with declarator and abstract declarator both having ( in their first set, but also being left recursive.

In general, LR(k) grammars are more powerful than LL(k). That means there are languages with LR(k) parser, but not LL(k).
One of the examples are language defined with grammar:
S -> a S
S -> P
P -> a P b
P -> \epsilon
Or, in other words, string of a's, followed by the same or less number of b's. That follows from the fact that LL(k) parser must make a decision about every a encountered - is it paired with some b - looking ahead no more than k symbols of input, but they also can be a's, giving no useful information.
For strict proof, look at the second part of accepted answer here https://cs.stackexchange.com/questions/3350/is-this-language-ll1-parseable
Your example, however, can be simply left factored in LL(1) grammar to be
parameter-type-list -> parameter-list optional-ellipsis
optional-ellipsis -> \epsilon
optional-ellipsis -> , ...
One note that FOLLOW set for parameter-list will contain , character, and this can cause FIRST-FOLLOW conflict. If it is the case, then we need to see parameter-list definition to fix this conflict too.
Edit: parameter-declaration rule seems very complicated to answer right away. You can try to perform left factorization by hands for all conflicting alternatives, or with some assistance tool, like ANTLR.

Related

What is a production in the context of the C Programming language?

In Appendix A of The C Programming Language, Second Edition, by Kernighan and Ritchie, the word production is mentioned:
This manual describes the C language specified by the draft submitted
to ANSI on 31 October, 1988, for approval as ``American Standard for
Information Systems - programming Language C, X3.159-1989.'' The
manual is an interpretation of the proposed standard, not the standard
itself, although care has been taken to make it a reliable guide to
the language. For the most part, this document follows the broad
outline of the standard, which in turn follows that of the first
edition of this book, although the organization differs in detail.
Except for renaming a few productions, and not formalizing the
definitions of the lexical tokens or the preprocessor, the grammar
given here for the language proper is equivalent to that of the
standard.
I can't however seem to make out what it actually means. My guess is that it refers to the value of a key in the grammar section:
translation-unit:
external-declaration translation-unit
external-declaration
The following piece from section A 13 should hint at what the writer means but I can't seem to figure it out by myself as I miss some essential terminology:
The grammar has undefined terminal symbols integer-constant,
character-constant, floating- constant, identifier, string, and
enumeration-constant; the typewriter style words and symbols are
terminals given literally. This grammar can be transformed
mechanically into input acceptable for an automatic parser-generator.
Besides adding whatever syntactic marking is used to indicate
alternatives in productions, it is necessary to expand the ``one of''
constructions, and (depending on the rules of the parser-generator) to
duplicate each production with an opt symbol, once with the symbol and
once without. With one further change, namely deleting the production
typedef-name: identifier and making typedef-name a terminal symbol,
this grammar is acceptable to the YACC parser-generator. It has only
one conflict, generated by the if-else ambiguity.
I don't know either what the embolden words/sentences mean. Can someone give me an insight?
In computer science, formal grammars are described using production rules. For example, a rule of the form:
P → xyz
says the symbol P may be replaced by the symbols xyz, and a rule of the form
Q → aP | bP
says the symbol Q may be replaced by the symbol a and the symbol P or by the symbol b and the symbol P.
The formal grammar designates a start symbol, such as Q, and says which symbols are nonterminal symbols (placeholders that do not appear in the final string) and which are terminal symbols (that may appear in the final string). If Q is the start symbol for the grammar with the two rules above, then these rules describe a grammar that contains two strings, axyz and bxyz, because those are the only two strings that can be made with those rules. The set of strings a grammar can produce is called its language.
λ is commonly used the mean an empty string, one with no symbols.
This grammar with nonterminal symbol S, terminal symbols a and b, and start symbol S:
S → aSbb | λ
can produce the empty string by replacing S with the empty string, or it can produce abb by replacing S with aSbb and then the new S with the empty string, or it can produce aabbbb by replacing S with aSbb and then replacing the new S with aSbb, yielding aaSbbbb, and then replacing S with the empty string, yielding aabbbb. We can see the language of this grammar is the set of all strings containing some number of a characters (possibly zero) followed by twice as many b characters.
The C standard defines the C language using a formal grammar (with some qualifications and modifications in English). For example, it contains this production (where it uses “:” instead of the “→” I used above and uses separate lines instead of “|” to indicate a choice):
statement:
        labeled-statement
        compound-statement
        expression-statement
        selection-statement
        iteration-statement
        jump-statement
Regarding the other terminology you marked in bold:
undefined terminal symbols integer-constant,…
The C grammar given in The C Programming Language appears to be a partial grammar, leaving integer-constant as an undefined symbol, and hence acting as a terminal symbol. In the official ANSI C 1990 standard, it is a defined non-terminal in the official ANSI C 1990 standard.
… the typewriter style words and symbols are terminals given literally.
This is clearer in the original, where the fonts used make it evident the “typewriter” style is a fixed-width font:
… the typewriter style words and symbols are terminals given literally.
This, with its preceding text, means that, in the grammar, words in italic, like declaration, are non-terminal symbols of the grammar that “become something else” by productions of the grammar, while words in the “typewriter” font, like typedef, are terminal symbols and they are symbols for the literal characters in their names; typedef in the grammar means the characters t, y, p, e, d, e, and f in the source code.
…to duplicate each production with an opt symbol, once with the symbol and once without…
The standard uses a subscript “opt”, like this:
P: Q Ropt
to mean that R is optional. Formally, this is actually two rules:
P → Q R
P → Q
meaning there is a rule that replaces P with Q R and a rule that replaces P with Q, and either may be used. Hence, the R is optional. Using the “opt” subscript is just a different way of writing the rules.
… this grammar is acceptable to the YACC parser-generator. It has only one conflict, generated by the if-else ambiguity.
In computer science theory, a parser is software that checks whether a string is in a language. (In practice, it often does other useful things at the same time, like assigning “meaning” to the string by interpreting parts of it as numbers, parts as variables, and so on.) YACC is software that reads a specification of a grammar and produces source code for a parser for the language the grammar describes.
The type of parser that YACC generates requires deciding which production will be applied as each token is read. A conflict occurs when the grammar is inadequate determine which production to apply. In the C formal grammar, if (x) if (y) S1 else S2 can be produced in two ways, one that associates else S2 with the first if and one that associates else S2 with the second if. This has to be resolved by adding information to the grammar; YACC is told to associate the else S2 with the most recent if that does not currently have an else.
A "symbol" is an element of a grammar.
E.g. in the section you quoted:
translation-unit:
      external-declaration translation-unit external-declaration
All four are symbols.
A "production" explains what a symbol consists of. Your quote is a single production that states that a translation-unit consists of three things in order: external-declaration, then translation-unit, then external-declaration.
A "terminal" symbol is a symbol that doesn't have any other symbols in its definition.
E.g. your translation-unit is not terminal, but this one is:
digit: one of
      0 1 2 3 4 5 6 7 8 9
It's moot if the individual digits are terminals, or if the whole "digit" is terminal; probably the former.
Those are also "given literally", meaning the digits are actually spelled in the standard.
And here's a terminal not given literally:
basic-c-char:
      any member of the translation character set except the U+0027 apostrophe, U+005c reverse solidus, or new-line character
Notice how they didn't spell every possible character, since there are too many of them, and instead provided a plain text description.
An "undefined" terminal is just a terminal that wasn't explained/defined, making the grammar technically incomplete. If they call "integer-constant" an undefined terminal, it means their grammar fails to include integer-constant: blah blah.
to duplicate each production with an opt symbol, once with the symbol and once without
Consider:
foo:
      xopt   y
Here foo is either x or x y.
They're suggesting an alternative way to spell the same thing, without using "opt":
foo:
      y
      x y
It has only one conflict, generated by the if-else ambiguity.
Parsing a program is done by starting with a single symbol, and figuring out which productions transform it to the text you're trying to parse.
If there's more than one sequence of productions that does it, the grammar is "ambiguous".

How to handle ambiguity in syntax (like in C) in a Parsing Expression Grammar (like PEG.js)

So from my limited understanding, C has syntax ambiguity as seen in the expression:
T(*b)[4];
Here it is said about this sort of thing:
The well-known "typedef problem" with parsing C is that the standard C grammar is ambiguous unless the lexer distinguishes identifiers bound by typedef and other identifiers as two separate lexical classes. This means that the parser needs to feed scope information to the lexer during parsing. One upshot is that lexing must be done concurrently with parsing.
The problem is it can be interpreted as either multiplication or as a pointer depending on context (I don't 100% understand the details of this since I'm not expert in C, but I get the gist of it and why it's a problem).
typedef a;
b * a; // multiplication
a * b; // b is pointer to type a
What I'm wondering is if you were to parse C with a Parsing Expression Grammar (PEG) such as this C grammar, how does it handle this ambiguity? I assume this grammar is not 100% correct because of this problem, and so am wondering how you would go about fixing it. What does it need to keep track of or do differently to account for this?
The usual way this is handled in a PEG grammar is to use a semantic predicate on a rule such that the rule only matches when the predicate is true, and have the predicate check whether the name in question is a type in the current context or not. In the link you give, there's a rule
typedefName : Identifier
which is the (only) one that needs the semantic predicate to resolve this ambiguity. The predicate simply checks the Identifier in question against the definitions in the current scope. If it is not defined as a type, then it rejects this rule, so the next lower priority one will (try to) match.

why is `int test {}` a function definition in C language BNF

I'm interested in the famous The syntax of C in Backus-Naur Form and studied for a while, what confuse me is that some syntax looks wrong to me but is considered right according to the BNF.
For example, int test {}, what's this? I think this is a ill syntax in C, but the truth is the BNF considered this a function definition:
int -> type_const -> type_spec -> decl_specs
test-> id -> direct_declarator -> declarator
'{' '}' -> compound_stat
decl_specs declarator compound_stat -> function_definition
I tried this with bison, it considered the input int test {} is a right form, but I tried this on a C compiler, it will not compile.
So got questions:
int test {} a right syntax or not?
If it is a right syntax, what is that mean and why compiler do not recognized it?
If it is an ill syntax, can I say the BNF is not rigorous? And does that mean modern C compiler does not stick with this BNF?
The grammar is necessary but not sufficient to describe a valid C program. For that you need constraints from the standard too. A simpler example of this would be 0++, which follows the syntax of a C expression, but certainly isn't a valid program fragment...
C11 6.9.1p2:
The identifier declared in a function definition (which is the name of the function) shall have a function type, as specified by the declarator portion of the function definition. [162]
The footnote 162 explains that the intent of the constraint is that a typedef cannot be used, i.e. that
typedef int F(void);
F f { /* ... */ }
will not be valid, even though such a typedef could be used for a function declaration, i.e.
F f;
would declare the function
int f(void);
But mere existence of this constraint also proves that the BNF grammar in itself is not sufficient in this case. Hence you are correct in that the grammar would consider such a fragment a function definition.
The BNF form is a precise way to describe the syntax of a language, i.e. what to do precisely to get the parse tree starting from raw input.
For each language you can define infinite many grammars that describe that language. The properties of these grammars that describe the same language can differ a lot.
If you study the grammar of the C language, take care as it is not context free but context sensitive, which means, the decision of choosing a rule or other depends on what there is around that point in input.
Read about the lexer hack to see how to correctly interpret the Backus Naur form of the C grammar.

Grammar excluding repetition of qualifiers

I'm currently reading through Compiler Design in C. I'm not too familiar with the concept of grammars, but the first exercise asks that i write a grammar that recognizes a C variable declaration.
My question is, how would I prevent(in the grammar) the repetition of signed and unsigned? My familiarity of productions(as the book teaches) has a single nonterminal on the left, pointing to up to two terminal / nonterminals. I'm just not sure how the language can be used to "see" if another symbol has already been used.!
My grammar thus far is:
Declaration -> Attributes Identifier
Attributes -> Prefix Type
Prefix ->
Is there not a more succinct way than "qualifier > unsigned long | signed long | unsigned " etc etc? it would get very long when you include all possible combinations, and even then doesn't appear transferable as a prefix, since one can put qualifiers anywhere.
You can prevent repetition by coding combinatorially enumerative sets of grammar rules that disallow it. You can do this for very small sets of optional or unordered items like "signed" and "unsigned", sort of practically.
It isn't worth the trouble.
You could also consider writing your grammar to prevent a given variable declaration from occurring twice in the same scope. This is essentially the same problem, but it should be pretty clear that it is hopeless to do this by just giant sets of grammar rules.
The issue is that most languages are context-sensitive, e.g, what you can write in one place, depends on what you wrote somewhere else, perhaps far away in the source text. (Your problem with signed and unsigned has "far away" being as small as 1 token). But our grammar formalisms are mostly for context-free (and in fact, for many parser generators, even less that that, eg. LL or LALR) languages. So they simply are not expressive enough to describe all the constraints that a language places on the text you write.
There's a standard cure for this. You write a grammar, that accepts "too much", and implement a procedural post-parsing pass that checks the additional constraints. Since the "procedural" part is typically implemented in a conventional programming language, it is Turing-powerful and you can thus always code checks for the context-sensitivity, if it is possible to check.
To have a post-parsing pass, your parser has to remember what it saw. Thus you get the need for parse-capture, typically done with abstract syntax trees. Checking the validity of names in scopes requires you have symbol tables. Verifying that the parts of your Java program are always reachable requires at least simple flow analysis.
Bottom line: write your grammar simply. Don't worry about such repetition; it is easily checked post-parsing. Get past the parsing bit; it is the easiest part of the problem. Be prepared to add lots more machinery to make processing the grammar practical. Most people don't seem to understand this. (Check my bio for a discussion of "Life After Parsing").
You can encode the "state" of your expression by using a series of productions. Something like:
Expr -> "unsigned" AfterSignExpr
Expr -> "signed" AfterSignExpr
Expr -> AfterSignExpr
AfterSignExpr -> "char" AfterTypeExpr
AfterSignExpr -> "short" AfterTypeExpr
AfterSignExpr -> "int" AfterTypeExpr
AfterSignExpr -> "long" AfterTypeExpr
AfterTypeExpr -> Identifier "," AfterTypeExpr
AfterTypeExpr -> Identifier ";"
Although a C declaration is considerably more complicated because it allows for modifiers all over the place, in different orders, function declarations, struct declarations, etc.
Spoiler alert: here is an actual C11 YACC grammar (and associated lexer). Have fun trying to wrap your head around that!

Is the Syntax of C Language completely defined by CFGs?

I think the Question is self sufficient. Is the syntax of C Language completely defined through Context Free Grammars or do we have Language Constructs which may require non-Context Free definitions in the course of parsing?
An example of non CFL construct i thought was the declaration of variables before their use. But in Compilers(Aho Ullman Sethi), it is stated that the C Language does not distinguish between identifiers on the basis of their names. All the identifiers are tokenized as 'id' by the Lexical Analyzer.
If C is not completely defined by CFGs, please can anyone give an example of Non CFL construct in C?
The problem is that you haven't defined "the syntax of C".
If you define it as the language C in the CS sense, meaning the set of all valid C programs, then C – as well as virtually every other language aside from turing tarpits and Lisp – is not context free. The reasons are not related to the problem of interpreting a C program (e.g. deciding whether a * b; is a multiplication or a declaration). Instead, it's simply because context free grammars can't help you decide whether a given string is a valid C program. Even something as simple as int main() { return 0; } needs a more powerful mechanism than context free grammars, as you have to (1) remember the return type and (2) check that whatever occurs after the return matches the return type. a * b; faces a similar problem – you don't need to know whether it's a multiplication, but if it is a multiplication, that must be a valid operation for the types of a and b. I'm not actually sure whether a context sensitive grammar is enough for all of C, as some restrictions on valid C programs are quite subtle, even if you exclude undefined behaviour (some of which may even be undecidable).
Of course, the above notion is hardly useful. Generally, when talking grammars, we're only interested in a pretty good approximation of a valid program: We want a grammar that rules out as many strings which aren't C as possible without undue complexity in the grammar (for example, 1 a or (-)). Everything else is left to later phases of the compiler and called a semantic error or something similar to distinguish it from the first class of errors. These "approximate" grammars are almost always context free grammars (including in C's case), so if you want to call this approximation of the set of valid programs "syntax", C is indeed defined by a context free grammar. Many people do, so you'd be in good company.
The C language, as defined by the language standard, includes the preprocessor. The following is a syntactically correct C program:
#define START int main(
#define MIDDLE ){
START int argc, char** argv MIDDLE return 0; }
It seems to be really tempting to answer this question (which arises a lot) "sure, there is a CFG for C", based on extracting a subset of the grammar in the standard, which grammar in itself is ambiguous and recognizes a superset of the language. That CFG is interesting and even useful, but it is not C.
In fact, the productions in the standard do not even attempt to describe what a syntactically correct source file is. They describe:
The lexical structure of the source file (along with the lexical structure of valid tokens after pre-processing).
The grammar of individual preprocessor directives
A superset of the grammar of the post-processed language, which relies on some other mechanism to distinguish between typedef-name and other uses of identifier, as well as a mechanism to distinguish between constant-expression and other uses of conditional-expression.
There are many who argue that the issues in point 3 are "semantic", rather than "syntactic". However, the nature of C (and even more so its cousin C++) is that it is impossible to disentangle "semantics" from the parsing of a program. For example, the following is a syntactically correct C program:
#define base 7
#if base * 2 < 10
&one ?= two*}}
#endif
int main(void){ return 0; }
So if you really mean "is the syntax of the C language defined by a CFG", the answer must be no. If you meant, "Is there a CFG which defines the syntax of some language which represents strings which are an intermediate product of the translation of a program in the C language," it's possible that the answer is yes, although some would argue that the necessity to make precise what is a constant-expression and a typedef-name make the syntax necessarily context-sensitive, in a way that other languages are not.
Is the syntax of C Language completely defined through Context Free Grammars?
Yes it is. This is the grammar of C in BNF:
http://www.cs.man.ac.uk/~pjj/bnf/c_syntax.bnf
If you don't see other than exactly one symbol on the left hand side of any rule, then the grammar is context free. That is the very definition of context free grammars (Wikipedia):
In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form
V → w
where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty).
Since ambiguity is mentioned by others, I would like to clarify a bit. Imagine the following grammar:
A -> B x | C x
B -> y
C -> y
This is an ambiguous grammar. However, it is still a context free grammar. These two are completely separate concepts.
Obviously, the semantics analyzer of C is context sensitive. This answer from the duplicate question has further explanations.
There are two things here:
The structure of the language (syntax): this is context free as you do not need to know the surroundings to figure out what is an identifier and what is a function.
The meaning of the program (semantics): this is not context free as you need to know whether an identifier has been declared and with what type when you are referring to it.
If you mean by the "syntax of C" all valid C strings that some C compiler accepts, and after running the pre-processor, but ignoring typing errors, then this is the answer: yes but not unambiguously.
First, you could assume the input program is tokenized according to the C standard. The grammar will describe relations among these tokens and not the bare characters. Such context-free grammars are found in books about C and in implementations that use parser generators. This tokenization is a big assumption because quite some work goes into "lexing" C. So, I would argue that we have not described C with a context-free grammar yet, if we have not used context-free grammars to describe the lexical level. The staging between the tokenizer and the parser combined with the ordering emposed by a scanner generator (prefer keywords, longest match, etc) are a major increase in computational power which is not easily simulated in a context-free grammar.
So, If you do not assume a tokenizer which for example can distinguish type names from variable names using a symbol table, then a context-free grammar is going to be harder. However: the trick here is to accept ambiguity. We can describe the syntax of C including its tokens in a context-free grammar fully. Only the grammar will be ambiguous and produce different interpretations for the same string . For example for A *a; it will have derivations for a multiplication and a pointer declaration both. No problem, the grammar is still describing the C syntax as you requested, just not unambiguously.
Notice that we have assumed having run the pre-processor first as well, I believe your question was not about the code as it looks before pre-processing. Describing that using a context-free grammar would be just madness since syntactic correctness depends on the semantics of expanding user-defined macros. Basically, the programmer is extending the syntax of the C language every time a macro is defined. At CWI we did write context-free grammars for C given a set of known macro definitions to extend the C language and that worked out fine, but that is not a general solution.

Resources