How do I construct a formal grammar like that? - theory

I have these two deterministic context-free grammars:
G1 and G2
with G1=(N1,T,P1,S1) and G2=(N2,T,P2,S2)
(note that both grammars share the same set of terminal symbols)
I need to construct a grammar G3 with L(G3)=L(G1){L(G2)}*
I think the crucial point here is the common set of terminals. But I don't know how to proceed...
Any help?

The common set of terminals is not really crucial but makes things a bit easier. Let us suppose that N1 and N2 have no symbols in common and that S and X are not in either of the sets. Then the grammar:
[N1+N2+{S,X}, T, P1+P2+{S->S1X, X->S2X, X->lambda}, S]
generates the language you want.
The new rules generate some string from S1(S2)*. From there on you obviously can generate a word from L(G1) followed by any number of words from L(G2). The dijointness of nonterminal alphabets guarantees that the two grammars do not interfere with each other's parts of the derivation.

Related

Is GLR algorithm a must when bison parsing C grammar?

I'm trying to study C grammar with flex/bison.
I found bison cannot parse this bison grammar: https://www.lysator.liu.se/c/ANSI-C-grammar-y.html, because LALR algorithm cannot process recursively multiple expressions.
Is GLR algorithm a must for C grammar?
There is nothing wrong with that grammar except:
it represents a very old version of C
it requires a lexical analyser which can somehow distinguish between IDENTIFIER and TYPE_NAME
it does not even attempt to handle the preprocessor phases
Also, it has one shift/reduce conflict as a result of the "dangling else" ambiguity. However, that conflict can be ignored because bison's conflict resolution algorithm produces the correct result in this case. (You can suppress the warning either with an %expect directive or by including a precedence declaration which favours shifting else over reducing if. Or you can eliminate the ambiguity in the grammar using the technique described in the Wikipedia page linked above. (Note: I'm not talking about copy-and-pasting code from the Wikipedia page. In the case of C, you need to consider all cases of compound statements which terminate with an if statement.)
Moreover, an LR parser is not recursive, and it has no problems which could be described as a failure to "process recursively multiple expressions". (You might have that problem with a recursive descent parser, although it's pretty easy to work around the issue.)
So any problems you might have experienced (if your question refers to a concrete issue) have nothing to do with what's described in your question.
Of the problems I listed above, the most troubling is the syntactic ambiguity of the cast operator. The cast operator is not actually ambiguous; clearly, C compilers manage to correct compile such expressions. But distinguishing between the two possible parses of, for example, (x)-y*z requires knowing whether x names a type or a variable.
In C, all names are lexically scoped, so it is certainly possible to resolve x at compile time. But the resolution is not context-free. Since GLR is also a technique for parsing context-free grammars, using a GLR parser won't directly help you. It might be useful in the sense that GLR parsers can theoretically produce "parse forests" rather than parse trees; that is, the output of a GLR parser might effectively contain all possible correct parses, leaving the possibility to resolve the ambiguity by building symbol tables for each scope and then choosing between alternative parses by examining the name binding in effect at each site. (This works because type alias declarations -- "typedefs" -- are not ambiguous, so all the potential parses will have the same alias declarations.)
The usual solution, though, is to parse the program text using a deterministic parser, maintaining a symbol table during the parse, and giving the lexical analyser access to this symbol table so that it can distinguish between IDENTIFIER and TYPE_NAME, as expected by the grammar you link. This technique is politely called "lexical feedback", although it's also often called "the lexer hack".

How to handle ambiguity in syntax (like in C) in a Parsing Expression Grammar (like PEG.js)

So from my limited understanding, C has syntax ambiguity as seen in the expression:
T(*b)[4];
Here it is said about this sort of thing:
The well-known "typedef problem" with parsing C is that the standard C grammar is ambiguous unless the lexer distinguishes identifiers bound by typedef and other identifiers as two separate lexical classes. This means that the parser needs to feed scope information to the lexer during parsing. One upshot is that lexing must be done concurrently with parsing.
The problem is it can be interpreted as either multiplication or as a pointer depending on context (I don't 100% understand the details of this since I'm not expert in C, but I get the gist of it and why it's a problem).
typedef a;
b * a; // multiplication
a * b; // b is pointer to type a
What I'm wondering is if you were to parse C with a Parsing Expression Grammar (PEG) such as this C grammar, how does it handle this ambiguity? I assume this grammar is not 100% correct because of this problem, and so am wondering how you would go about fixing it. What does it need to keep track of or do differently to account for this?
The usual way this is handled in a PEG grammar is to use a semantic predicate on a rule such that the rule only matches when the predicate is true, and have the predicate check whether the name in question is a type in the current context or not. In the link you give, there's a rule
typedefName : Identifier
which is the (only) one that needs the semantic predicate to resolve this ambiguity. The predicate simply checks the Identifier in question against the definitions in the current scope. If it is not defined as a type, then it rejects this rule, so the next lower priority one will (try to) match.

Does there exist an LR(k) grammar with no LL(1) equivalent

I haven't been able to find an answer to this yet. Are there grammars that are context free and non-ambiguous that can't be converted to LL(1)?
I found one production that I couldn't figure out how to convert into LL(1): the parameter-type-list production in C99:
parameter-type-list:
parameter-list
parameter-list , ...
Is this an example of an LR(k) grammar that doesn't have an LL(1) equivalent or am I doing something wrong?
edit: I copied the wrong name, I meant to copy parameter-declaration:
parameter-declaration:
declaration-specifiers declarator
declaration-specifiers abstract-declarator(opt)
the problem is with declarator and abstract declarator both having ( in their first set, but also being left recursive.
In general, LR(k) grammars are more powerful than LL(k). That means there are languages with LR(k) parser, but not LL(k).
One of the examples are language defined with grammar:
S -> a S
S -> P
P -> a P b
P -> \epsilon
Or, in other words, string of a's, followed by the same or less number of b's. That follows from the fact that LL(k) parser must make a decision about every a encountered - is it paired with some b - looking ahead no more than k symbols of input, but they also can be a's, giving no useful information.
For strict proof, look at the second part of accepted answer here https://cs.stackexchange.com/questions/3350/is-this-language-ll1-parseable
Your example, however, can be simply left factored in LL(1) grammar to be
parameter-type-list -> parameter-list optional-ellipsis
optional-ellipsis -> \epsilon
optional-ellipsis -> , ...
One note that FOLLOW set for parameter-list will contain , character, and this can cause FIRST-FOLLOW conflict. If it is the case, then we need to see parameter-list definition to fix this conflict too.
Edit: parameter-declaration rule seems very complicated to answer right away. You can try to perform left factorization by hands for all conflicting alternatives, or with some assistance tool, like ANTLR.

Grammar excluding repetition of qualifiers

I'm currently reading through Compiler Design in C. I'm not too familiar with the concept of grammars, but the first exercise asks that i write a grammar that recognizes a C variable declaration.
My question is, how would I prevent(in the grammar) the repetition of signed and unsigned? My familiarity of productions(as the book teaches) has a single nonterminal on the left, pointing to up to two terminal / nonterminals. I'm just not sure how the language can be used to "see" if another symbol has already been used.!
My grammar thus far is:
Declaration -> Attributes Identifier
Attributes -> Prefix Type
Prefix ->
Is there not a more succinct way than "qualifier > unsigned long | signed long | unsigned " etc etc? it would get very long when you include all possible combinations, and even then doesn't appear transferable as a prefix, since one can put qualifiers anywhere.
You can prevent repetition by coding combinatorially enumerative sets of grammar rules that disallow it. You can do this for very small sets of optional or unordered items like "signed" and "unsigned", sort of practically.
It isn't worth the trouble.
You could also consider writing your grammar to prevent a given variable declaration from occurring twice in the same scope. This is essentially the same problem, but it should be pretty clear that it is hopeless to do this by just giant sets of grammar rules.
The issue is that most languages are context-sensitive, e.g, what you can write in one place, depends on what you wrote somewhere else, perhaps far away in the source text. (Your problem with signed and unsigned has "far away" being as small as 1 token). But our grammar formalisms are mostly for context-free (and in fact, for many parser generators, even less that that, eg. LL or LALR) languages. So they simply are not expressive enough to describe all the constraints that a language places on the text you write.
There's a standard cure for this. You write a grammar, that accepts "too much", and implement a procedural post-parsing pass that checks the additional constraints. Since the "procedural" part is typically implemented in a conventional programming language, it is Turing-powerful and you can thus always code checks for the context-sensitivity, if it is possible to check.
To have a post-parsing pass, your parser has to remember what it saw. Thus you get the need for parse-capture, typically done with abstract syntax trees. Checking the validity of names in scopes requires you have symbol tables. Verifying that the parts of your Java program are always reachable requires at least simple flow analysis.
Bottom line: write your grammar simply. Don't worry about such repetition; it is easily checked post-parsing. Get past the parsing bit; it is the easiest part of the problem. Be prepared to add lots more machinery to make processing the grammar practical. Most people don't seem to understand this. (Check my bio for a discussion of "Life After Parsing").
You can encode the "state" of your expression by using a series of productions. Something like:
Expr -> "unsigned" AfterSignExpr
Expr -> "signed" AfterSignExpr
Expr -> AfterSignExpr
AfterSignExpr -> "char" AfterTypeExpr
AfterSignExpr -> "short" AfterTypeExpr
AfterSignExpr -> "int" AfterTypeExpr
AfterSignExpr -> "long" AfterTypeExpr
AfterTypeExpr -> Identifier "," AfterTypeExpr
AfterTypeExpr -> Identifier ";"
Although a C declaration is considerably more complicated because it allows for modifiers all over the place, in different orders, function declarations, struct declarations, etc.
Spoiler alert: here is an actual C11 YACC grammar (and associated lexer). Have fun trying to wrap your head around that!

Is the Syntax of C Language completely defined by CFGs?

I think the Question is self sufficient. Is the syntax of C Language completely defined through Context Free Grammars or do we have Language Constructs which may require non-Context Free definitions in the course of parsing?
An example of non CFL construct i thought was the declaration of variables before their use. But in Compilers(Aho Ullman Sethi), it is stated that the C Language does not distinguish between identifiers on the basis of their names. All the identifiers are tokenized as 'id' by the Lexical Analyzer.
If C is not completely defined by CFGs, please can anyone give an example of Non CFL construct in C?
The problem is that you haven't defined "the syntax of C".
If you define it as the language C in the CS sense, meaning the set of all valid C programs, then C – as well as virtually every other language aside from turing tarpits and Lisp – is not context free. The reasons are not related to the problem of interpreting a C program (e.g. deciding whether a * b; is a multiplication or a declaration). Instead, it's simply because context free grammars can't help you decide whether a given string is a valid C program. Even something as simple as int main() { return 0; } needs a more powerful mechanism than context free grammars, as you have to (1) remember the return type and (2) check that whatever occurs after the return matches the return type. a * b; faces a similar problem – you don't need to know whether it's a multiplication, but if it is a multiplication, that must be a valid operation for the types of a and b. I'm not actually sure whether a context sensitive grammar is enough for all of C, as some restrictions on valid C programs are quite subtle, even if you exclude undefined behaviour (some of which may even be undecidable).
Of course, the above notion is hardly useful. Generally, when talking grammars, we're only interested in a pretty good approximation of a valid program: We want a grammar that rules out as many strings which aren't C as possible without undue complexity in the grammar (for example, 1 a or (-)). Everything else is left to later phases of the compiler and called a semantic error or something similar to distinguish it from the first class of errors. These "approximate" grammars are almost always context free grammars (including in C's case), so if you want to call this approximation of the set of valid programs "syntax", C is indeed defined by a context free grammar. Many people do, so you'd be in good company.
The C language, as defined by the language standard, includes the preprocessor. The following is a syntactically correct C program:
#define START int main(
#define MIDDLE ){
START int argc, char** argv MIDDLE return 0; }
It seems to be really tempting to answer this question (which arises a lot) "sure, there is a CFG for C", based on extracting a subset of the grammar in the standard, which grammar in itself is ambiguous and recognizes a superset of the language. That CFG is interesting and even useful, but it is not C.
In fact, the productions in the standard do not even attempt to describe what a syntactically correct source file is. They describe:
The lexical structure of the source file (along with the lexical structure of valid tokens after pre-processing).
The grammar of individual preprocessor directives
A superset of the grammar of the post-processed language, which relies on some other mechanism to distinguish between typedef-name and other uses of identifier, as well as a mechanism to distinguish between constant-expression and other uses of conditional-expression.
There are many who argue that the issues in point 3 are "semantic", rather than "syntactic". However, the nature of C (and even more so its cousin C++) is that it is impossible to disentangle "semantics" from the parsing of a program. For example, the following is a syntactically correct C program:
#define base 7
#if base * 2 < 10
&one ?= two*}}
#endif
int main(void){ return 0; }
So if you really mean "is the syntax of the C language defined by a CFG", the answer must be no. If you meant, "Is there a CFG which defines the syntax of some language which represents strings which are an intermediate product of the translation of a program in the C language," it's possible that the answer is yes, although some would argue that the necessity to make precise what is a constant-expression and a typedef-name make the syntax necessarily context-sensitive, in a way that other languages are not.
Is the syntax of C Language completely defined through Context Free Grammars?
Yes it is. This is the grammar of C in BNF:
http://www.cs.man.ac.uk/~pjj/bnf/c_syntax.bnf
If you don't see other than exactly one symbol on the left hand side of any rule, then the grammar is context free. That is the very definition of context free grammars (Wikipedia):
In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form
V → w
where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty).
Since ambiguity is mentioned by others, I would like to clarify a bit. Imagine the following grammar:
A -> B x | C x
B -> y
C -> y
This is an ambiguous grammar. However, it is still a context free grammar. These two are completely separate concepts.
Obviously, the semantics analyzer of C is context sensitive. This answer from the duplicate question has further explanations.
There are two things here:
The structure of the language (syntax): this is context free as you do not need to know the surroundings to figure out what is an identifier and what is a function.
The meaning of the program (semantics): this is not context free as you need to know whether an identifier has been declared and with what type when you are referring to it.
If you mean by the "syntax of C" all valid C strings that some C compiler accepts, and after running the pre-processor, but ignoring typing errors, then this is the answer: yes but not unambiguously.
First, you could assume the input program is tokenized according to the C standard. The grammar will describe relations among these tokens and not the bare characters. Such context-free grammars are found in books about C and in implementations that use parser generators. This tokenization is a big assumption because quite some work goes into "lexing" C. So, I would argue that we have not described C with a context-free grammar yet, if we have not used context-free grammars to describe the lexical level. The staging between the tokenizer and the parser combined with the ordering emposed by a scanner generator (prefer keywords, longest match, etc) are a major increase in computational power which is not easily simulated in a context-free grammar.
So, If you do not assume a tokenizer which for example can distinguish type names from variable names using a symbol table, then a context-free grammar is going to be harder. However: the trick here is to accept ambiguity. We can describe the syntax of C including its tokens in a context-free grammar fully. Only the grammar will be ambiguous and produce different interpretations for the same string . For example for A *a; it will have derivations for a multiplication and a pointer declaration both. No problem, the grammar is still describing the C syntax as you requested, just not unambiguously.
Notice that we have assumed having run the pre-processor first as well, I believe your question was not about the code as it looks before pre-processing. Describing that using a context-free grammar would be just madness since syntactic correctness depends on the semantics of expanding user-defined macros. Basically, the programmer is extending the syntax of the C language every time a macro is defined. At CWI we did write context-free grammars for C given a set of known macro definitions to extend the C language and that worked out fine, but that is not a general solution.

Resources