I have the following fragment in my Bison file that describes a simple "while" loop as a condition followed by a sequence of statements. The list of statements is large and includes BREAK and CONTINUE. The latter two can be used only within a loop.
%start statements
%%
statements: | statement statements
statement: loop | BREAK | CONTINUE | WRITE | ...
loop: WHILE condition statements ENDWHILE
condition: ...
%%
I can add a C variable, set it upon entering the loop, reset it upon exiting, and check at BREAK or CONTINUE, but this solution does not look elegant:
loop: WHILE {loop++;} condition statements {loop--;} ENDWHILE
statement: loop | BREAK {if (!loop) yyerror();} ...
Is there a way to prevent the two statements from outside a loop using only Bison rules?
P.S. What I mean is "Is there an EASY way..," without fully duplicating the grammar.
Sure. You just need three different statement non-terminals, one which matches all statements; one which matches everything but continue (for switch blocks), and one which matches everything but break and continue. Of course, this distinction needs to trickle down through your rules. You'll also need three versions of each type of compound statement: loops, conditionals, switch, braced blocks, and so on. Oh, and don't forget that statements can be labelled, so there are some more non-terminals to duplicate.
But yeah, it can certainly be done. The question is, is it worth going to all that trouble. Or, to put it another way, what do you get out of it?
To start with, the end user finds that where they used to have a pretty informative error message about continue statements outside a loop, they now just get a generic Syntax Error. Now, you can fix that with some more grammar modifications, by actually providing productions which match the invalid statements, and then present a meaningful error message. But that's almost exactly the same code already rejected as inelegant.
Other than that, does it in any way reduce parser complexity? It lets you assume that a break statement is legally placed, but you still have to figure out where the break statement's destination. And other than that, there's not really a lot of evident advantages, IMHO.
But if you want to do it, go for it.
Once you've done that, you could try modifying your grammar so that break, continue, goto and return cannot be followed by an unlabelled statement. That sounds like a good idea, and some languages do it. It can certainly be done in the grammar. (But before you get too enthusiastic, remember that some programmers do deliberately create dead code during debugging sessions, and they won't thank you for making it impossible.)
There is a BNF extension, used in the ECMAscript standard, amongst others, which parameterizes non-terminals with a list of features, each of which can be present or not. These parameters can then be used in productions, either as conditions or to be passed through to non-terminals on the right-hand side. This could be used to generate three versions of statement, using the features [continue] and [break], which would be used as gates on those respective statement syntaxes, and also passed through to the compound statement non-terminals.
I don't know of a parser generator capable of handling such parameterised rules, so I can't offer it as a concrete suggestion, but this question is one of the use cases which motivated parameterised non-terminals. (In fact, I believe it's one of the uses, but I might be remembering that wrong.)
With an ECMAScript-style formalism, this grammatical restriction could be written without duplicating rules. The duplication would still be there, under the surface, since the parser generator would have to macro expand the templated rules into their various boolean possibilities. But the grammar is a lot more readable and the size of the state machine is not so important these days.
I have no doubt that it would be a useful feature, but I also suspect that it would be overused, with the result that the quality of error messages would be reduced.
As a general rule, compilers should be optimised for correct inputs, with the additional goal of producing helpful error messages for invalid input. Complicating the grammar even a little to make easily described errors into syntax errors does not help with either of these goals. If it's possible to write a few lines of code to produce the correct error message for a detected problem, instead of emitting a generic syntax error, I would definitely do that.
There are (many) other use cases for the ECMAScript BNF extensions. For example they make it much easier to describe a syntax whose naive grammar requires two or three lookahead tokens.
Related
I understand that, in principle, modern programming languages are intended to be used in a manner where the code written is self-documenting.
However, I was taught that on occasion it is necessary to explicitly write a brief precondition, postcondition statement for a function to assert generality. If I had a need to mention a variable by name in the comment is there a standard for denoting that it's a variable?
Please use doxygen for C, it is the de-facto-standard and worth the effort.
https://www.doxygen.nl/manual/commands.html
\pre { description of the precondition }
Starts a paragraph where the
precondition of an entity can be described. The paragraph will be
indented. The text of the paragraph has no special internal structure.
All visual enhancement commands may be used inside the paragraph.
Multiple adjacent \pre commands will be joined into a single
paragraph. Each precondition will start on a new line. Alternatively,
one \pre command may mention several preconditions. The \pre command
ends when a blank line or some other sectioning command is
encountered.
I am reading The C Programming Language (K&R) and noticed C allows the use of while loops preceding a single statement to function without any braces; why did the creators of C decide to support this? I presume this introduces some extra complexity for the compiler, is the desire for single statement while loops so common (for readability, perhaps?) it was worth whatever trade-off was required to allow them?
It doesn't add any special complexity to the compiler, and it's not just while loops. All of the control structures (if, for, while, etc.) govern a "statement", where a block is just a special case of a statement (called a "compound-statement") containing 0 or more declarations or statements. There isn't any specific use case or rationale for applying this rule to while, but none is really needed, other than maybe simplicity or consistency.
I'm trying to study C grammar with flex/bison.
I found bison cannot parse this bison grammar: https://www.lysator.liu.se/c/ANSI-C-grammar-y.html, because LALR algorithm cannot process recursively multiple expressions.
Is GLR algorithm a must for C grammar?
There is nothing wrong with that grammar except:
it represents a very old version of C
it requires a lexical analyser which can somehow distinguish between IDENTIFIER and TYPE_NAME
it does not even attempt to handle the preprocessor phases
Also, it has one shift/reduce conflict as a result of the "dangling else" ambiguity. However, that conflict can be ignored because bison's conflict resolution algorithm produces the correct result in this case. (You can suppress the warning either with an %expect directive or by including a precedence declaration which favours shifting else over reducing if. Or you can eliminate the ambiguity in the grammar using the technique described in the Wikipedia page linked above. (Note: I'm not talking about copy-and-pasting code from the Wikipedia page. In the case of C, you need to consider all cases of compound statements which terminate with an if statement.)
Moreover, an LR parser is not recursive, and it has no problems which could be described as a failure to "process recursively multiple expressions". (You might have that problem with a recursive descent parser, although it's pretty easy to work around the issue.)
So any problems you might have experienced (if your question refers to a concrete issue) have nothing to do with what's described in your question.
Of the problems I listed above, the most troubling is the syntactic ambiguity of the cast operator. The cast operator is not actually ambiguous; clearly, C compilers manage to correct compile such expressions. But distinguishing between the two possible parses of, for example, (x)-y*z requires knowing whether x names a type or a variable.
In C, all names are lexically scoped, so it is certainly possible to resolve x at compile time. But the resolution is not context-free. Since GLR is also a technique for parsing context-free grammars, using a GLR parser won't directly help you. It might be useful in the sense that GLR parsers can theoretically produce "parse forests" rather than parse trees; that is, the output of a GLR parser might effectively contain all possible correct parses, leaving the possibility to resolve the ambiguity by building symbol tables for each scope and then choosing between alternative parses by examining the name binding in effect at each site. (This works because type alias declarations -- "typedefs" -- are not ambiguous, so all the potential parses will have the same alias declarations.)
The usual solution, though, is to parse the program text using a deterministic parser, maintaining a symbol table during the parse, and giving the lexical analyser access to this symbol table so that it can distinguish between IDENTIFIER and TYPE_NAME, as expected by the grammar you link. This technique is politely called "lexical feedback", although it's also often called "the lexer hack".
I want to run simple analysis on C files (such as if you call foo macro with INT_TYPE as argument, then cast the response to int*), I do not want to prerprocess the file, I just want to parse it (so that, for instance, I'll have correct line numbers).
Ie, I want to get from
#include <a.h>
#define FOO(f)
int f() {FOO(1);}
an list of tokens like
<include_directive value="a.h"/>
<macro name="FOO"><param name="f"/><result/></macro>
<function name="f">
<return>int</return>
<body>
<macro_call name="FOO"><param>1</param></macro_call>
</body>
</function>
with no need to set include path, etc.
Is there any preexisting parser that does it? All parsers I know assume C is preprocessed. I want to have access to the macros and actual include instructions.
Our C Front End can parse code containing preprocesser elements can do this to fair extent and still build a usable AST. (Yes, the parse tree has precise file/line/column number information).
There are a number of restrictions, which allows it to handle most code. In those few cases it cannot handle, often a small, easy change to the source file giving equivalent code solves the problem.
Here's a rough set of rules and restrictions:
#includes and #defines can occur wherever a declaration or statement can occur, but not in the middle of a statement. These rarely cause a problem.
macro calls can occur where function calls occur in expressions, or can appear without semicolon in place of statements. Macro calls that span non-well-formed chunks are not handled well (anybody surprised?). The latter occur occasionally but not rarely and need manual revision. OP's example of "j(v,oid)*" is problematic, but this is really rare in code.
#if ... #endif must be wrapped around major language concepts (nonterminals) (constant, expression, statement, declaration, function) or sequences of such entities, or around certain non-well-formed but commonly occurring idioms, such as if (exp) {. Each arm of the conditional must contain the same kind of syntactic construct as the other arms. #if wrapped around random text used as bad kind of comment is problematic, but easily fixed in the source by making a real comment. Where these conditions are not met, you need to modify the original source code, often by moving the #if #elsif #else #end a few tokens.
In our experience, one can revise a code base of 50,000 lines in a few hours to get around these issues. While that seems annoying (and it is), the alternative is to not be able to parse the source code at all, which is far worse than annoying.
You also want more than just a parser. See Life After Parsing, to know what happens after you succeed in getting a parse tree. We've done some additional work in building symbol tables in which the declarations are recorded with the preprocessor context in which they are embedded, enabling type checking to include the preprocessor conditions.
You can have a look at this ANTLR grammar. You will have to add rules for preprocessor tokens, though.
Your specific example can be handled by writing your own parsing and ignore macro expansion.
Because FOO(1) itself can be interpreted as a function call.
When more cases are considered however, the parser is much more difficult. You can refer PDF Link to find more information.
A customer recently performed static analysis of my employer's C codebase and gave us the results. Among useful patches was the request to change the famous do { ... } while(0) macro to do { ... } while(0,0). I understand what their patch is doing (using the sequence operator to return evaluate to the value of the second "0", so the effect is the same) but it's not clear why they'd favor the second form over the first form.
Is there a legitimate reason why one should prefer the second form of the macro, or is our customer's static analysis being overly pedantic?
Just a guess as to why they might suggest using
do { ... } while(0,0)
over
do { ... } while(0)
Even though there's no behavior difference and should be no runtime cost difference between the two.
My guess is that the static analysis tool complains about the while loop being controlled by a constant in the simpler case and doesn't when 0,0 is used. The customer's suggestion is probably just so they don't get a bunch of false positives from the tool.
For example I occasionally come across situations where I want to have a conditional statement controlled by a constant, but the compiler will complain with a warning about a conditional expression evaluating to a constant. Then I have to jump through some hoops to get the compiler to stop complaining (since I don't like to have spurious warnings).
Your customer's suggestion is one of the hoops I've used to quiet that warning, though in my case it wasn't controlling a while loop, it was to deal with an "always fails" assertion. Occasionally, I'll have an area of code that should never execute (maybe the default case of a switch). In that situation I might have an assertion that always fails with some message:
assert( !"We should have never gotten here, dammit...");
But, at least one compiler I use issues a warning about the expression always evaluating to false. However, if I change it to:
assert( ("We should have never gotten here, dammit...", 0));
The warning goes away, and everybody's happy. I'm guessing that even your customer's static analysis tool would be, too. Note that I generally hide that bit of hoop jumping behind a macro like:
#define ASSERT_FAIL( x) assert( ((x), 0))
It might be nice to be able to tell the tool vendor to fix the problem, but there might be legitimate cases where they actually do want to diagnose a loop being controlled by a constant boolean expression. Not to mention the fact that even if you convince a tool vendor to make such a change, that doesn't help you for the next year or so that it might take to actually get a fix.
Using while(0,0) prevents Microsoft compiler from generating a warning about a condition which is a constant (Warning C4127).
When this warning is enabled (e.g. with /W4 or /Wall), it can be shut down on a case by case basis with this nice little trick (see this other thread).
EDIT: Since Visual Studio 2017 15.3, while(0) does not emit warnings anymore (cf. Constant Conditionals). You can get rid of your (0,0) !
Well, I'll go for an answer:
Is there a legitimate reason why one should prefer the second form of the macro... ?
No. There is no legitimate reason. Both always evaluate to false, and any decent compiler will probably turn the second one into the first in the assembly anyway. If there was any reason for it to be invalid in some cases, C's been around far long enough for that reason to be discovered by greater gurus than I.
If you like your code making owl-y eyes at you, use while(0,0). Otherwise, use what the rest of the C programming world uses and tell your customer's static analysis tool to shove it.