Is there a standard for mentioning variables within comments? - c

I understand that, in principle, modern programming languages are intended to be used in a manner where the code written is self-documenting.
However, I was taught that on occasion it is necessary to explicitly write a brief precondition, postcondition statement for a function to assert generality. If I had a need to mention a variable by name in the comment is there a standard for denoting that it's a variable?

Please use doxygen for C, it is the de-facto-standard and worth the effort.
https://www.doxygen.nl/manual/commands.html
\pre { description of the precondition }
Starts a paragraph where the
precondition of an entity can be described. The paragraph will be
indented. The text of the paragraph has no special internal structure.
All visual enhancement commands may be used inside the paragraph.
Multiple adjacent \pre commands will be joined into a single
paragraph. Each precondition will start on a new line. Alternatively,
one \pre command may mention several preconditions. The \pre command
ends when a blank line or some other sectioning command is
encountered.

Related

Bison: How to check if continue/break are in a loop?

I have the following fragment in my Bison file that describes a simple "while" loop as a condition followed by a sequence of statements. The list of statements is large and includes BREAK and CONTINUE. The latter two can be used only within a loop.
%start statements
%%
statements: | statement statements
statement: loop | BREAK | CONTINUE | WRITE | ...
loop: WHILE condition statements ENDWHILE
condition: ...
%%
I can add a C variable, set it upon entering the loop, reset it upon exiting, and check at BREAK or CONTINUE, but this solution does not look elegant:
loop: WHILE {loop++;} condition statements {loop--;} ENDWHILE
statement: loop | BREAK {if (!loop) yyerror();} ...
Is there a way to prevent the two statements from outside a loop using only Bison rules?
P.S. What I mean is "Is there an EASY way..," without fully duplicating the grammar.
Sure. You just need three different statement non-terminals, one which matches all statements; one which matches everything but continue (for switch blocks), and one which matches everything but break and continue. Of course, this distinction needs to trickle down through your rules. You'll also need three versions of each type of compound statement: loops, conditionals, switch, braced blocks, and so on. Oh, and don't forget that statements can be labelled, so there are some more non-terminals to duplicate.
But yeah, it can certainly be done. The question is, is it worth going to all that trouble. Or, to put it another way, what do you get out of it?
To start with, the end user finds that where they used to have a pretty informative error message about continue statements outside a loop, they now just get a generic Syntax Error. Now, you can fix that with some more grammar modifications, by actually providing productions which match the invalid statements, and then present a meaningful error message. But that's almost exactly the same code already rejected as inelegant.
Other than that, does it in any way reduce parser complexity? It lets you assume that a break statement is legally placed, but you still have to figure out where the break statement's destination. And other than that, there's not really a lot of evident advantages, IMHO.
But if you want to do it, go for it.
Once you've done that, you could try modifying your grammar so that break, continue, goto and return cannot be followed by an unlabelled statement. That sounds like a good idea, and some languages do it. It can certainly be done in the grammar. (But before you get too enthusiastic, remember that some programmers do deliberately create dead code during debugging sessions, and they won't thank you for making it impossible.)
There is a BNF extension, used in the ECMAscript standard, amongst others, which parameterizes non-terminals with a list of features, each of which can be present or not. These parameters can then be used in productions, either as conditions or to be passed through to non-terminals on the right-hand side. This could be used to generate three versions of statement, using the features [continue] and [break], which would be used as gates on those respective statement syntaxes, and also passed through to the compound statement non-terminals.
I don't know of a parser generator capable of handling such parameterised rules, so I can't offer it as a concrete suggestion, but this question is one of the use cases which motivated parameterised non-terminals. (In fact, I believe it's one of the uses, but I might be remembering that wrong.)
With an ECMAScript-style formalism, this grammatical restriction could be written without duplicating rules. The duplication would still be there, under the surface, since the parser generator would have to macro expand the templated rules into their various boolean possibilities. But the grammar is a lot more readable and the size of the state machine is not so important these days.
I have no doubt that it would be a useful feature, but I also suspect that it would be overused, with the result that the quality of error messages would be reduced.
As a general rule, compilers should be optimised for correct inputs, with the additional goal of producing helpful error messages for invalid input. Complicating the grammar even a little to make easily described errors into syntax errors does not help with either of these goals. If it's possible to write a few lines of code to produce the correct error message for a detected problem, instead of emitting a generic syntax error, I would definitely do that.
There are (many) other use cases for the ECMAScript BNF extensions. For example they make it much easier to describe a syntax whose naive grammar requires two or three lookahead tokens.

Variable and executable in a shell interpreter

Do you know, how make the difference between variable and executable in a shell interpretor? Because i don't know how i can do that in my lexer.
If anyone have an idea ^^
Thanks,
Have a nice day
Mathieu
In a normal Posix-style shell, the first "word" in a statement which is not a variable assignment is the command to execute. Variable assignments have the form name=value where there cannot be any whitespace around the = and the name is a valid variable name.
Other than that, and in arithmetic evaluation context (which is not required for basic shells), any use of a variable must be preceded by a $.
Identifying assignments is contextual, but it is easy to do since the = is mandatory. In a flex-style lexer you could enable and disable assignment recognition with appropriate start conditions, for example.
Without knowing anything more about your strategy for lexical analysis, it's hard to provide a more detailed answer.
If you care about compatibility with Posix shell syntax, the description can be found here.

Detect start & end of a C declaration without a full C parser

I would like to partially parse a list of C declarations and/or function definitions.
That is, I want to split it into substrings, each containing one declaration, or function definition.
Each declaration (separately) will then be passed to another module (that does contain a full C parser, but that I cannot call directly.)
Obviously I could do this by including another full C parser in my program, but I hope to avoid this.
The tricky cases I'e come up against so far involve the question of whether '}' terminates a declaration/definition or not. For example in
int main(int ac, char **av) {return 0;}
... the '}' is a terminator, whereas in
typedef struct foo {int bar;} *pfoo;
it is not. There may also be pathological pieces of code like this:
struct {int bar;} *getFooPtr(...) { /* code... */ }
Notes
Please assume the C code has already been fully preprocessed before my function sees it. (Actually it hasn't, but we have a workaround for that.)
My parser will probably be implemented in Lua with LPeg
To extend the state machine in your answer to deal with function definitions add the following steps:
set fun/var state to 'unknown'
Examine the character at the current position
If it's ;, we have found the end of the declaration, and its not a function definition (might be a function declaration, though).
If it's " or ', jump to the matching quote, skipping over escape sequences if necessary.
If it's (, [ or {, jump to the matching ), ] or } (skipping over nested brackets and strings recursively if necessary)
If fun/var state is 'function' and we just skipped { .. }, we've found the end of the declaration, and its a function definition
If fun/var state is 'unknown' and we just skipped ( .. ), set fun/var state to 'function'.
If the current char is = or ,, set fun/var state to 'not-function`.
Advance to the next input character, and go back to 2.
Of course, this only works on post-pre-processed code -- if you have macros that do various odd things that haven't yet been expanded, all bets are off.
As far as I can tell, the following solution works for declarations only (that is, function definitions must be kept out of this section, or adding semicolons after them may be a workaround:)
Examine the character at the current position
If it's ;, we have found the end of the declaration.
If it's " or ', jump to the matching quote, skipping over escape sequences if necessary.
If it's (, [ or {, jump to the matching ), ] or } (skipping over nested brackets and strings recursively if necessary)
Otherwise, advance to the next input character and goto step 1.
If this proves to be unsatisfactory, I will switch to the clang parser.
Your best bet would be to extract the part of the C grammar which is related to declarations, and build a parser for that or an abbreviated version of that. Similarly, you want the grammar for function bodies, abbreviated in a similar way, so you can skip them.
This might produce a relatively trustworthy parser for declarations.
It is unfortunate that you will not likely be able to get your hands on a trustworthy C grammar; the one in the ANSI Standard(s) is not the one the compilers actually use. Every vendor has added goodies and complications to their compiler (e.g., MS C's declspecs, etc.).
The assumption the preprocessor has run is interesting. Where are you going to get the preprocessor configuration? (e.g., compiler commmand line defines, include paths, pragma settings, etc.)? This is harder than it looks, as each development environment defines different ways to set the preprocessor conditionals.
If you are willing to accept occasional errors, then any heuristic is valid candidate,
modulo how often it makes a mistake on an important client's code. This also means you can handle un-processed code, avoiding the preprocessor issue entirely.

Parsing C files without preprocessing it

I want to run simple analysis on C files (such as if you call foo macro with INT_TYPE as argument, then cast the response to int*), I do not want to prerprocess the file, I just want to parse it (so that, for instance, I'll have correct line numbers).
Ie, I want to get from
#include <a.h>
#define FOO(f)
int f() {FOO(1);}
an list of tokens like
<include_directive value="a.h"/>
<macro name="FOO"><param name="f"/><result/></macro>
<function name="f">
<return>int</return>
<body>
<macro_call name="FOO"><param>1</param></macro_call>
</body>
</function>
with no need to set include path, etc.
Is there any preexisting parser that does it? All parsers I know assume C is preprocessed. I want to have access to the macros and actual include instructions.
Our C Front End can parse code containing preprocesser elements can do this to fair extent and still build a usable AST. (Yes, the parse tree has precise file/line/column number information).
There are a number of restrictions, which allows it to handle most code. In those few cases it cannot handle, often a small, easy change to the source file giving equivalent code solves the problem.
Here's a rough set of rules and restrictions:
#includes and #defines can occur wherever a declaration or statement can occur, but not in the middle of a statement. These rarely cause a problem.
macro calls can occur where function calls occur in expressions, or can appear without semicolon in place of statements. Macro calls that span non-well-formed chunks are not handled well (anybody surprised?). The latter occur occasionally but not rarely and need manual revision. OP's example of "j(v,oid)*" is problematic, but this is really rare in code.
#if ... #endif must be wrapped around major language concepts (nonterminals) (constant, expression, statement, declaration, function) or sequences of such entities, or around certain non-well-formed but commonly occurring idioms, such as if (exp) {. Each arm of the conditional must contain the same kind of syntactic construct as the other arms. #if wrapped around random text used as bad kind of comment is problematic, but easily fixed in the source by making a real comment. Where these conditions are not met, you need to modify the original source code, often by moving the #if #elsif #else #end a few tokens.
In our experience, one can revise a code base of 50,000 lines in a few hours to get around these issues. While that seems annoying (and it is), the alternative is to not be able to parse the source code at all, which is far worse than annoying.
You also want more than just a parser. See Life After Parsing, to know what happens after you succeed in getting a parse tree. We've done some additional work in building symbol tables in which the declarations are recorded with the preprocessor context in which they are embedded, enabling type checking to include the preprocessor conditions.
You can have a look at this ANTLR grammar. You will have to add rules for preprocessor tokens, though.
Your specific example can be handled by writing your own parsing and ignore macro expansion.
Because FOO(1) itself can be interpreted as a function call.
When more cases are considered however, the parser is much more difficult. You can refer PDF Link to find more information.

How to know line numbers of local declarations/loops in a C function

I wanted to know at which line number are the declarations in a given C function.
Also which lines have if/while/for loops or which lines span multiple lines (ie they
donot end on same line).
I think we need to know why you want the line number in order to help you.
Variously:
1) You can use __LINE__ in the code to get the current line number.
2) Most editors can show the line numbers next to the code.
If you want to script breakpoints, I'm not sure if that's possible - I'd suggest setting break-points on filename and function, and then splitting up the code till that's sufficient. Alternatively investigate other ways of getting the testing done - e.g. splitting up the code so unit tests can check it.
Maybe I did not understand your question, but you can use ctags (or one of its variants) to get a list of declarations and their line numbers.
For example exuberant ctags is capable of generating tags (line numbers) for all types of C/C++ language tags, including all of the following:
class names
macro definitions
enumeration names
enumerators
function definitions
function prototypes/declarations
class, interface, struct, and union data members
structure names
typedefs
union names
variables (definitions and external declarations)
If you can, use the diff tool. It provides line numbers as part of the output. Your tool could then parse that output, looking for declarations or primary code.

Resources