How to process macros in LEX? - c

How do I implement #define in yacc/bison?
For Example:
#define f(x) x*x
If anywhere f(x) appears in any function then it is replaced by the right side of the
macro substituting for the argument ‘x’.
For example, f(3) would be replaced with 3*3. The macro can call another macro too.

It's not usually possible to do macro expansion inside a parser, at least not C-style macros, because C-style macro expansion doesn't respect syntax. For example
#define IF if(
#define THEN )
is legal (although very bad style IMHO). But for that to be handled inside the grammar, it would be necessary to allow a macro identifier to appear anywhere in the input, not just where an identifier might be expected. The necessary modifications to the grammar are going to make it much less readable and are very likely to introduce parser action conflicts. [Note 1]
Alternatively, you could do the macro expansion in the lexical analyzer. The lexical analyzer is not a parser, but parsing a C-style macro invocation doesn't require much sophistication, and if macro parameters were not allowed, it would be even simpler. This is how Flex handles macro replacement in its regular expressions. ({identifier}, for example. [Note 2] Since Flex macros are just raw character sequences, not token lists as with C-style macros, they can be handled by pushing the replacement text back into the input stream. (F)lex provides the unput special action for this purpose. unput pushes one character back into the input stream, so if you want to push an entire macro replacement, you have to unput it one character at a time, back to front so that the last character unput is the first one to be read afterwards.
That's workable but ugly. And it's not really scalable to even the small feature list provided by the C preprocessor. And it violates the fundamental principle of software design, which is that each component does just one thing (so that it can do it well).
So that leaves the most common approach, which is to add a separate macro processor component, so that instead of dividing the parse into lexical scan/syntax analysis, the parse becomes lexical scan/macro expansion/syntax analysis. [Note 3]
A C-style macro processor which works between the lexical analyser and the syntactic analyser could itself be written in Bison. As I mentioned above, the parsing requirements are generally minimal, but there is still parsing to be done and Bison is presumably already part of the project. Although I don't know of any macro processor (other than proof-of-concept programs I've written myself) which do this, I think it's a very flexible solution. In particular, the Bison syntactic analysis phase could be implemented with a push-parser, which avoids the need to produce the entire macro-expanded token stream in order to make it available to a traditional pull-parser.
That's not the only way to design macros, though. Indeed, it has a lot of shortcomings, because the macro expansions are not hygienic, respecting neither syntax nor scope. Probably anyone who has used C macros has at one time or other been bitten by these problems; the simplest manifestation is defining a macro like:
#define NEXT(a) a + 1
and then writing
int x = NEXT(a) * 3;
which is not going to produce the expected result (unless what is expected is a violation of the syntactic form of the last statement). Also, any macro expansion which needs to use a local variable will sooner or later produce an incorrect expansion because of unexpected name collision. Hygienic macro expansion seeks to solve these issues by viewing macro expansion as an operation on syntax trees, not token streams, making the parsing paradigm lexical scan/syntax analysis/macro expansion (of the parse tree). For that operation, the appropriate tool might well be some kind of tree parser.
Notes
Also, you'd want to remove the token from the parse tree Yacc/bison does have a poorly-documented feature, YYBACKUP, which might possibly help be able to accomplish this. I don't know if that's one of its intended use cases; indeed, it is not clear to me what its intended use cases are.
The (f)lex documentation calls these definitions, but they really are macros, and they suffer from all the usual problems macros bring with them, such as mysterious interactions with surrounding syntax.
Another possibility is macro expansion/lexical scan/syntax analysis, which could be implemented using a macro processor like M4. But that completely divorces the macros from the rest of the language.

yacc and lex generate c source at the end. So you can use macros inside the parser and lexer actions.
The actual #define preprocessor directives can go in the first section of the lexer and parser file
%{
// Somewhere here
#define f(x) x*x
%}
These sections will be copied verbatim to the generated c source.

Related

How do function renaming macros work, and should one use them?

Everyone knows about classic #define DEFAULT_VALUE 100 macro where the preprocessor will just find the "token" and replace it with whatever the value is.
The problem I am having is understanding the function version of this #define my_puts(x) puts(x). I have K&R in front of me but I simply cannot find a suitable explanation. For instance:
why do I need to supply the number of arguments?
why can their name be whatever?
why don't I have to supply the type?
But mainly I would like to know how this replacement functions under the hood.
In the back of my mind I think I have a memory of someone saying somewhere that this is bad because there are no types.
In short, I would like to know if it is safe and secure to use macros to rename functions (as opposed to the alternative of manually wrapping the function in another function).
Thank you!
The problem I am having is understanding the function version of this #define my_puts(x) puts(x).
Part of your confusion might arise from thinking of this variety as a "function renaming" macro. A more conventional term is "function-like", referring to the form of the macro definition and usage. Providing aliases for function names or converting from one function name to another is a relatively minor use for this kind of macro.
Such macros are better regarded more generally, simply as macros that accept parameters. From that standpoint, your specific questions have relatively clear answers:
why do I need to supply the number of arguments?
You are primarily associating parameter names with the various positions in the macro's parameter list. This is necessary so that the preprocessor can properly expand the macro. That the number of parameters is thereby conveyed (except for variadic macros) is of secondary importance.
why can their name be whatever?
"Whatever" is a little too strong, but the answer is that the names of macro parameters are significant only within the scope of the macro definition. The preprocessor substitutes the actual arguments into each expansion in place of the parameter names whenever it expands the macro. This is analogous to bona fide functions, actually, so I'm not really sure why this particular uncertainty arises for you.
why don't I have to supply the type?
Of the macro? Because to the extent that macros have a type, they all have the same one. They all expand to sequences of zero or more tokens. You can view this as a source-to-source translation. The resulting token sequence will be interpreted by the compiler at a subsequent stage in the process.
But mainly I would like to know how this replacement functions under the hood.
Roughly speaking, wherever the name of an in-scope function like macro appears in the source code followed by a parenthesized list of arguments, the macro name and argument list are replaced by the expansion of the macro, with the macro arguments substituted appropriately.
For example, consider this function-like macro, which you might see in real source code:
#define MIN(x, y) (((x) <= (y)) ? (x) : (y))
Within the scope of that definition, this code ...
n = MIN(10, z);
... expands to
n = (((10) <= (z)) ? (10) : (z));
Note well that
the function-like macro is not providing function alias in this case.
the macro arguments are substituted into the macro expansion wherever they appear as complete tokens in the macro's defined replacement text.
In the back of my mind I think I have a memory of someone saying somewhere that this is bad because there are no types.
Well, there are no types declared in the macro definition. That doesn't prevent all the normal rules around data type from applying to the source code resulting from the preprocessing stage. Both of these factors need to be taken into account. In some ways, the MIN() macro in the above example is more flexible than any one function can be be. Is that bad? I don't mean to deny that there are arguments against, but it's a multifaceted question that is not well captured by a single consideration or a plain "good" vs. "bad" evaluation.
In short, I would like to know if it is safe and secure to use macros to rename functions (as opposed to the alternative of manually wrapping the function in another function).
That's largely a different question from any of the above. The semantics of function-like macros are well-defined. There is no inherent safety or security issue. But function-like macros do obscure what is going on, and thereby make it more difficult to analyze code. This is therefore mostly a stylistic issue.
Function-like macros do have detractors these days, especially in the C++ community. In most cases, they have little to offer to distinguish themselves as superior to functions.

Why does the preprocessor forbid macro pasting of weird tokens?

I am writing my own C-preprocessor based on GCC. So far it is nearly identical, but what I consider redundant is to perform any form of checking on the tokens being concatenated by virtue of ##.
So in my preprocessor manual, I've written this:
3.5 Concatenation
...
GCC forbids concatenation with two mutually incompatible preprocessing
tokens such as "x" and "+" (in any order). That would result in the
following error: "pasting "x" and "+" does not give a valid
preprocessing token" However this isn't true for this preprocessor - concatenation
may occur between any token.
My reasoning is simple: if it expands to invalid code, then the compiler will produce an error and so I don't have to explicitly handle such cases, making the preprocessor slower and increasing in code complexity. If it results in valid code, then this restriction removal just makes it more flexible (although probably in rare cases).
So I would like to ask, why does this error actually happen, why is this restriction actually applied and is it a true crime if I dismiss it in my preprocessor?
As far as ISO C goes, if ## creates an invalid token, the behavior is undefined. But there is a somewhat strange situation there, in the following way. The output of the preprocessing translation phases of C is a stream of preprocessing tokens (pp-tokens). These are converted to tokens, and then syntactically and semantically analyzed. Now here is an important rule: it is a constraint violation if a pp-token doesn't have a form which lets it be converted to a token. Thus, a preprocessor token which is garbage that you write yourself without help from the ## operator must be diagnosed for bad lexical syntax. But if you use ## to create a bad preprocessing token, the behavior is undefined.
Note the subtlety there: the behavior of ## is undefined if it is used to create a bad preprocessing token. It's not the case that the pasting is well-defined, and then caught at the stage where pp-tokens are converted to tokens: it's undefined right from that point where ## is evaluated.
Basically, this is historical. C preprocessors were historically (and probably some are) separate programs, with lexical analysis that was different from and looser from the downstream compiler. The C standard tried to capture that somehow in terms of a single language with translation phases, and the result has some quirks and areas of perhaps surprising under-specification. (For instance in the preprocessing translation phases, a number token ("pp-number") is a strange lexical grammar which allows gibberish, such as tokens with multiple floating-point E exponents.)
Now, back to your situation. Your textual C preprocessor does not actually output pp-token objects; it outputs another text stream. You may have pp-token objects internally, but they get flattened on output. Thus, you might think, why not allow your ## operator to just blindly glue together any two tokens? The net effect is as if those tokens were dumped into the output stream without any intervening whitespace. (And this is probably all it was, in early preprocessors which supported ##, and ran as separate programs).
Unfortunately what that means is that your ## operator is not purely a bona fide token pasting operator; it's just a blind juxtaposing operator which sometimes produces one token, when it happens to juxtapose two tokens that will be lexically analyzed as one by the downstream compiler. If you go that way, it may be best to be honest and document it as such, rather than portraying it as a flexibility feature.
A good reason, on the other hand, to reject bad preprocessing tokens in the ## operator is to catch situations in which it cannot achieve its documented job description: the requirement of making one token out of two. The idea is that the programmer knows the language spec (the contract between the programmer and the implementation) and knows that ## is supposed to make one token, and relies on that. For such a programmer, any situation involving a bad token paste is a mistake, and that programmer is best supported by diagnosis.
The maintainers of GCC and the GNU CPP preprocessor probably took this view: that the preprocessor isn't a flexible text munging tool, but part of a toolchain supporting disciplined C programming.
Moreover, the undefined behavior of a bad token paste job is easily diagnosed, so why not diagnose it? The lack of a diagnosis requirement in this area in the standard looks like just a historic concession. It is a kind of "low-hanging fruit" of diagnosis. Let those undefined behaviors go undiagnosed for which diagnosis is difficult or intractable, or requires run-time penalties.

Is GLR algorithm a must when bison parsing C grammar?

I'm trying to study C grammar with flex/bison.
I found bison cannot parse this bison grammar: https://www.lysator.liu.se/c/ANSI-C-grammar-y.html, because LALR algorithm cannot process recursively multiple expressions.
Is GLR algorithm a must for C grammar?
There is nothing wrong with that grammar except:
it represents a very old version of C
it requires a lexical analyser which can somehow distinguish between IDENTIFIER and TYPE_NAME
it does not even attempt to handle the preprocessor phases
Also, it has one shift/reduce conflict as a result of the "dangling else" ambiguity. However, that conflict can be ignored because bison's conflict resolution algorithm produces the correct result in this case. (You can suppress the warning either with an %expect directive or by including a precedence declaration which favours shifting else over reducing if. Or you can eliminate the ambiguity in the grammar using the technique described in the Wikipedia page linked above. (Note: I'm not talking about copy-and-pasting code from the Wikipedia page. In the case of C, you need to consider all cases of compound statements which terminate with an if statement.)
Moreover, an LR parser is not recursive, and it has no problems which could be described as a failure to "process recursively multiple expressions". (You might have that problem with a recursive descent parser, although it's pretty easy to work around the issue.)
So any problems you might have experienced (if your question refers to a concrete issue) have nothing to do with what's described in your question.
Of the problems I listed above, the most troubling is the syntactic ambiguity of the cast operator. The cast operator is not actually ambiguous; clearly, C compilers manage to correct compile such expressions. But distinguishing between the two possible parses of, for example, (x)-y*z requires knowing whether x names a type or a variable.
In C, all names are lexically scoped, so it is certainly possible to resolve x at compile time. But the resolution is not context-free. Since GLR is also a technique for parsing context-free grammars, using a GLR parser won't directly help you. It might be useful in the sense that GLR parsers can theoretically produce "parse forests" rather than parse trees; that is, the output of a GLR parser might effectively contain all possible correct parses, leaving the possibility to resolve the ambiguity by building symbol tables for each scope and then choosing between alternative parses by examining the name binding in effect at each site. (This works because type alias declarations -- "typedefs" -- are not ambiguous, so all the potential parses will have the same alias declarations.)
The usual solution, though, is to parse the program text using a deterministic parser, maintaining a symbol table during the parse, and giving the lexical analyser access to this symbol table so that it can distinguish between IDENTIFIER and TYPE_NAME, as expected by the grammar you link. This technique is politely called "lexical feedback", although it's also often called "the lexer hack".

macro expansion order with included files

Let's say I have a macro in an inclusion file:
// a.h
#define VALUE SUBSTITUTE
And another file that includes it:
// b.h
#define SUBSTITUTE 3
#include "a.h"
Is it the case that VALUE is now defined to SUBSTITUTE and will be macro expanded in two passes to 3, or is it the case that VALUE has been set to the macro expanded value of SUBSTITUTE (i.e. 3)?
I ask this question in the interest of trying to understand the Boost preprocessor library and how its BOOST_PP_SLOT defines work (edit: and I mean the underlying workings). Therefore, while I am asking the above question, I'd also be interested if anyone could explain that.
(and I guess I'd also like to know where the heck to find the 'painted blue' rules are written...)
VALUE is defined as SUBSTITUTE. The definition of VALUE is not aware at any point that SUBSTITUTE has also been defined. After VALUE is replaced, whatever it was replaced by will be scanned again, and potentially more replacements applied then. All defines exist in their own conceptual space, completely unaware of each other; they only interact with one another at the site of expansion in the main program text (defines are directives, and thus not part of the program proper).
The rules for the preprocessor are specified alongside the rules for C proper in the language standard. The standard documents themselves cost money, but you can usually download the "final draft" for free; the latest (C11) can be found here: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
For at-home use the draft is pretty much equivalent to the real thing. Most people who quote the standard are actually looking at copies of the draft. (Certainly it's closer to the actual standard than any real-world C compiler is...)
There's a more accessible description of the macro rules in the GCC manual: http://gcc.gnu.org/onlinedocs/cpp/Self_002dReferential-Macros.html
Additionally... I couldn't tell you much about the Boost preprocessor library, not having used it, but there's a beautiful pair of libraries by the same authors called Order and Chaos that are very "clean" (as macro code goes) and easy to understand. They're more academic in tone and intended to be pure rather than portable; which might make them easier reading.
(Since I don't know Boost PP I don't know how relevant this is to your question but) there's also a good introductory example of the kids of techniques these libraries use for advanced metaprogramming constructs in this answer: Is the C99 preprocessor Turing complete?

Parsing C files without preprocessing it

I want to run simple analysis on C files (such as if you call foo macro with INT_TYPE as argument, then cast the response to int*), I do not want to prerprocess the file, I just want to parse it (so that, for instance, I'll have correct line numbers).
Ie, I want to get from
#include <a.h>
#define FOO(f)
int f() {FOO(1);}
an list of tokens like
<include_directive value="a.h"/>
<macro name="FOO"><param name="f"/><result/></macro>
<function name="f">
<return>int</return>
<body>
<macro_call name="FOO"><param>1</param></macro_call>
</body>
</function>
with no need to set include path, etc.
Is there any preexisting parser that does it? All parsers I know assume C is preprocessed. I want to have access to the macros and actual include instructions.
Our C Front End can parse code containing preprocesser elements can do this to fair extent and still build a usable AST. (Yes, the parse tree has precise file/line/column number information).
There are a number of restrictions, which allows it to handle most code. In those few cases it cannot handle, often a small, easy change to the source file giving equivalent code solves the problem.
Here's a rough set of rules and restrictions:
#includes and #defines can occur wherever a declaration or statement can occur, but not in the middle of a statement. These rarely cause a problem.
macro calls can occur where function calls occur in expressions, or can appear without semicolon in place of statements. Macro calls that span non-well-formed chunks are not handled well (anybody surprised?). The latter occur occasionally but not rarely and need manual revision. OP's example of "j(v,oid)*" is problematic, but this is really rare in code.
#if ... #endif must be wrapped around major language concepts (nonterminals) (constant, expression, statement, declaration, function) or sequences of such entities, or around certain non-well-formed but commonly occurring idioms, such as if (exp) {. Each arm of the conditional must contain the same kind of syntactic construct as the other arms. #if wrapped around random text used as bad kind of comment is problematic, but easily fixed in the source by making a real comment. Where these conditions are not met, you need to modify the original source code, often by moving the #if #elsif #else #end a few tokens.
In our experience, one can revise a code base of 50,000 lines in a few hours to get around these issues. While that seems annoying (and it is), the alternative is to not be able to parse the source code at all, which is far worse than annoying.
You also want more than just a parser. See Life After Parsing, to know what happens after you succeed in getting a parse tree. We've done some additional work in building symbol tables in which the declarations are recorded with the preprocessor context in which they are embedded, enabling type checking to include the preprocessor conditions.
You can have a look at this ANTLR grammar. You will have to add rules for preprocessor tokens, though.
Your specific example can be handled by writing your own parsing and ignore macro expansion.
Because FOO(1) itself can be interpreted as a function call.
When more cases are considered however, the parser is much more difficult. You can refer PDF Link to find more information.

Resources