Lexical analysis of a list - lexical-analysis

I have to make a lexer for a language that has (among other things) lists of the form [1,2,3] for example or ['c','s','q','t'].
I don't really understand whether I need to match the list at the lexing stage. So, for example would
2:[1,2,3];
be
NUM(2) COLON LSQBRACKET NUM(1) COMMA NUM(2) COMMA NUM(3) RSQBRACKET SEMI
or
NUM(2) COLON LIST([1,2,3]) SEMI
Thanks for any help.

Technically, it's up to you. If you only ever have to match very simple list literals, then maybe you can get away with treating them kind of like string literals. (But, that's not likely to be a good approach).
You generally want the lexer to output a series of simple tokens. The lexer should be relatively simple -- one rule of thumb is that it should never require recursion.
So, for example, requiring it to output a "LIST" token would be counterproductive -- the lexer would have to recurse on nested lists, meaning that it would implement a mini-parser. Leave that job to the parser.
The first case makes for a simpler lexer, which is still useful to a later-stage parser.

Related

Bison: How to check if continue/break are in a loop?

I have the following fragment in my Bison file that describes a simple "while" loop as a condition followed by a sequence of statements. The list of statements is large and includes BREAK and CONTINUE. The latter two can be used only within a loop.
%start statements
%%
statements: | statement statements
statement: loop | BREAK | CONTINUE | WRITE | ...
loop: WHILE condition statements ENDWHILE
condition: ...
%%
I can add a C variable, set it upon entering the loop, reset it upon exiting, and check at BREAK or CONTINUE, but this solution does not look elegant:
loop: WHILE {loop++;} condition statements {loop--;} ENDWHILE
statement: loop | BREAK {if (!loop) yyerror();} ...
Is there a way to prevent the two statements from outside a loop using only Bison rules?
P.S. What I mean is "Is there an EASY way..," without fully duplicating the grammar.
Sure. You just need three different statement non-terminals, one which matches all statements; one which matches everything but continue (for switch blocks), and one which matches everything but break and continue. Of course, this distinction needs to trickle down through your rules. You'll also need three versions of each type of compound statement: loops, conditionals, switch, braced blocks, and so on. Oh, and don't forget that statements can be labelled, so there are some more non-terminals to duplicate.
But yeah, it can certainly be done. The question is, is it worth going to all that trouble. Or, to put it another way, what do you get out of it?
To start with, the end user finds that where they used to have a pretty informative error message about continue statements outside a loop, they now just get a generic Syntax Error. Now, you can fix that with some more grammar modifications, by actually providing productions which match the invalid statements, and then present a meaningful error message. But that's almost exactly the same code already rejected as inelegant.
Other than that, does it in any way reduce parser complexity? It lets you assume that a break statement is legally placed, but you still have to figure out where the break statement's destination. And other than that, there's not really a lot of evident advantages, IMHO.
But if you want to do it, go for it.
Once you've done that, you could try modifying your grammar so that break, continue, goto and return cannot be followed by an unlabelled statement. That sounds like a good idea, and some languages do it. It can certainly be done in the grammar. (But before you get too enthusiastic, remember that some programmers do deliberately create dead code during debugging sessions, and they won't thank you for making it impossible.)
There is a BNF extension, used in the ECMAscript standard, amongst others, which parameterizes non-terminals with a list of features, each of which can be present or not. These parameters can then be used in productions, either as conditions or to be passed through to non-terminals on the right-hand side. This could be used to generate three versions of statement, using the features [continue] and [break], which would be used as gates on those respective statement syntaxes, and also passed through to the compound statement non-terminals.
I don't know of a parser generator capable of handling such parameterised rules, so I can't offer it as a concrete suggestion, but this question is one of the use cases which motivated parameterised non-terminals. (In fact, I believe it's one of the uses, but I might be remembering that wrong.)
With an ECMAScript-style formalism, this grammatical restriction could be written without duplicating rules. The duplication would still be there, under the surface, since the parser generator would have to macro expand the templated rules into their various boolean possibilities. But the grammar is a lot more readable and the size of the state machine is not so important these days.
I have no doubt that it would be a useful feature, but I also suspect that it would be overused, with the result that the quality of error messages would be reduced.
As a general rule, compilers should be optimised for correct inputs, with the additional goal of producing helpful error messages for invalid input. Complicating the grammar even a little to make easily described errors into syntax errors does not help with either of these goals. If it's possible to write a few lines of code to produce the correct error message for a detected problem, instead of emitting a generic syntax error, I would definitely do that.
There are (many) other use cases for the ECMAScript BNF extensions. For example they make it much easier to describe a syntax whose naive grammar requires two or three lookahead tokens.

How to process macros in LEX?

How do I implement #define in yacc/bison?
For Example:
#define f(x) x*x
If anywhere f(x) appears in any function then it is replaced by the right side of the
macro substituting for the argument ‘x’.
For example, f(3) would be replaced with 3*3. The macro can call another macro too.
It's not usually possible to do macro expansion inside a parser, at least not C-style macros, because C-style macro expansion doesn't respect syntax. For example
#define IF if(
#define THEN )
is legal (although very bad style IMHO). But for that to be handled inside the grammar, it would be necessary to allow a macro identifier to appear anywhere in the input, not just where an identifier might be expected. The necessary modifications to the grammar are going to make it much less readable and are very likely to introduce parser action conflicts. [Note 1]
Alternatively, you could do the macro expansion in the lexical analyzer. The lexical analyzer is not a parser, but parsing a C-style macro invocation doesn't require much sophistication, and if macro parameters were not allowed, it would be even simpler. This is how Flex handles macro replacement in its regular expressions. ({identifier}, for example. [Note 2] Since Flex macros are just raw character sequences, not token lists as with C-style macros, they can be handled by pushing the replacement text back into the input stream. (F)lex provides the unput special action for this purpose. unput pushes one character back into the input stream, so if you want to push an entire macro replacement, you have to unput it one character at a time, back to front so that the last character unput is the first one to be read afterwards.
That's workable but ugly. And it's not really scalable to even the small feature list provided by the C preprocessor. And it violates the fundamental principle of software design, which is that each component does just one thing (so that it can do it well).
So that leaves the most common approach, which is to add a separate macro processor component, so that instead of dividing the parse into lexical scan/syntax analysis, the parse becomes lexical scan/macro expansion/syntax analysis. [Note 3]
A C-style macro processor which works between the lexical analyser and the syntactic analyser could itself be written in Bison. As I mentioned above, the parsing requirements are generally minimal, but there is still parsing to be done and Bison is presumably already part of the project. Although I don't know of any macro processor (other than proof-of-concept programs I've written myself) which do this, I think it's a very flexible solution. In particular, the Bison syntactic analysis phase could be implemented with a push-parser, which avoids the need to produce the entire macro-expanded token stream in order to make it available to a traditional pull-parser.
That's not the only way to design macros, though. Indeed, it has a lot of shortcomings, because the macro expansions are not hygienic, respecting neither syntax nor scope. Probably anyone who has used C macros has at one time or other been bitten by these problems; the simplest manifestation is defining a macro like:
#define NEXT(a) a + 1
and then writing
int x = NEXT(a) * 3;
which is not going to produce the expected result (unless what is expected is a violation of the syntactic form of the last statement). Also, any macro expansion which needs to use a local variable will sooner or later produce an incorrect expansion because of unexpected name collision. Hygienic macro expansion seeks to solve these issues by viewing macro expansion as an operation on syntax trees, not token streams, making the parsing paradigm lexical scan/syntax analysis/macro expansion (of the parse tree). For that operation, the appropriate tool might well be some kind of tree parser.
Notes
Also, you'd want to remove the token from the parse tree Yacc/bison does have a poorly-documented feature, YYBACKUP, which might possibly help be able to accomplish this. I don't know if that's one of its intended use cases; indeed, it is not clear to me what its intended use cases are.
The (f)lex documentation calls these definitions, but they really are macros, and they suffer from all the usual problems macros bring with them, such as mysterious interactions with surrounding syntax.
Another possibility is macro expansion/lexical scan/syntax analysis, which could be implemented using a macro processor like M4. But that completely divorces the macros from the rest of the language.
yacc and lex generate c source at the end. So you can use macros inside the parser and lexer actions.
The actual #define preprocessor directives can go in the first section of the lexer and parser file
%{
// Somewhere here
#define f(x) x*x
%}
These sections will be copied verbatim to the generated c source.

Is GLR algorithm a must when bison parsing C grammar?

I'm trying to study C grammar with flex/bison.
I found bison cannot parse this bison grammar: https://www.lysator.liu.se/c/ANSI-C-grammar-y.html, because LALR algorithm cannot process recursively multiple expressions.
Is GLR algorithm a must for C grammar?
There is nothing wrong with that grammar except:
it represents a very old version of C
it requires a lexical analyser which can somehow distinguish between IDENTIFIER and TYPE_NAME
it does not even attempt to handle the preprocessor phases
Also, it has one shift/reduce conflict as a result of the "dangling else" ambiguity. However, that conflict can be ignored because bison's conflict resolution algorithm produces the correct result in this case. (You can suppress the warning either with an %expect directive or by including a precedence declaration which favours shifting else over reducing if. Or you can eliminate the ambiguity in the grammar using the technique described in the Wikipedia page linked above. (Note: I'm not talking about copy-and-pasting code from the Wikipedia page. In the case of C, you need to consider all cases of compound statements which terminate with an if statement.)
Moreover, an LR parser is not recursive, and it has no problems which could be described as a failure to "process recursively multiple expressions". (You might have that problem with a recursive descent parser, although it's pretty easy to work around the issue.)
So any problems you might have experienced (if your question refers to a concrete issue) have nothing to do with what's described in your question.
Of the problems I listed above, the most troubling is the syntactic ambiguity of the cast operator. The cast operator is not actually ambiguous; clearly, C compilers manage to correct compile such expressions. But distinguishing between the two possible parses of, for example, (x)-y*z requires knowing whether x names a type or a variable.
In C, all names are lexically scoped, so it is certainly possible to resolve x at compile time. But the resolution is not context-free. Since GLR is also a technique for parsing context-free grammars, using a GLR parser won't directly help you. It might be useful in the sense that GLR parsers can theoretically produce "parse forests" rather than parse trees; that is, the output of a GLR parser might effectively contain all possible correct parses, leaving the possibility to resolve the ambiguity by building symbol tables for each scope and then choosing between alternative parses by examining the name binding in effect at each site. (This works because type alias declarations -- "typedefs" -- are not ambiguous, so all the potential parses will have the same alias declarations.)
The usual solution, though, is to parse the program text using a deterministic parser, maintaining a symbol table during the parse, and giving the lexical analyser access to this symbol table so that it can distinguish between IDENTIFIER and TYPE_NAME, as expected by the grammar you link. This technique is politely called "lexical feedback", although it's also often called "the lexer hack".

When writing a compiler, how are tokens checked?

Does a compiler use if statements when deciding what to do if a certain keyword is encounered, and should someone writing a compiler use them for most operations when checking code? Or is there a more efficient way? For example, when I test a symbol against a symbol table and it comes back as being a valid "token", do I have to use an if statement to determine what to do for every single keyword, since it seems rather inefficient, for example the pseudocode:
/*Each keyword/token in my compiler has a numerical representation which is what the symbol table returns back for example #define IF 0 and so on*/
if(Token == IF){
//This will be done to generate the AST representation for IF statements
}else if(Token == ELSE){
//This will be done to generate the AST representation of an if statement
}else if(Token == INT){
//This will be done to generate the AST represnetation of an integer
}
What kind of compilers do you mean?
If the performance matters, you may want something like callback, in this way, use the keyword as key and the callback function as the value, so the pseudo code would looks like this:
func *fp = funcTbl.get(Token);
if (fp) { fp(); }
You may try the recursive descent too. The function related to the keyword got called just where they are expected to be.
Last but not least, what you write is ok as well.
Assuming you have already split your source language from string representation to a series of lexical tokens, your next step is to use a parser to build an AST from your tokens.
The parsing stage of compilation achieves two main goals:
It checks your language for syntactic correctness, throwing an error if your input cannot be parsed according to the structure of your grammar.
It generates an AST representation of your source code
Does a compiler use if statements when deciding what to do if a
certain keyword is encountered?
No, your parser should analyse the series of lexical tokens and check them against the structure of your language's grammar.
Parsing is a well understood topic in computer science which can be approached in different ways. it cannot be trivially implemented in the example code fragment you have provided above. In a realistic programming language you need to consider that grammars can be ambiguous, and that a simple predictive parser is appropriate for all grammars and some kind of backtracking will be needed. If you do not understand this concept, I recommend you use a Parser generator for this, such as Bison.
This diagram shows a simplistic overview of the most important stages of compilation and may help you to understand its pipeline structure.
This is a process which has been refined for decades by many academics about how to best 'divide and conquer' such a mammoth task. I strongly encourage you to follow it.
For further reading, check out Modern Compiler Implementation in Java by Andrew Appel.

Match functions and function calls in C using regex

I am fairly new into regexes, so I wrote the following simple regex using positive lookahead that detects functions and function calls in a C source file-
\w+(?=\s*\()
It works fine, but the problem is it detects non-function syntaxes like if(), while()etc too.
I can easily avoid this by saying-
(if(?!\()) | (while(?!\())
But the problem is how to combine the second regex with the first one? I cant OR them, cos the first one still matches if(), while() etc and in an OR expression, its enough if one of the term matches.
How to combine these regexes or have a better simpler one which will not match non-function syntaxes like if(), while()
PS: I use the following tools to test my regexes
GSkinner
RegexPal
There are quite a lot of assumptions when you are searching for function call in C with regex. That aside, if you are happy with what is matched (there are valid function calls that will not be matched), and you want to exclude if and while from the result list, you can use the following regex:
(?!\b(if|while|for)\b)\b\w+(?=\s*\()
The regex uses word boundary \b to make sure that the whole name is matched (prevent partial matching of hile in while), and the whole name is not keyword (prevent rejection of whilenothinghappens).

Resources