How to properly scan for identifiers using Ragel - lexical-analysis

I'm trying to write a scanner for my C/C++/C#/Java/D-like programming language that I'm designing for personal reasons. For this task I'm using Ragel to generate my scanner. I'm having trouble understanding exactly when a lot of the operators trigger actions, probably because my academics were focused on practical knowledge rather than theory and great deal of this non-deterministic/deterministic finite automata business goes right over my head. I find the documentation to either be lacking or my understanding of it to be so. I'm assuming the latter.
In any case, I'm working my way up from the basics. I've identified several keywords and special characters in my first iteration. Now I've run into the issue where all keywords are being scanned as identifiers. I'm using the scanner operator for all of my keywords, as that resolved my issue of the string returns being scanned as both the return and returns keyword.
How can I properly scan for identifiers? I understand that to make this deterministic, I need to effectively specify that a lexeme can only be an identifier if it matches no other token's pattern. Forgive my lack of knowledge.
Ragel Script:
%%{
Identifier = (alpha | '_') . (alnum | '_')*;
action IdentifierAction
{
std::cout << "identifier(\"";
std::cout.write(ts, te - ts);
std::cout << "\")";
}
}%%
%%{
main :=
|*
Interface => InterfaceAction;
Class => ClassAction;
Property => PropertyAction;
Function => FunctionAction;
TypeQualifier => TypeQualifierAction;
OpenParenthesis => OpenParenthesisAction;
CloseParenthesis => CloseParenthesisAction;
OpenBracket => OpenBracketAction;
CloseBracket => CloseBracketAction;
OpenBrace => OpenBraceAction;
CloseBrace => CloseBraceAction;
Semicolon => SemicolonAction;
Returns => ReturnsAction;
Return => ReturnAction;
Identifier => IdentifierAction;
space+;
*|;
}%%

Not familiar with Ragel, but, have done some custom parsers & scanners.
Your question seems to relate more to detect keywords, than detect generic identifiers.
You have rules telling Ragel to detect when a section the code is a number, the "return" keyword, a semicolon, the "returns" keyword, an identifier, and so on. Altought, it's possible to make a rule for each keyword, I won't recommended.
What I have learn by experience, is that is better to read all keywords explicity as identifiers (assign a general "identifier" token ), and in some part of your C/C++ code, detect which identifiers are "keywords".
In other words. Ragel will detect only identifiers. "myvar", "return" and "returns", will all be marked as "identifiers". Later, in the code of your semantic action (C/C++ not Ragel), you will check each identifier, and detect if is a keyword in C/C++. This is usually done, by having a list of keywords.
I think It will be something like these:
%%{
Identifier = (alpha | '_') . (alnum | '_')*;
action IdentifierAction
{
String Keywords[] =
(
"return",
"if",
"else"
);
String MyIdentifier = te - ts;
if (SearchKeywordCode(Keywords, MyIdentifier)) {
std::cout << "keyword(\"";
std::cout.write(ts, te - ts);
std::cout << "\")";
}
else {
std::cout << "identifier(\"";
std::cout.write(ts, te - ts);
std::cout << "\")";
}
}
}%%
So, there not be a "Return" or "Returns" rule, just "Identifier".

Related

Yacc actions return 0 for any variable ($)

I'm new to Lex/Yacc. Found these lex & yacc files which parse ansi C.
To experiment I added an action to print part of the parsing:
constant
: I_CONSTANT { printf("I_CONSTANT %d\n", $1); }
| F_CONSTANT
| ENUMERATION_CONSTANT /* after it has been defined as such */
;
The problem is, no matter where I put the action, and whatever $X I use, I always get value 0.
Here I got printed:
I_CONSTANT 0
Even though my input is:
int foo(int x)
{
return 5;
}
Any idea?
Nothing in the lex file you point to actually sets semantic values for any token. As the author says, the files are just a grammar and "the bulk of the work" still needs to be done. (There are other caveats having to do with the need for a preprocessor.)
Since nothing in the lex file ever sets yylval, it will always be 0, and that is what yacc/bison will find when it sets up the semantic value for the token ($1 in this case).
Turns out yylval = atoi(yytext) is not done in the lex file, so I had to add it myself. Also learned I can add extern char *yytext to the yacc file header, and then use yytext directly.

How to disable parsing for a piece of text in a file?

Structure of my file is :
`pragma TOKEN1_NAME TOKEN1_VALUE
`pragma TOKEN2_NAME TOKEN2_VALUE
`pragma TOKEN3_NAME TOKEN3_VALUE
`pragma TOKEN4_NAME TOKEN4_VALUE
VHDL_TEXT{
// A valid VHDL text goes here.
}
`pragma TOKEN2_NAME TOKEN2_VALUE
VHDL_TEXT{
// VHDL text
}
I need to pass VHDL text as it is to the output file.I can do that by making a default rule at the end of lex file as:
Rule: . { append_to_buffer(*yytext); }
I also have list of other rules in my Lex file to deal with the tokens.
The problem i am having is how to deal with the situation in which VHDL text is also containing some of the tokens that can be recognized by the Lex rules?
In other words ,i want to disable detecting further valid token one i found the text i am interesting in and again start detection once it is over.
As rici points out indirectly you need to be able to distinguish between occurrences of the trailing delimiter '}' for your rule and occurrences of the right curly bracket in a valid VHDL design specification or portion.
See IEEE Std 1076-2008, 15.3 Lexical elements, separators, and delimiters where we find that '{' and '}' are not used as delimiters in VHDL.
They are other special characters (15.2 Character set, using ISO/IEC 8859-1:1998) requiring handling where graphic characters may appear.
graphic_character ::=
basic_graphic_character | lower_case_letter | other_special_character
These include extended identifiers (15.4.3), character literals (15.6), string literals (15.7), bit string literals (15.8), comments (15.9) and tool directives (15.11).
There's a need to identify these lexical elements within the production otherwise identifying '}' as a delimiter for the rule.
Only one tool directive is currently defined (24.1 Protect tool directives) wherein the use of the two curly bracket characters would be contained in VHDL lexical elements. All other uses in lexical elements are directly delimited. (And you could disclaim tool directive support, in VHDL they basically also invoke separate lexical, syntactical and semantic analysis).
Essentially you need to operate a VHDL lexical analyzer for traversing 'VHDL text' where you're rule delimiter right curly bracket will stand out like a sore thumb (as an exception, serving as the closing delimiter for VHDL text).
And about now you'd get the idea life would be easier if you could deal with VHDL by reference instead if possible. Your mechanism is as complex as including tool directives in VHDL (which can be done with a preprocessor as could your VHDL text).
This is in response to the vhdl tag added by FUZxxl.
When you have essentially different languages in a source file that you need to deal with that have clear demarcation tokens (like your VHDL_TEXT markers) that can be easily recognized by the lexer, the easiest thing to do is to use flex exclusive start states (%x). In your case, you would do something like:
%{
/* some global vars for holding aux state */
static int brace_depth;
static Buffer vhdl_text;
%}
%x VHDL
%%
.. normal lexer rules for your non-vhdl stuff
VHDL_TEXT[ \t]*{ { brace_depth = 1;
BufferClear(vhdl_text);
BEGIN(VHDL); }
<VHDL>"{" { BufferAppend(vhdl_text, *yytext);
brace_depth++; }
<VHDL>"}" { if (--brace_depth == 0) {
BEGIN(INITIAL);
yylval.buf = BufferExtract(vhdl_text);
return VHDL_TEXT; }
BufferAppend(vhdl_text, *yytext); }
<VHDL>--.*\n { BufferAppendString(vhdl_text, yytext); }
<VHDL>\"[^"\n]\" { BufferAppendString(vhdl_text, yytext); }
<VHDL>\\[^\\\n]\\ { BufferAppendString(vhdl_text, yytext); }
<VHDL>.|\n { BufferAppend(vhdl_text, *yytext); }
This will gather up everything between the curly braces in VHDL_TEXT {...} and return it to your parser as a single token (matching nested braces properly, if there are any in the VHDL text.) You can do macro substitution-like stuff in the VHDL code by adding a rule like:
<VHDL>{IDENT} { if (Macro *mac = lookup_macro_by_name(yytext)) {
BufferAppendString(vhdl_text, mac->replacement);
} else {
BufferAppendString(vhdl_text, yytext); } }
You also probably want a <VHDL><<EOF>> rule to detect a missing closing } on the vhdl text and give an appropriate error message.

Recognising multiple new_lines in PEGKit

I am learning how to use PEGKit, but am running into problem with creating a grammar for a script that parses lines, even when they are separated by multiple line break characters. I have reduced the problem to this grammar:
expr
#before {
PKTokenizer *t = self.tokenizer;
self.silentlyConsumesWhitespace = NO;
t.whitespaceState.reportsWhitespaceTokens = YES;
self.assembly.preservesWhitespaceTokens = YES;
}
= Word nl*;
nl = nl_char nl_char*;
nl_char = '\n'! | '\r'!;
This simple grammar to me should allow one word per line, with as many line breaks as necessary. But it only allows one word with an optional line break. Does anybody know what's wrong here? Thank you.
Creator of PEGKit here.
Try the following grammar instead (make sure you are using HEAD of master):
#before {
PKTokenizer *t = self.tokenizer;
[t.whitespaceState setWhitespaceChars:NO from:'\\n' to:'\\n'];
[t.whitespaceState setWhitespaceChars:NO from:'\\r' to:'\\r'];
[t setTokenizerState:t.symbolState from:'\\n' to:'\\n'];
[t setTokenizerState:t.symbolState from:'\\r' to:'\\r'];
}
lines = line+;
line = ~eol* eol+; // note the `~` Not unary operator. this means "zero or more NON eol tokens, followed by one or more eol token"
eol = '\n'! | '\r'!;
Note that here, I am tweaking the tokenizer to recogognize newlines and carriage returns as Symbols rather than whitespace. That makes it easier to match and discard them (they are discarded by the ! operator).
For another approach to the same problem using the builtin S whitespace rule, see here.

parsing with bison

I bought Flex & Bison from O'Reilly but I'm having some trouble implementing a parser (breaking things down into tokens was no big deal).
Suppose I have a huge binary string and what I need to do is add the bits together - every bit is a token:
[0-1] { return NUMBER;}
1101010111111
Or for that matter a collection of tokens with no "operation".
Would a such a grammar be correct?
calclist :
| calclist expr EOL {eval($2)}
expr: NUMBER
|expr NUMBER { $$=$1+$2 }
or is there a better way to do it?
Your example lex rule "[0-1] { return NUMBER; }" doesn't set yylval, so if you use that value in your grammar (as you do in the rule "expr NUMBER { $$=$1+$2; }") you'll get garbage.
In general what you're doing is correct, though the task you've chosen is so trivial that lex/bison is serious overkill.

Building a lexer in C

I want to build a lexer in C and I am following the dragon book, I can understand the state transitions but how to implement them?
Is there a better book?
The fact that I have to parse a string through a number of states so that I can tell whether the string is acceptable or not!
You can implement simple state transitions with a single state variable, for example if you want to cycle through the states start->part1->part2->end then you can use an enum to keep track of the current state and use a switch statement for the code you want to run in each state.
enum state { start=1, part1, part2, end} mystate;
// ...
mystate = start;
do {
switch (mystate) {
case start:
// ...
case part1:
// ...
case part2:
// ...
if (part2_end_condition) mystate = end; // state++ will also work
// Note you could also set the state back to part1 on some condition here
// which creates a loop
break;
}
} while (mystate != end);
For more complex state transitions that depend on several variables, you should use tables/arrays like this:
var1 var2 var_end next_state
0 0 0 state1
0 1 0 state2
1 0 0 state3
1 1 0 state4
-1 -1 1 state_end // -1 represents "doesn't matter" here
G'day,
Assuming you mean The Dragon book on compiler design, I'd recommend having a look around this page on compiler tools.
The page itself is quite small but has links through to various excellent resources on lexical analysers.
HTH
cheers,
There's more than one way to do it. Every regular expression corresponds directly to a simple structured program. For example, an expression for numbers could be this:
// regular expression
digit* [.digit*]
and the corresponding C code would be:
// corresponding code
while(DIGIT(*pc)) pc++;
if (*pc=='.'){
pc++;
while(DIGIT(*pc)) pc++;
}
The transition-table way of building lexers is, in my opinion, needlessly complicated, and obviously runs slower.
If you're looking for a more modern treatment than the dragon book(s) : Andrew W. Appel and Maia Ginsburg, Modern Compiler Implementation in C, Cambridge University Press, 2008.
Chapter 2 is focused on Lexical Analysis : Lexical tokens, Regular expressions, Finite automata; Nondeterministic Finite Automata; Lexical analyzer generators
Look at the Table of Contents
The program flex (a clone of lex) will create a lexer for you.
Given an input file with the lexer rules, it will produce a C file with an implementation of a lexer for those rules.
You can thus check the output of flex for how to write a lexer in C. That is, if you don't just want to use flex's lexer...

Resources