REPL for interpreter using Flex/Bison - c

I've written an interpreter for a C-like language, using Flex and Bison for the scanner/parser. It's working fine when executing full program files.
Now I'm trying implement a REPL in the interpreter for interactive use. I want it to work like the command line interpreters in Ruby or ML:
Show a prompt
Accept one or more statements on the line
If the expression is incomplete
display a continuation prompt
allow the user to continue entering lines
When the line ends with a complete expression
echo the result of evaluating the last expression
show the main prompt
My grammar starts with a top_level production, which represents a single statement in the language. The lexer is configured for interactive mode on stdin. I am using the same scanner and grammar in both full-file and REPL modes, because there's no semantic difference in the two interfaces.
My main evaluation loop is structured like this.
while (!interpreter.done) {
if (interpreter.repl)
printf(prompt);
int status = yyparse(interpreter);
if (status) {
if (interpreter.error)
report_error(interpreter);
}
else {
if (interpreter.repl)
puts(interpreter.result);
}
}
This works fine except for the prompt and echo logic. If the user enters multiple statements on a line, this loop prints out superfluous prompts and expressions. And if the expression continues on multiple lines, this code doesn't print out continuation prompts. These problems occur because the granularity of the prompt/echo logic is a top_level statement in the grammar, but the line-reading logic is deep in the lexer.
What's the best way to restructure the evaluation loop to handle the REPL prompting and echoing? That is:
how can I display one prompt per line
how can I display the continuation prompt at the right time
how can I tell when a complete expression is the last one on a line
(I'd rather not change the scanner language to pass newline tokens, since that will severely alter the grammar. Modifying YY_INPUT and adding a few actions to the Bison grammar would be fine. Also, I'm using the stock Flex 2.5.35 and Bison 2.3 that ship with Xcode.)

After looking at how languages like Python and SML/NJ handle their REPLs, I got a nice one working in my interpreter. Instead of having the prompt/echo logic in the outermost parser driver loop, I put it in the innermost lexer input routine. Actions in the parser and lexer set flags that control the prompting by input routine.
I'm using a reentrant scanner, so yyextra contains the state passed between the layers of the interpreter. It looks roughly like this:
typedef struct Interpreter {
char* ps1; // prompt to start statement
char* ps2; // prompt to continue statement
char* echo; // result of last statement to display
BOOL eof; // set by the EOF action in the parser
char* error; // set by the error action in the parser
BOOL completeLine // managed by yyread
BOOL atStart; // true before scanner sees printable chars on line
// ... and various other fields needed by the interpreter
} Interpreter;
The lexer input routine:
size_t yyread(FILE* file, char* buf, size_t max, Interpreter* interpreter)
{
// Interactive input is signaled by yyin==NULL.
if (file == NULL) {
if (interpreter->completeLine) {
if (interpreter->atStart && interpreter->echo != NULL) {
fputs(interpreter->echo, stdout);
fputs("\n", stdout);
free(interpreter->echo);
interpreter->echo = NULL;
}
fputs(interpreter->atStart ? interpreter->ps1 : interpreter->ps2, stdout);
fflush(stdout);
}
char ibuf[max+1]; // fgets needs an extra byte for \0
size_t len = 0;
if (fgets(ibuf, max+1, stdin)) {
len = strlen(ibuf);
memcpy(buf, ibuf, len);
// Show the prompt next time if we've read a full line.
interpreter->completeLine = (ibuf[len-1] == '\n');
}
else if (ferror(stdin)) {
// TODO: propagate error value
}
return len;
}
else { // not interactive
size_t len = fread(buf, 1, max, file);
if (len == 0 && ferror(file)) {
// TODO: propagate error value
}
return len;
}
}
The top level interpreter loop becomes:
while (!interpreter->eof) {
interpreter->atStart = YES;
int status = yyparse(interpreter);
if (status) {
if (interpreter->error)
report_error(interpreter);
}
else {
exec_statement(interpreter);
if (interactive)
interpreter->echo = result_string(interpreter);
}
}
The Flex file gets these new definitions:
%option extra-type="Interpreter*"
#define YY_INPUT(buf, result, max_size) result = yyread(yyin, buf, max_size, yyextra)
#define YY_USER_ACTION if (!isspace(*yytext)) { yyextra->atStart = NO; }
The YY_USER_ACTION handles the tricky interplay between tokens in the language grammar and lines of input. My language is like C and ML in that a special character (';') is required to end a statement. In the input stream, that character can either be followed by a newline character to signal end-of-line, or it can be followed by characters that are part of a new statement. The input routine needs to show the main prompt if the only characters scanned since the last end-of-statement are newlines or other whitespace; otherwise it should show the continuation prompt.

I too am working on such an interpreter, I haven't gotten to the point of making a REPL yet, so my discussion might be somewhat vague.
Is it acceptable if given a sequence of statements on a single line, only the result of the last expression is printed? Because you can re-factor your top level grammar rule like so:
top_level = top_level statement | statement ;
The output of your top_level then could be a linked list of statements, and interpreter.result would be the evaluation of the tail of this list.

Related

lexical analysis stops after yy_scan_string() is finished

I use flex to make a lexical analyzer. I want to analyse some define compiler statements which are in the form: #define identifier identifier_string. I keep a list of (identifier identifier_string) pair. So when I reach in the file a identifier that is #define list I need to switch the lexical analysis from main file to analyse the corresponding identifier_string.
(I don't put the complete flex code because is too big)
here's the part:
{IDENTIFIER} { // search if the identifier is in list
if( p = get_identifier_string(yytext) )
{
puts("DEFINE MATCHED");
yypush_buffer_state(yy_scan_string(p));
}
else//if not in list just print the identifier
{
printf("IDENTIFIER %s\n",yytext);
}
}
<<EOF>> {
puts("EOF reached");
yypop_buffer_state();
if ( !YY_CURRENT_BUFFER )
{
yyterminate();
}
loop_detection = 0;
}
The analysis of the identifier_string executes just fine. Now when the EOF is reached I want to switch back at the initial buffer and resume the analysis. But it finishes just printing EOF reached.
Although that approach seems logical, it won't work because yy_scan_string replaces the current buffer, and that happens before the call to yypush_buffer_state. Consequently, the original buffer is lost, and when yypop_buffer_state is called, the restored buffer state is the (now terminated) string buffer.
So you need a little hack: first, duplicate the current buffer state onto the stack, and then switch to the new string buffer:
/* Was: yypush_buffer_state(yy_scan_string(p)); */
yypush_buffer_state(YY_CURRENT_BUFFER);
yy_scan_string(p);

Does there exist an elegant way to implement a "while-then-do" loop?

A standard do-while implements the following logic:
do_something();
while(loop_condition) {
do_something();
}
Is there a common (i.e. existing in C or Java or some other frequently used language) construct to implement the following?
while(loop_condition) {
do_something();
}
do_something();
Suppose for example that I need to read a file line by line and do something whenever I hit every line beginning with \t or on EOF. Is there a less error-prone way of doing this without offloading the loop contents to a function?
How about the following:
loop_again = true;
while (loop_again) {
loop_again = loop_condition;
do_something();
}
or, alternatively:
do {
loop_again = loop_condition;
do_something();
} while (loop_again);
I suppose you could write something like this:
bool should_keep_going = true;
while (should_keep_going) {
should_keep_going = loop_condition;
do_something();
}
I'm not sure how clear that is. It might work better with a realistic example, where we could potentially name should_keep_going in a more meaningful or intuitive way.
Suppose for example that I need to read a file line by line and do something whenever I hit every line beginning with \t or on EOF. Is there a less error-prone way of doing this without offloading the loop contents to a function?
Well, lines might begin with non-tab character and then encounter EOF before the end of the line. I'm assuming that you meant "EOF after the last line".
This code executes what you seem to be describing:
while ( 1 )
{
char *ptr = fgets(buf, sizeof buf, fp);
if ( !ptr || ptr[0] == '\t' )
do_something();
if ( !ptr )
break;
}
You could convert the !ptr check into a loop condition if you want.

Bus Error on void function return

I'm learning to use libcurl in C. To start, I'm using a randomized list of accession names to search for protein sequence files that may be found hosted here. These follow a set format where the first line is a variable length (but which contains no information I'm trying to query) then a series of capitalized letters with a new line every sixty (60) characters (what I want to pull down, but reformat to eighty (80) characters per line).
I have the call itself in a single function:
//finds and saves the fastas for each protein (assuming on exists)
void pullFasta (proteinEntry *entry, char matchType, FILE *outFile) {
//Local variables
URL_FILE *handle;
char buffer[2] = "", url[32] = "http://www.uniprot.org/uniprot/", sequence[2] = "";
//Build full URL
/*printf ("u:%s\nt:%s\n", url, entry->title); /*This line was used for debugging.*/
strcat (url, entry->title);
strcat (url, ".fasta");
//Open URL
/*printf ("u:%s\n", url); /*This line was used for debugging.*/
handle = url_fopen (url, "r");
//If there is data there
if (handle != NULL) {
//Skip the first line as it's got useless info
do {
url_fread(buffer, 1, 1, handle);
} while (buffer[0] != '\n');
//Grab the fasta data, skipping newline characters
while (!url_feof (handle)) {
url_fread(buffer, 1, 1, handle);
if (buffer[0] != '\n') {
strcat (sequence, buffer);
}
}
//Print it
printFastaEntry (entry->title, sequence, matchType, outFile);
}
url_fclose (handle);
return;
}
With proteinEntry being defined as:
//Entry for fasta formatable data
typedef struct proteinEntry {
char title[7];
struct proteinEntry *next;
} proteinEntry;
And the url_fopen, url_fclose, url_feof, url_read, and URL_FILE code found here, they mimic the file functions for which they are named.
As you can see I've been doing some debugging with the URL generator (uniprot URLs follow the same format for different proteins), I got it working properly and can pull down the data from the site and save it to file in the proper format that I want. I set the read buffer to 1 because I wanted to get a program that was very simplistic but functional (if inelegant) before I start playing with things, so I would have a base to return to as I learned.
I've tested the url_<function> calls and they are giving no errors. So I added incremental printf calls after each line to identify exactly where the bus error is occurring and it is happening at return;.
My understanding of bus errors is that it's a memory access issue wherein I'm trying to get at memory that my program doesn't have control over. My confusion comes from the fact that this is happening at the return of a void function. There's nothing being read, written, or passed to trigger the memory error (as far as I understand it, at least).
Can anyone point me in the right direction to fix my mistake please?
EDIT: As #BLUEPIXY pointed out I had a potential url_fclose (NULL). As #deltheil pointed out I had sequence as a static array. This also made me notice I'm repeating my bad memory allocation for url, so I updated it and it now works. Thanks for your help!
If we look at e.g http://www.uniprot.org/uniprot/Q6GZX1.fasta and skip the first line (as you do) we have:
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
Which is a 60 characters string.
When you try to read this sequence with:
//Grab the fasta data, skipping newline characters
while (!url_feof (handle)) {
url_fread(buffer, 1, 1, handle);
if (buffer[0] != '\n') {
strcat (sequence, buffer);
}
}
The problem is sequence is not expandable and not large enough (it is a fixed length array of size 2).
So make sure to choose a large enough size to hold any sequence, or implement the ability to expand it on-the-fly.

Checking for a blank line in C - Regex

Goal:
Find if a string contains a blank line. Whether it be '\n\n',
'\r\n\r\n', '\r\n\n', '\n\r\n'
Issues:
I don't think my current regex for finding '\n\n' is right. This is my first time really using regex outside of simple use of * when removing files in command line.
Is it possible to check for all of these cases (listed above) in one regex? or do I have to do 4 seperate calls to compile_regex?
Code:
int checkForBlankLine(char *reader) {
regex_t r;
compile_regex(&r, "*\n\n");
match_regex(&r, reader);
return 0;
}
void compile_regex(regex_t *r, char *matchText) {
int status;
regcomp(r, matchText, 0);
}
int match_regex(regex_t *r, char *reader) {
regmatch_t match[1];
int nomatch = regexec(r, reader, 1, match, 0);
if (nomatch) {
printf("No matches.\n");
} else {
printf("MATCH!\n");
}
return 0;
}
Notes:
I only need to worry about finding one blank line, that's why my regmatch_t match[1] is only one item long
reader is the char array containing the text I am checking for a blank line.
I have seen other examples and tried to base the code off of those examples, but I still seem to be missing something.
Thank you kindly for the help/advice.
If anything needs to be clarified please let me know.
It seems that you have to compile the regex as extended:
regcomp(&re, "\r?\n\r?\n", REG_EXTENDED);
The first atom, \r? is probably unnecessary, because it doesn't add to the blank-line condition if you don't capture the result.
In the above, blank line really means empty line. If you want blank line to mean a line that has no characters except for white space, you can use:
regcomp(&re, "\r?\n[ \t]*\r?\n", REG_EXTENDED);
(I don't think you can use the space character pattern, \s here instead of [ \t], because that would include carriage return and new-line.)
As others have already hinted at, the "simple use of * in the command line` is not a regular expression. This wildcard-matching is called file globbing and has different semantics.
Check what the * in a regex means. It's not like the wildcard "anything" in the command line. The * means that the previous component can appear any amount of times. The wildcard in regex is the .. So if you want to say match anything you can do .*, which would be anything, any amount of times.
So in your case you can do .*\n\n.* which would match anything that has \n\n.
Finally, you can use or in a regex and ( ) to group stuff. So you can do something like .*(\n\n|\r\n\r\n).* And that would match anything that has a \n\n or a \r\n\r\n.
Hope that helps.
Rather than looking for only \r or \n, look for not \r or \n?
Your regex would simply be
'[^\r\n]'
and a match result of false indicates a blank line to your specification.

Parsing some particular statements with antlr3 in C target

I have some questions about antlr3 with tree grammar in C target.
I have almost done my interpretor (functions, variables, boolean and math expressions ok) and i have kept the most difficult statements for the end (like if, switch, etc.)
1) I would like interpreting a simple loop statement:
repeat: ^(REPEAT DIGIT stmt);
I've seen many examples but nothing about the tree walker (only a topic here with the macros MARK() / REWIND(m) + #init / #after but not working (i've antlr errors: "unexpected node at offset 0")). How can i interpret this statement in C?
2) Same question with a simple if statement:
if: ^(IF condition stmt elseifstmt* elsestmt?);
The problem is to skip the statement if the condition is false and test the other elseif/else statements.
3) I have some statements which can stop the script (like "break" or "exit"). How can i interrupt the tree walker and skip the following tokens?
4) When a lexer or parser error is detected, antlr returns an error. But i would like to make my homemade error messages. How can i have the line number where parser crashed?
Ask me if you want more details.
Thanks you very much (and i apologize for my poor english)
About the repeat statement, i think i've found a way to do it. In antlr.org, i've found a complete interpreter for C-- language but made in Java.
I put here the while statement (a bit different but the way is the same):
whileStmt
scope{
Boolean breaked;
}
#after{
CommonTree stmtNode=(CommonTree)$whileStmt.start.getChild(1);
CommonTree exprNode=(CommonTree)$whileStmt.start.getChild(0);
int test;
$whileStmt::breaked=false;
while($whileStmt::breaked==false){
stream.push(stream.getNodeIndex(exprNode));
test=expr().value;
stream.pop();
if (test==0) break;
stream.push(stream.getNodeIndex(stmtNode));
stmt();
stream.pop();
}
}
: ^(WHILE . .)
;
I've tried to transform this code into C language:
repeat
scope {
int breaked;
int tours;
}
#after
{
int test;
pANTLR3_BASE_TREE repeatstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,1);
pANTLR3_BASE_TREE exprstmt = (pANTLR3_BASE_TREE)$repeat.start->getChild($repeat.start,0);
$repeat::breaked = 0;
test = 1;
while($repeat::breaked == 0)
{
TW_FOLLOWPUSH(exprstmt);
TW_FOLLOWPOP();
test++;
if(test == $repeat::tours)
break;
TW_FOLLOWPUSH(repeatstmt);
CTX->repeat(CTX);
TW_FOLLOWPOP();
}
}
: ^(REPEAT DIGIT stmt)
{
$repeat::tours = $DIGIT.text->toInt32($DIGIT.text);
}
But nothing happened (stmt is parsed juste once).
Do you have an idea about this please?
About the homemade errors messages, i've found the macro GETLINE() in the lexer. It works when the tree walker crashes but antlr continues to display errors messages for lexer or parser errors.
Thanks.

Resources