Runtime formula evaluation - c

I would like to evaluate formulas which a user can input for many data points, so efficiency is a concern. This is for a Fortran project, but my solutions so far have been centered on using a yacc/bison grammar, so I will probably use Fortran's iso_c_binding feature to interface to yyparse().
The preferred (so far) solution would be an small extension of the classic mfcalc calculator example from the Bison manual, with the bison grammar made to recognize a (single) variable name as well (which is not hard).
The question is what to do in the executable statements. I see two options there.
First, I could simply evaluate the expression as it is parsed, as in the mfcalc example.
Second, I could invoke the bison parser once for parsing and for creating a stack-based (reverse polish) representation of the formula being parsed, so
2 + 3*x would be translated into 2 3 * + (of course, as the relevant data structure).
The relevant part of the grammar would look like this:
%union {
double val;
char *c;
int fcn;
}
%type <val> NUMBER
%type <c> VAR
%type <fcn> Function
/* Tokens and %left PLUS MINUS etc. left out for brevity */
%%
...
Function:
SIN { $$=SIN; }
| COS { $$=COS; }
| TAN { $$=TAN; }
| SQRT { $$=SQRT; }
Expression:
NUMBER { push_number($1); }
| VAR { push_var($1); }
| Expression PLUS Expression { push_operand(PLUS); }
| Expression MINUS Expression { push_operand(MINUS); }
| Expression DIVIDE Expression { push_operand(DIVIDE); }
| MINUS Expression %prec NEG { push_operand(NEG); }
| LEFT_PARENTHESIS Expression RIGHT_PARENTHESIS;
| Function LEFT_PARENTHESIS Expression RIGHT_PARENTHESIS { push_function($1); }
| Expression POWER Expression { push_operand(POWER); }
The functions push_... would put the formula into an array of structs, which which contain a struct holding the token and the yacc union.
The RPN would then be interpreted using a very simple (and hopefully fast) interpreter.
So, the questions.
Is the second approach valid? I think it is from what I understand about bison (or yacc's) way of handling shift and reduce (basically, this will shift a number and reduce an expression, so the order should be guaranteed to be correct for RPN), but I am not quite sure.
Also, is it worth the additional effort over simply evaluating the function using the $$ construct (the first approach)?
Finally, are there other, better solutions? I had considered using syntax trees, but I don't think the additional effort is actually worth it. Also, I tend to think that using trees is overkill where an array would do just nicely :-)

It's only slightly more difficult to generate three-address virtual ops than RPN. In effect, the RPN is a virtual stack machine. The three-address ops -- which can also easily go into an array -- are probably faster to interpret, and will probably be more flexible in the long term.
The main advantage of parsing the expression into some internal form is that it is likely to be faster to evaluate the internal form than to reparse the original string. That may not be the case, but it usually is because converting floating-point literals into floating-point numbers is (relatively speaking) quite slow.
There is also the intermediate case of tokenizing the expression (into an array), and then directly evaluating while parsing the token stream. (In effect, that makes bison your virtual machine.)
Which of these strategies is the best depends a lot on details of your use case, but none of them are difficult so you could try all three and compare.

Related

yacc parser reduces before left recursion suffix

I wrote a pretty simple left-recursive grammar in yacc, based on data types for a simple C-like language. The generated yacc parser reduces before the left recursion suffix, and I have no clue why.
Here is the source code:
%%
start: type {
printf("Reduced to start.\n");
};
type: pointer {
printf("Reduced pointer to type.\n");
}
| char {
printf("Reduced char to type.\n");
};
char: KW_CHAR {
printf("Reduced to char.\n");
};
pointer: type ASTERISK {
printf("Reduced to pointer.\n");
};
%%
Given the input char * (KW_CHAR ASTERISK):
Reduced to char.
Reduced char to type.
syntax error
You don't seem to be asking about the parse error you're receiving (which, as noted in the comments, is likely an issue with the token types being returned by yylex()), but rather why the parser reductions are performed in the order shown by your trace. That's the question I will try to address here, even though it's quite possible that it's an XY problem, because understanding reduction order is important.
In this case, the reduction order is pretty simple. If you have a production:
pointer: type ASTERISK
then type must be reduced before ASTERISK is shifted. In an LR parser, reductions must be done when the right-hand side to be reduced ends exactly at the last consumed input token. (The lookahead token has been identified by the lexical scanner but not yet consumed by the parser. So it's not part of the reduction and can be used also to identify other reductions ending at the same point.)
I hope it's more evident why with the production
type: char
char needs to be reduced before type. Until char is reduced, it's not available to be used in the reduction of type.
Really, these two examples show the same behaviour. Reductions are performed left-to-right, and from the bottom up (that is, children first). Hence the name for this kind of parsing.
So the reduction order shown by your parser (first char, then type, and only after the * is shifted, pointer) is precisely what would be expected.

Using symbols read by scanf() as C operators

I want to read a symbol in from scanf() and then get C to use it for what it is. My current (relevant) piece of code looks like:
float a; /* First Number */
char sym; /* Symbol (i.e. +, -, /, *) */
float b; /* Second number.... */
puts("Please type your equation");
printf("$: ");
scanf("%f %c %f", &a, &sym, &b);
So if the user were to type (at the $: prompt) 5 + 10 then the program should proceed to evaluate 5 + 10 but I know I can't expect C to do this (not without working some magic first :) because '+' is just an ANSI character code, so what I'm asking is:
How do I get C to literally take the variable sym for what we (as people) take it as (a plus +) and then use that to solve the equation as if the variables had hard-coded values?
EDIT
I now understand that it may be impossible (see comment by SLaks), so any workarounds would be great!
Just as a side-note: I know I can use
....
add(int a, int b)
{
return (1 + b);
}
....
if (sym == '+') {
add(a, b);
}
and so on, but when the I get to including more then just a and b (e.g. a, sym, b, sym2, c) and the user has more than a single type of operator (e.g. 2 + 4 - 6) this becomes tedious and time consuming.
You can't really do that. C is a compiled language (so is C++) and you cannot just execute a string as if it is C code at run time. The instructions are generated when the code is compiled. Other languages like Python which are interpreted support this (such as the eval function in Python). Using the if statements is probably the most efficient approach.
Also like Jiang Jie said I would look into reverse polish notation. This involves using a stack to evaluate the mathematical expressions and can handle complex expressions.
You will also probably need to look into converting infix expressions (e.g. 1 + 2) into postfix expressions (1 2 +).
If you really want to learn C by interpreting it, then as you'll need a tokenizer, a parser and an expression tree evaluator. I know there are the old classics: LEX and YACC, but I'm pretty sure there are newer tokenizers and parser generators. You can google for "C parser generator". There's even a Wikipedia article comparing a bunch of them.
But I will say that writing a C interpreter is not the best way to learn C. Learning to write simpler programs is highly recommended. I suggest finding a tutorial site or getting a book. There are lots of both.
What you need is Reverse Polish Notation, let the user input a regular math expression, then you need to transform it into an RPN expression, then calculate.

Project on flex and bison

I have used flex and bison in order to make a lexical analyzer and a parser for an EBNF grammar. This work is done! I mean, when i put a file with a program I write, I can see if the program has mistakes. If it doesn't, I can see the whole program in my screen based on the grammar i have used. I have no problem in this.
Now, I want to use loop handling and loop unrolling. Which part should I change? The lexical analyzer? The parser? Or the main after the parser? And how?
Introduction
As we don't have sight of a piece of your code to see how you are handling a loop in the parser and outputting code, and an example of a specific loop that you might want unrolled it is difficult to give any more detailed advice than that already given. There are unlikely to be any more experienced compiler writers or teachers anywhere on the globe than those already reading your question! So we will need to explore other ways to explain how to solve a problem like this.
It often happens that people can't post examples of their code because they started with a significant code base provided as part of a class exercise or from an open source repository, and they do not fully understand how it works to be able to find appropriate code fragments to post. Let's imagine that you had the complete source of a working compiler for a real language and wanted to add some loop optimisations to that existing, working compiler, you might then say, as you did, "what source, how can I show some source?" (because in actuality it is many tens of thousands of lines of code).
An Example Compiler
In the absence of some code to reference the alternative is to create one, as an exemplar, to explain the problem and solution. This is often how it is done in compiler text books or compiler classes. I will use a similar simple example to demonstrate how such optimisations can be achieved using the tools flex and bison.
First, we need to define the language of the example. To keep within the reasonable size constraints of a SO answer the language must be very simple. I will use simple assignments of expressions as the only statement form in my language. The variables in this language will be single letters and the constants will be positive integers. The only expression operator is plus (+). An example program in my language might be:
i = j + k; j = 1 + 2
The output code generated by the compiler will be simple assembler for a single accumulator machine with four instructions, LDA, STO, ADD and STP. The code generated for the above statements would be:
LDA j
ADD k
STO i
LDA #1
ADD #2
STO j
STP
Where LDA loads a value or variable into the accumulator, ADD adds a variable or value to the accumulator, STO stores the accumulator back to a variable. STP is "stop" for the end-of-program.
The flex program
The language shown above will need the tokens for ID and NUMBER and should also skip whitespace. The following will suffice:
%{
#define yyterminate() return (END);
%}
digit [0-9]
id [a-z]
ws [\t\n\r ]
%%
{ws}+ /* Skip whitespace */
{digit}+ {yylval = (int)(0l - atol(yytext)); return(NUMBER); }
{id} {yylval = yytext[0]; return(ID); }
"+" {return('+'); }
"=" {return('='); }
Gory details
Just some notes on how this works. I've used atol to convert the integer to allow for deal with potential integer overflow that can occur in reading MAXINT. I'm negating the constants so they can be easily distinguished from the identifiers which will be positive in one byte. I'm storing single character identifiers to avoid having the burden of illustrating symbol table code and thus permit a very small lexer, parser and code generator.
The bison program
To parse the language and generate some code from the bison actions we can achieve this by the following bison program:
%{
#include <stdio.h>
%}
%token NUMBER ID END
%%
program : statements END { printf("STP\n"); return(0) ; }
;
statements : statement
| statements ';' statement
;
statement : ID '=' expression { printf("STO %c\n",$1); }
|
;
expression : operand {
/* Load operand into accumulator */
if ($1 <= 0)
printf("LDA #%d\n",(int)0l-$1);
else printf("LDA %c\n",$1);
}
| expression '+' operand {
/* Add operand to accumulator */
if ($3 <= 0)
printf("ADD #%d\n",(int)0l-$3);
else printf("ADD %c\n",$3);
}
;
operand : NUMBER
| ID
;
%%
#include "lex.yy.c"
Explanation of methodology
This paragraph is intended for those who know how to do this and might query the approach used in my examples. I've deliberately avoided building a tree and doing a tree walk, although this would be the orthodox technique for code generation and optimisation. I wanted to avoid adding all the necessary code overhead in the example to manage the tree and walk it. This way my example compiler can be really tiny. However, being restricted to only using bison action to perform the code generation limits me to the ordering of the bison rule matching. This meant that only pseudo-machine code could really be generated. A source-to-source example would be less tractable with this methodology. I've chosen an idealised machine that is a cross between MU0 and a register-less PDP/11, again with the bare minimum of features to demonstrate some optimisations of code.
Optimisation
Now we have a working compiler for a language in a few lines of code we can start to demonstrate how the process of adding code optimisation might work.
As has already been said by the esteemed #Chris Dodd:
If you want to do program transformations after parsing, you should do them after parsing. You can do them incrementally (calling transform routines from your bison code after parsing part of your input), or after parsing is complete, but either way, they happen after parsing the part of the program you are transforming.
This compiler works by emitting code incrementally after parsing part of the input. As each statement is recognised the bison action (within the {...} clause) is invoked to generate code. If this is to be transformed into more optimal code it is this code that has to be changed to generate the desired optimisation. To be able to achieve effective optimisation we need a clear understanding of what language features are to be optimised and what the optimal transformation should be.
Constant Folding
A common optimisation (or code transformation) that can be done in a compiler is constant folding. In constant folding the compiler replaces expressions made entirely of numbers by the result. For example consider the following:
i = 1 + 2
An optimisation would be to treat this as:
i = 3
Thus the addition of 1 + 2 was made by the compiler and not put into the generated code to occur at run time. We would expect the following output to result:
LDA #3
STO i
Improved Code Generator
We can implement the improved code by looking for the explicit case where we have a NUMBER on both sides of expression '+' operand. To do this we have to delay taking any action on expression : operand to permit the value to be propagated onwards. As the value for an expression might not have been evaluated we have to potentially do that on assignment and addition, which makes for a slight explosion of if statements. We only need to change the actions for the rules statement and expression however, which are as shown below:
statement : ID '=' expression {
/* Check for constant expression */
if ($3 <= 0) printf("LDA #%d\n",(int)0l-$3);
else
/* Check if expression in accumulator */
if ($3 != 'A') printf("LDA %c\n",$3);
/* Now store accumulator */
printf("STO %c\n",$1);
}
| /* empty statement */
;
expression : operand { $$ = $1 ; }
| expression '+' operand {
/* First check for constant expression */
if ( ($1 <= 0) && ($3 <= 0)) $$ = $1 + $3 ;
else { /* No constant folding */
/* See if $1 already in accumulator */
if ($1 != 'A')
/* Load operand $1 into accumulator */
if ($1 <= 0)
printf("LDA #%d\n",(int)0l-$1);
else printf("LDA %c\n",$1);
/* Add operand $3 to accumulator */
if ($3 <= 0)
printf("ADD #%d\n",(int)0l-$3);
else printf("ADD %c\n",$3);
$$ = 'A'; /* Note accumulator result */
}
}
;
If you build the resultant compiler, you will see that it does indeed generate better code and perform the constant folding transformation.
Loop Unrolling
The transformation that you specifically asked about in your question was that of loop unrolling. In loop unrolling the compiler will look for some specific integer expression values in the loop start and end conditions to determine if the unrolled code transformation should be performed. The compiler can will then generate two possible code alternative sequences for loops, the unrolled and standard looping code. We can demonstrate this concept in this example mini-compiler by using integer increments.
If we imagine that the machine code has an INC instruction which increments the accumulator by one and is faster that performing an ADD #1 instruction, we can further improve the compiler by looking for that specific case. This involves evaluating integer constant expressions and comparing to a specific value to decide if an alternative code sequence should be used - just as in loop unrolling. For example:
i = j + 1
should result in:
LDA j
INC
STO i
Final Code Generator
To change the code generated for n + 1 we only need to recode part of the expression semantics and just test that when not folding constants wether the constant to be used would be 1 (which is negated in this example). The resultant code becomes:
expression : operand { $$ = $1 ; }
| expression '+' operand {
/* First check for constant expression */
if ( ($1 <= 0) && ($3 <= 0)) $$ = $1 + $3 ;
else { /* No constant folding */
/* Check for special case of constant 1 on LHS */
if ($1 == -1) {
/* Swap LHS/RHS to permit INC usage */
$1 = $3;
$3 = -1;
}
/* See if $1 already in accumulator */
if ($1 != 'A')
/* Load operand $1 into accumulator */
if ($1 <= 0)
printf("LDA #%d\n",(int)0l-$1);
else printf("LDA %c\n",$1);
/* Add operand $3 to accumulator */
if ($3 <= 0)
/* test if ADD or INC */
if ($3 == -1) printf("INC\n");
else printf("ADD #%d\n",(int)0l-$3);
else printf("ADD %c\n",$3);
$$ = 'A'; /* Note accumulator result */
}
}
;
Summary
In this mini-tutorial we have defined a whole language, a complete machine code, written a lexer, a compiler, a code generator and an optimiser. It has briefly demonstrated the process of code generation and indicated (albeit generally) how code transformation and optimisation could be performed. It should enable similar improvements to be made in other (as yet unseen) compilers, and has addressed the issue of identifying loop unrolling conditions and generating specific improvements for that case.
It should also have made it clear, how difficult it is to answer questions without specific examples of some program code to refer to.

How to recursively parse an expression?

I'm writing a small language, and I'm really stuck on expression parsing. I've written a LR Recursive Descent Parser, it works, but now I need to parse expressions I'm finding it really difficult. I do not have a grammar defined, but if it helps, I kind of have an idea on how it works even without a grammar. Currently, my expression struct looks like this:
typedef struct s_ExpressionNode {
Token *value;
char expressionType;
struct *s_ExpressionNode lhand;
char operand;
struct *s_ExpressionNode rhand;
} ExpressionNode;
I'm trying to get it to parse something like:
5 + 5 + 2 * (-3 / 2) * age
I was reading this article on how to parse expressions. The first grammar I tried to implement but it didn't work out too well, then I noticed the second grammar, which appears to remove left recursion. However, I'm stuck trying to implement it since I don't understand what P, B means, and also U is a - but the - is also for a B? Also I'm not sure what expect(end) is supposed to mean either.
In the "Recursive-descent recognition" section of the article you linked, the E, P, B, and U are the non-terminal symbols in the expression grammar presented. From their definitions in the text, I infer that "E" is chosen as a mnemonic for "expression", "P" as mnemonic for "primary", "B" for "binary (operator)", and "U" for "unary (operator)". Given those characterizations, it should be clear that the terminal symbol "-" can be reduced either to a U or to a B, depending on context:
unary: -1
binary: x-1
The expect() function described in the article is used to consume the next token if it happens to be of the specified type, or otherwise to throw an error. The end token is defined to be a synthetic token representing the end of the input. Thus
expect(end)
expresses the expectation that there are no more tokens to process in the expression, and its given implementation throws an error if that expectation is not met.
All of this is in the text, except the reason for choosing the particular symbols E, P, B, and U. If you're having trouble following the text then you probably need to search out something simpler.

Checking Valid Arithmetic Expression in Lex (in C)

I have to write code for checking if an arithmetic expression is valid or not , in lex. I am aware that I could do this very easily using yacc but doing only in lex is not so easy.
I have written the code below, which for some reason doesn't work.
Besides this, i also don't get how to handle binary operators .
My wrong code:
%{
#include <stdio.h>
/* Will be using stack to check the validity of arithetic expressions */
char stack[100];
int top = 0;
int validity =0;S
%}
operand [a-zA-Z0-9_]+
%%
/* Will consider unary operators (++,--), binary operators(+,-,*,/,^), braces((,)) and assignment operators (=,+=,-=,*=,^=) */
"(" { stack[top++]='(';}
")" { if(stack[top]!=')') yerror(); else top--;}
[+|"-"|*|/|^|%] { if(stack[top]!='$') yerror(); else stack[top]=='&';}
"++" { if(stack[top]!='$') yerror(); else top--;}
[+"-"*^%]?= { if(top) yerror();}
operand { if(stack[top]=='&') top--; else stack[top++]='$';}
%%
int yerror()
{
printf("Invalid Arithmetic Expression\n");
}
First, learn how to write regular expressions in Flex. (Patterns, Flex manual).
Inside a character class ([…]), neither quotes nor stars nor vertical bars are special. To include a - or a ], you can escape them with a \ or put them at the beginning of the list, or in the case of - at the end.
So in:
[+|"-"|*|/|^|%]
The | is just another character in the list, and including it five times doesn't change anything. "-" is a character range consisting only of the character ", although I suppose the intention was to include a -. Probably you wanted [-+*/^%] or [+\-*/^%].
There is no way that the flex scanner can guess that a + (for example) is a unary operator instead of a binary operator, and putting it twice in the list of rules won't do anything; the first rule will always take effect.
Finally, if you use definitions (like operand) in your patterns, you have to enclose them in braces: {operand}; otherwise, flex will interpret it as a simple keyword.
And a hint for the assignment itself: A valid unparenthesized arithmetic expression can be simplified into the regular expression:
term {prefix-operator}*{operand}{postfix-operator}*
expr {term}({infix-operator}{term})*
But you can't use that directly because (a) it doesn't deal with parentheses, (b) you probably need to allow whitespace, and (c) it doesn't correctly reject a+++++b because C insists on the "maximal munch" rule for lexical scans, so that is not the same as the correct expression a++ + ++b.
You can, however, translate the above regular expression into a very simple two-state state machine.

Resources