yacc parser reduces before left recursion suffix

yacc parser reduces before left recursion suffix - c

I wrote a pretty simple left-recursive grammar in yacc, based on data types for a simple C-like language. The generated yacc parser reduces before the left recursion suffix, and I have no clue why.
Here is the source code:
%%
start: type {
printf("Reduced to start.\n");
};
type: pointer {
printf("Reduced pointer to type.\n");
}
| char {
printf("Reduced char to type.\n");
};
char: KW_CHAR {
printf("Reduced to char.\n");
};
pointer: type ASTERISK {
printf("Reduced to pointer.\n");
};
%%
Given the input char * (KW_CHAR ASTERISK):
Reduced to char.
Reduced char to type.
syntax error

You don't seem to be asking about the parse error you're receiving (which, as noted in the comments, is likely an issue with the token types being returned by yylex()), but rather why the parser reductions are performed in the order shown by your trace. That's the question I will try to address here, even though it's quite possible that it's an XY problem, because understanding reduction order is important.
In this case, the reduction order is pretty simple. If you have a production:
pointer: type ASTERISK
then type must be reduced before ASTERISK is shifted. In an LR parser, reductions must be done when the right-hand side to be reduced ends exactly at the last consumed input token. (The lookahead token has been identified by the lexical scanner but not yet consumed by the parser. So it's not part of the reduction and can be used also to identify other reductions ending at the same point.)
I hope it's more evident why with the production
type: char
char needs to be reduced before type. Until char is reduced, it's not available to be used in the reduction of type.
Really, these two examples show the same behaviour. Reductions are performed left-to-right, and from the bottom up (that is, children first). Hence the name for this kind of parsing.
So the reduction order shown by your parser (first char, then type, and only after the * is shifted, pointer) is precisely what would be expected.

Related

Where can I find weird, specific C syntax rules?

I will take an exam and my teacher asks weird C syntax rules. Like:
int q=5;
for(q=-2;q=-5;q+=3) { //assignment in condition part??
printf("%d",q); //prints -5
break;
}
Or
int d[][3][2]={4,5,6,7,8,9,10,11,12,13,14,15,16};
int i=-1;
int j;
j=d[i++][++i][++i];
printf("%d",j); //prints 4?? why j=d[0][0][0] ?
Or
extern int a;
int main() {
do {
do {
printf("%o",a); //prints 12
} while(!1);
} while(0);
return 0;
}
int a=10;
I could not find it rules any site or book. Really absurd and uncommon. Where can I find?

To me it seems that your teacher is asking questions which invole undefined behavior.
If you tell him that this is incorrect, you're directly confronting him.
However, you could do the following:
Compile the code on different platforms
Compile the code with different compilers
Compile the code with different versions of the same compiler
Build a matrix with the results. You'll find out that they differ
Show the results to your teacher ans ask him to explain why that happens
That way you do not say that he's wrong, you're just showing some facts and you're showing that you're willing to learn and work.
Do that a long before the exam so that the teacher can look into it and think about his questions so that he can change the exam in time.
I could not find it rules any site or book. Where can I find?
See Where do I find the current C or C++ standard documents?. If you have a good library at university, they should own a copy.

Concerning for(q=-2;q=-5;q+=3) {, all you need to do is to break this down into its components. q=-2 is ran first, then q=-5 is tested, and if that is not 0 (which it isn't since it's an expression with value -5), then the loop body runs once. Then break forces a premature exit from an otherwise infinite loop. The expression then q+=3 is never reached.
The behaviour of d[i++][++i][++i] is undefined. Tell your teacher that, tactfully.
The "%o" format denotes octal output. a is set to 10 in decimal which is 12 in octal. Your code would be clearer if you had written:
int a=012; // octal constant.

The online version of the C language standard has what you need (and is what I will be referring to in this answer); just bear in mind is is a language definition and not a tutorial, and as such may not be easy to read for someone who doesn't have a lot of experience yet.
Having said that, your teacher is throwing you a few foul balls. For example:
j=d[i++][++i][++i];
This statement results in undefined behavior for several reasons. The first several paragraphs of section 6.5 of the document linked above explain the problem, but in a nutshell:
Except in a few situations, C does not guarantee left-to-right evaluation of expressions; neither does it guarantee that side effects are applied immediately after evaluation;
Attempting to modify the value of an object more than once between sequence points1, or modifying and then trying to use the value of an object without an intervening sequence point, results in undefined behavior.
Basically, don't write anything of the form:
x = x++;
x++ * x++;
a[i] = i++;
a[i++] = i;
C does not guarantee that each ++i and i++ is evaluated from left to right, and it does not guarantee that the side effect of each evaluation is applied immediately. So the result of j[i++][++i][++i] is not well-defined, and the result will not be consistent over different programs, or even different builds of the same program2.
AND, on top of that, i++ evaluates to the current value of i; so clearly, your teacher's intent was for j[i++][++i][++i] to evaluate to j[-1][1][2], which would also result in undefined behavior since you're attempting to index outside of the array bounds.
This is why I hate, hate, hate it when teachers throw this kind of code at their students - not only is it needlessly confusing, not only does it encourage bad practice, but more often than not it's just plain wrong.
As for the other questions:
for(q=-2;q=-5;q+=3) { //assignment in condition part??
See sections 6.5.16 and 6.8.5.3. In short, an assignment expression has a value (the value of the left operand after any type conversions), and it can appear as part of a controlling expression in a for loop. As long as the result of the assignment is non-zero (as in the case above), the loop will execute.
printf("%o",a); //prints 12
See section 7.21.6.1. The o conversion specifier tells printf to format the integer value as octal: 1010 == 128
A sequence point is a point in a programs execution where an expression has been fully evaluated and any side effects have been applied. Sequence points occur at the ends of statements, between the evaluation of a function's parameters and the function call, after evaluating the left operand of the &&, ||, and ?: operators, and a few other places. See Annex C for the complete list.
Or even different runs of the same build, although in practice you won't see values change from run to run unless you're doing something really hinky.

How to recursively parse an expression?

I'm writing a small language, and I'm really stuck on expression parsing. I've written a LR Recursive Descent Parser, it works, but now I need to parse expressions I'm finding it really difficult. I do not have a grammar defined, but if it helps, I kind of have an idea on how it works even without a grammar. Currently, my expression struct looks like this:
typedef struct s_ExpressionNode {
Token *value;
char expressionType;
struct *s_ExpressionNode lhand;
char operand;
struct *s_ExpressionNode rhand;
} ExpressionNode;
I'm trying to get it to parse something like:
5 + 5 + 2 * (-3 / 2) * age
I was reading this article on how to parse expressions. The first grammar I tried to implement but it didn't work out too well, then I noticed the second grammar, which appears to remove left recursion. However, I'm stuck trying to implement it since I don't understand what P, B means, and also U is a - but the - is also for a B? Also I'm not sure what expect(end) is supposed to mean either.

In the "Recursive-descent recognition" section of the article you linked, the E, P, B, and U are the non-terminal symbols in the expression grammar presented. From their definitions in the text, I infer that "E" is chosen as a mnemonic for "expression", "P" as mnemonic for "primary", "B" for "binary (operator)", and "U" for "unary (operator)". Given those characterizations, it should be clear that the terminal symbol "-" can be reduced either to a U or to a B, depending on context:
unary: -1
binary: x-1
The expect() function described in the article is used to consume the next token if it happens to be of the specified type, or otherwise to throw an error. The end token is defined to be a synthetic token representing the end of the input. Thus
expect(end)
expresses the expectation that there are no more tokens to process in the expression, and its given implementation throws an error if that expectation is not met.
All of this is in the text, except the reason for choosing the particular symbols E, P, B, and U. If you're having trouble following the text then you probably need to search out something simpler.

Runtime formula evaluation

I would like to evaluate formulas which a user can input for many data points, so efficiency is a concern. This is for a Fortran project, but my solutions so far have been centered on using a yacc/bison grammar, so I will probably use Fortran's iso_c_binding feature to interface to yyparse().
The preferred (so far) solution would be an small extension of the classic mfcalc calculator example from the Bison manual, with the bison grammar made to recognize a (single) variable name as well (which is not hard).
The question is what to do in the executable statements. I see two options there.
First, I could simply evaluate the expression as it is parsed, as in the mfcalc example.
Second, I could invoke the bison parser once for parsing and for creating a stack-based (reverse polish) representation of the formula being parsed, so
2 + 3*x would be translated into 2 3 * + (of course, as the relevant data structure).
The relevant part of the grammar would look like this:
%union {
double val;
char *c;
int fcn;
}
%type <val> NUMBER
%type <c> VAR
%type <fcn> Function
/* Tokens and %left PLUS MINUS etc. left out for brevity */
%%
...
Function:
SIN { $$=SIN; }
| COS { $$=COS; }
| TAN { $$=TAN; }
| SQRT { $$=SQRT; }
Expression:
NUMBER { push_number($1); }
| VAR { push_var($1); }
| Expression PLUS Expression { push_operand(PLUS); }
| Expression MINUS Expression { push_operand(MINUS); }
| Expression DIVIDE Expression { push_operand(DIVIDE); }
| MINUS Expression %prec NEG { push_operand(NEG); }
| LEFT_PARENTHESIS Expression RIGHT_PARENTHESIS;
| Function LEFT_PARENTHESIS Expression RIGHT_PARENTHESIS { push_function($1); }
| Expression POWER Expression { push_operand(POWER); }
The functions push_... would put the formula into an array of structs, which which contain a struct holding the token and the yacc union.
The RPN would then be interpreted using a very simple (and hopefully fast) interpreter.
So, the questions.
Is the second approach valid? I think it is from what I understand about bison (or yacc's) way of handling shift and reduce (basically, this will shift a number and reduce an expression, so the order should be guaranteed to be correct for RPN), but I am not quite sure.
Also, is it worth the additional effort over simply evaluating the function using the $$ construct (the first approach)?
Finally, are there other, better solutions? I had considered using syntax trees, but I don't think the additional effort is actually worth it. Also, I tend to think that using trees is overkill where an array would do just nicely :-)

It's only slightly more difficult to generate three-address virtual ops than RPN. In effect, the RPN is a virtual stack machine. The three-address ops -- which can also easily go into an array -- are probably faster to interpret, and will probably be more flexible in the long term.
The main advantage of parsing the expression into some internal form is that it is likely to be faster to evaluate the internal form than to reparse the original string. That may not be the case, but it usually is because converting floating-point literals into floating-point numbers is (relatively speaking) quite slow.
There is also the intermediate case of tokenizing the expression (into an array), and then directly evaluating while parsing the token stream. (In effect, that makes bison your virtual machine.)
Which of these strategies is the best depends a lot on details of your use case, but none of them are difficult so you could try all three and compare.

Regex: Returned type of C functions

I'm trying to write a regular expression that will give me only the returned type of any (see edit) C function in a C file, ignoring spaces and newlines, but I'm not having any luck with it.
Edit: The returned types I have to consider are only basic C data types
Example:
signed
long long
int function1 ( int j, int n)
should give me:
signed long long int
How can I write (or think of a solution for) this regular expression?

The hardest part of the problem is probably answering the question: "how can I tell that I have reached the start of a function definition". Given the various rules of C, it's not clear that there is a "sure fire" answer - so the best you can probably do is come up with a rule that catches "most" situations.
Function definitions will have
A return type with possible qualifier (one or more of void, signed, unsigned, short, long, char, int, float, double, *)
Followed by a word
Followed by an open parenthesis.
This means something like this should work: (demo: http://regex101.com/r/oJ3xS5 )
((?:(?:void|unsigned|signed|long|short|float|double|int|char|\*)(?:\s*))+)(\w+)\s*\(
Note - this does not "clean up the formatting" - so a return value definition that spans multiple lines will still do so. It does have the advantage (compared to other solutions) that it looks specifically for the basic types that are defined in the link in your question.
Also note - you need the g flag to capture all the instances; and I capture the function name itself in its own capturing group (\w+). If you don't want / need that, you can leave out the parentheses. But I thought that having both the return type and the function name might be useful.
Afterthought: if you first strip out multiple white spaces and returns, the above will still work but now there will be no extraneous white space in the return value. For instance you could run your code through
cat source.c | tr '\n' ' ' | sed 's/\s+/ /' > strippedSource.c
then process with the regex above.

Concatenate all words using the OR operator:
\b((void|unsigned|signed|char|short|int|long|float|double)\s*)+\b
The \b at start and end are to prevent partial function names popping up (void longjmp comes to mind).
This will not catch typedefs such as uchar_8, or complicated pointer-to-pointer constructions such as void (* int)(*) (I just made this up, it may not mean anything).

How to get and evaluate the Expressions from a string in C

How to get and evaluate the Expressions from a string in C
char *str = "2*8-5+6";
This should give the result as 17 after evaluation.

Try by yourself. you can Use stack data structure to evaluate this string here is reference to implement (its in c++)
stack data structre for string calcualtion

You have to do it yourself, C does not provide any way to do this. C is a very low level language. Simplest way to do it would be to find a library that does it, or if that does not exist use lex + yacc to create your own interpreter.
A quick google suggests the following:
http://www.gnu.org/software/libmatheval/
http://expreval.sourceforge.net/

You should try TinyExpr. It's a single C source code file (with no dependencies) that you can add to your project.
Using it to solve your problem is just:
#include <stdio.h>
#include "tinyexpr.h"
int main()
{
double result = te_interp("2*8-5+6", 0);
printf("Result: %f\n", result);
return 0;
}
That will print out: Result: 17

C does not have a standard eval() function.
There are lots of libraries and other tools out there that can do this.
But if you'd like to learn how to write an expression evaluator yourself, it can be surprisingly easy. It is not trivial: it is actually a pretty deeply theoretical problem, because you're basically writing a miniature parser, perhaps built on a miniature lexical analyzer, just like a real compiler.
One straightforward way of writing a parser involves a technique called recursive descent. Writing a recursive descent parser has a lot in common with another great technique for solving big or hard problems, namely by breaking the big, hard problem up into smaller and hopefully easier subproblems.
So let's see what we can come up with. We're going to write a function int eval(const char * expr) that takes a string containing an expression, and returns the int result of evaluating it. But first let's write a tiny main program to test it with. We'll read a line of text typed by the user using fgets, pass it to our expr() function, and print the result.
#include <stdio.h>
int eval(const char *expr);
int main()
{
char line[100];
while(1) {
printf("Expression? ");
if(fgets(line, sizeof line, stdin) == NULL) break;
printf(" -> %d\n", eval(line));
}
}
So now we start writing eval(). The first question is, how will we keep track of how how far we've read through the string as we parse it? A simple (although mildly cryptic) way of doing this will be to pass around a pointer to a pointer to the next character. That way any function can move forwards (or occasionally backwards) through the string. So our eval() function is going to do almost nothing, except take the address of the pointer to the string to be parsed, resulting in the char ** we just decided we need, and calling a function evalexpr() to do the work. (But don't worry, I'm not procrastinating; in just a second we'll start doing something interesting.)
int evalexpr(const char **);
int eval(const char *expr)
{
return evalexpr(&expr);
}
So now it's time to write evalexpr(), which is going to start doing some actual work. Its job is to do the first, top-level parse of the expression. It's going to look for a series of "terms" being added or subtracted. So it wants to get one or more subexpressions, with + or - operators between them. That is, it's going to handle expressions like
1 + 2
or
1 + 2 - 3
or
1 + 2 - 3 + 4
Or it can read a single expression like
1
Or any of the terms being added or subtracted can be a more-complicated subexpression, so it will also be able to (indirectly) handle things like
2*3 + 4*5 - 9/3
But the bottom line is that it wants to take an expression, then maybe a + or - followed by another subexpression, then maybe a + or - followed by another subexpression, and so on, as long as it keeps seeing a + or -. Here is the code. Since it's adding up the additive "terms" of the expression, it gets subexpressions by calling a function evalterm(). It also needs to look for the + and - operators, and it does this by calling a function gettok(). Sometimes it will see an operator other than + or -, but those are not its job to handle, so if it sees one of those it "ungets" it, and returns, because it's done. All of these functions pass the pointer-to-pointer p around, because as I said earlier, that's how all of these functions keep track of how they're moving through the string as they parse it.
int evalterm(const char **);
int gettok(const char **, int *);
void ungettok(int, const char **);
int evalexpr(const char **p)
{
int r = evalterm(p);
while(1) {
int op = gettok(p, NULL);
switch(op) {
case '+': r += evalterm(p); break;
case '-': r -= evalterm(p); break;
default: ungettok(op, p); return r;
}
}
}
This is some pretty dense code, Stare at it carefully and convince yourself that it's doing what I described. It calls evalterm() once, to get the first subexpression, and assigns the result to the local variable r. Then it enters a potentially infinite loop, because it can handle an arbitrary number of added or subtracted terms. Inside the loop, it gets the next operator in the expression, and uses it to decide what to do. (Don't worry about the second, NULL argument to gettok; we'll get to that in a minute.)
If it sees a +, it gets another subexpression (another term) and adds it to the running sum. Similarly, if it sees a -, it gets another term and subtracts it from the running sum. If it gets anything else, this means it's done, so it "ungets" the operator it doesn't want to deal with, and returns the running sum -- which is literally the value it has evaluated. (The return statement also breaks the "infinite" loop.)
At this point you're probably feeling a mixture of "Okay, this is starting to make sense" but also "Wait a minute, you're playing pretty fast and loose here, this is never going to work, is it?" But it is going to work, as we'll see.
The next function we need is the one that collects the "terms" or subexpressions to be added (and subtracted) together by evalexpr(). That function is evalterm(), and it ends up being very similar -- very similar -- to evalexpr. Its job is to collect a series of one or more subexpressions joined by * and/or /, and multiply and divide them. At this point, we're going to call those subexpressions "primaries". Here is the code:
int evalpri(const char **);
int evalterm(const char **p)
{
int r = evalpri(p);
while(1) {
int op = gettok(p, NULL);
switch(op) {
case '*': r *= evalpri(p); break;
case '/': r /= evalpri(p); break;
default: ungettok(op, p); return r;
}
}
}
There's actually nothing more to say here, because the structure of evalterm ends up being exactly like evalexpr, except that it does things with * and /, and it calls evalpri to get/evaluate its subexpressions.
So now let's look at evalpri. Its job is to evaluate the three lowest-level, but highest-precedence elements of an expression: actual numbers, and parenthesized subexpressions, and the unary - operator.
int evalpri(const char **p)
{
int v;
int op = gettok(p, &v);
switch(op) {
case '1': return v;
case '-': return -evalpri(p);
case '(':
v = evalexpr(p);
op = gettok(p, NULL);
if(op != ')') {
fprintf(stderr, "missing ')'\n");
ungettok(op, p);
}
return v;
}
}
The first thing to do is call the same gettok function we used in evalexpr and evalterm. But now it's time to say a little more about it. It is actually the lexical analyzer used by our little parser. A lexical analyzer returns primitive "tokens". Tokens are the basic syntactic elements of a programming language. Tokens can be single characters, like + or -, or they can also be multi-character entities. An integer constant like 123 is considered a single token. In C, other tokens are keywords like while, and identifiers like printf, and multi-character operators like <= and ++. (Our little expression evaluator doesn't have any of those, though.)
So gettok has to return two things. First it has to return a code for what kind of token it found. For single-character tokens like + and - we're going to say that the code is just the character. For numeric constants (that is, any numeric constant), we're going to say that gettok is going to return the character 1. But we're going to need some way of knowing what the value of the numeric constant was, and that, as you may have guessed, is what the second, pointer argument to the gettok function is for. When gettok returns 1 to indicate a numeric constant, and if the caller passes a pointer to an int value, gettok will fill in the integer value there. (We'll see the definition of the gettok function in a moment.)
At any rate, with that explanation of gettok out of the way, we can understand evalpri. It gets one token, passing the address of a local variable v in which the value of the token can be returned, if necessary. If the token is a 1 indicating an integer constant, we simply return the value of that integer constant. If the token is a -, this is a unary minus sign, so we get another subexpression, negate it, and return it. Finally, if the token is a (, we get another whole expression, and return its value, checking to make sure that there's another ) token after it. And, as you may notice, inside the parentheses we make a recursive call back to the top-level evalexpr function to get the subexpression, because obviously we want to allow any subexpression, even one containing lower-precedence operators like + and -, inside the parentheses.
And we're almost done. Next we can look at gettok. As I mentioned, gettok is the lexical analyzer, inspecting individual characters and constructing full tokens from them. We're now, finally, starting to see how the passed-around pointer-to-pointer p is used.
#include <stdlib.h>
#include <ctype.h>
void skipwhite(const char **);
int gettok(const char **p, int *vp)
{
skipwhite(p);
char c = **p;
if(isdigit(c)) {
char *p2;
int v = strtoul(*p, &p2, 0);
*p = p2;
if(vp) *vp = v;
return '1';
}
(*p)++;
return c;
}
Expressions can contain arbitrary whitespace, which is ignored, so we skip over that with an auxiliary function skipwhite. And now we look at the next character. p is a pointer to pointer to that character, so the character itself is **p. If it's a digit, we call strtoul to convert it. strtoul helpfully returns a pointer to the character following the number it scans, so we use that to update p. We fill in the passed pointer vp with the actual value strtoul computed for us, and we return the code 1 indicating an integer constant.
Otherwise -- if the next character isn't a digit -- it's an ordinary character, presumably an operator like + or - or punctuation like ( ), so we simply return the character, after incrementing *p to record the fact that we've consumed it. Properly "incrementing" p is mildly tricky: it's a pointer to a pointer, and we want to increment the pointed-to pointer. If we wrote p++ or *p++ it would increment the pointer p, so we need (*p)++ to say that it's the pointed-to pointer that we want to increment. (See also C FAQ 4.3.)
Two more utility functions, and then we're done. Here's skipwhite:
void skipwhite(const char **p)
{
while(isspace(**p))
(*p)++;
}
This simply skips over zero or more whitespace characters, as determined by the isspace function from <ctype.h>. (Again, we're taking care to remember that p is a pointer-to-pointer.)
Finally, we come to ungettok. It's a hallmark of a recursive descent parser (or, indeed, almost any parser) that it has to "look ahead" in the input, making a decision based on the next token. Sometimes, however, it decides that it's not ready to deal with the next token after all, so it wants to leave it there on the input for some other part of the parser to deal with later.
Stuffing input "back on the input stream", so to speak, can be tricky. This implementation of ungettok is simple, but it's decidedly imperfect:
void ungettok(int op, const char **p)
{
(*p)--;
}
It doesn't even look at the token it's been asked to put back; it just backs the pointer up by 1. This will work if (but only if) the token it's being asked to unget is in fact the most recent token that was gotten, and if it's not an integer constant token. In fact, for the program as written, and as long as the expression it's parsing is well-formed, this will always be the case. It would be possible to write a more complicated version of gettok that explicitly checked for these assumptions, and that would be able to back up over multi-character tokens (such as integer constants) if necessary, but this post has gotten much longer than I had intended, so I'm not going to worry about that for now.
But if you're still with me, we're done! And if you haven't already, I encourage you to copy all the code I've presented into your friendly neighborhood C compiler, and try it out. You'll find, for example, that 1 + 2 * 3 gives 7 (not 9), because the parser "knows" that * and / have higher precedence than + and -. Just like in a real compiler, you can override the default precedence using parentheses: (1 + 2) * 3 gives 9. Left-to-right associativity works, too: 1 - 2 - 3 is -4, not +2. It handles plenty of complicated, and perhaps surprising (but legal) cases, too: (((((5))))) evaluates to just 5, and ----4 evaluates to just 4 (it's parsed as "negative negative negative negative four", since our simplified parser doesn't have C's -- operator).
This parser does have a pretty big limitation, however: its error handling is terrible. It will handle legal expressions, but for illegal expressions, it will either do something bizarre, or just ignore the problem. For example, it simply ignores any trailing garbage it doesn't recognize or wasn't expecting -- the expressions 4 + 5 x, 4 + 5 %, and 4 + 5 ) all evaluate to 9.
Despite being somewhat of a "toy", this is also a very real parser, and as we've seen it can parse a lot of real expressions. You can learn a lot about how expressions are parsed (and about how compilers can be written) by studying this code. (One footnote: recursive descent is not the only way to write a parser, and in fact real compilers will usually use considerably more sophisticated techniques.)
You might even want to try extending this code, to handle other operators or other "primaries" (such as settable variables). Once upon a time, in fact, I started with something like this and extended it all the way into an actual C interpreter.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

yacc parser reduces before left recursion suffix - c

Related

Where can I find weird, specific C syntax rules?

How to recursively parse an expression?

Runtime formula evaluation

Regex: Returned type of C functions

How to get and evaluate the Expressions from a string in C

Categories

Resources