How to find a substring - c

Given a text file containing a string, I would find some specific substrings/sequences inside this string.
Bison file .y (Declaration+Rules)
%token <cval> AMINO
%token STARTCODON STOPCODON
%type <cval> seq
%%
series: STARTCODON seq STOPCODON {printf("%s", $2);}
seq: AMINO
| seq AMINO
;
%%
Here I want to print every sequence between STARTCODON and STOPCODON
Flex file .l (Rules)
%%
("ATG")+ {return STARTCODON;}
("TAA"|"TAG"|"TGA")+ {return STOPCODON;}
("GCT"|"GCC"|"GCA"|"GCG")+ {yylval.cval = 'A';
return AMINO;}
("CGT"|"CGC"|"CGA"|"CGG"|"AGA"|"AGG")+ {yylval.cval = 'R';
return AMINO;}
.
.
.
[ \t]+ /*ignore whitespace*/
\n /*ignore end of line*/
. {printf("-");}
%%
When I run the code I get only the output of the rule . {printf("-");}.
I am new Flex/Bison, I suspect that:
The bison rule series: STARTCODON seq STOPCODON {printf("%s", $2);} is not correct.
Flex doesn't subdivide correctly the entire string into tokens of 3 characters.
EDIT:
(Example) Input file: DnaSequence.txt:
Input string:cccATGAATTATTAGzzz, where lower characters (ccc, zzz) produce the (right) output -, ATG is the STARTCODON, AATTAT is the sequence of two AMINO (AAT TAT), and TAG is the STOPCODON.
This input string produces the (wrong) output ---.
EDIT:
Following the suggestions of #JohnBollinger I have added <<EOF>> {return ENDTXT;} in the Flex file, and the rule finalseries: series ENDTXT; in the Bison file.
Now it's returning the yyerror's error message, indicating a parsing error.
I suppose that we need a STARTTXT token, but I don't know how to implement it.

I am new Flex/Bison, I suspect that:
The bison rule series: STARTCODON seq STOPCODON {printf("%s", $2);} is not correct.
The rule is syntactically acceptable. It would be semantically correct if the value of token 2 were a C string, in which case it would cause that value to be printed to the standard output, but your Flex file appears to assume that type <cval> is char, which is not a C string, nor directly convertible to one.
Flex doesn't subdivide correctly the entire string into tokens of 3 characters.
Your Flex input looks OK to me, actually. And the example input / output you present indicates that Flex is indeed recognizing all your triplets from ATG to TAG, else the rule for . would be triggered more than three times.
The datatype problem is a detail that you'll need to sort out, but the main problem is that your production for seq does not set its semantic value. How that results in (seemingly) nothing being printed when the series production is used for a reduction depends on details that you have not disclosed, and probably involves undefined behavior.
If <cval> were declared as a string (char *), and if your lexer set its values as strings rather than as characters, then setting the semantic value might look something like this:
seq: AMINO { $$ = calloc(MAX_AMINO_ACIDS + 1, 1); /* check for allocation failure ... */
strcpy($$, $1); }
| seq AMINO { $$ = $1; strcat($$, $2); }
;
You might consider sticking with char as the type for the semantic value of AMINO, and defining seq to have a different type (i.e. char *). That way, your changes could be restricted to the grammar file. That would, however, call for a different implementation of the semantic actions in the production for seq.
Finally, note that although you say
Here I want to print every sequence between STARTCODON and STOPCODON
your grammar, as presented, has series as its start symbol. Thus, once it reduces the token sequence to a series, it expects to be done. If additional tokens follow (say those of another series) then that would be erroneous. If that's something you need to support then you'll need a higher-level start symbol representing a sequence of multiple series.

Related

Why doesn't YACC generate shift-reduce conflict?

Trying to understand shift-reduce conflicts and fix them.
I have following YACC code, for which I was expecting shift-reduce conflict but Bison doesn't generate any such warnings
%%
lang_cons: /* empty */
| declaraion // SEMI_COLON
| func
;
declaraion : keyword ID
;
func : keyword ID SEMI_COLON
;
keyword : INT
| FLOAT
;
%%
But if I uncomment SEMI_COLON in 2nd rule (i.e, | declaraion SEMI_COLON ), I get shift-reduce conflict. I was expecting reduce-reduce conflict in this case. Please help me understand this mess!
PS: Consider input,
1) int varName
2) int func;
If you give bison the -v command line flag, it will produce an .output file containing the generated state machine, which will probably help you see what is going on.
Note that bison actually parses the augmented grammar, which consists of your grammar with the additional rule
start': start END
where END is a special token whose code is 0, indicating the end of input, and start is whatever your grammar uses as a start symbol. (That ensures that the bison parser will not silently ignore garbage at the end of an otherwise valid input.)
That makes your original grammar unambiguous; after varName is seen, the lookahead will be either END, in which case declaration is reduced, or ';', which will be shifted (followed by a reduction of func when the following END is seen).
In your second grammar, the conflict involves the choice between reducing declaration or shifting the semicolon. If the semicolon were part of declaration, then you would see a reduce/reduce conflict.

Project on flex and bison

I have used flex and bison in order to make a lexical analyzer and a parser for an EBNF grammar. This work is done! I mean, when i put a file with a program I write, I can see if the program has mistakes. If it doesn't, I can see the whole program in my screen based on the grammar i have used. I have no problem in this.
Now, I want to use loop handling and loop unrolling. Which part should I change? The lexical analyzer? The parser? Or the main after the parser? And how?
Introduction
As we don't have sight of a piece of your code to see how you are handling a loop in the parser and outputting code, and an example of a specific loop that you might want unrolled it is difficult to give any more detailed advice than that already given. There are unlikely to be any more experienced compiler writers or teachers anywhere on the globe than those already reading your question! So we will need to explore other ways to explain how to solve a problem like this.
It often happens that people can't post examples of their code because they started with a significant code base provided as part of a class exercise or from an open source repository, and they do not fully understand how it works to be able to find appropriate code fragments to post. Let's imagine that you had the complete source of a working compiler for a real language and wanted to add some loop optimisations to that existing, working compiler, you might then say, as you did, "what source, how can I show some source?" (because in actuality it is many tens of thousands of lines of code).
An Example Compiler
In the absence of some code to reference the alternative is to create one, as an exemplar, to explain the problem and solution. This is often how it is done in compiler text books or compiler classes. I will use a similar simple example to demonstrate how such optimisations can be achieved using the tools flex and bison.
First, we need to define the language of the example. To keep within the reasonable size constraints of a SO answer the language must be very simple. I will use simple assignments of expressions as the only statement form in my language. The variables in this language will be single letters and the constants will be positive integers. The only expression operator is plus (+). An example program in my language might be:
i = j + k; j = 1 + 2
The output code generated by the compiler will be simple assembler for a single accumulator machine with four instructions, LDA, STO, ADD and STP. The code generated for the above statements would be:
LDA j
ADD k
STO i
LDA #1
ADD #2
STO j
STP
Where LDA loads a value or variable into the accumulator, ADD adds a variable or value to the accumulator, STO stores the accumulator back to a variable. STP is "stop" for the end-of-program.
The flex program
The language shown above will need the tokens for ID and NUMBER and should also skip whitespace. The following will suffice:
%{
#define yyterminate() return (END);
%}
digit [0-9]
id [a-z]
ws [\t\n\r ]
%%
{ws}+ /* Skip whitespace */
{digit}+ {yylval = (int)(0l - atol(yytext)); return(NUMBER); }
{id} {yylval = yytext[0]; return(ID); }
"+" {return('+'); }
"=" {return('='); }
Gory details
Just some notes on how this works. I've used atol to convert the integer to allow for deal with potential integer overflow that can occur in reading MAXINT. I'm negating the constants so they can be easily distinguished from the identifiers which will be positive in one byte. I'm storing single character identifiers to avoid having the burden of illustrating symbol table code and thus permit a very small lexer, parser and code generator.
The bison program
To parse the language and generate some code from the bison actions we can achieve this by the following bison program:
%{
#include <stdio.h>
%}
%token NUMBER ID END
%%
program : statements END { printf("STP\n"); return(0) ; }
;
statements : statement
| statements ';' statement
;
statement : ID '=' expression { printf("STO %c\n",$1); }
|
;
expression : operand {
/* Load operand into accumulator */
if ($1 <= 0)
printf("LDA #%d\n",(int)0l-$1);
else printf("LDA %c\n",$1);
}
| expression '+' operand {
/* Add operand to accumulator */
if ($3 <= 0)
printf("ADD #%d\n",(int)0l-$3);
else printf("ADD %c\n",$3);
}
;
operand : NUMBER
| ID
;
%%
#include "lex.yy.c"
Explanation of methodology
This paragraph is intended for those who know how to do this and might query the approach used in my examples. I've deliberately avoided building a tree and doing a tree walk, although this would be the orthodox technique for code generation and optimisation. I wanted to avoid adding all the necessary code overhead in the example to manage the tree and walk it. This way my example compiler can be really tiny. However, being restricted to only using bison action to perform the code generation limits me to the ordering of the bison rule matching. This meant that only pseudo-machine code could really be generated. A source-to-source example would be less tractable with this methodology. I've chosen an idealised machine that is a cross between MU0 and a register-less PDP/11, again with the bare minimum of features to demonstrate some optimisations of code.
Optimisation
Now we have a working compiler for a language in a few lines of code we can start to demonstrate how the process of adding code optimisation might work.
As has already been said by the esteemed #Chris Dodd:
If you want to do program transformations after parsing, you should do them after parsing. You can do them incrementally (calling transform routines from your bison code after parsing part of your input), or after parsing is complete, but either way, they happen after parsing the part of the program you are transforming.
This compiler works by emitting code incrementally after parsing part of the input. As each statement is recognised the bison action (within the {...} clause) is invoked to generate code. If this is to be transformed into more optimal code it is this code that has to be changed to generate the desired optimisation. To be able to achieve effective optimisation we need a clear understanding of what language features are to be optimised and what the optimal transformation should be.
Constant Folding
A common optimisation (or code transformation) that can be done in a compiler is constant folding. In constant folding the compiler replaces expressions made entirely of numbers by the result. For example consider the following:
i = 1 + 2
An optimisation would be to treat this as:
i = 3
Thus the addition of 1 + 2 was made by the compiler and not put into the generated code to occur at run time. We would expect the following output to result:
LDA #3
STO i
Improved Code Generator
We can implement the improved code by looking for the explicit case where we have a NUMBER on both sides of expression '+' operand. To do this we have to delay taking any action on expression : operand to permit the value to be propagated onwards. As the value for an expression might not have been evaluated we have to potentially do that on assignment and addition, which makes for a slight explosion of if statements. We only need to change the actions for the rules statement and expression however, which are as shown below:
statement : ID '=' expression {
/* Check for constant expression */
if ($3 <= 0) printf("LDA #%d\n",(int)0l-$3);
else
/* Check if expression in accumulator */
if ($3 != 'A') printf("LDA %c\n",$3);
/* Now store accumulator */
printf("STO %c\n",$1);
}
| /* empty statement */
;
expression : operand { $$ = $1 ; }
| expression '+' operand {
/* First check for constant expression */
if ( ($1 <= 0) && ($3 <= 0)) $$ = $1 + $3 ;
else { /* No constant folding */
/* See if $1 already in accumulator */
if ($1 != 'A')
/* Load operand $1 into accumulator */
if ($1 <= 0)
printf("LDA #%d\n",(int)0l-$1);
else printf("LDA %c\n",$1);
/* Add operand $3 to accumulator */
if ($3 <= 0)
printf("ADD #%d\n",(int)0l-$3);
else printf("ADD %c\n",$3);
$$ = 'A'; /* Note accumulator result */
}
}
;
If you build the resultant compiler, you will see that it does indeed generate better code and perform the constant folding transformation.
Loop Unrolling
The transformation that you specifically asked about in your question was that of loop unrolling. In loop unrolling the compiler will look for some specific integer expression values in the loop start and end conditions to determine if the unrolled code transformation should be performed. The compiler can will then generate two possible code alternative sequences for loops, the unrolled and standard looping code. We can demonstrate this concept in this example mini-compiler by using integer increments.
If we imagine that the machine code has an INC instruction which increments the accumulator by one and is faster that performing an ADD #1 instruction, we can further improve the compiler by looking for that specific case. This involves evaluating integer constant expressions and comparing to a specific value to decide if an alternative code sequence should be used - just as in loop unrolling. For example:
i = j + 1
should result in:
LDA j
INC
STO i
Final Code Generator
To change the code generated for n + 1 we only need to recode part of the expression semantics and just test that when not folding constants wether the constant to be used would be 1 (which is negated in this example). The resultant code becomes:
expression : operand { $$ = $1 ; }
| expression '+' operand {
/* First check for constant expression */
if ( ($1 <= 0) && ($3 <= 0)) $$ = $1 + $3 ;
else { /* No constant folding */
/* Check for special case of constant 1 on LHS */
if ($1 == -1) {
/* Swap LHS/RHS to permit INC usage */
$1 = $3;
$3 = -1;
}
/* See if $1 already in accumulator */
if ($1 != 'A')
/* Load operand $1 into accumulator */
if ($1 <= 0)
printf("LDA #%d\n",(int)0l-$1);
else printf("LDA %c\n",$1);
/* Add operand $3 to accumulator */
if ($3 <= 0)
/* test if ADD or INC */
if ($3 == -1) printf("INC\n");
else printf("ADD #%d\n",(int)0l-$3);
else printf("ADD %c\n",$3);
$$ = 'A'; /* Note accumulator result */
}
}
;
Summary
In this mini-tutorial we have defined a whole language, a complete machine code, written a lexer, a compiler, a code generator and an optimiser. It has briefly demonstrated the process of code generation and indicated (albeit generally) how code transformation and optimisation could be performed. It should enable similar improvements to be made in other (as yet unseen) compilers, and has addressed the issue of identifying loop unrolling conditions and generating specific improvements for that case.
It should also have made it clear, how difficult it is to answer questions without specific examples of some program code to refer to.

Bison/Flex print value of terminal from alternative

I have written a simple grammar:
operations :
/* empty */
| operations operation ';'
| operations operation_id ';'
;
operation :
NUM operator NUM
{
printf("%d\n%d\n",$1, $3);
}
;
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
operator :
'+' | '-' | '*' | '/'
{
$<string>$ = strdup(yytext);
}
;
As you can see, I have defined an operator that recognizes one of 4 symbols. Now, I want to print this symbol in operation_id. Problem is, that logic in operator works only for last symbol in alternative.
So if I write a/b; it prints ab/ and that's cool. But for other operations, eg. a+b; it prints aba. What am I doing wrong?
*I ommited new lines symbols in example output.
This non-terminal from your grammar is just plain wrong.
operator :
'+' | '-' | '*' | '/' { $<string>$ = strdup(yytext); }
;
First, in yacc/bison, each production has an action. That rule has four productions, of which only the last has an associated action. It would be clearer to write it like this:
operator : '+'
| '-'
| '*'
| '/' { $<string>$ = strdup(yytext); }
;
which makes it a bit more obvious that the action only applies to the reduction from the token '/'.
The action itself is incorrect as well. yytext should never be used outside of a lexer action, because its value isn't reliable; it will be the value at the time the most recent lexer action was taken, but since the parser usually (but not always) reads one token ahead, it will usually (but not always) be the string associated with the next token. That's why the usual advice is to make a copy of yytext, but the idea is to copy it in the lexer rule, assigning the copy to the appropriate member of yylval so that the parser can use the semantic value of the token.
You should avoid the use of $<type>$ =. A non-terminal can only have one type, and it should be declared in the prologue to the bison file:
%type <string> operator
Finally, you will find that it is very rarely useful to have a non-terminal which recognizes different operators, because the different operators are syntactically different. In a more complete expression grammar, you'd need to distinguish between a + b * c, which is the sum of a and the product of b and c, and a * b + c, which is the sum of c and the product of a and b. That can be done by using different non-terminals for the sum and product syntaxes, or by using different productions for an expression non-terminal and disambiguating with precedence rules, but in both cases you will not be able to use an operator non-terminal which produces + and * indiscriminately.
For what its worth, here is the explanation of why a+b results in the output of aba:
The production operator : '+' has no explicit action, so it ends up using the default action, which is $$ = $1.
However, the lexer rule which returns '+' (presumably -- I'm guessing here) never sets yylval. So yylval still has the value it was last assigned.
Presumably (another guess), the lexer rule which produces WORD correctly sets yylval.string = strdup(yytext);. So the semantic value of the '+' token is the semantic value of the previous WORD token, which is to say a pointer to the string "a".
So when the rule
operation_id :
WORD operator WORD
{
printf("%s\n%s\n%s\n",$1, $3, $<string>2);
}
;
executes, $1 and $2 both have the value "a" (two pointers to the same string), and $3 has the value "b".
Clearly, it is semantically incorrect for $2 to have the value "a", but there is another error waiting to occur. As written, your parser leaks memory because you never free() any of the strings created by strdup. That's not very satisfactory, and at some point you will want to fix the actions so that semantic values are freed when they are no longer required. At that point, you will discover that having two semantic values pointing at the same block of allocated memory makes it highly likely that free() will be called twice on the same memory block, which is Undefined Behaviour (and likely to produce very difficult-to-diagnose bugs).

Checking Valid Arithmetic Expression in Lex (in C)

I have to write code for checking if an arithmetic expression is valid or not , in lex. I am aware that I could do this very easily using yacc but doing only in lex is not so easy.
I have written the code below, which for some reason doesn't work.
Besides this, i also don't get how to handle binary operators .
My wrong code:
%{
#include <stdio.h>
/* Will be using stack to check the validity of arithetic expressions */
char stack[100];
int top = 0;
int validity =0;S
%}
operand [a-zA-Z0-9_]+
%%
/* Will consider unary operators (++,--), binary operators(+,-,*,/,^), braces((,)) and assignment operators (=,+=,-=,*=,^=) */
"(" { stack[top++]='(';}
")" { if(stack[top]!=')') yerror(); else top--;}
[+|"-"|*|/|^|%] { if(stack[top]!='$') yerror(); else stack[top]=='&';}
"++" { if(stack[top]!='$') yerror(); else top--;}
[+"-"*^%]?= { if(top) yerror();}
operand { if(stack[top]=='&') top--; else stack[top++]='$';}
%%
int yerror()
{
printf("Invalid Arithmetic Expression\n");
}
First, learn how to write regular expressions in Flex. (Patterns, Flex manual).
Inside a character class ([…]), neither quotes nor stars nor vertical bars are special. To include a - or a ], you can escape them with a \ or put them at the beginning of the list, or in the case of - at the end.
So in:
[+|"-"|*|/|^|%]
The | is just another character in the list, and including it five times doesn't change anything. "-" is a character range consisting only of the character ", although I suppose the intention was to include a -. Probably you wanted [-+*/^%] or [+\-*/^%].
There is no way that the flex scanner can guess that a + (for example) is a unary operator instead of a binary operator, and putting it twice in the list of rules won't do anything; the first rule will always take effect.
Finally, if you use definitions (like operand) in your patterns, you have to enclose them in braces: {operand}; otherwise, flex will interpret it as a simple keyword.
And a hint for the assignment itself: A valid unparenthesized arithmetic expression can be simplified into the regular expression:
term {prefix-operator}*{operand}{postfix-operator}*
expr {term}({infix-operator}{term})*
But you can't use that directly because (a) it doesn't deal with parentheses, (b) you probably need to allow whitespace, and (c) it doesn't correctly reject a+++++b because C insists on the "maximal munch" rule for lexical scans, so that is not the same as the correct expression a++ + ++b.
You can, however, translate the above regular expression into a very simple two-state state machine.

bison:why do the result is wrong when print a constant in a action?

I have the grammatical:
%token T_SHARE
%token T_COMMENT T_PUBLIC T_WRITEABLE T_PATH T_GUESTOK T_VALID_USERS
T_WRITE_LIST T_CREATE_MODE T_DIRECTORY_MODE
%union
{
int number;
char *string;
}
%token <string> T_STRING
%token <number> T_NUMBER T_STATE
%%
parameters:
|parameters parameter
;
parameter:
section_share
|comment
....
section_share:
'[' T_SHARE ']' {section_print(T_SHARE);}
;
comment:
T_COMMENT '=' T_STRING {print(2);parameter_print(T_COMMENT);}
;
the function print is:
void print(int arg)
{
printf("%d\n", arg);
}
but it prints the argument `2' of print to other values that like "8508438", without rule. why?
It's very hard to understand what you are trying to ask, but I think you are confusing tokens' numeric codes with their semantic values. In particular, there is nothing special about the print(2) call in the action associated with your 'comment' rule. It is copied literally to the generated parser, so, given the definition of the print() function, a literal '2' should be printed each time that rule fires. I think that's what you say you observe.
If instead you want to print the semantic value associated with a symbol in the rule, then the syntax has the form $n, where the number after the dollar sign is the number of the wanted symbol in the rule, counting from 1. Thus, in the 'comment' rule, the semantic value associated with the T_STRING symbol can be referenced as $3. For example:
comment:
T_COMMENT '=' T_STRING { printf("The string is %s\n", $3); }
;
Semantic values of primitive tokens must be set by your lexical analyzer to be available; semantic values of non-terminals must be set by actions in your grammar. Note also that mid-rule actions get included in the count.
Although token symbols such as your T_COMMENT can be used directly in actions, it is not typically useful to do so. These symbols will be resolved by the C preprocessor to numbers characteristic of the specific symbol. The resulting token codes have nothing to do with the specific values parsed.

Resources