How to disable parsing for a piece of text in a file?

How to disable parsing for a piece of text in a file? - c

Structure of my file is :
`pragma TOKEN1_NAME TOKEN1_VALUE
`pragma TOKEN2_NAME TOKEN2_VALUE
`pragma TOKEN3_NAME TOKEN3_VALUE
`pragma TOKEN4_NAME TOKEN4_VALUE
VHDL_TEXT{
// A valid VHDL text goes here.
}
`pragma TOKEN2_NAME TOKEN2_VALUE
VHDL_TEXT{
// VHDL text
}
I need to pass VHDL text as it is to the output file.I can do that by making a default rule at the end of lex file as:
Rule: . { append_to_buffer(*yytext); }
I also have list of other rules in my Lex file to deal with the tokens.
The problem i am having is how to deal with the situation in which VHDL text is also containing some of the tokens that can be recognized by the Lex rules?
In other words ,i want to disable detecting further valid token one i found the text i am interesting in and again start detection once it is over.

As rici points out indirectly you need to be able to distinguish between occurrences of the trailing delimiter '}' for your rule and occurrences of the right curly bracket in a valid VHDL design specification or portion.
See IEEE Std 1076-2008, 15.3 Lexical elements, separators, and delimiters where we find that '{' and '}' are not used as delimiters in VHDL.
They are other special characters (15.2 Character set, using ISO/IEC 8859-1:1998) requiring handling where graphic characters may appear.
graphic_character ::=
basic_graphic_character | lower_case_letter | other_special_character
These include extended identifiers (15.4.3), character literals (15.6), string literals (15.7), bit string literals (15.8), comments (15.9) and tool directives (15.11).
There's a need to identify these lexical elements within the production otherwise identifying '}' as a delimiter for the rule.
Only one tool directive is currently defined (24.1 Protect tool directives) wherein the use of the two curly bracket characters would be contained in VHDL lexical elements. All other uses in lexical elements are directly delimited. (And you could disclaim tool directive support, in VHDL they basically also invoke separate lexical, syntactical and semantic analysis).
Essentially you need to operate a VHDL lexical analyzer for traversing 'VHDL text' where you're rule delimiter right curly bracket will stand out like a sore thumb (as an exception, serving as the closing delimiter for VHDL text).
And about now you'd get the idea life would be easier if you could deal with VHDL by reference instead if possible. Your mechanism is as complex as including tool directives in VHDL (which can be done with a preprocessor as could your VHDL text).
This is in response to the vhdl tag added by FUZxxl.

When you have essentially different languages in a source file that you need to deal with that have clear demarcation tokens (like your VHDL_TEXT markers) that can be easily recognized by the lexer, the easiest thing to do is to use flex exclusive start states (%x). In your case, you would do something like:
%{
/* some global vars for holding aux state */
static int brace_depth;
static Buffer vhdl_text;
%}
%x VHDL
%%
.. normal lexer rules for your non-vhdl stuff
VHDL_TEXT[ \t]*{ { brace_depth = 1;
BufferClear(vhdl_text);
BEGIN(VHDL); }
<VHDL>"{" { BufferAppend(vhdl_text, *yytext);
brace_depth++; }
<VHDL>"}" { if (--brace_depth == 0) {
BEGIN(INITIAL);
yylval.buf = BufferExtract(vhdl_text);
return VHDL_TEXT; }
BufferAppend(vhdl_text, *yytext); }
<VHDL>--.*\n { BufferAppendString(vhdl_text, yytext); }
<VHDL>\"[^"\n]\" { BufferAppendString(vhdl_text, yytext); }
<VHDL>\\[^\\\n]\\ { BufferAppendString(vhdl_text, yytext); }
<VHDL>.|\n { BufferAppend(vhdl_text, *yytext); }
This will gather up everything between the curly braces in VHDL_TEXT {...} and return it to your parser as a single token (matching nested braces properly, if there are any in the VHDL text.) You can do macro substitution-like stuff in the VHDL code by adding a rule like:
<VHDL>{IDENT} { if (Macro *mac = lookup_macro_by_name(yytext)) {
BufferAppendString(vhdl_text, mac->replacement);
} else {
BufferAppendString(vhdl_text, yytext); } }
You also probably want a <VHDL><<EOF>> rule to detect a missing closing } on the vhdl text and give an appropriate error message.

Related

Using strtok to tokenize html

I'm looking to extract text found exactly within an < and >, while also extracting things found between > and <.
For instance:
<html> would just return <html>
<title>This is a title</title> would return <title>, This is a title, </title>
This is a title would return This is a title
And finally <title>This is a weird use of < bracket</title> should return <title>, This is a weird use of < bracket, </title>. My current version recognises it as <title>, This is a weird use of, < bracket, </title>
I'd appreciate any snippets of code, or directions to head in to get to a solution.
tldr, grab substrings with <...> and >...< seperately without being stumped by a floating ...>... or ...<....
Edit: not using strtok anymore, would appreciate any other help or similiar problems you may know about. Any thing to read also would be greatly beneficial. Note: we aren't trying to parse, simply lex the input string
Can only use standard libraries for c.

Just trying to build a basic validator for a subset of valid HTML.
You can't, not even a basic one. You will have too many false positives and negatives. Here's a simple example.
<tag attribute=">" />
HTML has many features which do not allow simple parsing. It is...
Balanced, like <tag></tag> and also "quotes".
Nested, like <tag><tag></tag></tag>.
Escaped, like "escaped\"quote".
Has other languages embedded in it, like Javascript and CSS.
If this is an exercise in tokenization, you could define a very specific subset, but I'd suggest something simpler like JSON which has a well defined grammar. Those are typically parsed using a lexer and parser, but JSON is small enough to be written by hand.

My own solution has been thus so far,
as suggested by #chqrlie...
void tokenize(char* stringPtr)
{
char *flag;
strcpy(flag, " ");
/*We build this up as we iterate the string.
Strtok was not suitable, build up tokens char by char */
char tempToken[tokenLength];
strcpy(tempToken, ""); // Init current token
// Traverse string catching stuff between <...> and >...< seperately.
for(int i =0; i<strlen(stringPtr);i++)
{
if (stringPtr[i]=='<' )
{
if (strcmp(flag, " ")==0)
{
putToken(tempToken);
strcpy(tempToken,""); // Tag starting, everything before it is a token.
strcpy(flag,"<");
strcat(tempToken, flag);
}
else // Catches <...<
{
presentError(stringPtr);
}
}
else if (stringPtr[i]=='>')
{
if (strcmp(flag,"<")==0)
{
strcat(tempToken, ">");
strcpy(flag," ");
putToken(tempToken);
strcpy(tempToken,"");
}
else // Cant have a > unless we saw < already
{
presentError(stringPtr);
}
}
else // Manage non angle brackets
{
strncat(tempToken, &stringPtr[i],1 );
}
}
putToken(tempToken); // Catches a line ending in a value, not a tag
/* Notes
Floating <'s and >'s will be errored up
- Special case ....<...>..., which is incorrect
will cause floating tokens, can be identified
Unclosed tags i.e. </p will be tokenized verbatim,
thus can identify this mistake
Unopened tags i.e. p> will be errored
*/
}
Assume that presentError() terminates lexing.
Some improvements can be made, I'm open to suggestions however this is a first working draft.

Yacc actions return 0 for any variable ($)

I'm new to Lex/Yacc. Found these lex & yacc files which parse ansi C.
To experiment I added an action to print part of the parsing:
constant
: I_CONSTANT { printf("I_CONSTANT %d\n", $1); }
| F_CONSTANT
| ENUMERATION_CONSTANT /* after it has been defined as such */
;
The problem is, no matter where I put the action, and whatever $X I use, I always get value 0.
Here I got printed:
I_CONSTANT 0
Even though my input is:
int foo(int x)
{
return 5;
}
Any idea?

Nothing in the lex file you point to actually sets semantic values for any token. As the author says, the files are just a grammar and "the bulk of the work" still needs to be done. (There are other caveats having to do with the need for a preprocessor.)
Since nothing in the lex file ever sets yylval, it will always be 0, and that is what yacc/bison will find when it sets up the semantic value for the token ($1 in this case).

Turns out yylval = atoi(yytext) is not done in the lex file, so I had to add it myself. Also learned I can add extern char *yytext to the yacc file header, and then use yytext directly.

Lex how to emulate modes or a stack of contexts

I'm trying to figure out how to emulate a context/mode or "stack of contexts" in lex (flex).
In particular, I'd like to write a parser that has a notion of string literals that can drop you back into an expression-y context.
I have a simple grammar that supports raw string literals using the syntax '...' and prints a string when it finds one.
However, a string token has potentially unbounded length (up to lex's maximum buffer size which I think is defined in some macro in the generated C source).
I want to define a begin_string token ' and an end_string token ' as well as a distinct token for reading a character while inside a string.
And I want to achieve this by having some notion of a context that says "now I'm in a string" and affects which tokenization rules are "active".
Here's the naive grammar below for context.
%{
#include <stdio.h>
%}
%option noyywrap
%%
'[^']*' { printf("found string literal (( %s ))\n", yytext); }
\n { /* do nothing */ }
. { /* do nothing */ }
%%
int main()
{
yylex();
return 0;
}

If I understand your needs correctly, that feature is provided with start conditions. As the manual explains, a start condition is a kind of state, which can be used to enable and disable a set of productions.
For example, you might have:
%option nodefault
%x IN_STRING
%%
/* Other patterns for regular tokens */
"'" { BEGIN(IN_STRING); return BEGIN_STRING; }
<IN_STRING>"'" { BEGIN(INITIAL); return END_STRING; }
<IN_STRING>.|\n { return STRING_CHAR; }
Flex will optionally enable a feature which allows you to push and pop the current start condition on a stack, but in this simple case that isn't necessary. If you do need to do that, remember to add %option stack to your prolog, and read the description of the API at the end of the Start Condition chapter linked above.

Trying to make match on a rule that uses "recursive" identifier in flex

I have this line:
0, 6 -> W(1) L(#);
or
\# -> #shift_right R W(1) L
I have to parse this line with flex, and take every element from every part of the arrow and put it in a list. I know how to match simple things, but I don't know how to match multiple things with the same rule. I'm not allowed to increase the limit for rules. I have a hint: parse the pieces, pieces will then combine, and I can use states, but I don't know how to do that, and I can't find examples on the net. Can someone help me?
So, here an example:
{
a -> W(b) #invert_loop;
b -> W(a) #invert_loop;
-> L(#)
}
When this section begins I have to create a structure for each line, where I put what is on the left of -> in a vector, those are some parameters, and the right side in a list, where each term is kinda another structure. For what is on the right side I wrote rules:
writex W([a-zA-Z0-9.#]) for W(anything).
So I need to parse these lines, so I can put the parameters and the structures int the big structure. Something like this(for the first line):
new bigStruc with param = a and list of struct = W(anything), #invert(it is a notation for a reference to another structure)
So what I need is to know how to parse these line so that I can create and create and fill these bigStruct, also using to rules for simple structure(i have all I need for these structures, but I don't how to parse so that I can use these methods).
Sorry for my English and I hope this time I was more clear on what I want.
Last-minute editing: I have matched the whole line with a rule, and then work on it with strtok. There is a way to use previous rules to see what type of structure i have to create? I mean not to stay and put a lots of if, but to use writex W([a-zA-Z0-9.#]) to know that i have to create that kind of structure?

Ok, lets see how this snippet works for you:
// these are exclusive rules, so they do not overlap, for inclusive rules, use %s
%x dataStructure
%x addRules
%%
<dataStructure>-> { BEGIN addRules; }
\{ { BEGIN dataStructure; }
<addRules>; { BEGIN dataStructure; }
<dataStructure>\} { BEGIN INITIAL; }
<dataStructure>[^,]+ { ECHO; } //this will output each comma separated token
<dataStructure>. { } //ignore anything else
<dataStructure>\n { } //ignore anything else
<addRules>[^ ]+ { ECHO; } //this will output each space separated rule
<addRules>. { } //ignore anything else
<addRules>\n { } //ignore anything else
%%
I'm not entirely sure what it it you want. Edit your original post to include the contents of your comments, with examples, and please structure your English better. If you can't explain what you want without contradicting yourself, I can't help you.

parsing with bison

I bought Flex & Bison from O'Reilly but I'm having some trouble implementing a parser (breaking things down into tokens was no big deal).
Suppose I have a huge binary string and what I need to do is add the bits together - every bit is a token:
[0-1] { return NUMBER;}
1101010111111
Or for that matter a collection of tokens with no "operation".
Would a such a grammar be correct?
calclist :
| calclist expr EOL {eval($2)}
expr: NUMBER
|expr NUMBER { $$=$1+$2 }
or is there a better way to do it?

Your example lex rule "[0-1] { return NUMBER; }" doesn't set yylval, so if you use that value in your grammar (as you do in the rule "expr NUMBER { $$=$1+$2; }") you'll get garbage.
In general what you're doing is correct, though the task you've chosen is so trivial that lex/bison is serious overkill.