Regular expressions in flex has error - c

I am new in flex and I want to design a scanner using flex.
At this step, I want to make regular expression to match with id, but here are some conditions:
underline can exist in id
you can use _ whenever you want, but if you are using them exactly
consequently it can be at most 2 underlines for example :
a_b_c »»»» true
a___b »»»» false
123abv »»»» false
integers can't be at the beginning of an id
underline can't exist at the end of an id
The regular expression I have written for that is :
(\b(_{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*)\b)
but now I have 2 questions:
Is the regular expression true? I have tested it in rubular.com and I think this is true but I'm not sure?
The other important problem is that when I write this in my flex file, Unfortunately no id is identified. But I can't why it is not recognized
Can anyone please help me?

The problem here is your ID regular expression. You are using \b to match a word boundary, but Flex's regular expressions have no built-in support for matching word boundaries. Other than that, your regular expression is sound. I was able to get your code working using this modified version of yours: _{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*. (I just got rid of the \b's, and some of the parentheses that bothered me).
Unfortunately, this causes a slight problem. Say that you're lexing and run across something like 12_345. Flex will read 12, assume that it found an IC, and then read _. Finding no match, it will print that to stdout, then read 345 as another IC.
In order to avoid this issue (caused by Flex's lack of word boundaries), you could do one of two things:
Create a rule at the end that matches any character (other than whitespace) and make it give an error. This would stop Flex when it got to _ in the example above.
Create a rule at the end that matches any combination of letters, numbers, and underscores ([_0-9A-Za-z]+). If it is matched, give an error. This will cause Flex to return the entire token 12_345 as an error in the above example.
One other problem: The ID regular expression still won't match anything with underscores at the end of it. This means your current regular expression isn't perfect, and you'll need to do some tweaking with it, but now you know not to use the \b symbol. Here is a reference on Flex's regular expression syntax so you can find other things to use/avoid.

I think your requirement is:
Identifiers can use only alphanumeric characters and _
Identifiers cannot start with a number
Identifiers cannot end with an _
Identifiers cannot include more than two consecutive _
(When I first read your question, I thought the last requirement was that identifiers cannot include more than two _, but looking at the proposed regex, I think the version above is more accurate.)
Based on the above, you should be able to use the following two Flex patterns:
([[:alpha:]]|__?[[:alnum:]])(_?_?[[:alnum:]])* { /* Handle an identifier */ }
[[:alpha:]_][[:alnum:]_]* { /* Error */ }
Breaking that down:
([[:alpha:]]|__?[[:alnum:]]) matches an alphabetic character or one or two _ followed by an alphanumeric character.
(_?_?[[:alnum:]])* matches a string of and alphanumeric characters, with a maximum of two before an alphanumeric character.
The second pattern will match anything which starts with an alphabetic character or followed by any number of alphanumerics or . This will match all valid identifiers as well as the sequences which contain too many consecutive or which end with . If both patterns match (that is, a valid identifier), the first one will win, so it will be correctly recognized. The second pattern will consume the entire erroneous identifier, allowing for easier error recovery.
The pattern in the OP doesn't work because flex treats \b as a backspace character (as in C). Flex does not implement word boundary assertions, but in a lexer you almost never need these; the pattern above can be used if necessary.

Related

Unable to form the required regex in C

I am trying to write a regex which can search a string and return true if it matches with the regex and false otherwise.
Check should ensure string is wildcard domain name of a website.
Example:
*.cool.dude is valid
*.cool is not valid
abc.cool.dude is not valid
So I had written something which like this
\\*\\.[.*]\\.[.*]
However, this is also allowing a *.. string as valid string because * means 0 or infinite occurrences.
I am looking for something which ensures that at-least 1 occurrence of the string happens.
Example:
*.a.b -> valid but *.. -> invalid
how to change the regex to support this?
I have already tried doing something like this:
\\*\\.([.*]{1,})\\.([.*]{1,}) -> doesnt work
\\*\\.([.+])\\.(.+) -> doesnt work
^\\*\\.[a-zA-Z]+\\.[a-zA-Z]+ -> doesnt work
I have tried a bunch of other options as well and have failed to find a solution. Would be great if someone can provide some input.
PS. Looking for a solution which works in C.
[.*] does not mean "0 or more occurrences" of anything. It means "a single character, either a (literal) . or a (literal) [*]". […] defines a character class, which matches exactly one character from the specified set. Brackets are not even remotely the same as parentheses.
So if you wanted to express "zero or more of any character except newline", you could just write .*. That's what .* means. And if you wanted "one or more" instead of "zero or more", you could change the * to a plus, as long as you remember that regex.h regexes should always be compiled with the REG_EXTENDED flag. Without that flag, + is just an ordinary character. (And there are a lot of other inconveniences.)
But that's probably not really what you want. My guess is that you want something like:
^[*]([.][A-Za-z0-9_]+){2,}$
although you'll have to correct the character class to specify the precise set of characters you think are legitimate.
Again, don't forget the crucial REG_EXTENDED flag when you call regcomp.
Some notes:
The {2,} requires at least two components after the *, so that *.cool doesn't match.
The ^ and $ at the beginning and end of the regex "anchor" the match to the entire input. That stops the pattern matching just a part of the input, but it might not be exactly what you want, either.
Finally, I deliberately used a single-character character class to force [*] and [.] to be ordinary characters. I find that a lot more readable than falling timber (\\) and it avoids having to think about the combination of string escaping and regex-escaping.
For more information, I highly recommend reading man regcomp and man 7 regex. A good introduction to regexes might be useful, as well.

Flex match string literal, escaping line feed

I am using flex to try and match C-like, simplified string literals.
A regular expression as such:
\"([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})*\"
will match all one-line string literals I am interested in.
A string literal cannot contain a non-escaped backslash. A string literal also cannot contain a literal line feed (0x0a) unless it is escaped by a backslash, in which case the line feed and any following spaces and tabulations are ignored..
For example, assuming {LF} is an actual line feed and {TAB} an actual tabulation (I could not format it better than that).
In: "This is an example \{LF}{TAB}{TAB}{TAB}of a confusing valid string"
Token: "This is an example of a confusing valid string"
My first idea was to use a starting state, a trailing context and yymore() to match what I want and check for errors giving something like the following:
...
%%
\" { BEGIN STRING; yymore(); }
<STRING>{
\n { /* ERROR HERE! */ }
<<EOF>> { /* ERROR HERE AS WELL */ }
([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})* {
/* String ok up to here*/
yymore();
}
\\\n[ \t]* {
/*Vadid inside a tring but needs to be ignored! */
yymore();
}
\" { /* Full string matched */ BEGIN INITIAL;}
.|\n { \* Anything else is considered an error *\ }
}
%%
...
Is there a way to do what I want in the way I am trying to do it? Is there instead any other 'standard' maybe method provided by flex that I just stupidly have not though of? This does not look to me like an uncommon use case. Should I just match the strings separately (beginning to before , after whitespace to end) and concatenate them. This is a bit complicated to do since a string can be decomposed into an arbitrary number of lines using backslashes.
If all you want to do is to recognise a string literal, there's no need for start conditions. You can use some variant of the simple pattern which you'll find in many answers:
["]({normal}|{escape})*["]
(I used macros to make the structure clear, although in practice I would hardly ever use them.)
"Normal" here means any character without special significance in a string. In other words, any character other than " (which ends the literal), \ (which starts an escape sequence, or newline (which is usually an error although some languages allow newlines in strings). In other words, [^"\n\\] (or something similar).
"escape" would be any valid escape sequence. If you didn't want to validate the escape sequence, you could just match a backslash followed by any single character (including newline): \\(.|\n). But since you do seem to want to validate, you'd need to be explicit about the escape sequences you're prepared for:
\\([\n\\btnr"]|x[[:xdigit:]]{2})
But all that only recognises valid string literals. Invalid string literals won't match the pattern, and will therefore fall back to whatever you're using as a fallback rule (matching only the initial "). Since that's practically never what you want, you need to add a second rule which detects error. The easiest way to write the second rule is ["]({normal}|{escape})*, i.e. the valid rule without the final double quote. That will only match erroneous string literals because of (f)lex's maximal munch rule: a valid string literal has a longer match with the valid rule than with the error rule (because the valid rule's match includes the final double quote).
In real-life lexical scanners (as opposed to school exercises), it's more common to expect that the lexical scanner will actually resolve the string literal into the actual bytes it represents, by replacing escape sequences with the corresponding character. That is generally done with a start condition, but the individual patterns are more focussed (and there are more of them). For an example of such a parser, you could look at these two answers (and many others):
Flex / Lex Encoding Strings with Escaped Characters
Optimizing flex string literal parsing

How to detect string during lexical analysis?

I am using some syntax to detect string during lexical analysis
"".*"" return TOK_STRING;
but this is not working.
I think you want
\".*\"
but be aware that . in flex does not match newlines. And, as #chqrlie mentions in a comment, it does match ", so it will match to the end of the last string, and not the current one.
So a better pattern might be:
\"[^"]*\"
([^"] matches any character including newlines, except ").
But then you have no way to include a " in a string. So you will have to decide what syntax that should be. If you wanted to implement SQL style, with doubled quotes representing a single quote inside a string, you could use
\"([^"]|\"\")*\"
For the possibly more common backslash escape:
\"([^"]|\\(.|\n))*\"

Difficulty with my lexical analyzer

I'm trying to program a lexical analyzer to a standard C translation unit, so I've divided the possible tokens into 6 groups; for each group there's a regular expression, which will be converted to a DFA:
Keyword - (will have a symbol table containing "goto", "int"....)
Identifers - [a-zA-z][a-zA-Z0-9]*
Numeric Constants - [0-9]+/.?[0-9]*
String Constants - ""[EVERY_ASCII_CHARACTER]*""
Special Symbols - (will have a symbol table containing ";", "(", "{"....)
Operators - (will have a symbol table containing "+", "-"....)
My Analyzer's input is a stream of bytes/ASCII characters. My algorithm is the following:
assuming there's a stream of characters, x1...xN
foreach i=1, i<=n, i++
if x1...xI accepts one or more of the 6 group's DFA
{
take the longest-token
add x1...xI to token-linked-list
delete x1...xI from input
}
However, this algorithm will assume that every byte it is given, which is a letter, is an identifier, since after an input of 1 character, it accepts the DFA of the identifiers tokens ([a-zA-Z][a-zA-Z0-9]*).
Another possible problem is for the input "intx;", my algorithm will tokenize this stream into "int", "x", ";" which of course is an error.
I'm trying to think about a new algorithm, but I keep failing. Any suggestions?
Code your scanner so that it treats identifiers and keywords the same until the reading is finished.
When you have the complete token, look it up in the keyword table, and designate it a keyword if you find it and as an identifier if you don't find it. This deals with the intx problem immediately; the scanner reads intx and that's not a keyword so it must be be an identifier.
I note that your identifiers don't allow underscores. That's not necessarily a problem, but many languages do allow underscores in identifiers.
Tokenizers generally FIRST split the input stream into tokens, based on rules which dictate what constitute an END of token, and only later decide what kind of token it is (or an error otherwise). Typical end of token are things like white space (when not part of literal string), operators, special delimiters, etc.
It seems you are missing the greediness aspect of competing DFAs. greedy matching is usually the most useful (left-most longest match) because it solves the problem of how to choose between competing DFAs. Once you've matched int you have another node in the IDENTIFIER DFA that advances to intx. Your finate automata doesn't exit until it reaches something it can't consume, and if it isn't in a valid accept state at the end of input, or at the point where another DFA is accepting, it is pruned and the other DFA is matched.
Flex, for example, defaults to greedy matching.
In other words, your proposed problem of intx isn't a problem...
If you have 2 rules that compete for int
rule 1 is the token "int"
rule 2 is IDENTIFIER
When we reach
i n t
we don't immediately ACCEPT int because we see another rule (rule 2) where further input x progresses the automata to a NEXT state:
i n t x
If rule 2 is in an ACCEPT state at that point, then rule 1 is discarded by definition. But if rule 2 is still not in ACCEPT state, we must keep rule 1 around while we examine more input to see if we could eventually reach an ACCEPT state in rule 2 that is longer than rule 1. If we receive some other character that matches neither rule, we check if rule 2 automata is in an ACCEPT state for intx, if so, it is the match. If not, it is discarded, and the longest previous match (rule 1) is accepted, however in this case, rule 2 is in ACCEPT state and matches intx
In the case that 2 rules reach an ACCEPT or EXIT state simultaneously, then precedence is used (order of the rule in the grammar). Generally you put your keywords first so IDENTIFIER doesn't match first.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Resources