Related
I am trying to write a minipython compiler, and as we know, Python works with spaces to define a block. In my situation, I defined the block as exactly 4 spaces, but when I want to create a block with multiple lines it tells that there is a shift/reduce conflict. I guess I know where the problem is; it doesn't know to treat the second line as space or tab, but I am not sure, so here is my code:
lexical :
" " {column = column + strlen(yytext); return mc_tab;}
" " {column = column + strlen(yytext);}
syntaxic :
S : INSTRUCTION S
| {printf("\nWINNER WINNER CHICKEN DINNER\n");YYACCEPT;}
;
INSTRUCTION : DECL mc_jmp | LOOP | COND | mc_jmp
;
COND : IF ELSE
;
IF : mc_if mc_opnArc COMPARISION mc_clsArc mc_dblpnt mc_jmp BEGIN_BLOCK
;
ELSE : mc_else mc_dblpnt BEGIN_BLOCK |
BEGIN_BLOCK : mc_tab INSTRUCTION BEGIN_BLOCK |
;
Just to know, when I deleted the recursion in BEGIN_BLOCK, the conflict is gone.
EDIT :
and there is another problem, I guess if we solved it, then the first will be solved too.
when I write TAB in any line of code it will be treated exactly as that tab didn't exist, so the code treated the tab exactly as it treats the 4 spacements
As I wrote in comments, shift / reduce conflicts are parser issues. They are being reported by Bison in your case, and they are a function of your grammar (only). If you ask it to do so via either or both of -v or -r all, Bison will produce an output file that shows you (among other things) exactly where such conflicts occur.
The grammar presented in the question is incomplete, but I made it into an input file that Bison would accept by adding section delimiters and adding a %token declaration for each symbol that is not otherwise defined. I also added the semicolon that appears to have been intended after the definition of the ELSE symbol, before the definition of BEGIN_BLOCK. Bison reported four shift / reduce conflicts for the result:
when the token on top of the stack is an IF and the next token is an mc_else, it is ambiguous whether to reduce zero tokens to an ELSE or to shift the mc_else in anticipation of performing a later reduction to an ELSE. This arises in part because the grammar accommodates nested conditionals, so it is a manifestation of the common issue of matching elses with the appropriate ifs.
when the tokens on top of the stack are mc_else mc_dblpnt and the next token is an mc_tab, it is ambiguous whether to reduce zero tokens to a BEGIN_BLOCK or to shift the mc_tab in anticipation of an INSTRUCTION to follow.
when a token sequence that can be reduced to a BEGIN_BLOCK is required and the next token is an mc_tab, it is ambiguous whether to reduce zero tokens to a BEGIN_BLOCK or to shift the mc_tab in anticipation of reducing to a BEGIN_BLOCK via the other production for that.
when the tokens on top of the stack are mc_if mc_opnArc COMPARISION mc_clsArc mc_dblpnt mc_jmp and the next token is an mc_tab, it is ambiguous whether to reduce zero tokens to a BEGIN_BLOCK or to shift the mc_tab in anticipation of an INSTRUCTION to follow.
A common theme emerges: empty rules are biting you hard. Such rules are by no means the only way that a shift / reduce conflict can emerge, but I presume that you will recognize that allowing the parser to create a token out of nothing is something to be handled with considerable care.
The name and usage of BEGIN_BLOCK in particular suggest a design problem, especially in conjunction with the fact that there is no corresponding END_BLOCK. Python's own parsing approach relies on the lexical analyzer to track indentation levels, and to emit synthetic indent and corresponding dedent tokens, as appropriate, when the indentation level changes. Sometimes a sequence of multiple dedents must be emitted to maintain correct indent / dedent pairing. And again, indents and dedents correspond to indentation level changes, not individual characters.
Making the lexer track indents and corresponding dedents allows for grammatic rules along these lines:
if_stmt: IF conditional_expr COLON block optional_else ;
optional_else: /* empty */
| ELSE COLON block
;
block: INDENT stmts DEDENT ;
stmts: stmt
| stmts stmt /* note _left_ recursion */
;
stmt: ...
conditional_expr: ...
Note that the block structure is completely unambiguous -- a block begins with an indent and ends with a matching dedent. Although it may not be immediately obvious, that takes care not only of ambiguities such as arise from your empty BEGIN_BLOCK productions, but also ambiguities such as arise from your rules providing for optional else clauses. The latter are addressed because now the grammar allows at most one if with which any given else can pair.
You could do similar.
I am trying to write a regex which can search a string and return true if it matches with the regex and false otherwise.
Check should ensure string is wildcard domain name of a website.
Example:
*.cool.dude is valid
*.cool is not valid
abc.cool.dude is not valid
So I had written something which like this
\\*\\.[.*]\\.[.*]
However, this is also allowing a *.. string as valid string because * means 0 or infinite occurrences.
I am looking for something which ensures that at-least 1 occurrence of the string happens.
Example:
*.a.b -> valid but *.. -> invalid
how to change the regex to support this?
I have already tried doing something like this:
\\*\\.([.*]{1,})\\.([.*]{1,}) -> doesnt work
\\*\\.([.+])\\.(.+) -> doesnt work
^\\*\\.[a-zA-Z]+\\.[a-zA-Z]+ -> doesnt work
I have tried a bunch of other options as well and have failed to find a solution. Would be great if someone can provide some input.
PS. Looking for a solution which works in C.
[.*] does not mean "0 or more occurrences" of anything. It means "a single character, either a (literal) . or a (literal) [*]". […] defines a character class, which matches exactly one character from the specified set. Brackets are not even remotely the same as parentheses.
So if you wanted to express "zero or more of any character except newline", you could just write .*. That's what .* means. And if you wanted "one or more" instead of "zero or more", you could change the * to a plus, as long as you remember that regex.h regexes should always be compiled with the REG_EXTENDED flag. Without that flag, + is just an ordinary character. (And there are a lot of other inconveniences.)
But that's probably not really what you want. My guess is that you want something like:
^[*]([.][A-Za-z0-9_]+){2,}$
although you'll have to correct the character class to specify the precise set of characters you think are legitimate.
Again, don't forget the crucial REG_EXTENDED flag when you call regcomp.
Some notes:
The {2,} requires at least two components after the *, so that *.cool doesn't match.
The ^ and $ at the beginning and end of the regex "anchor" the match to the entire input. That stops the pattern matching just a part of the input, but it might not be exactly what you want, either.
Finally, I deliberately used a single-character character class to force [*] and [.] to be ordinary characters. I find that a lot more readable than falling timber (\\) and it avoids having to think about the combination of string escaping and regex-escaping.
For more information, I highly recommend reading man regcomp and man 7 regex. A good introduction to regexes might be useful, as well.
I am new in flex and I want to design a scanner using flex.
At this step, I want to make regular expression to match with id, but here are some conditions:
underline can exist in id
you can use _ whenever you want, but if you are using them exactly
consequently it can be at most 2 underlines for example :
a_b_c »»»» true
a___b »»»» false
123abv »»»» false
integers can't be at the beginning of an id
underline can't exist at the end of an id
The regular expression I have written for that is :
(\b(_{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*)\b)
but now I have 2 questions:
Is the regular expression true? I have tested it in rubular.com and I think this is true but I'm not sure?
The other important problem is that when I write this in my flex file, Unfortunately no id is identified. But I can't why it is not recognized
Can anyone please help me?
The problem here is your ID regular expression. You are using \b to match a word boundary, but Flex's regular expressions have no built-in support for matching word boundaries. Other than that, your regular expression is sound. I was able to get your code working using this modified version of yours: _{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*. (I just got rid of the \b's, and some of the parentheses that bothered me).
Unfortunately, this causes a slight problem. Say that you're lexing and run across something like 12_345. Flex will read 12, assume that it found an IC, and then read _. Finding no match, it will print that to stdout, then read 345 as another IC.
In order to avoid this issue (caused by Flex's lack of word boundaries), you could do one of two things:
Create a rule at the end that matches any character (other than whitespace) and make it give an error. This would stop Flex when it got to _ in the example above.
Create a rule at the end that matches any combination of letters, numbers, and underscores ([_0-9A-Za-z]+). If it is matched, give an error. This will cause Flex to return the entire token 12_345 as an error in the above example.
One other problem: The ID regular expression still won't match anything with underscores at the end of it. This means your current regular expression isn't perfect, and you'll need to do some tweaking with it, but now you know not to use the \b symbol. Here is a reference on Flex's regular expression syntax so you can find other things to use/avoid.
I think your requirement is:
Identifiers can use only alphanumeric characters and _
Identifiers cannot start with a number
Identifiers cannot end with an _
Identifiers cannot include more than two consecutive _
(When I first read your question, I thought the last requirement was that identifiers cannot include more than two _, but looking at the proposed regex, I think the version above is more accurate.)
Based on the above, you should be able to use the following two Flex patterns:
([[:alpha:]]|__?[[:alnum:]])(_?_?[[:alnum:]])* { /* Handle an identifier */ }
[[:alpha:]_][[:alnum:]_]* { /* Error */ }
Breaking that down:
([[:alpha:]]|__?[[:alnum:]]) matches an alphabetic character or one or two _ followed by an alphanumeric character.
(_?_?[[:alnum:]])* matches a string of and alphanumeric characters, with a maximum of two before an alphanumeric character.
The second pattern will match anything which starts with an alphabetic character or followed by any number of alphanumerics or . This will match all valid identifiers as well as the sequences which contain too many consecutive or which end with . If both patterns match (that is, a valid identifier), the first one will win, so it will be correctly recognized. The second pattern will consume the entire erroneous identifier, allowing for easier error recovery.
The pattern in the OP doesn't work because flex treats \b as a backspace character (as in C). Flex does not implement word boundary assertions, but in a lexer you almost never need these; the pattern above can be used if necessary.
I'm trying to program a lexical analyzer to a standard C translation unit, so I've divided the possible tokens into 6 groups; for each group there's a regular expression, which will be converted to a DFA:
Keyword - (will have a symbol table containing "goto", "int"....)
Identifers - [a-zA-z][a-zA-Z0-9]*
Numeric Constants - [0-9]+/.?[0-9]*
String Constants - ""[EVERY_ASCII_CHARACTER]*""
Special Symbols - (will have a symbol table containing ";", "(", "{"....)
Operators - (will have a symbol table containing "+", "-"....)
My Analyzer's input is a stream of bytes/ASCII characters. My algorithm is the following:
assuming there's a stream of characters, x1...xN
foreach i=1, i<=n, i++
if x1...xI accepts one or more of the 6 group's DFA
{
take the longest-token
add x1...xI to token-linked-list
delete x1...xI from input
}
However, this algorithm will assume that every byte it is given, which is a letter, is an identifier, since after an input of 1 character, it accepts the DFA of the identifiers tokens ([a-zA-Z][a-zA-Z0-9]*).
Another possible problem is for the input "intx;", my algorithm will tokenize this stream into "int", "x", ";" which of course is an error.
I'm trying to think about a new algorithm, but I keep failing. Any suggestions?
Code your scanner so that it treats identifiers and keywords the same until the reading is finished.
When you have the complete token, look it up in the keyword table, and designate it a keyword if you find it and as an identifier if you don't find it. This deals with the intx problem immediately; the scanner reads intx and that's not a keyword so it must be be an identifier.
I note that your identifiers don't allow underscores. That's not necessarily a problem, but many languages do allow underscores in identifiers.
Tokenizers generally FIRST split the input stream into tokens, based on rules which dictate what constitute an END of token, and only later decide what kind of token it is (or an error otherwise). Typical end of token are things like white space (when not part of literal string), operators, special delimiters, etc.
It seems you are missing the greediness aspect of competing DFAs. greedy matching is usually the most useful (left-most longest match) because it solves the problem of how to choose between competing DFAs. Once you've matched int you have another node in the IDENTIFIER DFA that advances to intx. Your finate automata doesn't exit until it reaches something it can't consume, and if it isn't in a valid accept state at the end of input, or at the point where another DFA is accepting, it is pruned and the other DFA is matched.
Flex, for example, defaults to greedy matching.
In other words, your proposed problem of intx isn't a problem...
If you have 2 rules that compete for int
rule 1 is the token "int"
rule 2 is IDENTIFIER
When we reach
i n t
we don't immediately ACCEPT int because we see another rule (rule 2) where further input x progresses the automata to a NEXT state:
i n t x
If rule 2 is in an ACCEPT state at that point, then rule 1 is discarded by definition. But if rule 2 is still not in ACCEPT state, we must keep rule 1 around while we examine more input to see if we could eventually reach an ACCEPT state in rule 2 that is longer than rule 1. If we receive some other character that matches neither rule, we check if rule 2 automata is in an ACCEPT state for intx, if so, it is the match. If not, it is discarded, and the longest previous match (rule 1) is accepted, however in this case, rule 2 is in ACCEPT state and matches intx
In the case that 2 rules reach an ACCEPT or EXIT state simultaneously, then precedence is used (order of the rule in the grammar). Generally you put your keywords first so IDENTIFIER doesn't match first.
Edit: WHOOPS! Big admission, I screwed up the definition of the ? in fnmatch pattern syntax and seem to have proposed (and possibly solved) a much harder problem where it behaves like .? in regular expressions. Of course it actually is supposed to behave like . in regular expressions (matching exactly one character, not zero or one). Which in turn means my initial problem-reduction work was sufficient to solve the (now rather boring) original problem. Solving the harder problem is rather interesting still though; I might write it up sometime.
On the plus side, this means there's a much greater chance that something like 2way/SMOA needle factorization might be applicable to these patterns, which in turn could yield the better-than-originally-desired O(n) or even O(n/m) performance.
In the question title, let m be the length of the pattern/needle and n be the length of the string being matched against it.
This question is of interest to me because all the algorithms I've seen/used have either pathologically bad performance and possible stack overflow exploits due to backtracking, or required dynamic memory allocation (e.g. for a DFA approach or just avoiding doing backtracking on the call stack) and thus have failure cases that could also be dangerous if a program is using fnmatch to grant/deny access rights of some sort.
I'm willing to believe that no such algorithm exists for regular expression matching, but the filename pattern language is much simpler than regular expressions. I've already simplified the problem to the point where one can assume the pattern does not use the * character, and in this modified problem you're not matching the whole string but searching for an occurrence of the pattern in the string (like the substring match problem). If you further simplify the language and remove the ? character, the language is just composed of concatenations of fixed strings and bracket expressions, and this can easily be matched in O(mn) time and O(1) space, which perhaps can be improved to O(n) if the needle factorization techniques used in 2way and SMOA substring search can be extended to such bracket patterns. However, naively each ? requires trials with or without the ? consuming a character, bringing in a time factor of 2^q where q is the number of ? characters in the pattern.
Anyone know if this problem has already been solved, or have ideas for solving it?
Note: In defining O(1) space, I'm using the Transdichotomous_model.
Note 2: This site has details on the 2way and SMOA algorithms I referenced: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
Have you looked into the re2 regular expression engine by Russ Cox (of Google)?
It's a regular expression matching engine based on deterministic finite automata, which is different than the usual implementations (Perl, PCRE) using backtracking to simulate a non-deterministic finite automaton. One of the specific design goals was to eliminate the catastrophic backtracking behaviour you mention.
It disallows some of the Perl extensions like backreferences in the search pattern, but you don't need that for glob matching.
I'm not sure if it guarantees O(mn) time and O(1) memory constraints specifically, but it was good enough to run the Google Code Search service while it existed.
At the very least it should be cool to look inside and see how it works. Russ Cox has written three articles about re2 - one, two, three - and the re2 code is open source.
Edit: WHOOPS! Big admission, I screwed up the definition of the ? in fnmatch pattern syntax and seem to have solved a much harder problem where it behaves like .? in regular expressions. Of course it actually is supposed to behave like . in regular expressions (matching exactly one character, not zero or one). Which in turn means my initial problem-reduction work was sufficient to solve the (now rather boring) original problem. Solving the harder problem is rather interesting still though; I might write it up sometime.
Possible solution to the harder problem follows below.
I have worked out what seems to be a solution in O(log q) space (where q is the number of question marks in the pattern, and thus q < m) and uncertain but seemingly better-than-exponential time.
First of all, a quick explanation of the problem reduction. First break the pattern at each *; it decomposes as a (possibly zero length) initial and final component, and a number of internal components flanked on both sided by a *. This means once we've determined if the initial/final components match up, we can apply the following algorithm for internal matches: Starting with the last component, search for the match in the string that starts at the latest offset. This leaves the most possible "haystack" characters free to match earlier components; if they're not all needed, it's no problem, because the fact that a * intervenes allows us to later throw away as many as needed, so it's not beneficial to try "using more ? marks" of the last component or finding an earlier occurrence of it. This procedure can then be repeated for every component. Note that here I'm strongly taking advantage of the fact that the only "repetition operator" in the fnmatch expression is the * that matches zero or more occurrences of any character. The same reduction would not work with regular expressions.
With that out of the way, I began looking for how to match a single component efficiently. I'm allowing a time factor of n, so that means it's okay to start trying at every possible position in the string, and give up and move to the next position if we fail. This is the general procedure we'll take (no Boyer-Moore-like tricks yet; perhaps they can be brought in later).
For a given component (which contains no *, only literal characters, brackets that match exactly one character from a given set, and ?), it has a minimum and maximum length string it could match. The minimum is the length if you omit all ? characters and count bracket expressions as one character, and the maximum is the length if you include ? characters. At each position, we will try each possible length the pattern component could match. This means we perform q+1 trials. For the following explanation, assume the length remains fixed (it's the outermost loop, outside the recursion that's about to be introduced). This also fixes a length (in characters) from the string that we will be comparing to the pattern at this point.
Now here's the fun part. I don't want to iterate over all possible combinations of which ? characters do/don't get used. The iterator is too big to store. So I cheat. I break the pattern component into two "halves", L and R, where each contains half of the ? characters. Then I simply iterate over all the possibilities of how many ? characters are used in L (from 0 to the total number that will be used based on the length that was fixed above) and then the number of ? characters used in R is determined as well. This also partitions the string we're trying to match into part that will be matched against pattern L and pattern R.
Now we've reduced the problem of checking if a pattern component with q ? characters matches a particular fixed-length string to two instances of checking if a pattern component with q/2 ? characters matches a particular smaller fixed-length string. Apply recursion. And since each step halves the number of ? characters involved, the number of levels of recursion is bounded by log q.
You can create a hash of both strings and then compare these. The hash computation will be done in O(m) while the search in O(m + n)
You can use something like this for calculating the hash of the string where s[i] is a character
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
As you said this is for file-name matching and you can't use this where you have wildcards in the strings. Good luck!
My feeling is that this is not possible.
Though I can't provide a bullet-proof argument, my intuition is that you will always be able to construct patterns containing q=Theta(m) ? characters where it will be necessary for the algorithm to, in some sense, account for all 2^q possibilities. This will then require O(q)=O(m) space to keep track of which of the possibilities you're currently looking at. For example, the NFA algorithm uses this space to keep track of the set of states it's currently in; the brute-force backtracking approach uses the space as stack (and to add insult to injury, it uses O(2^q) time in addition to the O(q) of space).
OK, here's how I solved the problem.
Attempt to match the initial part of the pattern up to the first * against the string. If this fails, bail out. If it succeeds, throw away this initial part of both the pattern and the string; we're done with them. (And if we hit the end of pattern before hitting a *, we have a match iff we also reached the end of the string.)
Skip all the way to end end of the pattern (everything after the last *, which might be a zero-length pattern if the pattern ends with a *). Count the number of characters needed to match it, and examine that many characters from the end of the string. If they fail to match, we're done. If they match, throw away this component of the pattern and string.
Now, we're left with a (possibly empty) sequence of subpatterns, all of which are flanked on both sides by *'s. We try searching for them sequentially in what remains of the string, taking the first match for each and discarding the beginning of the string up through the match. If we find a match for each component in this manner, we have a match for the whole pattern. If any component search fails, the whole pattern fails to match.
This alogorithm has no recursion and only stores a finite number of offsets in the string/pattern, so in the transdichotomous model it's O(1) space. Step 1 was O(m) in time, step 2 was O(n+m) in time (or O(m) if we assume the input string length is already known, but I'm assuming a C string), and step 3 is (using a naive search algorithm) O(nm). Thus the algorithm overall is O(nm) in time. It may be possible to improve step 3 to be O(n) but I haven't yet tried.
Finally, note that the original harder problem is perhaps still useful to solve. That's because I didn't account for multi-character collating elements, which most people implementing regex and such tend to ignore because they're ugly to get right and there's no standard API to interface with the system locale and obtain the necessary info to get them. But with that said, here's an example: Suppose ch is a multi-character collating element. Then [c[.ch.]] could consume either 1 or 2 characters. And we're back to needing the more advanced algorithm I described in my original answer, which I think needs O(log m) space and perhaps somewhat more than O(nm) time (I'm guessing O(n²m) at best). At the moment I have no interest in implementing multi-character collating element support, but it does leave a nice open problem...