How do I customize the default action for flex. I found something like <*> but when I run it it says "flex scanner jammed"? Also the . rule only adds a rule so it does not work either. What I want is
comment "/*"[^"*/"]*"*/"
%%
{comment} return 1;
{default} return 0;
<<EOF>> return -1;
Is it possible to change the behavior of matching longest to match first? If so I would do something like this
default (.|\n)*
but because this almost always gives a longer match it will hide the comment rule.
EDIT
I found the {-} operator in the manual, however this example straight from the manual gives me "unrecogized rule":
[a-c]{-}[b-z]
The flex default rule matches a single character and prints it on standard output. If you don't want that action, write an explicit rule which matches a single character and does something else.
The pattern (.|\n)* matches the entire input file as a single token, so that is a very bad idea. You're thinking that the default should be a long match, but in fact you want that to be as short as possible (but not empty).
The purpose of the default rule is to do something when there is no match for any of the tokens in the input language. When lex is used for tokenizing a language, such a situation is almost always erroneous because it means that the input begins with a character which is not the start of any valid token of the language.
Thus, a "catch any character" rule is coded as a form of error recovery. The idea is to discard the bad character (just one) and try tokenizing from the character after that one. This is only a guess, but it's a good guess because it's based on what is known: namely that there is one bad character in the input.
The recovery rule can be wrong. For instance suppose that no token of the language begins with #, and the programmer wanted to write the string literal "#abc". Only, she forgot the opening " and wrote #abc". The right fix is to insert the missing ", not to discard the #. But that would require a much more clever set of rules in the lexer.
Anyway, usually when discarding a bad character, you want to issue an error message for this case like "skipping invalid character '~` in line 42, column 3".
The default rule/action of copying the unmatched character to standard output is useful when lex is used for text filtering. The default rule then brings about the semantics of a regex search (as opposed to a regex match): the idea is to search the input for matches of the lexer's token-recognizing state machine, while printing all material that is skipped by that search.
So for instance, a lex specification containing just the rule:
"foo" { printf("bar"); }
will implement the equivalent of
sed -e 's/foo/bar/g'
I solved the problem manually instead if trying to match the complement of a rule. This works fine because the matching pattern involved in this case is quite simple.
Why does adding "." not do the trick? You can't perform an action in the absence of a matched amount. flex won't do anything if there is no match, so to add a "default" rule, just make it match something.
<*>.|\n /* default action here */
Using this at the end of the file catches the default rule across all start spaces. It's useful to find out where there may be holes.
What I don't know (and would like to know) is how to get flex to report where the default rule match has been found.
Related
I need to write a regex which will match only lines with the C function call, not its declaration.
So, I need it to match only lines, where funcName() is not preceeded by int, double, float, char etc. and an arbitrary number of spaces.
The problem is, I can run into following expressions:
printf("Hello"); int f() {return 1;};
So I must consider even the situation, where there are some other characters before the date-type name.
myStruct f();
In this situation I want regex to match it, ONLY basic data-types should be excluded.
So far I've got to this expression:
^(?!(void|int|double|char))\s*f\(\).*$
But I have no idea, how to take care of the situation with characters before the type name.
The following regex meets your specs:
(^|((^|\s)(?!(void|int|double|char))[^\s]+)\s+)([a-zA-Z_]+\(\)?)
The function name is defined by a character class containing letters and the underscore.
The line starts with the function call, or
the line contains at least one non-whitespace character before the function name. In that case ...
this non-WS sequence does not match the excluded keywords
there is at least 1 WS character before the function name
See the live demo at regex101.
Caveat
As several commentors have noted, this is not a robust solution. It will work for a tightly constrained set of function call and declaration patterns only.
A general regex-based solution (if possible at all, which would heavily depend on the regex engine features available) will be of theoretical interest only as it had to mimic completely the C preprocessor.
I am trying to write a regex to detect IP addresses and floating point number in re2c (http://re2c.org/). Here is the regex I am using
<SYMBOL> [-+]?[0-9]+[.][0-9]+ { RETURN(FLOAT); }
<SYMBOL> [0-9]{1,3}'.'[0-9]{1,3}'.'[0-9]{1,3}'.'[0-9]{1,3} {RETURN (IPADDR); }
Whenever I compile, it throws error about some YYMARKER being undeclared. But if I use only one of the rules the compilation goes fine. I guess re2c is having trouble with backtracking based regex since both the rules have a large data set with common prefix (for example 192.132 could be starting of both a floating point number as well as ip address).
Here is the command line I am using to first generate the tokenizer file. re2c itself does not throw any error.
re2c -c -o tokenizer.c tokenizer.re
But when i compile the C file i get the following error.
tokenizer.c: In function 'getnext_querytoken':
tokenizer.c:74: error: 'YYMARKER' undeclared (first use in this function)
tokenizer.c:74: error: (Each undeclared identifier is reported only once
tokenizer.c:74: error: for each function it appears in.)
Is there any way I can solve this problem ?
#sushil, you are right: YYMARKER is a part of re2c API.
However, re2c is not "having trouble with backtracking based regex since both the rules have a large data set". re2c-generated lexers only iterate the input once (complexity is linear). YYMARKER is needed because of the overlapping rules, as explained in this example: http://re2c.org/examples/example_01.html :
YYMARKER (line 5) is needed because rules overlap: it backups input position of the longest successful match. Say, we have overlapping rules "a" and "abc" and input string "abd": by the time "a" matches there's still a chance to match "abc", but when lexer sees 'd' it must rollback. (You might wonder why YYMARKER is exposed at all: why not make it a local variable like yych? The reason is, all input pointers must be updated by YYFILL as explained in Arbitrary large input and YYFILL example.)
Looks like i did not read the manpage properly. According to the manpage I needed to manually define the variable YYMARKER to support backtracking in re2c. Here is the extract from http://re2c.org/manual.html
YYMARKER l-value of type * YYCTYPE. The generated code saves
backtracking information in YYMARKER. Some easy scanners might not use
this.
I would like to partially parse a list of C declarations and/or function definitions.
That is, I want to split it into substrings, each containing one declaration, or function definition.
Each declaration (separately) will then be passed to another module (that does contain a full C parser, but that I cannot call directly.)
Obviously I could do this by including another full C parser in my program, but I hope to avoid this.
The tricky cases I'e come up against so far involve the question of whether '}' terminates a declaration/definition or not. For example in
int main(int ac, char **av) {return 0;}
... the '}' is a terminator, whereas in
typedef struct foo {int bar;} *pfoo;
it is not. There may also be pathological pieces of code like this:
struct {int bar;} *getFooPtr(...) { /* code... */ }
Notes
Please assume the C code has already been fully preprocessed before my function sees it. (Actually it hasn't, but we have a workaround for that.)
My parser will probably be implemented in Lua with LPeg
To extend the state machine in your answer to deal with function definitions add the following steps:
set fun/var state to 'unknown'
Examine the character at the current position
If it's ;, we have found the end of the declaration, and its not a function definition (might be a function declaration, though).
If it's " or ', jump to the matching quote, skipping over escape sequences if necessary.
If it's (, [ or {, jump to the matching ), ] or } (skipping over nested brackets and strings recursively if necessary)
If fun/var state is 'function' and we just skipped { .. }, we've found the end of the declaration, and its a function definition
If fun/var state is 'unknown' and we just skipped ( .. ), set fun/var state to 'function'.
If the current char is = or ,, set fun/var state to 'not-function`.
Advance to the next input character, and go back to 2.
Of course, this only works on post-pre-processed code -- if you have macros that do various odd things that haven't yet been expanded, all bets are off.
As far as I can tell, the following solution works for declarations only (that is, function definitions must be kept out of this section, or adding semicolons after them may be a workaround:)
Examine the character at the current position
If it's ;, we have found the end of the declaration.
If it's " or ', jump to the matching quote, skipping over escape sequences if necessary.
If it's (, [ or {, jump to the matching ), ] or } (skipping over nested brackets and strings recursively if necessary)
Otherwise, advance to the next input character and goto step 1.
If this proves to be unsatisfactory, I will switch to the clang parser.
Your best bet would be to extract the part of the C grammar which is related to declarations, and build a parser for that or an abbreviated version of that. Similarly, you want the grammar for function bodies, abbreviated in a similar way, so you can skip them.
This might produce a relatively trustworthy parser for declarations.
It is unfortunate that you will not likely be able to get your hands on a trustworthy C grammar; the one in the ANSI Standard(s) is not the one the compilers actually use. Every vendor has added goodies and complications to their compiler (e.g., MS C's declspecs, etc.).
The assumption the preprocessor has run is interesting. Where are you going to get the preprocessor configuration? (e.g., compiler commmand line defines, include paths, pragma settings, etc.)? This is harder than it looks, as each development environment defines different ways to set the preprocessor conditionals.
If you are willing to accept occasional errors, then any heuristic is valid candidate,
modulo how often it makes a mistake on an important client's code. This also means you can handle un-processed code, avoiding the preprocessor issue entirely.
I am fairly new into regexes, so I wrote the following simple regex using positive lookahead that detects functions and function calls in a C source file-
\w+(?=\s*\()
It works fine, but the problem is it detects non-function syntaxes like if(), while()etc too.
I can easily avoid this by saying-
(if(?!\()) | (while(?!\())
But the problem is how to combine the second regex with the first one? I cant OR them, cos the first one still matches if(), while() etc and in an OR expression, its enough if one of the term matches.
How to combine these regexes or have a better simpler one which will not match non-function syntaxes like if(), while()
PS: I use the following tools to test my regexes
GSkinner
RegexPal
There are quite a lot of assumptions when you are searching for function call in C with regex. That aside, if you are happy with what is matched (there are valid function calls that will not be matched), and you want to exclude if and while from the result list, you can use the following regex:
(?!\b(if|while|for)\b)\b\w+(?=\s*\()
The regex uses word boundary \b to make sure that the whole name is matched (prevent partial matching of hile in while), and the whole name is not keyword (prevent rejection of whilenothinghappens).
On my OS X 10.5.8 machine, using the regcomp and regexec C functions to match the extended regex "(()|abc)xyz", I find a match for the string "abcxyz" but only from offset 3 to offset 6. My expectation was that the entire string would be matched and that I would see a submatch for the initial "abc" part of the string.
When I try the same pattern and text with awk on the same machine, it shows a match for the entire string as I would expect.
I expect that my limited experience with regular expressions may be the problem. Can somebody explain what is going on? Is my regular expression valid? If so, why doesn't it match the entire string?
I understand that "((abc){0,1})xyz" could be used as an alternative, but the pattern of interest is being automatically generated from another pattern format and eliminating instances of "()" is extra work I'd like to avoid if possible.
For reference, the flags I'm passing to regcomp consist only of REG_EXTENDED. I pass an empty set of flags (0) to regexec.
The POSIX standard says:
9.4.3 ERE Special Characters
An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are as follows:
.[\(
The <period>, <left-square-bracket>, <backslash>, and <left-parenthesis> shall be special except when used in a bracket expression (see RE Bracket Expression ). Outside a bracket expression, a <left-parenthesis> immediately followed by a <right-parenthesis> produces undefined results.
What you are seeing is the result of invoking undefined behaviour - anything goes.
If you want reliable, portable results, you will have to eliminate the empty '()' notations.
If you iterate over all matches, and don't get both [3,6) and [0,6), then there's a bug. I'm not sure what posix mandates as far as order in which matches are returned.
Try (abc|())xyz - I bet it'll produce the same result in both places. I can only presume that the C version is trying to match xyz wherever it can, and if that fails, it tries to match abcxyz wherever it can (but, as you see, it doesn't fail, so we never bother with the "abc" part) whereas awk must be using it's own regex engine that performs the way you expect.
Your regex is valid. I think the problem is either a) POSIX isn't very clear about how the regex should work, or b) awk isn't using 100% POSIX-compliant regexes (probably because it appears OS X ships with a more original version of awk). Whichever problem it is, it's probably caused because this is somewhat of an edge case and most people wouldn't write the regex that way.