expected behavior of posix extended regex: (()|abc)xyz - c

On my OS X 10.5.8 machine, using the regcomp and regexec C functions to match the extended regex "(()|abc)xyz", I find a match for the string "abcxyz" but only from offset 3 to offset 6. My expectation was that the entire string would be matched and that I would see a submatch for the initial "abc" part of the string.
When I try the same pattern and text with awk on the same machine, it shows a match for the entire string as I would expect.
I expect that my limited experience with regular expressions may be the problem. Can somebody explain what is going on? Is my regular expression valid? If so, why doesn't it match the entire string?
I understand that "((abc){0,1})xyz" could be used as an alternative, but the pattern of interest is being automatically generated from another pattern format and eliminating instances of "()" is extra work I'd like to avoid if possible.
For reference, the flags I'm passing to regcomp consist only of REG_EXTENDED. I pass an empty set of flags (0) to regexec.

The POSIX standard says:
9.4.3 ERE Special Characters
An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are as follows:
.[\(
The <period>, <left-square-bracket>, <backslash>, and <left-parenthesis> shall be special except when used in a bracket expression (see RE Bracket Expression ). Outside a bracket expression, a <left-parenthesis> immediately followed by a <right-parenthesis> produces undefined results.
What you are seeing is the result of invoking undefined behaviour - anything goes.
If you want reliable, portable results, you will have to eliminate the empty '()' notations.

If you iterate over all matches, and don't get both [3,6) and [0,6), then there's a bug. I'm not sure what posix mandates as far as order in which matches are returned.

Try (abc|())xyz - I bet it'll produce the same result in both places. I can only presume that the C version is trying to match xyz wherever it can, and if that fails, it tries to match abcxyz wherever it can (but, as you see, it doesn't fail, so we never bother with the "abc" part) whereas awk must be using it's own regex engine that performs the way you expect.
Your regex is valid. I think the problem is either a) POSIX isn't very clear about how the regex should work, or b) awk isn't using 100% POSIX-compliant regexes (probably because it appears OS X ships with a more original version of awk). Whichever problem it is, it's probably caused because this is somewhat of an edge case and most people wouldn't write the regex that way.

Related

C - regexec returns NOMATCH - even though it should?

Regex pattern needs to match the following:
abc_xyz_0
abc_1025_01.29.00_xyz_0
abc_0302_42.01.00_xyz_0
(numbers between abc and xyz dont matter)
So I parse for:
(abc_(\w+\.\d+\.\w+)?xyz_0)
My code:
regex_t r;
unsigned int maxGroups = 3;
regmatch_t groupArray[maxGroups];
char * to_match = "abc_0302_02.01.00_xyz_18 abc_0302_02.01.00_xyz_16 abc_0302_02.01.00_xyz_14 abc_0302_02.01.00_xyz_0 abc_0302_02.01.00_xyz_10 abc_0302_02.01.00_xyz_2"
if (0 != regcomp(&r, "(abc_(\\w+\\.\\d+\\.\\w+)?xyz_0)", REG_EXTENDED))
{
//this does NOT get hit
printf("regcomp failed")
}
else if(regexec(r, to_match, maxGroups, groupArray, REG_EXTENDED) == 0)
{ *never gets here* }
else
{ printf("regexec returned non-zero(No Matches)\n"); }
regfree(&r);
So my guess is either I have the wrong regex (which works fine for my cases defined above - and I used regexpal.com to confirm), or there is something I am missing?
Either way I know I am close and would greatly appreciate some help.
There are several typos in the code you copied into the question (see below), and you should only pass REG_EXTENDED to regcomp; the only flags regexec recognizes are REG_NOTBOL and REG_NOTEOL. (See the regexec manpage for details.)
However, the problem is that Posix regex, including the Gnu implementation, does not implement the non-standard escape sequences \d. As indicated in the regex(7) manpage, a pattern can include:
a '\' followed by one of the characters "^.[$()|*+?{\" (matching that character taken as an ordinary character),
or
a '\' followed by any other character (matching that character taken as an ordinary character, as if the '\' had not been present)
Note that the only effect of \, in either case, is to cause the following character to be matched as an ordinary character. While the Gnu implementation of regcomp does recognize \w as a character class, that behaviour is not required by Posix and other implementations might not do so. (It is also not documented, so it may not always work.) And it does not recognize \d.
If you are using Posix regexes, you should use Posix standard character classes, so the regex string should be:
"(abc_([[:alnum:]_]+\\.[[:digit:]]+\\.[[:alnum:]_]+)?xyz_0)"
You'll find a list of Posix named character classes in the regex manpage in the previous link (or by typing man 7 regex assuming you have installed standard library documentation, which is highly recommended.)
I verified this with your code, after adding the missing semicolon at the end of char * to_match =... and changing r to &r in the call to regexec.
Note that surprisingly few online regex resources implement the Posix regex specification; http://regexpal.com, for example, only provides the options of PCRE- and Javascript-style regexes.
Each time you call regexec, you get the first match in the string you pass to it, according to a fixed algorithm described in man 7 regex:
In the event that an RE could match more than one substring of a
given string, the RE matches the one starting earliest in the string.
If the RE could match more than one substring starting at that point,
it matches the longest. Subexpressions also match the longest
possible substrings, subject to the constraint that the whole match
be as long as possible, with subexpressions starting earlier in the
RE taking priority over ones starting later. Note that higher-level
subexpressions thus take priority over their lower-level component
subexpressions.
If you want to find multiple instances of a pattern in the same string, you need to call regexec in a loop. Each time through the loop, you give it the address of the first unmatched byte from the previous match (i.e. string + matches[0].rm_eo) until it reports no more matches. If you rely on ^ anchors in your match, you will need to pass the correct value of the REG_NOTBOL flag to each call to regexec.

Posix regular expression non capturing group

I'm writing a simple shell in C under linux. I'm trying to parse user input with POSIX regex with group capturing. My problem is I dont want to capture all the groups, but the ?: symbol desnt seem to work for me.
"^(?:[A-Za-z0-9]+)( [A-Za-z0-9]*(?:\"[^\"]*\")*(?:\'[^\']*\')*[A-Za-z0-9]*)*&?$"
The use of (?:..), or any other grouping prefix, is not allowed in POSIX Regular Expressions.
There are tools to make languages, lex & yacc for example, and a simplified yacc grammar for POSIX shells is provided by the standard.
The character sequence (? is undefined as per section 9.4.3 ERE Special
Characters:
*+?{
The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:
If these characters appear first in an ERE, or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis>
If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)
A POSIX RE implementation has a few choices for how to handle undefined syntax. Those choices include enabling an extended syntax as per section 9.1 Regular Expression Definitions. So it's free to implement the non-capturing group syntax:
[...] violations of the specified syntax or semantics for REs produce
undefined results: this may entail an error, enabling an extended
syntax for that RE, or using the construct in error as literal
characters to be matched.
If you'd like to see the feature as part of a future POSIX standard, you could open an issue on the standard's issue tracker.

Regex to find all calls of the certain function and NOT its declaration

I need to write a regex which will match only lines with the C function call, not its declaration.
So, I need it to match only lines, where funcName() is not preceeded by int, double, float, char etc. and an arbitrary number of spaces.
The problem is, I can run into following expressions:
printf("Hello"); int f() {return 1;};
So I must consider even the situation, where there are some other characters before the date-type name.
myStruct f();
In this situation I want regex to match it, ONLY basic data-types should be excluded.
So far I've got to this expression:
^(?!(void|int|double|char))\s*f\(\).*$
But I have no idea, how to take care of the situation with characters before the type name.
The following regex meets your specs:
(^|((^|\s)(?!(void|int|double|char))[^\s]+)\s+)([a-zA-Z_]+\(\)?)
The function name is defined by a character class containing letters and the underscore.
The line starts with the function call, or
the line contains at least one non-whitespace character before the function name. In that case ...
this non-WS sequence does not match the excluded keywords
there is at least 1 WS character before the function name
See the live demo at regex101.
Caveat
As several commentors have noted, this is not a robust solution. It will work for a tightly constrained set of function call and declaration patterns only.
A general regex-based solution (if possible at all, which would heavily depend on the regex engine features available) will be of theoretical interest only as it had to mimic completely the C preprocessor.

Match functions and function calls in C using regex

I am fairly new into regexes, so I wrote the following simple regex using positive lookahead that detects functions and function calls in a C source file-
\w+(?=\s*\()
It works fine, but the problem is it detects non-function syntaxes like if(), while()etc too.
I can easily avoid this by saying-
(if(?!\()) | (while(?!\())
But the problem is how to combine the second regex with the first one? I cant OR them, cos the first one still matches if(), while() etc and in an OR expression, its enough if one of the term matches.
How to combine these regexes or have a better simpler one which will not match non-function syntaxes like if(), while()
PS: I use the following tools to test my regexes
GSkinner
RegexPal
There are quite a lot of assumptions when you are searching for function call in C with regex. That aside, if you are happy with what is matched (there are valid function calls that will not be matched), and you want to exclude if and while from the result list, you can use the following regex:
(?!\b(if|while|for)\b)\b\w+(?=\s*\()
The regex uses word boundary \b to make sure that the whole name is matched (prevent partial matching of hile in while), and the whole name is not keyword (prevent rejection of whilenothinghappens).

Flex default rule

How do I customize the default action for flex. I found something like <*> but when I run it it says "flex scanner jammed"? Also the . rule only adds a rule so it does not work either. What I want is
comment "/*"[^"*/"]*"*/"
%%
{comment} return 1;
{default} return 0;
<<EOF>> return -1;
Is it possible to change the behavior of matching longest to match first? If so I would do something like this
default (.|\n)*
but because this almost always gives a longer match it will hide the comment rule.
EDIT
I found the {-} operator in the manual, however this example straight from the manual gives me "unrecogized rule":
[a-c]{-}[b-z]
The flex default rule matches a single character and prints it on standard output. If you don't want that action, write an explicit rule which matches a single character and does something else.
The pattern (.|\n)* matches the entire input file as a single token, so that is a very bad idea. You're thinking that the default should be a long match, but in fact you want that to be as short as possible (but not empty).
The purpose of the default rule is to do something when there is no match for any of the tokens in the input language. When lex is used for tokenizing a language, such a situation is almost always erroneous because it means that the input begins with a character which is not the start of any valid token of the language.
Thus, a "catch any character" rule is coded as a form of error recovery. The idea is to discard the bad character (just one) and try tokenizing from the character after that one. This is only a guess, but it's a good guess because it's based on what is known: namely that there is one bad character in the input.
The recovery rule can be wrong. For instance suppose that no token of the language begins with #, and the programmer wanted to write the string literal "#abc". Only, she forgot the opening " and wrote #abc". The right fix is to insert the missing ", not to discard the #. But that would require a much more clever set of rules in the lexer.
Anyway, usually when discarding a bad character, you want to issue an error message for this case like "skipping invalid character '~` in line 42, column 3".
The default rule/action of copying the unmatched character to standard output is useful when lex is used for text filtering. The default rule then brings about the semantics of a regex search (as opposed to a regex match): the idea is to search the input for matches of the lexer's token-recognizing state machine, while printing all material that is skipped by that search.
So for instance, a lex specification containing just the rule:
"foo" { printf("bar"); }
will implement the equivalent of
sed -e 's/foo/bar/g'
I solved the problem manually instead if trying to match the complement of a rule. This works fine because the matching pattern involved in this case is quite simple.
Why does adding "." not do the trick? You can't perform an action in the absence of a matched amount. flex won't do anything if there is no match, so to add a "default" rule, just make it match something.
<*>.|\n /* default action here */
Using this at the end of the file catches the default rule across all start spaces. It's useful to find out where there may be holes.
What I don't know (and would like to know) is how to get flex to report where the default rule match has been found.

Resources