I am fairly new into regexes, so I wrote the following simple regex using positive lookahead that detects functions and function calls in a C source file-
\w+(?=\s*\()
It works fine, but the problem is it detects non-function syntaxes like if(), while()etc too.
I can easily avoid this by saying-
(if(?!\()) | (while(?!\())
But the problem is how to combine the second regex with the first one? I cant OR them, cos the first one still matches if(), while() etc and in an OR expression, its enough if one of the term matches.
How to combine these regexes or have a better simpler one which will not match non-function syntaxes like if(), while()
PS: I use the following tools to test my regexes
GSkinner
RegexPal
There are quite a lot of assumptions when you are searching for function call in C with regex. That aside, if you are happy with what is matched (there are valid function calls that will not be matched), and you want to exclude if and while from the result list, you can use the following regex:
(?!\b(if|while|for)\b)\b\w+(?=\s*\()
The regex uses word boundary \b to make sure that the whole name is matched (prevent partial matching of hile in while), and the whole name is not keyword (prevent rejection of whilenothinghappens).
Related
I have the following fragment in my Bison file that describes a simple "while" loop as a condition followed by a sequence of statements. The list of statements is large and includes BREAK and CONTINUE. The latter two can be used only within a loop.
%start statements
%%
statements: | statement statements
statement: loop | BREAK | CONTINUE | WRITE | ...
loop: WHILE condition statements ENDWHILE
condition: ...
%%
I can add a C variable, set it upon entering the loop, reset it upon exiting, and check at BREAK or CONTINUE, but this solution does not look elegant:
loop: WHILE {loop++;} condition statements {loop--;} ENDWHILE
statement: loop | BREAK {if (!loop) yyerror();} ...
Is there a way to prevent the two statements from outside a loop using only Bison rules?
P.S. What I mean is "Is there an EASY way..," without fully duplicating the grammar.
Sure. You just need three different statement non-terminals, one which matches all statements; one which matches everything but continue (for switch blocks), and one which matches everything but break and continue. Of course, this distinction needs to trickle down through your rules. You'll also need three versions of each type of compound statement: loops, conditionals, switch, braced blocks, and so on. Oh, and don't forget that statements can be labelled, so there are some more non-terminals to duplicate.
But yeah, it can certainly be done. The question is, is it worth going to all that trouble. Or, to put it another way, what do you get out of it?
To start with, the end user finds that where they used to have a pretty informative error message about continue statements outside a loop, they now just get a generic Syntax Error. Now, you can fix that with some more grammar modifications, by actually providing productions which match the invalid statements, and then present a meaningful error message. But that's almost exactly the same code already rejected as inelegant.
Other than that, does it in any way reduce parser complexity? It lets you assume that a break statement is legally placed, but you still have to figure out where the break statement's destination. And other than that, there's not really a lot of evident advantages, IMHO.
But if you want to do it, go for it.
Once you've done that, you could try modifying your grammar so that break, continue, goto and return cannot be followed by an unlabelled statement. That sounds like a good idea, and some languages do it. It can certainly be done in the grammar. (But before you get too enthusiastic, remember that some programmers do deliberately create dead code during debugging sessions, and they won't thank you for making it impossible.)
There is a BNF extension, used in the ECMAscript standard, amongst others, which parameterizes non-terminals with a list of features, each of which can be present or not. These parameters can then be used in productions, either as conditions or to be passed through to non-terminals on the right-hand side. This could be used to generate three versions of statement, using the features [continue] and [break], which would be used as gates on those respective statement syntaxes, and also passed through to the compound statement non-terminals.
I don't know of a parser generator capable of handling such parameterised rules, so I can't offer it as a concrete suggestion, but this question is one of the use cases which motivated parameterised non-terminals. (In fact, I believe it's one of the uses, but I might be remembering that wrong.)
With an ECMAScript-style formalism, this grammatical restriction could be written without duplicating rules. The duplication would still be there, under the surface, since the parser generator would have to macro expand the templated rules into their various boolean possibilities. But the grammar is a lot more readable and the size of the state machine is not so important these days.
I have no doubt that it would be a useful feature, but I also suspect that it would be overused, with the result that the quality of error messages would be reduced.
As a general rule, compilers should be optimised for correct inputs, with the additional goal of producing helpful error messages for invalid input. Complicating the grammar even a little to make easily described errors into syntax errors does not help with either of these goals. If it's possible to write a few lines of code to produce the correct error message for a detected problem, instead of emitting a generic syntax error, I would definitely do that.
There are (many) other use cases for the ECMAScript BNF extensions. For example they make it much easier to describe a syntax whose naive grammar requires two or three lookahead tokens.
I need to write a regex which will match only lines with the C function call, not its declaration.
So, I need it to match only lines, where funcName() is not preceeded by int, double, float, char etc. and an arbitrary number of spaces.
The problem is, I can run into following expressions:
printf("Hello"); int f() {return 1;};
So I must consider even the situation, where there are some other characters before the date-type name.
myStruct f();
In this situation I want regex to match it, ONLY basic data-types should be excluded.
So far I've got to this expression:
^(?!(void|int|double|char))\s*f\(\).*$
But I have no idea, how to take care of the situation with characters before the type name.
The following regex meets your specs:
(^|((^|\s)(?!(void|int|double|char))[^\s]+)\s+)([a-zA-Z_]+\(\)?)
The function name is defined by a character class containing letters and the underscore.
The line starts with the function call, or
the line contains at least one non-whitespace character before the function name. In that case ...
this non-WS sequence does not match the excluded keywords
there is at least 1 WS character before the function name
See the live demo at regex101.
Caveat
As several commentors have noted, this is not a robust solution. It will work for a tightly constrained set of function call and declaration patterns only.
A general regex-based solution (if possible at all, which would heavily depend on the regex engine features available) will be of theoretical interest only as it had to mimic completely the C preprocessor.
I have to make a lexer for a language that has (among other things) lists of the form [1,2,3] for example or ['c','s','q','t'].
I don't really understand whether I need to match the list at the lexing stage. So, for example would
2:[1,2,3];
be
NUM(2) COLON LSQBRACKET NUM(1) COMMA NUM(2) COMMA NUM(3) RSQBRACKET SEMI
or
NUM(2) COLON LIST([1,2,3]) SEMI
Thanks for any help.
Technically, it's up to you. If you only ever have to match very simple list literals, then maybe you can get away with treating them kind of like string literals. (But, that's not likely to be a good approach).
You generally want the lexer to output a series of simple tokens. The lexer should be relatively simple -- one rule of thumb is that it should never require recursion.
So, for example, requiring it to output a "LIST" token would be counterproductive -- the lexer would have to recurse on nested lists, meaning that it would implement a mini-parser. Leave that job to the parser.
The first case makes for a simpler lexer, which is still useful to a later-stage parser.
I'm trying to regexp match a C function, e.g.
func(blah blah);
The match can include newlines.
I've tried:
func([.+]);
which didn't do newlines, and:
func([...]);
func([^...]);
neither of which seemed to do anything. I guess I'm looking for the part of a regexp that will match any number/type of characters between my opening func( and );.
You could try func[[:space:]]*([^)]*). Nested parens in calls will confuse it though.
I think that the general case is not feasible with regular expressions, because the nested function calls are not a regular language.
While Maxim's answer is specific, I'm going to guess you are looking to do something with the matched function you found. To do serious code processing, you can't beat the semantic parser that is a part of CEDET's suite of tools. http://cedet.sf.net is also part of Emacs.
If you use the semantic parser in emacs, you can:
M-x semantic-mode RET
and then in code:
(semantic-fetch-tags)
or
(semantic-current-tag)
to get the current tag. Once you have the tag, you can call:
(semantic-tag-function-arguments mytag)
to get the arguments, which are tags. For one of those, use semantic-tag-name to get the name, or semantic-tag-type to get the data type.
Once you've got your tag data, you can always write out new code with SRecode, which is a code generator which will take in tags, and spit out code, such as function declarations.
On my OS X 10.5.8 machine, using the regcomp and regexec C functions to match the extended regex "(()|abc)xyz", I find a match for the string "abcxyz" but only from offset 3 to offset 6. My expectation was that the entire string would be matched and that I would see a submatch for the initial "abc" part of the string.
When I try the same pattern and text with awk on the same machine, it shows a match for the entire string as I would expect.
I expect that my limited experience with regular expressions may be the problem. Can somebody explain what is going on? Is my regular expression valid? If so, why doesn't it match the entire string?
I understand that "((abc){0,1})xyz" could be used as an alternative, but the pattern of interest is being automatically generated from another pattern format and eliminating instances of "()" is extra work I'd like to avoid if possible.
For reference, the flags I'm passing to regcomp consist only of REG_EXTENDED. I pass an empty set of flags (0) to regexec.
The POSIX standard says:
9.4.3 ERE Special Characters
An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are as follows:
.[\(
The <period>, <left-square-bracket>, <backslash>, and <left-parenthesis> shall be special except when used in a bracket expression (see RE Bracket Expression ). Outside a bracket expression, a <left-parenthesis> immediately followed by a <right-parenthesis> produces undefined results.
What you are seeing is the result of invoking undefined behaviour - anything goes.
If you want reliable, portable results, you will have to eliminate the empty '()' notations.
If you iterate over all matches, and don't get both [3,6) and [0,6), then there's a bug. I'm not sure what posix mandates as far as order in which matches are returned.
Try (abc|())xyz - I bet it'll produce the same result in both places. I can only presume that the C version is trying to match xyz wherever it can, and if that fails, it tries to match abcxyz wherever it can (but, as you see, it doesn't fail, so we never bother with the "abc" part) whereas awk must be using it's own regex engine that performs the way you expect.
Your regex is valid. I think the problem is either a) POSIX isn't very clear about how the regex should work, or b) awk isn't using 100% POSIX-compliant regexes (probably because it appears OS X ships with a more original version of awk). Whichever problem it is, it's probably caused because this is somewhat of an edge case and most people wouldn't write the regex that way.