I'm writing a simple shell in C under linux. I'm trying to parse user input with POSIX regex with group capturing. My problem is I dont want to capture all the groups, but the ?: symbol desnt seem to work for me.
"^(?:[A-Za-z0-9]+)( [A-Za-z0-9]*(?:\"[^\"]*\")*(?:\'[^\']*\')*[A-Za-z0-9]*)*&?$"
The use of (?:..), or any other grouping prefix, is not allowed in POSIX Regular Expressions.
There are tools to make languages, lex & yacc for example, and a simplified yacc grammar for POSIX shells is provided by the standard.
The character sequence (? is undefined as per section 9.4.3 ERE Special
Characters:
*+?{
The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:
If these characters appear first in an ERE, or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis>
If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)
A POSIX RE implementation has a few choices for how to handle undefined syntax. Those choices include enabling an extended syntax as per section 9.1 Regular Expression Definitions. So it's free to implement the non-capturing group syntax:
[...] violations of the specified syntax or semantics for REs produce
undefined results: this may entail an error, enabling an extended
syntax for that RE, or using the construct in error as literal
characters to be matched.
If you'd like to see the feature as part of a future POSIX standard, you could open an issue on the standard's issue tracker.
Related
I'm trying to study C grammar with flex/bison.
I found bison cannot parse this bison grammar: https://www.lysator.liu.se/c/ANSI-C-grammar-y.html, because LALR algorithm cannot process recursively multiple expressions.
Is GLR algorithm a must for C grammar?
There is nothing wrong with that grammar except:
it represents a very old version of C
it requires a lexical analyser which can somehow distinguish between IDENTIFIER and TYPE_NAME
it does not even attempt to handle the preprocessor phases
Also, it has one shift/reduce conflict as a result of the "dangling else" ambiguity. However, that conflict can be ignored because bison's conflict resolution algorithm produces the correct result in this case. (You can suppress the warning either with an %expect directive or by including a precedence declaration which favours shifting else over reducing if. Or you can eliminate the ambiguity in the grammar using the technique described in the Wikipedia page linked above. (Note: I'm not talking about copy-and-pasting code from the Wikipedia page. In the case of C, you need to consider all cases of compound statements which terminate with an if statement.)
Moreover, an LR parser is not recursive, and it has no problems which could be described as a failure to "process recursively multiple expressions". (You might have that problem with a recursive descent parser, although it's pretty easy to work around the issue.)
So any problems you might have experienced (if your question refers to a concrete issue) have nothing to do with what's described in your question.
Of the problems I listed above, the most troubling is the syntactic ambiguity of the cast operator. The cast operator is not actually ambiguous; clearly, C compilers manage to correct compile such expressions. But distinguishing between the two possible parses of, for example, (x)-y*z requires knowing whether x names a type or a variable.
In C, all names are lexically scoped, so it is certainly possible to resolve x at compile time. But the resolution is not context-free. Since GLR is also a technique for parsing context-free grammars, using a GLR parser won't directly help you. It might be useful in the sense that GLR parsers can theoretically produce "parse forests" rather than parse trees; that is, the output of a GLR parser might effectively contain all possible correct parses, leaving the possibility to resolve the ambiguity by building symbol tables for each scope and then choosing between alternative parses by examining the name binding in effect at each site. (This works because type alias declarations -- "typedefs" -- are not ambiguous, so all the potential parses will have the same alias declarations.)
The usual solution, though, is to parse the program text using a deterministic parser, maintaining a symbol table during the parse, and giving the lexical analyser access to this symbol table so that it can distinguish between IDENTIFIER and TYPE_NAME, as expected by the grammar you link. This technique is politely called "lexical feedback", although it's also often called "the lexer hack".
I was looking at the runtime.c file in the go runtime at
/usr/local/go/src/pkg/runtime
and saw the following function definitions:
void
runtime∕pprof·runtime_cyclesPerSecond(int64 res)
{...}
and
int64
runtime·tickspersecond(void)
{...}
and there are a lot of declarations like
void runtime·hashinit(void);
in the runtime.h.
I haven't seen this C syntax before (specially the one with the slash seems odd).
Is this part of std C or some plan9 dialect?
It's Go's special internal syntax for Go package paths. For example,
runtime∕pprof·runtime_cyclesPerSecond
is function runtime_cyclesPerSecond in package path runtime∕pprof.
The '∕' character is the Unicode division slash character, which separates path elements. The '·' character is the Unicode middle dot character, which separates the package path and the function.
∕ and · and friends are merely random Unicode characters that someone decided to put in function names. Obscure Unicode characters (edit: that are listed in Annex D of the C99 standard (pages 452-453 of this PDF); see also here) are just as legal in C identifiers as A or 7 (in your average Unicode-capable compiler, anyway).
Char| Hex| Octal|Decimal|Windows Alt-code
----+------+------+-------+----------------
∕ |0x2215|021025| 8725| (null)
· | 0xB7| 0267| 183| Alt+0183
Putting characters that look like operators but aren't (U+2215 ∕, in particular, resembles U+2F / (division) far too closely) in function names can be a confusing practice, so I would personally advise against it. Obviously someone on the Go team decided that whatever reasons they had for including them in function names outweighed the potential for confusion.
(Edit: It should be noted that U+2215 ∕ isn't expressly permitted by Annex D. As discussed here, this may be an extension.)
I know there are several language extensions added in the GNU C compiler (aka gcc).
I can read something about that here.
What I'm looking for is deeper and wider documentation about those topics.
For example I'd like to read more about _Static_assert(), typeof and the likes.
Maybe it's just my fault, but I cannot find such an official documentation. Any hint? TIA!
The answer is http://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html and you're not finding about static assertions because it's not an extension of the C language, it's a core, built-in, standardized part of the language and described in the language international standards. In this case, refer to the C specification:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
See section 6.7.10 Static assertions, in particular paragraph 3:
"The constant expression shall be an integer constant expression. If the value of the
constant expression compares unequal to 0, the declaration has no effect. Otherwise, the
constraint is violated and the implementation shall produce a diagnostic message that
includes the text of the string literal, except that characters not in the basic source
character set are not required to appear in the message."
Here: http://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html.
Use Google to search inside gnu.org. Found it by typing this search in Google: c extensions site:gnu.org.
I'm looking for how can I write identifiers name with characters like [ ' " or #.
Everytime that I try to do that, I give the error:
error: macro names must be identifiers
But learning about gcc, I found this option:
-fextended-identifiers
But it seems not working like I wanted, please, somebody know how to accomplish that?
Identifiers can't include such characters. It is defined that way in the language syntax, identifiers are letters, digits or underline (and mustn't begin with a digit to avoid ambiguity with litteral numbers).
If it was possible this would conflict with the C compiler (that uses [ for arrays) and C preprocessor syntax (that uses #). Extended identifiers extension only allow using characters non forbidden by the language syntax inside identifiers (basically unicode foreign letters, etc.).
But if you really, really want to do this, nothings forbids you to preprocess your source files with your own "extended macro preprocessor", practically creating a new "C like" language. That looks like a terrible idea, but it's not really hard to do. Then you'll see soon enough by yourself why it's not a good idea...
According to this link, -fextended-identifiers only enables UTF-8 support for identifiers, so it won't help in your case.
So, answer is: You can't use such characters in macro identifiers.
Even if the extended identifier characters support was fully enabled, it wouldn't help you get characters such as:
[ ' " #
enabled for identifiers. The standard allows 'universal character names' or 'other implementation-defined characters' to be part of an identifier, but they cannot be part of the basic character set. Out of the basic character set, only _, letters and digits can be part of an identifier name (6.4.2.1 Identifiers/General).
On my OS X 10.5.8 machine, using the regcomp and regexec C functions to match the extended regex "(()|abc)xyz", I find a match for the string "abcxyz" but only from offset 3 to offset 6. My expectation was that the entire string would be matched and that I would see a submatch for the initial "abc" part of the string.
When I try the same pattern and text with awk on the same machine, it shows a match for the entire string as I would expect.
I expect that my limited experience with regular expressions may be the problem. Can somebody explain what is going on? Is my regular expression valid? If so, why doesn't it match the entire string?
I understand that "((abc){0,1})xyz" could be used as an alternative, but the pattern of interest is being automatically generated from another pattern format and eliminating instances of "()" is extra work I'd like to avoid if possible.
For reference, the flags I'm passing to regcomp consist only of REG_EXTENDED. I pass an empty set of flags (0) to regexec.
The POSIX standard says:
9.4.3 ERE Special Characters
An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are as follows:
.[\(
The <period>, <left-square-bracket>, <backslash>, and <left-parenthesis> shall be special except when used in a bracket expression (see RE Bracket Expression ). Outside a bracket expression, a <left-parenthesis> immediately followed by a <right-parenthesis> produces undefined results.
What you are seeing is the result of invoking undefined behaviour - anything goes.
If you want reliable, portable results, you will have to eliminate the empty '()' notations.
If you iterate over all matches, and don't get both [3,6) and [0,6), then there's a bug. I'm not sure what posix mandates as far as order in which matches are returned.
Try (abc|())xyz - I bet it'll produce the same result in both places. I can only presume that the C version is trying to match xyz wherever it can, and if that fails, it tries to match abcxyz wherever it can (but, as you see, it doesn't fail, so we never bother with the "abc" part) whereas awk must be using it's own regex engine that performs the way you expect.
Your regex is valid. I think the problem is either a) POSIX isn't very clear about how the regex should work, or b) awk isn't using 100% POSIX-compliant regexes (probably because it appears OS X ships with a more original version of awk). Whichever problem it is, it's probably caused because this is somewhat of an edge case and most people wouldn't write the regex that way.