What does "[^<]" in a sscanf evaluation expression mean? - c

Googling "sscanf reference" or various other sscanf search terms brings up plenty of available references regarding the C sscanf function. Many of these references contain explanations of the available tokens that can be used in format string.
I'm looking through another developers code (who no longer works with our company) and I see in a format string that he has multiple tokens that looks like %15[^<]. I know that the %15 portion of the token is taking 15 characters from the reference string and storing them in a string pointer. What I can not find is documentation that explains the function of the [^<] token.
I've looked through multiple reference pages and I can't find a reference to a token like this. Maybe I'm just clicking on the wrong links but what does this mean? Furthermore, is sscanf (and other cstdio functions with format strings) more robust than what traditional documentation outlines? If so, is does anyone have a link to more thorough documentation?
Thanks guys.

The reference you should be checking is the ISO standard, particularly section 7.19.6.2.
For your particular case, it's matching fifteen characters that aren't <.

The first reference I found (here) sure includes it. Search the page for [.
Brackets are used, much like in regular expressions, to express groups of characters that match. If the first character is a ^, the group is inverted. So [^<] will match any character except for the less-than symbol.

That token matches everything but "<". Is it some kind of XML or HTML parser perhaps?
If you're on a platform other than Windows it's propably in the man page for scanf. Otherwise, I advise you to install cygwin. :)

Related

Is Form Feed character (FF) valid in MISRA C2 standard

Opening some legacy code in Notepad++ and notice a few occurrences of FF character below function comment headers. They are ASCII code 12 which is the Form Feed character. Are FF characters valid in MISRA C2 standard please? Apologies I don't have access to PC-Lint/QAC checker.
You appear to be talking about a commercial product, whose announcements give no useful information, e.g., this press release.
Form feed is stated explicitly to be part of the character set in ISO/IEC 9899:199 (E) 5.2.1 Character sets. If the tool advised you not to use a documented, standard feature, that would be a defect in the tool itself. A comparable issue would be whether to allow tab characters in leading whitespace on a line.
Given that context, the use of form feed characters is a stylistic issue unrelated to static analysis, and I would not expect the two to be confused in a commercial product.
MISRA-C:2004 3.2 merely states that the character set and the corresponding encoding should be documented (for example by a reference to the relevant ISO standard). You are only allowed to use character constants and string literals that exist in that standard.
But there is no such requirement on source code comments.

The syntax and semantic of the Go compiler runtime

I was looking at the runtime.c file in the go runtime at
/usr/local/go/src/pkg/runtime
and saw the following function definitions:
void
runtime∕pprof·runtime_cyclesPerSecond(int64 res)
{...}
and
int64
runtime·tickspersecond(void)
{...}
and there are a lot of declarations like
void runtime·hashinit(void);
in the runtime.h.
I haven't seen this C syntax before (specially the one with the slash seems odd).
Is this part of std C or some plan9 dialect?
It's Go's special internal syntax for Go package paths. For example,
runtime∕pprof·runtime_cyclesPerSecond
is function runtime_cyclesPerSecond in package path runtime∕pprof.
The '∕' character is the Unicode division slash character, which separates path elements. The '·' character is the Unicode middle dot character, which separates the package path and the function.
∕ and · and friends are merely random Unicode characters that someone decided to put in function names. Obscure Unicode characters (edit: that are listed in Annex D of the C99 standard (pages 452-453 of this PDF); see also here) are just as legal in C identifiers as A or 7 (in your average Unicode-capable compiler, anyway).
Char| Hex| Octal|Decimal|Windows Alt-code
----+------+------+-------+----------------
∕ |0x2215|021025| 8725| (null)
· | 0xB7| 0267| 183| Alt+0183
Putting characters that look like operators but aren't (U+2215 ∕, in particular, resembles U+2F / (division) far too closely) in function names can be a confusing practice, so I would personally advise against it. Obviously someone on the Go team decided that whatever reasons they had for including them in function names outweighed the potential for confusion.
(Edit: It should be noted that U+2215 ∕ isn't expressly permitted by Annex D. As discussed here, this may be an extension.)

What is the meaning of an interpunct (·) in C?

I´ve seen this in many popular C-Projects e.g the Go language and nowhere i can find some information about it. I think it is a kind of namespacing but i thought C doesn´t support it.
e.g
void runtime·memhash(uintptr*, uintptr, void*);
Thanks.
· is not a part of the "basic execution character set", and thus is not a standard C operator.
However, it does appear that the C standard allows it as an implementation-defined identifier character. It has no special meaning; it's just another character.

Flex default rule

How do I customize the default action for flex. I found something like <*> but when I run it it says "flex scanner jammed"? Also the . rule only adds a rule so it does not work either. What I want is
comment "/*"[^"*/"]*"*/"
%%
{comment} return 1;
{default} return 0;
<<EOF>> return -1;
Is it possible to change the behavior of matching longest to match first? If so I would do something like this
default (.|\n)*
but because this almost always gives a longer match it will hide the comment rule.
EDIT
I found the {-} operator in the manual, however this example straight from the manual gives me "unrecogized rule":
[a-c]{-}[b-z]
The flex default rule matches a single character and prints it on standard output. If you don't want that action, write an explicit rule which matches a single character and does something else.
The pattern (.|\n)* matches the entire input file as a single token, so that is a very bad idea. You're thinking that the default should be a long match, but in fact you want that to be as short as possible (but not empty).
The purpose of the default rule is to do something when there is no match for any of the tokens in the input language. When lex is used for tokenizing a language, such a situation is almost always erroneous because it means that the input begins with a character which is not the start of any valid token of the language.
Thus, a "catch any character" rule is coded as a form of error recovery. The idea is to discard the bad character (just one) and try tokenizing from the character after that one. This is only a guess, but it's a good guess because it's based on what is known: namely that there is one bad character in the input.
The recovery rule can be wrong. For instance suppose that no token of the language begins with #, and the programmer wanted to write the string literal "#abc". Only, she forgot the opening " and wrote #abc". The right fix is to insert the missing ", not to discard the #. But that would require a much more clever set of rules in the lexer.
Anyway, usually when discarding a bad character, you want to issue an error message for this case like "skipping invalid character '~` in line 42, column 3".
The default rule/action of copying the unmatched character to standard output is useful when lex is used for text filtering. The default rule then brings about the semantics of a regex search (as opposed to a regex match): the idea is to search the input for matches of the lexer's token-recognizing state machine, while printing all material that is skipped by that search.
So for instance, a lex specification containing just the rule:
"foo" { printf("bar"); }
will implement the equivalent of
sed -e 's/foo/bar/g'
I solved the problem manually instead if trying to match the complement of a rule. This works fine because the matching pattern involved in this case is quite simple.
Why does adding "." not do the trick? You can't perform an action in the absence of a matched amount. flex won't do anything if there is no match, so to add a "default" rule, just make it match something.
<*>.|\n /* default action here */
Using this at the end of the file catches the default rule across all start spaces. It's useful to find out where there may be holes.
What I don't know (and would like to know) is how to get flex to report where the default rule match has been found.

expected behavior of posix extended regex: (()|abc)xyz

On my OS X 10.5.8 machine, using the regcomp and regexec C functions to match the extended regex "(()|abc)xyz", I find a match for the string "abcxyz" but only from offset 3 to offset 6. My expectation was that the entire string would be matched and that I would see a submatch for the initial "abc" part of the string.
When I try the same pattern and text with awk on the same machine, it shows a match for the entire string as I would expect.
I expect that my limited experience with regular expressions may be the problem. Can somebody explain what is going on? Is my regular expression valid? If so, why doesn't it match the entire string?
I understand that "((abc){0,1})xyz" could be used as an alternative, but the pattern of interest is being automatically generated from another pattern format and eliminating instances of "()" is extra work I'd like to avoid if possible.
For reference, the flags I'm passing to regcomp consist only of REG_EXTENDED. I pass an empty set of flags (0) to regexec.
The POSIX standard says:
9.4.3 ERE Special Characters
An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are as follows:
.[\(
The <period>, <left-square-bracket>, <backslash>, and <left-parenthesis> shall be special except when used in a bracket expression (see RE Bracket Expression ). Outside a bracket expression, a <left-parenthesis> immediately followed by a <right-parenthesis> produces undefined results.
What you are seeing is the result of invoking undefined behaviour - anything goes.
If you want reliable, portable results, you will have to eliminate the empty '()' notations.
If you iterate over all matches, and don't get both [3,6) and [0,6), then there's a bug. I'm not sure what posix mandates as far as order in which matches are returned.
Try (abc|())xyz - I bet it'll produce the same result in both places. I can only presume that the C version is trying to match xyz wherever it can, and if that fails, it tries to match abcxyz wherever it can (but, as you see, it doesn't fail, so we never bother with the "abc" part) whereas awk must be using it's own regex engine that performs the way you expect.
Your regex is valid. I think the problem is either a) POSIX isn't very clear about how the regex should work, or b) awk isn't using 100% POSIX-compliant regexes (probably because it appears OS X ships with a more original version of awk). Whichever problem it is, it's probably caused because this is somewhat of an edge case and most people wouldn't write the regex that way.

Resources