What does %[^<] (and friends) mean in the formatted string family? - c

A comment (which should probably be submitted as an answer) has the code
sscanf(string, "<title>%[^<]</title>", extracted_string);
Running the code seems to copy the text between the <title> tags to extracted_string, but I cannot find any references to a caret in the printf family, either in the man pages or elsewhere online.
Can someone point me to a resource that explains the use of %[^<], and other similar syntax, in the sscanf() family?

From the C11 standard document, chapter §7.21.6.2, Paragraph 12, conversion specifiers, (emphasis mine)
[
Matches a nonempty sequence of characters from a set of expected characters
(the scanset).
....
The conversion specifier includes all subsequent characters in the format
string, up to and including the matching right bracket (]). The characters
between the brackets (the scanlist) compose the scanset, unless the character
after the left bracket is a circumflex (^), in which case the scanset contains all characters that do not appear in the scanlist between the circumflex and the
right bracket.
A draft version of the standard, found online.

It means match anything that is not a <, it's not a good idea to do that without specifying the maximum destination buffer length, if your destination buffer can hold say 100 characters, then
char extracted_string[100];
sscanf(string, "<title>%99[^<]</title>", extracted_string);
would be a better solution.
Using strstr() for this purpose allows you to actually make extracted_string dynamic.

this link explains the [ and ^ usage in scanf family of functions
(emphasis mine)
http://www.cdf.toronto.edu/~ajr/209/notes/printf.html
[
Matches a nonempty sequence of characters from the specified set of accepted characters; the next pointer must be a pointer to char, and there must be enough room for all the characters in the string, plus a terminating null byte. The usual skip of leading white space is suppressed. The string is to be made up of characters in (or not in) a particular set; the set is defined by the characters between the open bracket [ character and a close bracket ] character. The set excludes those characters if the first character after the open bracket is a circumflex (^). To include a close bracket in the set, make it the first character after the open bracket or the circumflex; any other position will end the set. The hyphen character - is also special; when placed between two other characters, it adds all intervening characters to the set. To include a hyphen, make it the last character before the final close bracket. For instance, [^]0-9-] means the set "everything except close bracket, zero through nine, and hyphen". The string ends with the appearance of a character not in the (or, with a circumflex, in) set or when the field width runs out.

Related

C preprocessor: line continuation: why exactly comment is not allowed after backslash character ('\')?

Valid code:
#define M xxx\
yyy
Not valid code:
#define M xxx\/*comment*/
yyy
#define M xxx\//comment
yyy
Questions:
Why comment is not allowed after backslash character (\)?
What the standard says?
UPD.
Extra question:
What is the motivation / reason / argumentation behind the requirement that (in order to achieve splicing of physical source lines) backslash character (\) must immediately follow by a new-line character? What is the obstacle to allow comments (or spaces) after the backslash character (\)?
Lines are spliced together only if a backslash character is the last character on a line. C 2018 5.1.1.2 specifies phases of translating a C program. In phase 2:
Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines…
If a comment follows a backslash character, the backslash character is not followed by a new-line character, so no splicing is performed. Comments are processed in phase 3:
The source file is decomposed into preprocessing tokens7) and sequences of white-space characters (including comments)… Each comment is replaced by one space character…
Regarding the added question:
What is the motivation / reason / argumentation behind the requirement that (in order to achieve splicing of physical source lines) backslash character (\) must immediately follow by a new-line character? What is the obstacle to allow comments (or spaces) after the backslash character (\)?
The earliest processing in compiling a C program is the simplest. Early C compilers may have been implemented as layers of simple filters: First local-environment characters or methods of file storage would be translated to a simple stream of characters, then lines would be spliced together (perhaps dealing with a problem of wanting a long source line while having to type your source code on 80-column punched cards), then comments would be removed, and so on.
Splicing together lines marked by a backslash at the end of a line is easy; it only requires looking at two characters. If instead we allow comments to follow the backslash that marks a splice, it becomes complicated:
A backslash followed by a comment followed by a new-line would be spliced, but a backslash followed by a comment followed by other source code would not. That requires looking possibly many characters ahead and parsing the comment delimiters, possibly for multiple comments.
One purpose of splicing lines was to allow continuing long strings across multiple lines. (This was before adjacent strings were concatenated in C.) So "abc\ on one line and def" on another would be spliced together, making "abcdef". While we might allow comments after backslashes intended to join lines, we do not want to splice after a line containing "abc\ /*" /*comment*/. That means the code doing the splicing has to be context-sensitive; if the backslash appears in a quoted string, it has to treat it differently.
There is actually a reason why backslash-newlines are processed before comments are removed. It's the same reason why backslash-newlines are entirely removed, instead of being replaced with (virtual) horizontal whitespace, as comments are. It's a ridiculous reason, but it's the official reason. It's so you can mechanically force-fit C code with long lines onto punched cards, by inserting backslash-newline at column 79 no matter what that happens to divide:
static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * st\
atbuf)
{
static int warncount = 5;
struct __old_kernel_stat tmp;
if (warncount > 0) {
warncount--;
printk(KERN_WARNING "VFS: Warning: %s using old stat() call. Re\
compile your binary.\n",
(this is the first chunk of C I found on my hard drive that actually had lines that wouldn't fit on punched cards)
For this to work as intended, backslash-newline has to be able to split a /* or a */, like
/* this comment just so happens to be exactly 80 characters wide at the close *\
/
And you can't have it both ways: if comments were to be removed before processing backslash-newline, then backslash-newline could not affect comment boundaries; conversely, if backslash-newline is to be processed first, then comments can't appear between the backslash and the newline.
(I Am Not Making This Up™: C99 Rationale section 5.1.1.2 paragraph 30 reads
A backslash immediately before a newline has long been used to continue string literals, as well as preprocessing command lines. In the interest of easing machine generation of C, and of transporting code to machines with restrictive physical line lengths, the C89 Committee generalized this mechanism to permit any token to be continued by interposing a backslash/newline sequence.
Emphasis in original. Sorry, I don't know of any non-PDF version of this document.)
Per 5.1.1.2 Translation phases of the C11 standard (note the bolded text added)
5.1.1.2 Translation phases
1 The precedence among the syntax rules of translation is specified by the following phases.6)
1 Physical source file multibyte characters are mapped, in an implementation- defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
2 Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
...
Only backslash characters immediately followed by a new-line will cause lines to be spliced. A comment is not a new-line character.

What is the difference between %s and %s%*c [duplicate]

This question already has answers here:
%*c in scanf() - what does it mean?
(4 answers)
Closed 3 years ago.
Hi I am reading some code and this line has been used:
scanf("%s%*c",dati[i].part);
What does %s%*c do and why not just use %s?
What does %s%*c do
The %s has the same meaning as anywhere else -- skip leading whitespace and scan the next sequence of non-whitespace characters into the specified character array.
The %*c means the same thing as %c -- read the next input character, whatever it is (i.e. without skipping leading whitespace) -- except that the * within means that the result should not be assigned anywhere, and therefore that no corresponding pointer argument should be expected. Also, assignment suppression means that scanf's return value is not affected by whether that field is successfully scanned.
and why not just use %s?
We cannot say for sure why the author of the code in which you saw it used %s%*c, except for the unsatisfying "because that's what the author thought was appropriate." We have no context at all for making any other judgement.
Certainly the actual effect is to consume the next input character after the string, if any. If there is such a character then it will necessarily be a whitespace character, else it would have been scanned by the preceding %s directive. We might therefore speculate that the author's idea was to consume a trailing newline.
There are at least two problems with that:
The next character might not be a newline. For example, there might be trailing space characters before a newline, in which case the first of those space characters would be consumed, but the newline would remain in the stream. If that's a genuine problem then %*c does not reliably solve it.
In practice, it's not very useful. Most scanf directives are like %s in that they automatically skip leading whitespace, including newlines. The %*c serves only to confuse if the next directive that will be processed is any of those. Moreover, it is possible for a scanf format to explicitly express that a run of whitespace at a given position should be skipped, and it is clearer to make use of that in conjunction with the next directive to be processed if that next directive is one of those that don't automatically skip whitespace (and whitespace skipping is in fact desired).
That doesn't mean that assignment suppression generally or %*c specifically is useless, mind. It's just trying to use that technique to attempt to consume trailing newlines that is poorly conceived.
The %* format specifier in a scanf call instructs the function to read data in the following format (c in your case) from the input buffer but not to store it anywhere (i.e. discard it).
In your specific case, the %*c is being used to read and discard the trailing newline character (added when the user hits the Enter key), which will otherwise remain in the input buffer, and likely upset any subsequent calls to scanf.

What is the purpose of : in scanf?

scanf("%d:%d:%d%s", &hh, &mm, &ss, t12)
When taking multiple inputs for the time to be displayed the input is written as above where : is used in input statements the above line works fine but can someone explain the need and uses of colon in the input statement
From the standard, C11 7.21.6.2 The fscanf function /3 and /6:
The format is composed of zero or more directives: one or more white-space
characters, an ordinary multibyte character (neither % nor a white-space character), or a conversion specification.
A directive that is an ordinary multibyte character is executed by reading the next characters of the stream. If any of those characters differ from the ones composing the directive, the directive fails and the differing and subsequent characters remain unread.
Hence the : simply means "make sure that the next character in the stream is a colon". Nothing more, nothing less.
Your format string simply means you'll be able to scan things like 12:34:56am - without the literal colons in the format string, the scan would fail.

Please explain this line from the book 'The C Programming Language' Pg 192

"If the input stream has been separated into tokens up to a given character, the next token is the longest string of characters that could constitute a token."
Here is what I interpret from this:
Suppose I enter a string "abc xyz" ,then there would be two tokens in this input,"abc" and "xyz",so "abc" is separated from "xyz" by white-space and "xyz" is the longest string of characters that could constitute a token.
I wish to know if I am understanding this correctly or not?
Yes, you're basically right, but the context is different. It is not about the "input", specifically.
The chapter you're referring to, describes the "Lexical Conventions" and tokenizing of the source file(s) during the preprocessing stage.
Just to clarify, To quote the related part, from the Chapter "Tokens" in "Lexical Conventions"
Blanks, horizontal and vertical tabs, newlines, formfeeds and comments as
described below (collectively, ``white space'') are ignored except as they separate tokens. Some white space is required to separate otherwise adjacent identifiers, keywords, and constants.
If the input stream has been separated into tokens up to a given character, the next token is the longest string of characters that could constitute a token.
So, it's not only the "space" character, the tokens can be separated by any white-space element, as described above. In this case, yes, it is the "space" () character.

When exactly in preprocessing are newlines removed?

In early stages of preprocessing C, newlines (unlike other kinds of whitespace outside quotes) are retained; by the time actual parsing begins, they're gone. When exactly are they removed?
5.1.1.2 Translation phases says "7. White-space characters separating tokens are no longer significant" but that's after "6. Adjacent string literal tokens are concatenated" which doesn't seem right, because string literals on separate lines are still concatenated. What am I missing?
6.10.3.2 The # operator says "Each occurrence of white space between the argument’s preprocessing tokens becomes a single space character in the character string literal." Is that an earlier removal of newlines, separate from their removal from the entire file?
You are right that there is a bit of ambiguity in that text. It is clear that newlines are significant up to phase 4, otherwise the preprocessing directives couldn't be executed correctly. What would make "adjacent string literal tokens" is never explained, in particular since whitespace only looses their significance only in phase 7.
My understanding would be that "adjacent tokens" are tokens that are only separated by white space (if any), white space by itself is not considered to form tokens. With that reading it becomes clear that newlines between string literal tokens are removed by phase 6.

Resources