In early stages of preprocessing C, newlines (unlike other kinds of whitespace outside quotes) are retained; by the time actual parsing begins, they're gone. When exactly are they removed?
5.1.1.2 Translation phases says "7. White-space characters separating tokens are no longer significant" but that's after "6. Adjacent string literal tokens are concatenated" which doesn't seem right, because string literals on separate lines are still concatenated. What am I missing?
6.10.3.2 The # operator says "Each occurrence of white space between the argument’s preprocessing tokens becomes a single space character in the character string literal." Is that an earlier removal of newlines, separate from their removal from the entire file?
You are right that there is a bit of ambiguity in that text. It is clear that newlines are significant up to phase 4, otherwise the preprocessing directives couldn't be executed correctly. What would make "adjacent string literal tokens" is never explained, in particular since whitespace only looses their significance only in phase 7.
My understanding would be that "adjacent tokens" are tokens that are only separated by white space (if any), white space by itself is not considered to form tokens. With that reading it becomes clear that newlines between string literal tokens are removed by phase 6.
Related
Valid code:
#define M xxx\
yyy
Not valid code:
#define M xxx\/*comment*/
yyy
#define M xxx\//comment
yyy
Questions:
Why comment is not allowed after backslash character (\)?
What the standard says?
UPD.
Extra question:
What is the motivation / reason / argumentation behind the requirement that (in order to achieve splicing of physical source lines) backslash character (\) must immediately follow by a new-line character? What is the obstacle to allow comments (or spaces) after the backslash character (\)?
Lines are spliced together only if a backslash character is the last character on a line. C 2018 5.1.1.2 specifies phases of translating a C program. In phase 2:
Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines…
If a comment follows a backslash character, the backslash character is not followed by a new-line character, so no splicing is performed. Comments are processed in phase 3:
The source file is decomposed into preprocessing tokens7) and sequences of white-space characters (including comments)… Each comment is replaced by one space character…
Regarding the added question:
What is the motivation / reason / argumentation behind the requirement that (in order to achieve splicing of physical source lines) backslash character (\) must immediately follow by a new-line character? What is the obstacle to allow comments (or spaces) after the backslash character (\)?
The earliest processing in compiling a C program is the simplest. Early C compilers may have been implemented as layers of simple filters: First local-environment characters or methods of file storage would be translated to a simple stream of characters, then lines would be spliced together (perhaps dealing with a problem of wanting a long source line while having to type your source code on 80-column punched cards), then comments would be removed, and so on.
Splicing together lines marked by a backslash at the end of a line is easy; it only requires looking at two characters. If instead we allow comments to follow the backslash that marks a splice, it becomes complicated:
A backslash followed by a comment followed by a new-line would be spliced, but a backslash followed by a comment followed by other source code would not. That requires looking possibly many characters ahead and parsing the comment delimiters, possibly for multiple comments.
One purpose of splicing lines was to allow continuing long strings across multiple lines. (This was before adjacent strings were concatenated in C.) So "abc\ on one line and def" on another would be spliced together, making "abcdef". While we might allow comments after backslashes intended to join lines, we do not want to splice after a line containing "abc\ /*" /*comment*/. That means the code doing the splicing has to be context-sensitive; if the backslash appears in a quoted string, it has to treat it differently.
There is actually a reason why backslash-newlines are processed before comments are removed. It's the same reason why backslash-newlines are entirely removed, instead of being replaced with (virtual) horizontal whitespace, as comments are. It's a ridiculous reason, but it's the official reason. It's so you can mechanically force-fit C code with long lines onto punched cards, by inserting backslash-newline at column 79 no matter what that happens to divide:
static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * st\
atbuf)
{
static int warncount = 5;
struct __old_kernel_stat tmp;
if (warncount > 0) {
warncount--;
printk(KERN_WARNING "VFS: Warning: %s using old stat() call. Re\
compile your binary.\n",
(this is the first chunk of C I found on my hard drive that actually had lines that wouldn't fit on punched cards)
For this to work as intended, backslash-newline has to be able to split a /* or a */, like
/* this comment just so happens to be exactly 80 characters wide at the close *\
/
And you can't have it both ways: if comments were to be removed before processing backslash-newline, then backslash-newline could not affect comment boundaries; conversely, if backslash-newline is to be processed first, then comments can't appear between the backslash and the newline.
(I Am Not Making This Up™: C99 Rationale section 5.1.1.2 paragraph 30 reads
A backslash immediately before a newline has long been used to continue string literals, as well as preprocessing command lines. In the interest of easing machine generation of C, and of transporting code to machines with restrictive physical line lengths, the C89 Committee generalized this mechanism to permit any token to be continued by interposing a backslash/newline sequence.
Emphasis in original. Sorry, I don't know of any non-PDF version of this document.)
Per 5.1.1.2 Translation phases of the C11 standard (note the bolded text added)
5.1.1.2 Translation phases
1 The precedence among the syntax rules of translation is specified by the following phases.6)
1 Physical source file multibyte characters are mapped, in an implementation- defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
2 Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
...
Only backslash characters immediately followed by a new-line will cause lines to be spliced. A comment is not a new-line character.
scanf("%d:%d:%d%s", &hh, &mm, &ss, t12)
When taking multiple inputs for the time to be displayed the input is written as above where : is used in input statements the above line works fine but can someone explain the need and uses of colon in the input statement
From the standard, C11 7.21.6.2 The fscanf function /3 and /6:
The format is composed of zero or more directives: one or more white-space
characters, an ordinary multibyte character (neither % nor a white-space character), or a conversion specification.
A directive that is an ordinary multibyte character is executed by reading the next characters of the stream. If any of those characters differ from the ones composing the directive, the directive fails and the differing and subsequent characters remain unread.
Hence the : simply means "make sure that the next character in the stream is a colon". Nothing more, nothing less.
Your format string simply means you'll be able to scan things like 12:34:56am - without the literal colons in the format string, the scan would fail.
"If the input stream has been separated into tokens up to a given character, the next token is the longest string of characters that could constitute a token."
Here is what I interpret from this:
Suppose I enter a string "abc xyz" ,then there would be two tokens in this input,"abc" and "xyz",so "abc" is separated from "xyz" by white-space and "xyz" is the longest string of characters that could constitute a token.
I wish to know if I am understanding this correctly or not?
Yes, you're basically right, but the context is different. It is not about the "input", specifically.
The chapter you're referring to, describes the "Lexical Conventions" and tokenizing of the source file(s) during the preprocessing stage.
Just to clarify, To quote the related part, from the Chapter "Tokens" in "Lexical Conventions"
Blanks, horizontal and vertical tabs, newlines, formfeeds and comments as
described below (collectively, ``white space'') are ignored except as they separate tokens. Some white space is required to separate otherwise adjacent identifiers, keywords, and constants.
If the input stream has been separated into tokens up to a given character, the next token is the longest string of characters that could constitute a token.
So, it's not only the "space" character, the tokens can be separated by any white-space element, as described above. In this case, yes, it is the "space" () character.
A comment (which should probably be submitted as an answer) has the code
sscanf(string, "<title>%[^<]</title>", extracted_string);
Running the code seems to copy the text between the <title> tags to extracted_string, but I cannot find any references to a caret in the printf family, either in the man pages or elsewhere online.
Can someone point me to a resource that explains the use of %[^<], and other similar syntax, in the sscanf() family?
From the C11 standard document, chapter §7.21.6.2, Paragraph 12, conversion specifiers, (emphasis mine)
[
Matches a nonempty sequence of characters from a set of expected characters
(the scanset).
....
The conversion specifier includes all subsequent characters in the format
string, up to and including the matching right bracket (]). The characters
between the brackets (the scanlist) compose the scanset, unless the character
after the left bracket is a circumflex (^), in which case the scanset contains all characters that do not appear in the scanlist between the circumflex and the
right bracket.
A draft version of the standard, found online.
It means match anything that is not a <, it's not a good idea to do that without specifying the maximum destination buffer length, if your destination buffer can hold say 100 characters, then
char extracted_string[100];
sscanf(string, "<title>%99[^<]</title>", extracted_string);
would be a better solution.
Using strstr() for this purpose allows you to actually make extracted_string dynamic.
this link explains the [ and ^ usage in scanf family of functions
(emphasis mine)
http://www.cdf.toronto.edu/~ajr/209/notes/printf.html
[
Matches a nonempty sequence of characters from the specified set of accepted characters; the next pointer must be a pointer to char, and there must be enough room for all the characters in the string, plus a terminating null byte. The usual skip of leading white space is suppressed. The string is to be made up of characters in (or not in) a particular set; the set is defined by the characters between the open bracket [ character and a close bracket ] character. The set excludes those characters if the first character after the open bracket is a circumflex (^). To include a close bracket in the set, make it the first character after the open bracket or the circumflex; any other position will end the set. The hyphen character - is also special; when placed between two other characters, it adds all intervening characters to the set. To include a hyphen, make it the last character before the final close bracket. For instance, [^]0-9-] means the set "everything except close bracket, zero through nine, and hyphen". The string ends with the appearance of a character not in the (or, with a circumflex, in) set or when the field width runs out.
I'm trying to break up a shell command that contains both pipes (|) and the OR symbols (||) represented as characters in an array with strtok, except, well the OR command could also be two pipes next to each other. Specifically, I need to know when |, ;, &&, or || show up in the command.
Is there a way to specify where one delimiter ends and another begins in strtok, since I know usually the delimiters are one character long and you just list them all out with no spaces or anything in between.
Oh and, is a newline a valid delimiter? Or does strtok only do spaces?
Starting from your last question: yes, strtok can use new-line as a delimiter without any problems.
Unfortunately, the answer to your first question isn't nearly so positive. strtok treats all delimiter characters as equal, and does nothing to differentiate between a single delimiter and an arbitrary number of consecutive delimiters. In other words, if you give |&; as the delimiter, it'll treat ||||||||| or &&& or &|&|; all exactly the same way.
I'll go a little further: I'll go out on a limb and state as a fact that strtok simply isn't suitable for breaking a shell command into constituent pieces -- I'm pretty sure there's just no way to use it for this job that will produce usable results.
In particular, you don't have anything that just acts as a delimiter. For your purposes, the &, |, and || are tokens of their own. In a string being supplied to the shell, you don't necessarily have anything that qualifies as a delimiter the way strtok "thinks" of them.
strtok is oriented toward tokens that are separated by delimiters that are nothing except delimiters. As strtok reads the tokens, the delimiters between them are completely ignored (and, destroyed, for that matter). For the shell, a string like a|b is really three tokens -- you need the a, the | and the b -- there's nothing between them that strtok can safely overwrite and/or ignore -- but that's a requirement for how strtok works. For it to deliver you the first a, it overwrites the next character (the | in this case) with a '\0'. Then it has no way of recovering that pipe to tell you what the next token should be.
I think you probably need a greedy tokenizer instead -- i.e., one that builds the longest string of characters that can be token, and stops when it encounters a character that can't be part of the current token. When you ask for the next token, it starts from the first character after the end of the previous token, without (necessarily) skipping/ignoring anything (though, of course, if it encounters something like white-space that hasn't been quoted somehow, it'll probably skip over it).
For your purpose, strtok() is not the correct tool to use; it destroys the delimiter, so you can't tell what was at the end of a token if someone types ls|wc. It could have been a pipe, a semi-colon, and ampersand, or a space. Also, it treats multiple adjacent delimiters as part of a single delimiter.
Look at strspn() and strcspn(); both are in standard C and are non-destructive relatives of strtok().
strtok() is quite happy to use newline as a delimiter; in fact, any character except '\0' can be used as one of the delimiters.
There are other reasons for being extremely cautious about using strtok(), such as thread safety and the fact that it is highly unwise to use it in library code.
strtok() is a basic, all-purpose parsing function. For more advanced parsing, I don't recommend its use.
For example, in the case of '|', you really need to inspect the next character to determine if you've found '|' or '||'.
I've done a huge amount of parsing of this nature, including writing a small language interpreter. It's not that hard if you break it up into smaller tasks. But my advice is to write your own parsing routine in this case.
And, yes, a newline character is a valid delimiter.