I am reading the book "The C Programming Language" by Brian Kernighan and Dennis Ritchie(2nd edition, published by PHI). In the first article 1.1 Getting started of the first chapter A Tutorial Introduction, page number 7, they say that one must use \n in the printf() argument, otherwise the C compile will produce an error message. But when I compiled the program without \n in printf(), it went fine. I did not see any error message. I am using Dev-C portable with "MinGW GCC 4.6.2 32-bit" compiler.
Why I do not get the error message?
Here is the passage in question, from page 7 of the second edition of K&R:
You must use \n to include a newline character in the printf argument; if you try something like
printf("hello, world
");
the C compiler will produce an error message.
This means that you can't embed a literal newline in a quoted string.
Either one of the lines below, however, are fine:
printf("hello, world"); /* does not print a newline */
printf("hello, world\n"); /* prints a newline */
All the text above is saying is that you can't have a quoted string that spans multiple lines in the source code.
You can also escape a newline with a backslash. The C preprocessor will remove the backslash and newline, so the following two statements are equivalent:
printf("hello, world\
");
printf("hello, world");
And if you have a lot of text, you can put multiple quoted strings next to each other, or separated by whitespace, and the compiler will join them for you:
printf("hello, world\n"
"this is a second line of text\n"
"but you still need to include backslash-n to break each line\n");
You don't get a compile-time error message because there is no error.
In the first article they say that one must use \n in the printf() argument, otherwise the C compiler will produce an error message.
Can you cite (by section and/or page number) where that statement appears? I seriously do not believe that K&R (you're using the second edition, right?) says that. If it did say that, it would be an error in the book.
Update: What the book says, quite correctly, is that a newline in a string literal is represented by the two-character sequence \n, not by an actual newline character. A string literal must be on a single logical source line; something like
printf("hello
world");
is a syntax error. This applies to all string literals, whether they're printf format strings or not.
An actual newline in a string literal is an error. A \n sequence that represents a newline is optional; its lack is not an error, but a printf format string should usually end with a \n.
There is no requirement for a printf call to include the \n character, and I've never seen a compiler complain about a printf that lacks a \n.
There is an issue here, but it's not a compile-time error.
Some examples:
printf("No newline");
This is a perfectly legal call. It prints the specified string on standard output without a newline character.
printf("hello%c", '\n');
There's no \n in the format string, but it prints hello followed by a newline. Again, this is perfectly legal.
The actual issue is that you should (almost) always print a newline at the very end of your output. This complete program:
#include <stdio.h>
int main(void) {
printf("hello");
return 0;
}
is legal, but its behavior may be undefined in some implementations. The relevant rule is in the standard, section 7.21.2 paragraph 2 (the quote is from the N1570 draft):
A text stream is an ordered sequence of characters composed into
lines, each line consisting of zero or more characters plus a
terminating new-line character. Whether the last line requires a
terminating new-line character is implementation-defined.
Whether that terminating newline character is required or not, it's (almost always) a very good idea to end your output with a newline. If I run it on my system, I get the string hello immediately followed by my shell prompt on the same line. It's not illegal, but it's inconvenient and ugly.
But that applies only at the very end of the program's output. This program is perfectly valid and has well defined behavior:
#include <stdio.h>
int main(void) {
printf("hello");
putchar('\n');
return 0;
}
Still, the easiest and most reliable way to produce clean output is for each printf call to print exactly one line, which ends with exactly one '\n' character. This isn't a universal rule; sometimes it's convenient to print a line a piece at a time, or to print two or more lines in a single printf.
Very often, if you don't end your printf format string with a \n, some of the output stays in the stdout buffer, and you need to call fflush to get all the output shown.
This means that if you don't get all the expected output you should add fflush at appropriate places (e.g. before calls to fork).
But you won't get a compiler message in such case, because it is not an error (it may be a mistake many beginners are doing). If you really wanted, you could customize your compiler (e.g. with MELT if using a recent GCC compiler) to get the warning. I believe it is not worth the effort (because there are legitimate calls to printf without any \n....)
An example of legitimate printf calls without newlines would be if you coded a (recursive) function to output an expression from its AST; you certainly should not emit a newline after each token.
See documentation of printf(3), fflush(3), stdio(3), setvbuf(3) etc...
Related
Perhaps I'm misinterpreting my results, but:
#include <stdio.h>
int
main(void)
{
char buf[32] = "";
int x;
x = scanf("%31[^\0]", buf);
printf("x = %d, buf=%s", x, buf);
}
$ printf 'foo\n\0bar' | ./a.out
x = 1, buf=foo
Since the string literal "%31[^\0]" contains an embedded null, it seems that it should be treated the same as "%31[^", and the compiler should complain that the [ is unmatched. Indeed, if you replace the string literal, clang gives:
warning: no closing ']' for '%[' in scanf format string [-Wformat]
Why does it work to embed a null character in the string literal passed to scanf?
-- EDIT --
The above is undefined behavior and merely happens to "work".
First of all, Clang totally fails to output any meaningful diagnostics here, whereas GCC knows exactly what is happening - so yet again GCC 1 - 0 Clang.
And as for the format string - well, it doesn't work. The format argument to scanf is a string. The string ends at terminating null, i.e. the format string you're giving to scanf is
scanf("%31[^", buf);
On my computer, compiling the program gives
% gcc scanf.c
scanf.c: In function ‘main’:
scanf.c:8:20: warning: no closing ‘]’ for ‘%[’ format [-Wformat=]
8 | x = scanf("%31[^\0]", buf);
| ^
scanf.c:8:21: warning: embedded ‘\0’ in format [-Wformat-contains-nul]
8 | x = scanf("%31[^\0]", buf);
| ^~
The scanset must have the closing right bracket ], otherwise the conversion specifier is invalid. If conversion specifier is invalid, the behaviour is undefined.
And, on my computer running it,
% printf 'foo\n\0bar' | ./a.out
x = 0, buf=
Q.E.D.
This is a rather strange situation. I think there are a couple of things going on.
First of all, a string in C ends by definition at the first \0. You can always scoff at this rule, for example by writing a string literal with an explicit \0 in the middle of it. When you do, though, the characters after the \0 are mostly invisible. Very few standard library functions are able to see them, because of course just about everything that interprets a C string will stop at the first \0 it finds.
However: the string you pass as the first argument to scanf is typically parsed twice -- and by "parsed" I mean actually interpreted as a scanf format string possibly containing special % sequences. It's always going to be parsed at run time, by the actual copy of scanf in your C run-time library. But it's typically also parsed by the compiler, at compile time, so that the compiler can warn you if the % sequences don't match the actual arguments you call it with. (The run-time library code for scanf, of course, is unable to perform this checking.)
Now, of course, there's a pretty significant potential problem here: what if the parsing performed by the compiler is in some way different than the parsing performed by the actual scanf code in the run-time library? That might lead to confusing results.
And, to my considerable surprise, it looks like the scanf format parsing code in compilers can (and in some cases does) do something special and unexpected. clang doesn't (it doesn't complain about the malformed string at all), but gcc says both "no closing ‘]’ for ‘%[’ format" and "embedded ‘\0’ in format". So it's noticing.
This is possible (though still surprising) because the compiler, at least, can see the whole string literal, and is in a position to notice that the null character is an explicit one inserted by the programmer, not the more usual implicit one appended by the compiler. And indeed the warning "embedded ‘\0’ in format" emitted by gcc proves that gcc, at least, is quite definitely written to accommodate this possibility. (See the footnote below for a bit more on the compiler's ability to "see" the whole string literal.)
But the second question is, why does it (seem to) work at runtime? What is the actual scanf code in the C library doing?
That code, at least, has no way of knowing that the \0 was explicit and that there are "real" characters following it. That code simply must stop at the first \0 that it finds. So it's operating as if the format string was
"%31[^"
That's a malformed format string, of course. The run-time library code isn't required to do anything reasonable. But my copy, like yours, is able to read the full string "foo". What's up with that?
My guess is that after seeing the % and the [ and the ^, and deciding that it's going to scan characters not matching some set, it's perfectly willing to, in effect, infer the missing ], and sail on matching characters from the scanset, which ends up having no excluded characters.
I tested this by trying the variant
x = scanf("%31[^\0o]", buf);
This also matched and printed "foo", not "f".
Obviously things are nothing like guaranteed to work like this, of course. #AnttiHaapala has already posted an answer showing that his C RTL declines to scan "foo" with the malformed scan string at all.
Footnote:
Most of the time, an embedded in \0 in a string truly, prematurely ends it. Most of the time, everything following the \0 is effectively invisible, because at run time, every piece of string interpreting code will stop at the first \0 it finds, with no way to know whether it was one explicitly inserted by the programmer or implicitly appended by the compiler. But as we've seen, the compiler can tell the difference, because the compiler (obviously) can see the entire string literal, exactly as entered by the programmer. Here's proof:
char str1[] = "Hello, world!";
char str2[] = "Hello\0world!";
printf("sizeof(str1) = %zu, strlen(str1) = %zu\n", sizeof(str1), strlen(str1));
printf("sizeof(str2) = %zu, strlen(str2) = %zu\n", sizeof(str2), strlen(str2));
Normally, sizeof on a string literal gives you a number one bigger than strlen. But this code prints:
sizeof(str1) = 14, strlen(str1) = 13
sizeof(str2) = 13, strlen(str2) = 5
Just for fun I also tried:
char str3[5] = "Hello";
This time, though, strlen gave a larger number:
sizeof(str3) = 5, strlen(str3) = 10
I was mildly lucky. str3 has no trailing \0, neither one inserted by me nor appended by the compiler, so strlen sails off the end, and could easily have counted hundreds or thousands of characters before finding a random \0 somewhere in memory, or crashing.
Why can a null character be embedded in a conversion specifier for scanf?
A null character cannot directly be specified as part of a scanset as in "%31[^\0]" as the parsing of the string ends with the first null character.
"%31[^\0]" is parsed by scanf() as if it was "%31[^". As it is an invalid scanf()
specifier, UB will likely follow. A compiler may provide diagnostics on more than what scanf() sees.
A null character can be part of a scanset as in "%31[^\n]". This will read in all characters, including the null character, other than '\n'.
In the unusual case of reading null chracters, to determine the number of characters read scanned, use "%n".
int n = 0;
scanf("%31[^\n]%n", buf, &n);
scanf("%*1[\n]"); // Consume any 1 trailing \n
if (n) {
printf("First part of buf=%s, %d characters read ", buf, n);
}
I have run into some code and was wondering what the original developer was up to. Below is a simplified program using this pattern:
#include <stdio.h>
int main() {
char title[80] = "mytitle";
char title2[80] = "mayataiatale";
char mystring[80];
/* hugh ? */
sscanf(title,"%[^a]",mystring);
printf("%s\n",mystring); /* Output is "mytitle" */
/* hugh ? */
sscanf(title2,"%[^a]",mystring); /* Output is "m" */
printf("%s\n",mystring);
return 0;
}
The man page for scanf has relevant information, but I'm having trouble reading it. What is the purpose of using this sort of notation? What is it trying to accomplish?
The main reason for the character classes is so that the %s notation stops at the first white space character, even if you specify field lengths, and you quite often don't want it to. In that case, the character class notation can be extremely helpful.
Consider this code to read a line of up to 10 characters, discarding any excess, but keeping spaces:
#include <ctype.h>
#include <stdio.h>
int main(void)
{
char buffer[10+1] = "";
int rc;
while ((rc = scanf("%10[^\n]%*[^\n]", buffer)) >= 0)
{
int c = getchar();
printf("rc = %d\n", rc);
if (rc >= 0)
printf("buffer = <<%s>>\n", buffer);
buffer[0] = '\0';
}
printf("rc = %d\n", rc);
return(0);
}
This was actually example code for a discussion on comp.lang.c.moderated (circa June 2004) related to getline() variants.
At least some confusion reigns. The first format specifier, %10[^\n], reads up to 10 non-newline characters and they are assigned to buffer, along with a trailing null. The second format specifier, %*[^\n] contains the assignment suppression character (*) and reads zero or more remaining non-newline characters from the input. When the scanf() function completes, the input is pointing at the next newline character. The body of the loop reads and prints that character, so that when the loop restarts, the input is looking at the start of the next line. The process then repeats. If the line is shorter than 10 characters, then those characters are copied to buffer, and the 'zero or more non-newlines' format processes zero non-newlines.
The constructs like %[a] and %[^a] exist so that scanf() can be used as a kind of lexical analyzer. These are sort of like %s, but instead of collecting a span of as many "stringy" characters as possible, they collect just a span of characters as described by the character class. There might be cases where writing %[a-zA-Z0-9] might make sense, but I'm not sure I see a compelling use case for complementary classes with scanf().
IMHO, scanf() is simply not the right tool for this job. Every time I've set out to use one of its more powerful features, I've ended up eventually ripping it out and implementing the capability in a different way. In some cases that meant using lex to write a real lexical analyzer, but usually doing line at a time I/O and breaking it coarsely into tokens with strtok() before doing value conversion was sufficient.
Edit: I ended ripping out scanf() typically because when faced with users insisting on providing incorrect input, it just isn't good at helping the program give good feedback about the problem, and having an assembler print "Error, terminated." as its sole helpful error message was not going over well with my user. (Me, in that case.)
It's like character sets from regular expressions; [0-9] matches a string of digits, [^aeiou] matches anything that isn't a lowercase vowel, etc.
There are all sorts of uses, such as pulling out numbers, identifiers, chunks of whitespace, etc.
You can read about it in the ISO/IEC9899 standard available online.
Here is a paragraph I quote from the document about [ (Page 286):
Matches a nonempty sequence of characters from a set of expected
characters.
The conversion specifier includes all subsequent characters in the
format string, up to and including the matching right bracket (]). The
characters between the brackets (the scanlist) compose the scanset,
unless the character after the left bracket is a circumflex (^), in
which case the scanset contains all characters that do not appear in
the scanlist between the circumflex and the right bracket. If the
conversion specifier begins with [] or [^], the right bracket
character is in the scanlist and the next following right bracket
character is the matching right bracket that ends the specification;
otherwise the first following right bracket character is the one that
ends the specification. If a - character is in the scanlist and is not
the first, nor the second where the first character is a ^, nor the
last character, the behavior is implementation-defined.
Lately, I noticed a strange case I would like to verify:
By SUS, for %n in a format string, the respective int will be set to the-amount-of-bytes-written-to-the-output.
Additionally, for snprintf(dest, 3, "abcd"), dest will point to "ab\0". Why? Because no more than n (n = 3) bytes are to be written to the output (the dest buffer).
I deduced that for the code:
int written;
char dest[3];
snprintf(dest, 3, "abcde%n", &written);
written will be set to 2 (null termination excluded from count).
But from a test I made using GCC 4.8.1, written was set to 5.
Was I misinterpreting the standard? Is it a bug? Is it undefined behaviour?
Edit:
#wildplasser said:
... the behavior of %n in the format string could be undefined or implementation defined ...
And
... the implementation has to simulate processing the complete format string (including the %n) ...
#par said:
written is 5 because that's how many characters would be written at the point the %n is encountered. This is correct behavior. snprintf only copies up to size characters minus the trailing null ...
And:
Another way to look at this is that the %n wouldn't have even been encountered if it only processed up to 2 characters, so it's conceivable to expect written to have an invalid value...
And:
... the whole string is processed via printf() rules, then the max-length is applied ...
Can it be verified be the standard, a standard-draft or some official source?
written is 5 because that's how many characters would be written at the point the %n is encountered. This is correct behavior. snprintf only copies up to size characters minus the trailing null (so in your case 3-1 == 2. You have to separate the string formatting behavior from the only-write-so-many characters.
Another way to look at this is that the %n wouldn't have even been encountered if it only processed up to 2 characters, so it's conceivable to expect written to have an invalid value. That's where there would be a bug, if you were expecting something valid in written at the point %n was encountered (and there wasn't).
So remember, the whole string is processed via printf() rules, then the max-length is applied.
It is not a bug: ISOC99 says
The snprintf function is equivalent to fprintf
[...]
output characters beyond the n-1st are discarded rather than being written to the array
[...]
So it just discards trailing output but otherwise behaves the same.
With the following code, what's the output of this code, and why?
#include <stdio.h>
int main() {
printf("Hello world\n"); // \\
printf("What's the meaning of this?");
return 0;
}
The backslash at the end of the 4th line is escaping the following new line so that they become one continuous line. And because we can see the // beginning a comment, the 5th line is commented out.
That is, your code is the equivalent of:
#include <stdio.h>
int main() {
printf("Hello world\n"); // \printf("What's the meaning of this?");
return 0;
}
The output is simply "Hello world" with a new line.
Edit: As Erik and pmg both said, this is true in C99 but not C89. Credit where credit is due.
It is defined in the 2nd phase of translation (ISO/IEC 9899:1999 §5.1.1.2):
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
It's "Hello world\n". Didn't you try? Line continuation (and e.g trigraphs) are well documented, look it up. A syntax highlighting editor (e.g. Visual Studio with VA X) will make this obvious.
Note that this works in C99 and C++ - not C89
The trailing backslash causes the next line to be 'spliced' to the line that ends inthe backslash - even if it's part of a comment. This is nearly always unintentional (unless it's a deliberate obfuscation trick), and will cause a bug unless the next line is entirely whitespace or a comment itself.
This happens because 'line-splicing' occurs in translation phase 2, while removing comments happens in phase 3.
Newer compilers will warn about the single-line comment being continued to the next line (I'm not sure exactly what warning level might be required though):
GCC 4.5.1 (MinGW)
C:\temp\test.c:4:34: warning: multi-line comment
MSVC 9 (VS 2008) or 10 (VS 2010):
C:\temp\test.c(5) : warning C4010: single-line comment contains line-continuation character
C:\temp\test.c:4:34: warning: multi-line comment
C:\temp\test.c(5) : warning C4010: single-line comment contains line-continuation character
I have run into some code and was wondering what the original developer was up to. Below is a simplified program using this pattern:
#include <stdio.h>
int main() {
char title[80] = "mytitle";
char title2[80] = "mayataiatale";
char mystring[80];
/* hugh ? */
sscanf(title,"%[^a]",mystring);
printf("%s\n",mystring); /* Output is "mytitle" */
/* hugh ? */
sscanf(title2,"%[^a]",mystring); /* Output is "m" */
printf("%s\n",mystring);
return 0;
}
The man page for scanf has relevant information, but I'm having trouble reading it. What is the purpose of using this sort of notation? What is it trying to accomplish?
The main reason for the character classes is so that the %s notation stops at the first white space character, even if you specify field lengths, and you quite often don't want it to. In that case, the character class notation can be extremely helpful.
Consider this code to read a line of up to 10 characters, discarding any excess, but keeping spaces:
#include <ctype.h>
#include <stdio.h>
int main(void)
{
char buffer[10+1] = "";
int rc;
while ((rc = scanf("%10[^\n]%*[^\n]", buffer)) >= 0)
{
int c = getchar();
printf("rc = %d\n", rc);
if (rc >= 0)
printf("buffer = <<%s>>\n", buffer);
buffer[0] = '\0';
}
printf("rc = %d\n", rc);
return(0);
}
This was actually example code for a discussion on comp.lang.c.moderated (circa June 2004) related to getline() variants.
At least some confusion reigns. The first format specifier, %10[^\n], reads up to 10 non-newline characters and they are assigned to buffer, along with a trailing null. The second format specifier, %*[^\n] contains the assignment suppression character (*) and reads zero or more remaining non-newline characters from the input. When the scanf() function completes, the input is pointing at the next newline character. The body of the loop reads and prints that character, so that when the loop restarts, the input is looking at the start of the next line. The process then repeats. If the line is shorter than 10 characters, then those characters are copied to buffer, and the 'zero or more non-newlines' format processes zero non-newlines.
The constructs like %[a] and %[^a] exist so that scanf() can be used as a kind of lexical analyzer. These are sort of like %s, but instead of collecting a span of as many "stringy" characters as possible, they collect just a span of characters as described by the character class. There might be cases where writing %[a-zA-Z0-9] might make sense, but I'm not sure I see a compelling use case for complementary classes with scanf().
IMHO, scanf() is simply not the right tool for this job. Every time I've set out to use one of its more powerful features, I've ended up eventually ripping it out and implementing the capability in a different way. In some cases that meant using lex to write a real lexical analyzer, but usually doing line at a time I/O and breaking it coarsely into tokens with strtok() before doing value conversion was sufficient.
Edit: I ended ripping out scanf() typically because when faced with users insisting on providing incorrect input, it just isn't good at helping the program give good feedback about the problem, and having an assembler print "Error, terminated." as its sole helpful error message was not going over well with my user. (Me, in that case.)
It's like character sets from regular expressions; [0-9] matches a string of digits, [^aeiou] matches anything that isn't a lowercase vowel, etc.
There are all sorts of uses, such as pulling out numbers, identifiers, chunks of whitespace, etc.
You can read about it in the ISO/IEC9899 standard available online.
Here is a paragraph I quote from the document about [ (Page 286):
Matches a nonempty sequence of characters from a set of expected
characters.
The conversion specifier includes all subsequent characters in the
format string, up to and including the matching right bracket (]). The
characters between the brackets (the scanlist) compose the scanset,
unless the character after the left bracket is a circumflex (^), in
which case the scanset contains all characters that do not appear in
the scanlist between the circumflex and the right bracket. If the
conversion specifier begins with [] or [^], the right bracket
character is in the scanlist and the next following right bracket
character is the matching right bracket that ends the specification;
otherwise the first following right bracket character is the one that
ends the specification. If a - character is in the scanlist and is not
the first, nor the second where the first character is a ^, nor the
last character, the behavior is implementation-defined.