Why the different behavior of snprintf vs swprintf? - c

The C standard states the following from the standard library function snprintf:
"The snprintf function is equivalent to fprintf, except that the
output is written into an array (specified by arguments) rather than
to a stream. If n is zero, nothing is written, and s may be a null
pointer. Otherwise, output characters beyond the n-1st are discarded
rather than being written to the array, and a null character is
written at the end of the characters actually written into the array.
If copying takes place between objects that overlap, the behavior is
undefined."
"The snprintf function returns the number of characters that would
have been written had n been sufficiently large, not counting the
terminating null character, or a negative value if an encoding error
occurred. Thus, the null-terminated output has been completely written
if and only if the returned value is nonnegative and less than n."
Compare it to the statement about swprintf:
"The swprintf function is equivalent to fwprintf, except that the
argument s specifies an array of wide characters into which the
generated output is to be written, rather than written to a stream. No
more than n wide characters are written, including a terminating null
wide character, which is always added (unless n is zero)."
"The swprintf function returns the number of wide characters written
in the array, not counting the terminating null wide character, or a
negative value if an encoding error occurred or if n or more wide
characters were requested to be written."
At first glance it may seem like snprintf and swprintf are complete equivalent to each other, the latter merely handling wide strings and the former narrow strings. However, that's not the case. While snprintf returns the number of characters that would have been written if n had been large enough, swprintf returns a negative value in this case (which means that you can't know how many characters would have been written if there had been enough space). This makes the two functions not fully interchangeable, because their behavior is different in this regard (and thus the latter can't be used for some thing that the former can, such as evaluating how long the output buffer would need to be, before actually creating it.)
Why would they make this difference? I suppose the behavior of swprintf makes the implementation more efficient when n is too small, but still, why the difference? I don't think it's even a question of snprintf being older and thus "legacy" and "dragging the weight of its history, which can't be changed later" and swprintf being newer and thus free to be improved, because both were introduced in C99.
There is, however, another significantly subtler difference between the two specifications. If you notice, the specifications are not merely carbon-copies of each other, with the only difference being the return value. That's another, much subtler difference, and that's the somewhat ambiguous behavior of what happens if n is too small for the string-to-be-printed.
The specification for snprintf quite clearly states that the output will be written up to n-1 characters even when n is too small, and the null character will be written at the end of it always. The specification of swprintf almost states this... except it leaves it ambiguous in its specification of the return value.
More specifically a negative return value is used to signal that an error occurred while trying to write the string to the destination. A negative value is also returned when n was too small. It's left ambiguous whether this is actually considered an error situation or not. This is significant because if it's considered an error, then the implementation is free to not write all, or anything, into the destination, because it can be signaling "an error occurred, the output is invalid". The first paragraph of the specification makes it sound like at most n-1 characters are always written, and an ending null character is always written ("which is always added"), but the second paragraph about the return value leaves it ambiguous whether this is actually an error situation and whether the implementation can choose not to write those things in this case.
This is significant because the glibc implementation of swprintf does not write the final null character when n is too small, making the result an invalid string. While I can't find definitive information on this, I have got the impression that the developers of glibc have interpreted the standard in such a manner that they don't have to write the final null character (or anything) to the output because this is an error situation.
The thing is that the standard seems to be very ambiguous and vague in this regard. Is it a correct interpretation? Or are they misinterpreting? Why would the standard leave it this ambiguous?
My interpretation differs from that of the glibc developers. I understand the second paragraph to mean:
A negative value is returned if:
an encoding error occurred, or
n or more wide characters were requested to be written.
I don't see how this could be interpreted as n being too small being considered an error.

Related

Portablilty of using percison when printf-ing non 0 terminated strings

As multiple questions on here also point out, you can printf a nonterminated string by formatting with a precision as maximum length to print. Something like
printf("%.*s\n", length, str);
will print length chars starting at str (or until the first 0 byte).
As pointed out here by jonathan-leffler, this is specified by posix here. And when reading the doc I discovered it actually never states this should work (or I couldn't find it) , as "The ‘%s’ conversion prints a string." and "A string is a null-terminated array of bytes [...] ". The regard about the precision states "A precision can be specified to indicate the maximum number of characters to write;".
My interpretation would be that the line above is actually undefined behavior, but because printf's implementation is efficient it doesn't read more than it writes.
So my question is: Is this interpretation correct and
TLDR:
Should I stop using this printf trick when trying to be posix compliant as there exists an implementation where this might cause a buffer-overrun?
What you're reading isn't the actual POSIX spec, but the GNU libc manual, which tends to be a little less precise for the sake of readability. The actual spec can be found at https://pubs.opengroup.org/onlinepubs/9699919799/functions/printf.html (it's even linked from Jonathan Leffler's answer which you link to), and it makes it clear that your code is fine:
s
The argument shall be a pointer to an array of char. Bytes from the array shall be written up to (but not including) any terminating null byte. If the precision is specified, no more than that many bytes shall be written. If the precision is not specified or is greater than the size of the array, the application shall ensure that the array contains a null byte.
Note that they are careful not to use the word "string" for exactly the reason you point out.
The ISO C17 standard uses almost identical language, so your code is even portable to non-POSIX standard C implementations. (POSIX generally incorporates ISO C and many parts of the POSIX spec are copy/pasted from the C standard.)

Does sscanf require a null terminated string as input?

A recently discovered explanation for GTA lengthy load times(1) showed that many implementations of sscanf() call strlen() on their input string to set up a context object for an internal routine shared with other scanning functions (scanf(), fscanf()...). This can become a performance bottleneck when the input string is very long. Parsing a 10MB JSON file loaded as a string with repeated calls to sscanf() with an offset and a %n conversion proved to be a dominant cause for the load time.
My question is should sscanf() even read the input string beyond the bytes necessary for the conversions to complete? For example does the following code invoke undefined behavior:
int test(void) {
char buf[1] = { '1' };
int v;
sscanf(buf, "%1d", &v);
return v;
}
The function should return 1 and does not need to read more than one byte from buf, but is sscanf() allowed to read from buf beyond the first byte?
(1) references provided by JdeBP:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
https://news.ycombinator.com/item?id=26297612
https://github.com/biojppm/rapidyaml/issues/40
Here are the relevant parts from the C Standard:
7.21.6.7 The sscanf function Synopsis
Synopsis
#include <stdio.h>
int sscanf(const char * restrict s, const char * restrict format, ...);
Description
The sscanf function is equivalent to fscanf, except that input is obtained from a string (specified by the argument s) rather than from a stream. Reaching the end of the string is equivalent to encountering end-of-file for the fscanf function. If copying takes place between objects that overlap, the behavior is undefined.
Returns
The sscanf function returns the value of the macro EOF if an input failure occurs before the first conversion (if any) has completed. Otherwise, the sscanf function returns the number of input items assigned, which can be fewer than provided for, or even zero, in the event of an early matching failure.
The input is specifically referred to as a string, so it should be null terminated
Albeit none of the characters in the string beyond the initial prefix that matches the conversion specifier and potentially the next byte that helped determine the end of the matching sequence are used for the conversion, these characters must be followed by a null terminator so the input is a well formed string, and it is conforming to call strlen() on it to determine the input length.
To avoid linear time complexity on long input strings, sscanf() should limit the scan for the end of string to a small size with strnlen() or equivalent and pass an appropriate refill function. Passing a huge length and letting the internal routine special case the null byte is an even better approach.
In the mean time, programmers should avoid passing long input strings to sscanf() and use more specialized functions for their parsing tasks, such as strtol(), which also requires a well formed C string, but is implemented in a more conservative way. This would also avoid potential undefined behavior on number conversions for out of range string representations.
When the Standard was written, many library functions were handled identically by almost all existing implementations, but some implementations may have had good reasons for handling a few cases differently. If the number of implementations that would have reason to differ from the commonplace behavior was substantial, then the Committee would either require that all implementations behave in the common fashion (as happens when e.g. computing UINT_MAX+1u), or explicitly state that they were not required to do so (as when e.g. computing INT_MAX+1). In cases where there was a clear common behavior, but it might not be practical on all implementations, however, the Committee generally simply refrained from saying anything, on the presumption that most compilers would have no reason to deviate from the common behavior, and the authors of those that would have reason to deviate would be better placed than the Committee to judge the pros and cons of following the common behavior versus deviating from it.
The sscanf behavior at issue fits the latter pattern. The Committee didn't want to mandate that implementations which would have trouble if the data source didn't have a trailing zero byte must be changed to deal with such data sources, but nor did they want to require that programmers copy data from sources that don't have trailing zero bytes to places that do before using sscanf upon it even when their implementation wouldn't care about anything beyond the portion of the source that would be meaningfully examined. Since makers of implementations that require a trailing zero will likely block any change to the Standard that would require them to tolerate its absence, and programmers whose implementations that impose no such needless requirements will block any change to the Standard that would require that they add extra data-copying steps to their code, the situation will remain deadlocked unless people can agree to categorize implementations that impose the trailing-byte requirement as "conforming but deficient" and require that they indicate such deficiency via predefined macro or other such means.

Is subtracting a char by '0' to convert to int bad practice?

I'm expecting a single digit integer input, and have error handling in place already if this is not the case. Are the any potential unforeseen consequences by simply subtracting the input character by '0' to "convert" it into an integer?
I'm not looking for opinions on readability or what's more commonly used (although they wouldn't hurt as an extension to the answer), but simply whether or not it's a reliable form of conversion. If I ask the user to input an integer between 0 and 9, is there any scenario in which there can be input that input = input-'0' should handle, but doesn't?
This is safe and guaranteed by the C language. In the current version, C11, the relevant text is 5.2.1 Character sets, ¶3:
In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
As for whether it's "bad practice", that's a matter of opinion, but I would say no. It's both idiomatic (commonly used and understood by C programmers) and lacks any alternative that's not confusing and inefficient. For example nobody reading C would want to see this written as a switch statement with 10 cases or by setting up a dummy one-character string to pass to atoi.
The order of characters are encoding/system-dependent, so one must not rely on a particular order in general. For the sequence of digits 0..9 in any system, however, it is guaranteed that it starts with 0 and continues to 9 without any intermediate characters. So input = input - '0' is perfect as long as you guarantee that input contains a digit (e.g. by using isdigit).

What is `scanf` supposed to do with incomplete exponent-part?

Take for example rc = scanf("%f", &flt); with the input 42ex. An implementation of scanf would read 42e thinking it would encounter a digit or sign after that and realize first when reading x that it didn't get that. Should it at this point push back both x and e? Or should it only push back the x.
The reason I ask is that GNU's libc will on a subsequent call to gets return ex indicating they've pushed back both x and e, but the standard says:
An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence[245] The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.
I interpret this as since 42e is a prefix of a matching input sequence (since for example 42e1 would be a matching input sequence), which should mean that it would consider 42e as a input item that should be read leaving only x unread. That would also be more convenient to implement if the stream only supports single character push back.
Your interpretation of the standard is correct. There's even an example further down in the C standard which says that 100ergs of energy shouldn't match %f%20s of %20s because 100e fails to match %f.
But most C libraries seem to implement this differently, probably due to historical reasons. I just checked the C library on macOS and it behaves like glibc. The corresponding glibc bug was closed as WONTFIX with the following explanation from Ulrich Drepper:
This is stupidity on the ISO C committee side which goes against existing
practice. Any change can break existing code.

Why does C not have an snwprintf function?

Does anyone know why there is no snwprintf function in the C standard library?
I am aware of swprintf, but that doesn't have the same semantics of a true, wchar_t version of snprintf. As far as I can tell, there is no easy way to implement an snwprintf function using [v]swprintf:
Unlike snprintf, swprintf does not return the necessary buffer size; if the supplied buffer is insufficient, it simply returns -1. This is indistinguishable from failure due to encoding errors, so I can't keep retrying with progressively larger buffers hoping that it eventually will succeed.
I suppose I could set the last element of the buffer to be non-NUL, call swprintf, and assume that truncation occurred if that element is NUL afterward. However, is that guaranteed to work? The standard does not specify what state the buffer should be in if swprintf fails. (In contrast, snprintf describes which characters are written and which are discarded.)
See the answer given by Larry Jones here.
Essentially, swprintf was added in C95 while snprintf was added in C99 and since many implementations already returned the number of required characters (for snprintf) and it seemed a useful thing to do, that was the behavior that was standardized. They didn't think that behavior was important enough to break backwards compatibility with swprintf by adding it (which was added without that behavior several years earlier).

Resources