Portablilty of using percison when printf-ing non 0 terminated strings - c

As multiple questions on here also point out, you can printf a nonterminated string by formatting with a precision as maximum length to print. Something like
printf("%.*s\n", length, str);
will print length chars starting at str (or until the first 0 byte).
As pointed out here by jonathan-leffler, this is specified by posix here. And when reading the doc I discovered it actually never states this should work (or I couldn't find it) , as "The ‘%s’ conversion prints a string." and "A string is a null-terminated array of bytes [...] ". The regard about the precision states "A precision can be specified to indicate the maximum number of characters to write;".
My interpretation would be that the line above is actually undefined behavior, but because printf's implementation is efficient it doesn't read more than it writes.
So my question is: Is this interpretation correct and
TLDR:
Should I stop using this printf trick when trying to be posix compliant as there exists an implementation where this might cause a buffer-overrun?

What you're reading isn't the actual POSIX spec, but the GNU libc manual, which tends to be a little less precise for the sake of readability. The actual spec can be found at https://pubs.opengroup.org/onlinepubs/9699919799/functions/printf.html (it's even linked from Jonathan Leffler's answer which you link to), and it makes it clear that your code is fine:
s
The argument shall be a pointer to an array of char. Bytes from the array shall be written up to (but not including) any terminating null byte. If the precision is specified, no more than that many bytes shall be written. If the precision is not specified or is greater than the size of the array, the application shall ensure that the array contains a null byte.
Note that they are careful not to use the word "string" for exactly the reason you point out.
The ISO C17 standard uses almost identical language, so your code is even portable to non-POSIX standard C implementations. (POSIX generally incorporates ISO C and many parts of the POSIX spec are copy/pasted from the C standard.)

Related

Why the different behavior of snprintf vs swprintf?

The C standard states the following from the standard library function snprintf:
"The snprintf function is equivalent to fprintf, except that the
output is written into an array (specified by arguments) rather than
to a stream. If n is zero, nothing is written, and s may be a null
pointer. Otherwise, output characters beyond the n-1st are discarded
rather than being written to the array, and a null character is
written at the end of the characters actually written into the array.
If copying takes place between objects that overlap, the behavior is
undefined."
"The snprintf function returns the number of characters that would
have been written had n been sufficiently large, not counting the
terminating null character, or a negative value if an encoding error
occurred. Thus, the null-terminated output has been completely written
if and only if the returned value is nonnegative and less than n."
Compare it to the statement about swprintf:
"The swprintf function is equivalent to fwprintf, except that the
argument s specifies an array of wide characters into which the
generated output is to be written, rather than written to a stream. No
more than n wide characters are written, including a terminating null
wide character, which is always added (unless n is zero)."
"The swprintf function returns the number of wide characters written
in the array, not counting the terminating null wide character, or a
negative value if an encoding error occurred or if n or more wide
characters were requested to be written."
At first glance it may seem like snprintf and swprintf are complete equivalent to each other, the latter merely handling wide strings and the former narrow strings. However, that's not the case. While snprintf returns the number of characters that would have been written if n had been large enough, swprintf returns a negative value in this case (which means that you can't know how many characters would have been written if there had been enough space). This makes the two functions not fully interchangeable, because their behavior is different in this regard (and thus the latter can't be used for some thing that the former can, such as evaluating how long the output buffer would need to be, before actually creating it.)
Why would they make this difference? I suppose the behavior of swprintf makes the implementation more efficient when n is too small, but still, why the difference? I don't think it's even a question of snprintf being older and thus "legacy" and "dragging the weight of its history, which can't be changed later" and swprintf being newer and thus free to be improved, because both were introduced in C99.
There is, however, another significantly subtler difference between the two specifications. If you notice, the specifications are not merely carbon-copies of each other, with the only difference being the return value. That's another, much subtler difference, and that's the somewhat ambiguous behavior of what happens if n is too small for the string-to-be-printed.
The specification for snprintf quite clearly states that the output will be written up to n-1 characters even when n is too small, and the null character will be written at the end of it always. The specification of swprintf almost states this... except it leaves it ambiguous in its specification of the return value.
More specifically a negative return value is used to signal that an error occurred while trying to write the string to the destination. A negative value is also returned when n was too small. It's left ambiguous whether this is actually considered an error situation or not. This is significant because if it's considered an error, then the implementation is free to not write all, or anything, into the destination, because it can be signaling "an error occurred, the output is invalid". The first paragraph of the specification makes it sound like at most n-1 characters are always written, and an ending null character is always written ("which is always added"), but the second paragraph about the return value leaves it ambiguous whether this is actually an error situation and whether the implementation can choose not to write those things in this case.
This is significant because the glibc implementation of swprintf does not write the final null character when n is too small, making the result an invalid string. While I can't find definitive information on this, I have got the impression that the developers of glibc have interpreted the standard in such a manner that they don't have to write the final null character (or anything) to the output because this is an error situation.
The thing is that the standard seems to be very ambiguous and vague in this regard. Is it a correct interpretation? Or are they misinterpreting? Why would the standard leave it this ambiguous?
My interpretation differs from that of the glibc developers. I understand the second paragraph to mean:
A negative value is returned if:
an encoding error occurred, or
n or more wide characters were requested to be written.
I don't see how this could be interpreted as n being too small being considered an error.

Does sscanf require a null terminated string as input?

A recently discovered explanation for GTA lengthy load times(1) showed that many implementations of sscanf() call strlen() on their input string to set up a context object for an internal routine shared with other scanning functions (scanf(), fscanf()...). This can become a performance bottleneck when the input string is very long. Parsing a 10MB JSON file loaded as a string with repeated calls to sscanf() with an offset and a %n conversion proved to be a dominant cause for the load time.
My question is should sscanf() even read the input string beyond the bytes necessary for the conversions to complete? For example does the following code invoke undefined behavior:
int test(void) {
char buf[1] = { '1' };
int v;
sscanf(buf, "%1d", &v);
return v;
}
The function should return 1 and does not need to read more than one byte from buf, but is sscanf() allowed to read from buf beyond the first byte?
(1) references provided by JdeBP:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
https://news.ycombinator.com/item?id=26297612
https://github.com/biojppm/rapidyaml/issues/40
Here are the relevant parts from the C Standard:
7.21.6.7 The sscanf function Synopsis
Synopsis
#include <stdio.h>
int sscanf(const char * restrict s, const char * restrict format, ...);
Description
The sscanf function is equivalent to fscanf, except that input is obtained from a string (specified by the argument s) rather than from a stream. Reaching the end of the string is equivalent to encountering end-of-file for the fscanf function. If copying takes place between objects that overlap, the behavior is undefined.
Returns
The sscanf function returns the value of the macro EOF if an input failure occurs before the first conversion (if any) has completed. Otherwise, the sscanf function returns the number of input items assigned, which can be fewer than provided for, or even zero, in the event of an early matching failure.
The input is specifically referred to as a string, so it should be null terminated
Albeit none of the characters in the string beyond the initial prefix that matches the conversion specifier and potentially the next byte that helped determine the end of the matching sequence are used for the conversion, these characters must be followed by a null terminator so the input is a well formed string, and it is conforming to call strlen() on it to determine the input length.
To avoid linear time complexity on long input strings, sscanf() should limit the scan for the end of string to a small size with strnlen() or equivalent and pass an appropriate refill function. Passing a huge length and letting the internal routine special case the null byte is an even better approach.
In the mean time, programmers should avoid passing long input strings to sscanf() and use more specialized functions for their parsing tasks, such as strtol(), which also requires a well formed C string, but is implemented in a more conservative way. This would also avoid potential undefined behavior on number conversions for out of range string representations.
When the Standard was written, many library functions were handled identically by almost all existing implementations, but some implementations may have had good reasons for handling a few cases differently. If the number of implementations that would have reason to differ from the commonplace behavior was substantial, then the Committee would either require that all implementations behave in the common fashion (as happens when e.g. computing UINT_MAX+1u), or explicitly state that they were not required to do so (as when e.g. computing INT_MAX+1). In cases where there was a clear common behavior, but it might not be practical on all implementations, however, the Committee generally simply refrained from saying anything, on the presumption that most compilers would have no reason to deviate from the common behavior, and the authors of those that would have reason to deviate would be better placed than the Committee to judge the pros and cons of following the common behavior versus deviating from it.
The sscanf behavior at issue fits the latter pattern. The Committee didn't want to mandate that implementations which would have trouble if the data source didn't have a trailing zero byte must be changed to deal with such data sources, but nor did they want to require that programmers copy data from sources that don't have trailing zero bytes to places that do before using sscanf upon it even when their implementation wouldn't care about anything beyond the portion of the source that would be meaningfully examined. Since makers of implementations that require a trailing zero will likely block any change to the Standard that would require them to tolerate its absence, and programmers whose implementations that impose no such needless requirements will block any change to the Standard that would require that they add extra data-copying steps to their code, the situation will remain deadlocked unless people can agree to categorize implementations that impose the trailing-byte requirement as "conforming but deficient" and require that they indicate such deficiency via predefined macro or other such means.

Is there a limit on the number of values that can be printed by a single call of printf?

Does the number of values printed by printf depend on the memory allocated for a specific program or it can keep on printing the values?
The C Standard documents the minimum number of arguments that a compiler should accept for a function call:
C11 5.2.4.1 Translation limits
The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:
...
127 arguments in one function call
...
Therefore, you should be able to pass at least 126 values to printf after the initial format string, assuming the format string is properly constructed and consistent with the actual arguments that follow.
If the format string is a string literal, the standard guarantees that the compiler can handle string literals at least 4095 bytes long, and source lines at least 4095 characters long. You can use string concatenation to split the literal on multiple source lines. If you use a char array for the format string, no such limitation exists.
The only environmental limit documented for the printf family of functions is this:
The number of characters that can be produced by any single conversion shall be at least 4095
This makes the behavior of format %10000d at best defined by the implementation, but the standard does not mandate anything.
A compliant compiler/library combination should therefore accept at least 126 values for printf, whether your environment allows even more arguments may be defined by the implementation and documented as such, but is not guaranteed by the standard.

Unsigned character gotchas in C

Most C compilers use signed characters. Most C libraries define EOF as -1.
Despite being a long-time C programmer I had never before put these two facts together and so in the interest of robust and international software I would ask for a bit of help in spelling out the implications.
Here is what I have discovered thus far:
fgetc() and friends cast to unsigned characters before returning as int to avoid clashing with EOF.
Therefore care needs to be taken with the results, e.g. getchar() == (unsigned char) 'µ'.
Theoretically I believe that not even the basic character set is guaranteed to be positive.
The <ctype.h> functions are designed to handle EOF and expected unsigned characters. Any other negative input may cause out-of-bounds addressing.
Most functions taking character parameters as integers ignore EOF and will accept signed or unsigned characters interchangeably.
String comparison (strcmp/strncmp/memcmp) compares unsigned character strings.
It may not be impossible to discriminate EOF from a proper characters on systems where sizeof(int) = 1.
The wide characters functions are not used for binary I/O and so WEOF is defined within the range of wchar_t.
Is this assessment correct and if so what other gotchas did I miss?
Full disclosure: I ran into an out-of-bounds indexing bug today when feeding non-ASCII characters to isspace() and the realization of the amount of lurking bugs in my old code both scared and annoyed me. Hence this frustrated question.
The basic execution character set is guaranteed to be nonnegative - the precise wording in C99 is:
If a member of the basic execution character set is stored in a char
object, its value is guaranteed to be nonnegative.

C - isgraph() function

Does anyone know how the isgraph() function works in C? I understand its use and results, but the code behind it is what I'm interested in.
For example, does it look at only the char value of it and compare it to the ASCII table? Or does it actually check to see if it can be displayed? If so, how?
The code behind the isgraph() function varies by platform (or, more precisely, by implementation). One common technique is to use an initialized array of bit-fields, one per character in the (single-byte) codeset plus EOF (which has to be accepted by the functions), and then selecting the relevant bit. This allows for a simple implementation as a macro which is safe (only evaluates its argument once) and as a simple (possibly inline) function.
#define isgraph(x) (__charmap[(x)+1]&__PRINT)
where __charmap and __PRINT are names reserved for the implementation. The +1 part deals with the common situation where EOF is -1.
According to the C standard (ISO/IEC 9899:1999):
§7.4.1.6 The isgraph function
Synopsis
#include <ctype.h>
int isgraph(int c);
Description
The isgraph function tests for any printing character except space (' ').
And:
§7.4 Character handling <ctype.h>
¶1 The header declares several functions useful for classifying and mapping
characters.166) In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.
¶2 The behavior of these functions is affected by the current locale. Those functions that
have locale-specific aspects only when not in the "C" locale are noted below.
¶3 The term printing character refers to a member of a locale-specific set of characters, each
of which occupies one printing position on a display device; the term control character
refers to a member of a locale-specific set of characters that are not printing
characters.167) All letters and digits are printing characters.
166) See ‘‘future library directions’’ (7.26.2).
167) In an implementation that uses the seven-bit US ASCII character set, the printing characters are those
whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose
values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).
It's called isgraph, not isGraph (and char, not Char), and the POSIX Programmer's Manual says
The isgraph() function shall test
whether c is a character of class
graph in the program's current locale;
see the Base Definitions volume of
IEEE Std 1003.1-2001,
Chapter 7, Locale.
So yes, it looks it up in a table (or equivalent code). It can't check whether it can actually be displayed, since that would vary depending upon the output device, many of which can display chars in addition to those for which isgraph returns true.
isgraph checks for "printable" characters, but the definition of "printable" can vary depending on your locale. Your locale may use characters that aren't in the ASCII table. Internally, it's most likely either a table lookup, a range-based test ((x >= 'a') && (x <= 'z'), etc), or a combination of both. Different implementations may do it slightly differently.
The isgraph() macro only looks at the ASCII table, or your location/country/providence/planet/galaxy's version of the ASCII table.
Here's a test code Counting Words, which found you can increase performance by writing your own version, which initializes a bool array[256] using isgraph(). There are benchmark results with the code.
Since bool variables/arrays are actually BYTEs, not bits, you can do even better, in terms of memory efficiency, if you use a bit array, and test that. It happily takes up only 32 bytes. That's almost certainly going to get cashed on any general-purpose modern processor.
Importantly, if you want a slightly different test than the standard ones provided here (see graphic depiction of character tests), you are free to change the initialization provided by the standard test to include your own exceptions.

Resources