Unsigned character gotchas in C - c

Most C compilers use signed characters. Most C libraries define EOF as -1.
Despite being a long-time C programmer I had never before put these two facts together and so in the interest of robust and international software I would ask for a bit of help in spelling out the implications.
Here is what I have discovered thus far:
fgetc() and friends cast to unsigned characters before returning as int to avoid clashing with EOF.
Therefore care needs to be taken with the results, e.g. getchar() == (unsigned char) 'µ'.
Theoretically I believe that not even the basic character set is guaranteed to be positive.
The <ctype.h> functions are designed to handle EOF and expected unsigned characters. Any other negative input may cause out-of-bounds addressing.
Most functions taking character parameters as integers ignore EOF and will accept signed or unsigned characters interchangeably.
String comparison (strcmp/strncmp/memcmp) compares unsigned character strings.
It may not be impossible to discriminate EOF from a proper characters on systems where sizeof(int) = 1.
The wide characters functions are not used for binary I/O and so WEOF is defined within the range of wchar_t.
Is this assessment correct and if so what other gotchas did I miss?
Full disclosure: I ran into an out-of-bounds indexing bug today when feeding non-ASCII characters to isspace() and the realization of the amount of lurking bugs in my old code both scared and annoyed me. Hence this frustrated question.

The basic execution character set is guaranteed to be nonnegative - the precise wording in C99 is:
If a member of the basic execution character set is stored in a char
object, its value is guaranteed to be nonnegative.

Related

Why the different behavior of snprintf vs swprintf?

The C standard states the following from the standard library function snprintf:
"The snprintf function is equivalent to fprintf, except that the
output is written into an array (specified by arguments) rather than
to a stream. If n is zero, nothing is written, and s may be a null
pointer. Otherwise, output characters beyond the n-1st are discarded
rather than being written to the array, and a null character is
written at the end of the characters actually written into the array.
If copying takes place between objects that overlap, the behavior is
undefined."
"The snprintf function returns the number of characters that would
have been written had n been sufficiently large, not counting the
terminating null character, or a negative value if an encoding error
occurred. Thus, the null-terminated output has been completely written
if and only if the returned value is nonnegative and less than n."
Compare it to the statement about swprintf:
"The swprintf function is equivalent to fwprintf, except that the
argument s specifies an array of wide characters into which the
generated output is to be written, rather than written to a stream. No
more than n wide characters are written, including a terminating null
wide character, which is always added (unless n is zero)."
"The swprintf function returns the number of wide characters written
in the array, not counting the terminating null wide character, or a
negative value if an encoding error occurred or if n or more wide
characters were requested to be written."
At first glance it may seem like snprintf and swprintf are complete equivalent to each other, the latter merely handling wide strings and the former narrow strings. However, that's not the case. While snprintf returns the number of characters that would have been written if n had been large enough, swprintf returns a negative value in this case (which means that you can't know how many characters would have been written if there had been enough space). This makes the two functions not fully interchangeable, because their behavior is different in this regard (and thus the latter can't be used for some thing that the former can, such as evaluating how long the output buffer would need to be, before actually creating it.)
Why would they make this difference? I suppose the behavior of swprintf makes the implementation more efficient when n is too small, but still, why the difference? I don't think it's even a question of snprintf being older and thus "legacy" and "dragging the weight of its history, which can't be changed later" and swprintf being newer and thus free to be improved, because both were introduced in C99.
There is, however, another significantly subtler difference between the two specifications. If you notice, the specifications are not merely carbon-copies of each other, with the only difference being the return value. That's another, much subtler difference, and that's the somewhat ambiguous behavior of what happens if n is too small for the string-to-be-printed.
The specification for snprintf quite clearly states that the output will be written up to n-1 characters even when n is too small, and the null character will be written at the end of it always. The specification of swprintf almost states this... except it leaves it ambiguous in its specification of the return value.
More specifically a negative return value is used to signal that an error occurred while trying to write the string to the destination. A negative value is also returned when n was too small. It's left ambiguous whether this is actually considered an error situation or not. This is significant because if it's considered an error, then the implementation is free to not write all, or anything, into the destination, because it can be signaling "an error occurred, the output is invalid". The first paragraph of the specification makes it sound like at most n-1 characters are always written, and an ending null character is always written ("which is always added"), but the second paragraph about the return value leaves it ambiguous whether this is actually an error situation and whether the implementation can choose not to write those things in this case.
This is significant because the glibc implementation of swprintf does not write the final null character when n is too small, making the result an invalid string. While I can't find definitive information on this, I have got the impression that the developers of glibc have interpreted the standard in such a manner that they don't have to write the final null character (or anything) to the output because this is an error situation.
The thing is that the standard seems to be very ambiguous and vague in this regard. Is it a correct interpretation? Or are they misinterpreting? Why would the standard leave it this ambiguous?
My interpretation differs from that of the glibc developers. I understand the second paragraph to mean:
A negative value is returned if:
an encoding error occurred, or
n or more wide characters were requested to be written.
I don't see how this could be interpreted as n being too small being considered an error.

Portablilty of using percison when printf-ing non 0 terminated strings

As multiple questions on here also point out, you can printf a nonterminated string by formatting with a precision as maximum length to print. Something like
printf("%.*s\n", length, str);
will print length chars starting at str (or until the first 0 byte).
As pointed out here by jonathan-leffler, this is specified by posix here. And when reading the doc I discovered it actually never states this should work (or I couldn't find it) , as "The ‘%s’ conversion prints a string." and "A string is a null-terminated array of bytes [...] ". The regard about the precision states "A precision can be specified to indicate the maximum number of characters to write;".
My interpretation would be that the line above is actually undefined behavior, but because printf's implementation is efficient it doesn't read more than it writes.
So my question is: Is this interpretation correct and
TLDR:
Should I stop using this printf trick when trying to be posix compliant as there exists an implementation where this might cause a buffer-overrun?
What you're reading isn't the actual POSIX spec, but the GNU libc manual, which tends to be a little less precise for the sake of readability. The actual spec can be found at https://pubs.opengroup.org/onlinepubs/9699919799/functions/printf.html (it's even linked from Jonathan Leffler's answer which you link to), and it makes it clear that your code is fine:
s
The argument shall be a pointer to an array of char. Bytes from the array shall be written up to (but not including) any terminating null byte. If the precision is specified, no more than that many bytes shall be written. If the precision is not specified or is greater than the size of the array, the application shall ensure that the array contains a null byte.
Note that they are careful not to use the word "string" for exactly the reason you point out.
The ISO C17 standard uses almost identical language, so your code is even portable to non-POSIX standard C implementations. (POSIX generally incorporates ISO C and many parts of the POSIX spec are copy/pasted from the C standard.)

Is subtracting a char by '0' to convert to int bad practice?

I'm expecting a single digit integer input, and have error handling in place already if this is not the case. Are the any potential unforeseen consequences by simply subtracting the input character by '0' to "convert" it into an integer?
I'm not looking for opinions on readability or what's more commonly used (although they wouldn't hurt as an extension to the answer), but simply whether or not it's a reliable form of conversion. If I ask the user to input an integer between 0 and 9, is there any scenario in which there can be input that input = input-'0' should handle, but doesn't?
This is safe and guaranteed by the C language. In the current version, C11, the relevant text is 5.2.1 Character sets, ¶3:
In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
As for whether it's "bad practice", that's a matter of opinion, but I would say no. It's both idiomatic (commonly used and understood by C programmers) and lacks any alternative that's not confusing and inefficient. For example nobody reading C would want to see this written as a switch statement with 10 cases or by setting up a dummy one-character string to pass to atoi.
The order of characters are encoding/system-dependent, so one must not rely on a particular order in general. For the sequence of digits 0..9 in any system, however, it is guaranteed that it starts with 0 and continues to 9 without any intermediate characters. So input = input - '0' is perfect as long as you guarantee that input contains a digit (e.g. by using isdigit).

C - isgraph() function

Does anyone know how the isgraph() function works in C? I understand its use and results, but the code behind it is what I'm interested in.
For example, does it look at only the char value of it and compare it to the ASCII table? Or does it actually check to see if it can be displayed? If so, how?
The code behind the isgraph() function varies by platform (or, more precisely, by implementation). One common technique is to use an initialized array of bit-fields, one per character in the (single-byte) codeset plus EOF (which has to be accepted by the functions), and then selecting the relevant bit. This allows for a simple implementation as a macro which is safe (only evaluates its argument once) and as a simple (possibly inline) function.
#define isgraph(x) (__charmap[(x)+1]&__PRINT)
where __charmap and __PRINT are names reserved for the implementation. The +1 part deals with the common situation where EOF is -1.
According to the C standard (ISO/IEC 9899:1999):
§7.4.1.6 The isgraph function
Synopsis
#include <ctype.h>
int isgraph(int c);
Description
The isgraph function tests for any printing character except space (' ').
And:
§7.4 Character handling <ctype.h>
¶1 The header declares several functions useful for classifying and mapping
characters.166) In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.
¶2 The behavior of these functions is affected by the current locale. Those functions that
have locale-specific aspects only when not in the "C" locale are noted below.
¶3 The term printing character refers to a member of a locale-specific set of characters, each
of which occupies one printing position on a display device; the term control character
refers to a member of a locale-specific set of characters that are not printing
characters.167) All letters and digits are printing characters.
166) See ‘‘future library directions’’ (7.26.2).
167) In an implementation that uses the seven-bit US ASCII character set, the printing characters are those
whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose
values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).
It's called isgraph, not isGraph (and char, not Char), and the POSIX Programmer's Manual says
The isgraph() function shall test
whether c is a character of class
graph in the program's current locale;
see the Base Definitions volume of
IEEE Std 1003.1-2001,
Chapter 7, Locale.
So yes, it looks it up in a table (or equivalent code). It can't check whether it can actually be displayed, since that would vary depending upon the output device, many of which can display chars in addition to those for which isgraph returns true.
isgraph checks for "printable" characters, but the definition of "printable" can vary depending on your locale. Your locale may use characters that aren't in the ASCII table. Internally, it's most likely either a table lookup, a range-based test ((x >= 'a') && (x <= 'z'), etc), or a combination of both. Different implementations may do it slightly differently.
The isgraph() macro only looks at the ASCII table, or your location/country/providence/planet/galaxy's version of the ASCII table.
Here's a test code Counting Words, which found you can increase performance by writing your own version, which initializes a bool array[256] using isgraph(). There are benchmark results with the code.
Since bool variables/arrays are actually BYTEs, not bits, you can do even better, in terms of memory efficiency, if you use a bit array, and test that. It happily takes up only 32 bytes. That's almost certainly going to get cashed on any general-purpose modern processor.
Importantly, if you want a slightly different test than the standard ones provided here (see graphic depiction of character tests), you are free to change the initialization provided by the standard test to include your own exceptions.

ASCII char to int conversions in C [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Char to int conversion in C.
I remember learning in a course a long time ago that converting from an ASCII char to an int by subtracting '0' is bad.
For example:
int converted;
char ascii = '8';
converted = ascii - '0';
Why is this considered a bad practice? Is it because some systems don't use ASCII? The question has been bugging me for a long time.
While you probably shouldn't use this as part of a hand rolled strtol (that's what the standard library is for) there is nothing wrong with this technique for converting a single digit to its value. It's simple and clear, even idiomatic. You should, though, add range checking if you are not absolutely certain that the given char is in range.
It's a C language guarantee that this works.
5.2.1/3 says:
In both the source and execution basic character sets, the value of each character after 0 in the above list [includes the sequence: 0,1,2,3,4,5,6,7,8,9] shall be one greater that the value of the previous.
Character sets may exist where this isn't true but they can't be used as either source or execution character sets in any C implementation.
Edit: Apparently the C standard guarantees consecutive 0-9 digits.
ASCII is not guaranteed by the C standard, in effect making it non-portable. You should use a standard library function intended for conversion, such as atoi.
However, if you wish to make assumptions about where you are running (for example, an embedded system where space is at a premium), then by all means use the subtraction method. Even on systems not in the US-ASCII code page (UTF-8, other code pages) this conversion will work. It will work on ebcdic (amazingly).
This is a common trick taught in C classes primarily to illustrate the notion that a char is a number and that its value is different from the corresponding int.
Unfortunately, this educational toy somehow became part of the typical arsenal of most C developers, partially because C doesn't provide a convenient call for this (it is often platform specific, I'm not even sure what it is).
Generally, this code is not portable for non-ASCII platforms, and for future transitions to other encodings. It's also not really readable. At a minimum wrap this trick in a function.

Resources