Is subtracting a char by '0' to convert to int bad practice? - c

I'm expecting a single digit integer input, and have error handling in place already if this is not the case. Are the any potential unforeseen consequences by simply subtracting the input character by '0' to "convert" it into an integer?
I'm not looking for opinions on readability or what's more commonly used (although they wouldn't hurt as an extension to the answer), but simply whether or not it's a reliable form of conversion. If I ask the user to input an integer between 0 and 9, is there any scenario in which there can be input that input = input-'0' should handle, but doesn't?

This is safe and guaranteed by the C language. In the current version, C11, the relevant text is 5.2.1 Character sets, ¶3:
In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
As for whether it's "bad practice", that's a matter of opinion, but I would say no. It's both idiomatic (commonly used and understood by C programmers) and lacks any alternative that's not confusing and inefficient. For example nobody reading C would want to see this written as a switch statement with 10 cases or by setting up a dummy one-character string to pass to atoi.

The order of characters are encoding/system-dependent, so one must not rely on a particular order in general. For the sequence of digits 0..9 in any system, however, it is guaranteed that it starts with 0 and continues to 9 without any intermediate characters. So input = input - '0' is perfect as long as you guarantee that input contains a digit (e.g. by using isdigit).

Related

What does scanf("%f%c", ...) do against input `100e`?

Consider the following C code (online available io.c):
#include <stdio.h>
int main () {
float f;
char c;
scanf ("%f%c", &f, &c);
printf ("%f \t %c", f, c);
return 0;
}
When the input is 100f, it outputs 100.000000 f.
However, when the input is 100e, it outputs only 100.000000, without e followed. What is going on here? Isn't 100e an invalid floating-point number?
This is (arguably) a glibc bug.
This behaviour clearly goes against the standard. However it is exhibited by other implementations. Some people consider it a bug in the standard instead.
Per the standard, An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. So 100e is an input item because it is a prefix of a matching input sequence, say, 100e1, but any longer sequence of characters from the input isn't. Further, If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. 100e is not a matching sequence so the standard requires the directive to fail.
The standard cannot tell scanf to accept 100 and continue scanning from e, as some people would expect, because stdio has a limited push-back of just one character. So having read 100e, the implementation would have to read at least one more character, say a newline to be specific, and then push back both newline and e, which it cannot always do.
I'd say this is pretty clearly a pretty unclear, gray area.
If you're an implementor of a C library (or a member of the X3J11 committee), you have to worry about this sort of thing — sometimes a lot. You have to worry about the edge cases, and sometimes the edge cases can be particularly edgy.
However, you did not tag your question with the "language lawyer" tag, so perhaps you're not worried about a scrupulously correct, official interpretation.
If you're not an implementor of a C library or a member of the X3J11 committee, I'd say: don't worry what the "right" answer here is! You don't have to worry, because you don't care, because you'd be crazy to write code which is sensitive to this question — precisely because it's such an obvious gray area. (Even if you do figure out what the right behavior here is, do you trust every implementor of every C library in the world to always implement that behavior?)
I'd say there are three things you can do in the category of "not worrying", and not writing code which is sensitive to this question.
Don't use scanf at all (for anything). It's an odious, imprecise, imperfect function, that's not good for anything except — perhaps — getting numbers into the first few programs you ever write while you're first learning C. After that, scanf has no use in any serious program.
Don't arrange your code and data such that it has to confront ambiguous input like "100e" in the first place. Where is it coming from, anyway? Is it input the user might type? Data being read in from a data file? Is it expected or unexpected, correct or incorrect input? If you're reading a data file, do you have control over the code that writes the data file? Can you guarantee that floating-point fields will always be delimited appropriately, will not occasionally have random alphabetic characters appended?
If you do have to parse input that might contain a valid floating-point number, might have random alphabetic characters appended, and might therefore be ambiguous like this, I'd encourage you to use strtod instead, which is likely to be both better-defined and better-implemented.
Give a space between "%f %c" like that and also when you are going to enter input make sure to have a space between two inputs.
I am assuming you just want to print a character.
From the C Standard (6.4.4.2 Floating constants)
decimal-floating-constant:
fractional-constant exponent-partopt floating-suffixopt
digit-sequence exponent-part floating-suffixopt
and
exponent-part:
e signopt digit-sequence
E signopt digit-sequence
If you will change the call of printf the following way
printf ("%e \t %d\n", f, c);
you will get the output
1.000000e+02 10
that is the variable c has gotten the new line character '\n'.
It seems that the implementation of scanf is made such a way that the symbol e is interpreted as a part of a floating number though there is no digit after the symbol.
According to the C Standard (7.21.6.2 The fscanf function)
9 An input item is read from the stream, unless the specification
includes an n specifier. An input item is defined as the longest
sequence of input characters which does not exceed any specified
field width and which is, or is a prefix of, a matching input
sequence.278) The first character, if any, after the input item
remains unread.
So 100e is a matching input sequence of characters for a floating number.

Why the different behavior of snprintf vs swprintf?

The C standard states the following from the standard library function snprintf:
"The snprintf function is equivalent to fprintf, except that the
output is written into an array (specified by arguments) rather than
to a stream. If n is zero, nothing is written, and s may be a null
pointer. Otherwise, output characters beyond the n-1st are discarded
rather than being written to the array, and a null character is
written at the end of the characters actually written into the array.
If copying takes place between objects that overlap, the behavior is
undefined."
"The snprintf function returns the number of characters that would
have been written had n been sufficiently large, not counting the
terminating null character, or a negative value if an encoding error
occurred. Thus, the null-terminated output has been completely written
if and only if the returned value is nonnegative and less than n."
Compare it to the statement about swprintf:
"The swprintf function is equivalent to fwprintf, except that the
argument s specifies an array of wide characters into which the
generated output is to be written, rather than written to a stream. No
more than n wide characters are written, including a terminating null
wide character, which is always added (unless n is zero)."
"The swprintf function returns the number of wide characters written
in the array, not counting the terminating null wide character, or a
negative value if an encoding error occurred or if n or more wide
characters were requested to be written."
At first glance it may seem like snprintf and swprintf are complete equivalent to each other, the latter merely handling wide strings and the former narrow strings. However, that's not the case. While snprintf returns the number of characters that would have been written if n had been large enough, swprintf returns a negative value in this case (which means that you can't know how many characters would have been written if there had been enough space). This makes the two functions not fully interchangeable, because their behavior is different in this regard (and thus the latter can't be used for some thing that the former can, such as evaluating how long the output buffer would need to be, before actually creating it.)
Why would they make this difference? I suppose the behavior of swprintf makes the implementation more efficient when n is too small, but still, why the difference? I don't think it's even a question of snprintf being older and thus "legacy" and "dragging the weight of its history, which can't be changed later" and swprintf being newer and thus free to be improved, because both were introduced in C99.
There is, however, another significantly subtler difference between the two specifications. If you notice, the specifications are not merely carbon-copies of each other, with the only difference being the return value. That's another, much subtler difference, and that's the somewhat ambiguous behavior of what happens if n is too small for the string-to-be-printed.
The specification for snprintf quite clearly states that the output will be written up to n-1 characters even when n is too small, and the null character will be written at the end of it always. The specification of swprintf almost states this... except it leaves it ambiguous in its specification of the return value.
More specifically a negative return value is used to signal that an error occurred while trying to write the string to the destination. A negative value is also returned when n was too small. It's left ambiguous whether this is actually considered an error situation or not. This is significant because if it's considered an error, then the implementation is free to not write all, or anything, into the destination, because it can be signaling "an error occurred, the output is invalid". The first paragraph of the specification makes it sound like at most n-1 characters are always written, and an ending null character is always written ("which is always added"), but the second paragraph about the return value leaves it ambiguous whether this is actually an error situation and whether the implementation can choose not to write those things in this case.
This is significant because the glibc implementation of swprintf does not write the final null character when n is too small, making the result an invalid string. While I can't find definitive information on this, I have got the impression that the developers of glibc have interpreted the standard in such a manner that they don't have to write the final null character (or anything) to the output because this is an error situation.
The thing is that the standard seems to be very ambiguous and vague in this regard. Is it a correct interpretation? Or are they misinterpreting? Why would the standard leave it this ambiguous?
My interpretation differs from that of the glibc developers. I understand the second paragraph to mean:
A negative value is returned if:
an encoding error occurred, or
n or more wide characters were requested to be written.
I don't see how this could be interpreted as n being too small being considered an error.

Converting a Letter to a Number in C [duplicate]

This question already has answers here:
Converting Letters to Numbers in C
(10 answers)
Closed 6 years ago.
Alright so pretty simple, I want to convert a letter to a number so that a = 0, b = 1, etc. Now I know I can do
number = letter + '0';
so when I input the letter 'a' it gives me the number 145. My question is, if I am to run this on a different computer or OS, would it still give me the same number 145 for when I input the letter 'a'?
It depends on what character encoding you are using. If you're using the same encoding and compiler on both the computers, yes, it will be the same. But if you're using another encoding like EBCDIC on one computer and ASCII on another, you cannot guarantee them to be the same.
Also, you can use atoi.
If you do not want to use atoi, see: Converting Letters to Numbers in C
It depends on what character encoding you are using.
It is also important to note that if you use ASCII the value will fit in a byte.
If you are using UTF-8 for example, the value wont fit a byte but you will require two bytes (int16) at least.
Now, lets assume you are making sure you use one specific character encoding then, the value will be the same no matter the system.
Yes, the number used to represent a is defined in the American Standard Code for Information Interchange. This is the standard that C compilers use by default, so on all other OSs you will get the same result.

Unsigned character gotchas in C

Most C compilers use signed characters. Most C libraries define EOF as -1.
Despite being a long-time C programmer I had never before put these two facts together and so in the interest of robust and international software I would ask for a bit of help in spelling out the implications.
Here is what I have discovered thus far:
fgetc() and friends cast to unsigned characters before returning as int to avoid clashing with EOF.
Therefore care needs to be taken with the results, e.g. getchar() == (unsigned char) 'µ'.
Theoretically I believe that not even the basic character set is guaranteed to be positive.
The <ctype.h> functions are designed to handle EOF and expected unsigned characters. Any other negative input may cause out-of-bounds addressing.
Most functions taking character parameters as integers ignore EOF and will accept signed or unsigned characters interchangeably.
String comparison (strcmp/strncmp/memcmp) compares unsigned character strings.
It may not be impossible to discriminate EOF from a proper characters on systems where sizeof(int) = 1.
The wide characters functions are not used for binary I/O and so WEOF is defined within the range of wchar_t.
Is this assessment correct and if so what other gotchas did I miss?
Full disclosure: I ran into an out-of-bounds indexing bug today when feeding non-ASCII characters to isspace() and the realization of the amount of lurking bugs in my old code both scared and annoyed me. Hence this frustrated question.
The basic execution character set is guaranteed to be nonnegative - the precise wording in C99 is:
If a member of the basic execution character set is stored in a char
object, its value is guaranteed to be nonnegative.

ASCII char to int conversions in C [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Char to int conversion in C.
I remember learning in a course a long time ago that converting from an ASCII char to an int by subtracting '0' is bad.
For example:
int converted;
char ascii = '8';
converted = ascii - '0';
Why is this considered a bad practice? Is it because some systems don't use ASCII? The question has been bugging me for a long time.
While you probably shouldn't use this as part of a hand rolled strtol (that's what the standard library is for) there is nothing wrong with this technique for converting a single digit to its value. It's simple and clear, even idiomatic. You should, though, add range checking if you are not absolutely certain that the given char is in range.
It's a C language guarantee that this works.
5.2.1/3 says:
In both the source and execution basic character sets, the value of each character after 0 in the above list [includes the sequence: 0,1,2,3,4,5,6,7,8,9] shall be one greater that the value of the previous.
Character sets may exist where this isn't true but they can't be used as either source or execution character sets in any C implementation.
Edit: Apparently the C standard guarantees consecutive 0-9 digits.
ASCII is not guaranteed by the C standard, in effect making it non-portable. You should use a standard library function intended for conversion, such as atoi.
However, if you wish to make assumptions about where you are running (for example, an embedded system where space is at a premium), then by all means use the subtraction method. Even on systems not in the US-ASCII code page (UTF-8, other code pages) this conversion will work. It will work on ebcdic (amazingly).
This is a common trick taught in C classes primarily to illustrate the notion that a char is a number and that its value is different from the corresponding int.
Unfortunately, this educational toy somehow became part of the typical arsenal of most C developers, partially because C doesn't provide a convenient call for this (it is often platform specific, I'm not even sure what it is).
Generally, this code is not portable for non-ASCII platforms, and for future transitions to other encodings. It's also not really readable. At a minimum wrap this trick in a function.

Resources