ASCII char to int conversions in C [duplicate] - c

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Char to int conversion in C.
I remember learning in a course a long time ago that converting from an ASCII char to an int by subtracting '0' is bad.
For example:
int converted;
char ascii = '8';
converted = ascii - '0';
Why is this considered a bad practice? Is it because some systems don't use ASCII? The question has been bugging me for a long time.

While you probably shouldn't use this as part of a hand rolled strtol (that's what the standard library is for) there is nothing wrong with this technique for converting a single digit to its value. It's simple and clear, even idiomatic. You should, though, add range checking if you are not absolutely certain that the given char is in range.
It's a C language guarantee that this works.
5.2.1/3 says:
In both the source and execution basic character sets, the value of each character after 0 in the above list [includes the sequence: 0,1,2,3,4,5,6,7,8,9] shall be one greater that the value of the previous.
Character sets may exist where this isn't true but they can't be used as either source or execution character sets in any C implementation.

Edit: Apparently the C standard guarantees consecutive 0-9 digits.
ASCII is not guaranteed by the C standard, in effect making it non-portable. You should use a standard library function intended for conversion, such as atoi.
However, if you wish to make assumptions about where you are running (for example, an embedded system where space is at a premium), then by all means use the subtraction method. Even on systems not in the US-ASCII code page (UTF-8, other code pages) this conversion will work. It will work on ebcdic (amazingly).

This is a common trick taught in C classes primarily to illustrate the notion that a char is a number and that its value is different from the corresponding int.
Unfortunately, this educational toy somehow became part of the typical arsenal of most C developers, partially because C doesn't provide a convenient call for this (it is often platform specific, I'm not even sure what it is).
Generally, this code is not portable for non-ASCII platforms, and for future transitions to other encodings. It's also not really readable. At a minimum wrap this trick in a function.

Related

Effect of Wide Characters/ Strings on a C Program

Below is an excerpt from an old edition of the book Programming Windows by Charles Petzold
There are, of course, certain disadvantages to using Unicode. First and foremost is that every string in your program will occupy twice as much space. In addition, you'll observe that the functions in the wide-character run-time library
are larger than the usual functions.
Why would every string in my program occupy twice the bytes, should not only the character arrays we've declared as storing wchar_t type do so?
Is there perhaps some condition that if a program is to be able to work with Long values, then the entire program mode it'll operate on is altered?
Usually if we declare a long int, we never fuss over or mention the fact that all ints will be occupying double the memory now. Are strings somehow a special case?
Why would every string in my program occupy twice the bytes, should not only the character arrays we've declared as storing wchar_t type do so?
As I understand it, it is meant, that if you have a program that uses char *, and now you rewrite that program to use wchar_t *, then it will use (more than) twice the bytes.
If a string could potentially contain a character outside of the ascii range, you'll have to declare it as a wide string. So most strings in the program will be bigger. Personally, I wouldn't worry about it; if you need Unicode, you need Unicode, and a few more bytes aren't going to kill you.
That seems to be what you're saying, and I agree. But the question is skating the fine line between opinionated and objective.
Unicode have some types : utf8, utf16 utf32. https://en.wikipedia.org/wiki/Unicode.
You can check advantage , disadvantage of them to know what situation you should use .
reference: UTF-8, UTF-16, and UTF-32

How does C uppercase letters?

I see this code in glibc-2.33/ctype/ctype.c:
// [...]
#define __ctype_toupper \
((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)
// [...]
int
toupper (int c)
{
return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}
libc_hidden_def (toupper)
I understand that it's checking if c is within -128 and 256 (inclusive) and returns the character as-is if it's outside that range, but what does _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128) mean and where do I actually find the source code of how letters are uppercased? This seems to be looking up the current locale, I am only interested in en_US.UTF-8. Also, how can a character be negative?
I don't care about glibc specifically, I just want to know how all the ASCII characters (all as in from NUL to DEL) are uppercased in C.
"C" doesn't convert characters to upper case. The C standard only mandates that there be a function in the standard library which does so correctly according to the current locale, and that it does so in a particular way in the "C" locale (which is the only locale which is guaranteed to exist).
Library implementations are free to accomplish that task as the implementers see fit, and they all do it in different ways. Even radically different ways. Some C libraries don't support locales other than the "C" locale with an ASCII character set. An example of such a C library is musl and it is hard to beat the simplicity of its implementation:
int toupper(int c)
{
if (islower(c)) return c & 0x5f;
return c;
}
As you can see, the above code depends on islower. Here it is:
int islower(int c)
{
return (unsigned)c-'a' < 26;
}
Because of the call to islower, toupper returns unchanged any argument outside of the range of lower case characters, even arguments not in the valid range for toupper. Since the standard doesn't define the behaviour of toupper for arguments outside of the valid range (essentially values which might be returned by fgetc), just returning invalid arguments unchanged is certainly as acceptable as any other behaviour. Glibc's toupper function will often segfault on invalid arguments, since it uses the argument as an index into an array (as you can see in the code you cite). That behaviour is also acceptable according to the standard.
The Glibc implementation is a lot more complicated. And behind the scenes it depends on the locale data which is compiled from locale definition files, a process which is completely outside of the C standard and somewhat defined by the Posix standard (although the GNU implementation diverges in some way from Posix).
But here's the scoop: If you're using single byte characters in a UTF-8 locale, none of glibc's complicated code makes the slightest difference. The musl implementation works precisely as required in a UTF-8 locale, because the only alphabetic characters representable in a single byte UTF-8 representation are the 52 characters in the "Roman" alphabet. All the other Unicode characters are only representable in wide characters and multibyte sequences.
Furthermore, environments which use a single-byte encoding other than UTF-8 are increasingly rare. There are certainly a lot of us who had to learn this stuff because our programs ran on a variety of platforms which used different ISO-8859-x code pages. Or different single-byte Windows codepages. But in the end, Unicode won out. (And many of us breathed huge sighs of relief.) So most of this apparatus is no longer really necessary except in legacy environments.
But that's not to say that Unicode magically solves all the complications involved in managing the huge variety of alphabets in use in the world. Far from it. What Unicode does do is two-fold: it clarifies what the complications are (most of which is not captured by C/Posix locales), and it provides some basic standards for implementations.
And, as a side effect, UTF-8 standardises single-byte codes to basically conform with the original ASCII 7-bit standard. So if you're only dealing with 7-bit characters (which, these days, is probably less than ideal), you don't need anything beyond musl-style implementations. And if you are dealing with "all the world's character sets", you'll be looking for a library which actually conforms to Unicode, and which uses something other than char to represent characters.
But one complication is going to remain forever, sadly: the fact that C does not standardise the signedness of char. On platforms on which char is signed (Unix X86 and Windows, for two major examples),
(char)0xA0 is (a) unspecified and (b) probably -96, which is what a single-byte 0xA0 represents in 2's complement. So if you write code which uses the various functions in ctype.h and don't take care of negative char values, and then you try to use that code with a UTF-8 encoded string which includes characters outside of the single-byte domain, then you will end up passing negative numbers to functions which might not be expecting them.
If you go back at the root and look for _NL_CTYPE_TOUPPER you will find a commit where it is written
[..] (ctype_output): Support for alternate locale format: Computation of
nelems changes. _NL_CTYPE_TOUPPER32 [...]
So basically _NL_CTYPE_TOUPPER is the macro for _NL_CTYPE_TOUPPER(8bits) as for example in French you have À as uppercase version of à
Following this link you will find the header file langinfo.h that has this enum starting at line 43 and with _NL_CTYPE_TOUPPER defined at line 259.
LC_CTYPE category: character classification.
256 This information is accessed by the functions in <ctype.h>.
LC_CTYPE is defined for each language, see for example for French:
fr_FR:2000"
Note that it doesn't make a lot of sense to call this function since characters with accent are not contained in the ASCII table, but since this function is the one handling both utf8 and ascii that's how it works.

Is subtracting a char by '0' to convert to int bad practice?

I'm expecting a single digit integer input, and have error handling in place already if this is not the case. Are the any potential unforeseen consequences by simply subtracting the input character by '0' to "convert" it into an integer?
I'm not looking for opinions on readability or what's more commonly used (although they wouldn't hurt as an extension to the answer), but simply whether or not it's a reliable form of conversion. If I ask the user to input an integer between 0 and 9, is there any scenario in which there can be input that input = input-'0' should handle, but doesn't?
This is safe and guaranteed by the C language. In the current version, C11, the relevant text is 5.2.1 Character sets, ¶3:
In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
As for whether it's "bad practice", that's a matter of opinion, but I would say no. It's both idiomatic (commonly used and understood by C programmers) and lacks any alternative that's not confusing and inefficient. For example nobody reading C would want to see this written as a switch statement with 10 cases or by setting up a dummy one-character string to pass to atoi.
The order of characters are encoding/system-dependent, so one must not rely on a particular order in general. For the sequence of digits 0..9 in any system, however, it is guaranteed that it starts with 0 and continues to 9 without any intermediate characters. So input = input - '0' is perfect as long as you guarantee that input contains a digit (e.g. by using isdigit).

Converting a Letter to a Number in C [duplicate]

This question already has answers here:
Converting Letters to Numbers in C
(10 answers)
Closed 6 years ago.
Alright so pretty simple, I want to convert a letter to a number so that a = 0, b = 1, etc. Now I know I can do
number = letter + '0';
so when I input the letter 'a' it gives me the number 145. My question is, if I am to run this on a different computer or OS, would it still give me the same number 145 for when I input the letter 'a'?
It depends on what character encoding you are using. If you're using the same encoding and compiler on both the computers, yes, it will be the same. But if you're using another encoding like EBCDIC on one computer and ASCII on another, you cannot guarantee them to be the same.
Also, you can use atoi.
If you do not want to use atoi, see: Converting Letters to Numbers in C
It depends on what character encoding you are using.
It is also important to note that if you use ASCII the value will fit in a byte.
If you are using UTF-8 for example, the value wont fit a byte but you will require two bytes (int16) at least.
Now, lets assume you are making sure you use one specific character encoding then, the value will be the same no matter the system.
Yes, the number used to represent a is defined in the American Standard Code for Information Interchange. This is the standard that C compilers use by default, so on all other OSs you will get the same result.

Unsigned character gotchas in C

Most C compilers use signed characters. Most C libraries define EOF as -1.
Despite being a long-time C programmer I had never before put these two facts together and so in the interest of robust and international software I would ask for a bit of help in spelling out the implications.
Here is what I have discovered thus far:
fgetc() and friends cast to unsigned characters before returning as int to avoid clashing with EOF.
Therefore care needs to be taken with the results, e.g. getchar() == (unsigned char) 'µ'.
Theoretically I believe that not even the basic character set is guaranteed to be positive.
The <ctype.h> functions are designed to handle EOF and expected unsigned characters. Any other negative input may cause out-of-bounds addressing.
Most functions taking character parameters as integers ignore EOF and will accept signed or unsigned characters interchangeably.
String comparison (strcmp/strncmp/memcmp) compares unsigned character strings.
It may not be impossible to discriminate EOF from a proper characters on systems where sizeof(int) = 1.
The wide characters functions are not used for binary I/O and so WEOF is defined within the range of wchar_t.
Is this assessment correct and if so what other gotchas did I miss?
Full disclosure: I ran into an out-of-bounds indexing bug today when feeding non-ASCII characters to isspace() and the realization of the amount of lurking bugs in my old code both scared and annoyed me. Hence this frustrated question.
The basic execution character set is guaranteed to be nonnegative - the precise wording in C99 is:
If a member of the basic execution character set is stored in a char
object, its value is guaranteed to be nonnegative.

Resources