Converting Decimal Literals to ASCII Equivalent for putchar in C - c

I am trying to understand why the following statement works:
putchar( 1 + '0' );
It seems that the + '0' expression converts the literal to the respective ASCII version (49 in this particular case) that putchar likes to be given.
My question was why does it do this? Any help is appreciated. I also apologize if I have made any incorrect assumptions.

This has nothing to do with ASCII. Nobody even mentioned ASCII.
What this code does assume is that in the system's character encoding all the numerals appear as a contiguous range from '0' to '9', and so if you add an offset to the character '0', you get the character for the corresponding numeral.
All character encodings that could possibly be used by a C or a C++ compiler must have this property (e.g. 2.3/3 in C++), so this code is portable.

Characters '0' to '9' are consecutive. The C standard guarantees this.
In ASCII:
'0' = 48
'1' = 49
'2' = 50
etc.
The '0' is simply seen as an offset.
'0' + 0 = 48, which is '0'.
'0' + 1 = 49, which is '1'.
etc.

Related

Can I always assume the characters '0' to '9' appear sequentially in any C character encoding

I'm writing a program in C that converts some strings to integers. The way I've implemeted this before is like so
int number = (character - '0');
This always works perfectly for me, but I started thinking, are there any systems using some obscure character encoding in which the characters '0' to '9' don't appear one after another in that order? This code assumes '1' follows '0', '2' follows '1' and so on, but is there ever a case when this is not true?
Yes, this is guaranteed by the C standard.
N1570 5.2.1 paragraph 3 says:
In both the source and execution basic character sets, the value of
each character after 0 in the above list of decimal digits shall be
one greater than the value of the previous.
This guarantee was possible because both ASCII and EBCDIC happen to have this property.
Note that there's no corresponding guarantee for letters; in EBCDIC, the letters do not have contiguous codes.

Is a character literal ('A') exactly equivalent to a hex literal (0x41)

Is there any situation in which changing 'A' to 0x41 could change the behaviour of my program? How about changing 0x41 to 'A'? Are there any uncommon architectures or obscure compiler settings or weird macros that might make those to be not exactly equivalent? If they are exactly equivalent in a standards compliant compiler, has anyone come across a buggy or non-standard compiler where they are not the same?
Is there any situation in which changing 'A' to 0x41 could change the behaviour of my program?
Yes, in EBCDIC character set 'A' value is not 0x41 but 0xC1.
C does not require ASCII character set.
(C99, 5.2.1p1) "The values of the members of the execution character set
are implementation-defined."
Both the character literal 'A' and the integer literal 0x41 have type int. Therefore, the only situation where they are not exactly the same is when the basic execution character set is not ASCII-based, in which case 'A' may have some other value. The only non-ASCII basic execution character set you are ever likely to encounter is EBCDIC, in which 'A' == 0xC1.
The C standard does guarantee that, whatever their actual values might be, the character literals '0' through '9' will be consecutive and in increasing numerical order, i.e. if i is an integer between 0 and 9 inclusive, '0' + i will be the character for the decimal representation of that integer. 'A' through 'Z' and 'a' through 'z' are required to be in increasing alphabetical order, but not to be consecutive, and indeed they are not consecutive in EBCDIC. (The standardese was tailored precisely to permit both ASCII and EBCDIC as-is.) You can get away with coding hexadecimal digits A through F with 'A' + i (or 'a' + i), because those are consecutive in both ASCII and EBCDIC, but it is technically something you are getting away with rather than something guaranteed.

Character to integer conversion off for alphabet characters

I'm converting a character string character by character into integers. So 'A' - '0' should be 10. However even though the numbers come out fine, alphabetical characters (i.e. A-F) come out as being off by 7. For instance, here's my line of code for conversion:
result = result + (((int) (*new - '0')) * pow(16, bases));
If I print that line piece by piece for a hex string like "A2C9" then for some reason my A is converted to 17 and my C is converted to 19. However the numbers 2 and 9 come out correctly. I'm trying to figure out if I'm missing something somewhere.
You are subtracting ASCII values. That is fine for A-Z and for 0-9, but not if you start mixing them. Read about the ASCII table to better understand the issue.
Here is the table:
http://www.asciitable.com/index/asciifull.gif
The ASCII code for 'A' is 65; for 'Z', it is 90.
The ASCII code for '0' is 48; for '9', it is 57. These codes are also used in Unicode (UTF-8), 8859-x, and many other codesets.
When you calculate 'A' - '0', you get 65 - 48 = 17, which is the 'off-by-seven' you are seeing.
To convert the alphabetic characters 'A' to 'F' to their hex equivalents, you need some variation on:
c - 'A' + 10;
Remembering that 'a' to 'f' are also allowed and for them you'd need:
c - 'a' + 10;
Or you'd need to convert to upper-case first. Or you can use:
const char hexdigits[] = "0123456789ABCDEF";
int digit = strchr(hexdigits, toupper(c)) - hexdigits;
or any of a myriad other techniques. This last fragment assumes that c is known to contain a valid hex digit. It fails horribly if that is not the case.
Note that C does guarantee that the codes for the digits 0-9 are consecutive, but does not guarantee that the codes for the letters A-Z are consecutive. In particular, if the codeset is EBCDIC (mainly but not solely used on IBM mainframes), the codes for the letters are not contiguous.

C - Convert char to int

I know that to convert any given char to int, this code is possible [apart from atoi()]:
int i = '2' - '0';
but I never understood how it worked, what is significance of '0' and I don't seem to find any explanation on the net about that.
Thanks in advance!!
In C, a character literal has type int. [Character Literals/IBM]
In your example, the numeric value of '0' is 48, the numeric value of '2' is 50. When you do '2' - '0' you get 50 - 48 = 2. This works for ASCII numbers from 0 to 9.
See ASCII table to get a better picture.
Edit: Thanks to #ouah for correction.
All the chars in C are represented with an integer value, the ASCII code of the character.
For instance '0' corresponds to 48 and '2' corresponds to 50, so '2'-'0' gets you 50-48 = 2
Link to an ASCII table: http://www.robelle.com/smugbook/ascii.html
When you use the commas ' ' you are treating the number as a char, and if this is given to an int, the int will take the value of the ASCII code of this character.
Any character literal enclosed in single quotes corresponds to a number that represents the ASCII code of that character. In fact, such literals evaluate not to char, but to int, so they are perfectly interchangeable with other number literals.
Within your expression, '2' is interchangeable with 50, and '0' is interchangeable with 48.
Have a look at the ASCII table.
'0' is represented as 0x30, '9' is represented as 0x32.
This results in
0x32 - 0x30 = 2
It's all about the ASCII codes of the corresponding characters.
In C, all the digits (0 to 9) are encoded in ASCII by values 48 to 57, sequentially. So '0' actually gets value 48, and '2' has the value 50. So when you write int i = '2' - '0';, you're actually subtracting 48 from 50, and get 2.
'0' to '9' are guaranteed to be sequential values in C in all character sets. This not limited to ASCII and C is not limited to the ASCII character set.
So sequential here means that '2' value is '0' + 2.
Regarding int and char note that '0' and '9' values are of type int in C and not of type char. A character literal is of type int.
Both terms are internally represented by the ASCII code of the number, and as numeric digits have consecutive ASCII codes subtracting them gives you the difference between the two numbers.
You can do similar tricks with characters as well, eg shift lowercase to uppercase by subtracting 32 from a lowercase character
'a' - 32 = 'A'
This works only because ASCII assigns codes to characters in order i.e. '2' has a character code that is with 2 bigger than the character code of '0'.
In an another encoding it wouldn't work.
When you cast a char to an int it actually maps each character to the appropriate number in the ascii table.
This means that '2' - '0' is translated to 50 - 48.
So you could also find out the numeric distance of two letters in the same way, e.g.
'z' - 'a' equals 122 - 97 equals 25
You can look up the numeric representaions of each ASCII character in thsi table:
http://www.asciitable.com/
Actually a char is just a unsigned byte: C just treats it differently for different operations. For example printf(97) yields 97 as output, but printf((char)97) will give you 'a' as output.

Numeric value of digit characters in C

I have just started reading through The C Programming Language and I am having trouble understanding one part. Here is an excerpt from page 24:
#include<stdio.h>
/*countdigits,whitespace,others*/
main()
{
intc,i,nwhite,nother;
intndigit[10];
nwhite=nother=0;
for(i=0;i<10;++i)
ndigit[i]=0;
while((c=getchar())!=EOF)
if(c>='0'&&c<='9')
++ndigit[c-'0']; //THIS IS THE LINE I AM WONDERING ABOUT
else if(c==''||c=='\n'||c=='\t')
++nwhite;
else
++nother;
printf("digits=");
for(i=0;i<10;++i)
printf("%d",ndigit[i]);
printf(",whitespace=%d,other=%d\n",
nwhite,nother);
}
The output of this program run on itself is
digits=9300000001,whitespace=123,other=345
The declaration
intndigit[10];
declares ndigit to be an array of 10 integers. Array subscripts always start at zero in C, so the elements are
ndigit[0], ndigit[ 1], ..., ndigit[9]
This is reflected in the for loops that initialize and print the array. A subscript can be any integer expression, which includes integer variables like i,and integer constants. This particular program relies on the properties of the character representation of the digits. For example, the test
if(c>='0'&&c<='9')
determines whether the character in c is a digit. If it is, the numeric value of that digit is
c-'0'`
This works only if '0', '1', ..., '9' have consecutive increasing values. Fortunately, this is true for all character sets. By definition, chars are just small integers, so char variables and constants are identical to ints in arithmetic expressions. This is natural and convenient; for example
c-'0'
is an integer expression with a value between 0 and 9 corresponding to the character '0' to '9' stored in c, and thus a valid subscript for the array ndigit.
The part I am having trouble understanding is why the -'0' part is necessary in the expression c-'0'. If a character is a small integer as the author says, and the digit characters correspond to their numeric values, then what is -'0' doing?
Digit characters don't correspond to their numeric values. They correspond to their encoding values (in this case, ASCII).
IIRC, ascii '0' is the value 48. And, luckily for this example and most character sets, the values of '0' through '9' are stored in order in the character set.
So, subtracting the ASCII value for '0' from any ASCII digit returns its "true" value of 0-9.
The numeric value of a character is (on most systems) its ASCII value. The ASCII value of '0' is 48, '1' is 49, etc.
By subtracting 48 from the value of the character '0' becomes 0, '1' becomes 1, etc. By writing it as c - '0' you don't actually need to know what the ASCII value of '0' is (or that the system is using ASCII - it could be using EBCDIC). The only thing that matters is that the values are consecutive increasing integers.
It converts from the ASCII code of the '0' key on your keyboard to the value zero.
if you did int x = '0' + '0' the result would not be zero.
In most character encodings, all of the digits are placed consecutively in the character set. In ASCII for example, they start with '0' at 0x30 ('1' is 0x31, '2' is 0x32, etc.). If you want the numeric value of a given digit, you can just subtract '0' from it and get the right value. The advantage of using '0' instead of the specific value is that your code can be portable to other character sets with much less effort.
If you access a character string by their characters you'll get the ASCII values back, even if the characters happen to be numbers.
Fortunately the guys who designed that character table made sure that the characters for 0 to 9 are sequential, so you can simply convert from ASCII to a number by subtracting the ASCII-value of '0'.
That's what the code does. I have to admit that it is confusing when you see it the first time, but it's not rocket science.
The ASCII-character value of '0' is 48, '1' is 49, '2' is 50 and so on.
For reference here is a nice ASCII-chart:
http://www.sciencelobby.com/ascii-table/images/ascii-table1.gif

Resources