Is there any situation in which changing 'A' to 0x41 could change the behaviour of my program? How about changing 0x41 to 'A'? Are there any uncommon architectures or obscure compiler settings or weird macros that might make those to be not exactly equivalent? If they are exactly equivalent in a standards compliant compiler, has anyone come across a buggy or non-standard compiler where they are not the same?
Is there any situation in which changing 'A' to 0x41 could change the behaviour of my program?
Yes, in EBCDIC character set 'A' value is not 0x41 but 0xC1.
C does not require ASCII character set.
(C99, 5.2.1p1) "The values of the members of the execution character set
are implementation-defined."
Both the character literal 'A' and the integer literal 0x41 have type int. Therefore, the only situation where they are not exactly the same is when the basic execution character set is not ASCII-based, in which case 'A' may have some other value. The only non-ASCII basic execution character set you are ever likely to encounter is EBCDIC, in which 'A' == 0xC1.
The C standard does guarantee that, whatever their actual values might be, the character literals '0' through '9' will be consecutive and in increasing numerical order, i.e. if i is an integer between 0 and 9 inclusive, '0' + i will be the character for the decimal representation of that integer. 'A' through 'Z' and 'a' through 'z' are required to be in increasing alphabetical order, but not to be consecutive, and indeed they are not consecutive in EBCDIC. (The standardese was tailored precisely to permit both ASCII and EBCDIC as-is.) You can get away with coding hexadecimal digits A through F with 'A' + i (or 'a' + i), because those are consecutive in both ASCII and EBCDIC, but it is technically something you are getting away with rather than something guaranteed.
Related
So as part of my C classes, for our first homework we are supposed to implement our own atof.c function, and then use it for some tasks. So, being the smart stay-at-home student I am I decided to look at the atof.c source code and adapt it to meet my needs. I think i'm on board with most of the operations that this function does, like counting the digits before and after the decimal point, however there is one line of code that I do not understand. I'm assuming this is the line that actually converts the ASCII digit into a digit of type int. Posting it here:
frac1 = 10*frac1 + (c - '0');
in the source code, c is the digit that they are processing, and frac1 is an int that stores some of the digits from the incoming ASCII string. but why does c- '0' work?? And as a followup, is there another way of achieving the same result?
There is no such thing as "text" in C. Just APIs that happen to treat integer values as text information. char is an integer type, and you can do math with it. Character literals are actually ints in C (in C++ they're char, but they're still usable as numeric values even there).
'0' is a nice way for humans to write "the ordinal value of the character for zero"; in ASCII, that's the number 48. Since the digits appear in order from 0 to 9 in all encodings I'm aware of, you can convert from the ordinal value in the encoding (e.g. ASCII) to actual numeric values by subtracting away '0' to get actual int values from 0 to 9.
You could just as easily subtract 48 directly (when compiled, it would be impossible to tell which option you used; 48 and ASCII '0' are indistinguishable), it would just be less obvious what you were doing to other people reading your source code.
The ASCII value of '0' is the 48'th character in code page 437 (IBM default character set). Similarly, '1' is the 49'th etc. Subtracting '0' instead of a magic number such as 48 is much clearer as far as self-documentation goes.
void push(float[],float);
Here, st[] is float data-type stack and exp[] is char data-type array storing postfix expression.
push(st,(float)(exp[i]-'0'));
I couldn't figure out the purpose of (exp[i]-'0') section though. Why are we subtracting '0'?
A character is basically nothing more than an integer, whose value is the encoding of the character.
In the most common encoding scheme, ASCII, the value for e.g. the character '0' is 48, and the value for e.g. '3' is 51. Now, if we have a variable someChar containing the character '3' and you do someChar - '0' it's the same as doing 51 - 48 which will result in the value 3.
So if you have a digit read as a character from somewhere, then you subtract '0' to get the integer value of that digit.
This also works on other encodings, not only ASCII, because the C specification says that all encodings must have the digits in consecutive order.
Note that this "trick" is not guaranteed to work for any non-digit character.
I'm writing a program in C that converts some strings to integers. The way I've implemeted this before is like so
int number = (character - '0');
This always works perfectly for me, but I started thinking, are there any systems using some obscure character encoding in which the characters '0' to '9' don't appear one after another in that order? This code assumes '1' follows '0', '2' follows '1' and so on, but is there ever a case when this is not true?
Yes, this is guaranteed by the C standard.
N1570 5.2.1 paragraph 3 says:
In both the source and execution basic character sets, the value of
each character after 0 in the above list of decimal digits shall be
one greater than the value of the previous.
This guarantee was possible because both ASCII and EBCDIC happen to have this property.
Note that there's no corresponding guarantee for letters; in EBCDIC, the letters do not have contiguous codes.
A char in the C programming language is a fixed-size byte entity designed specifically to be large enough to store a character value from an encoding such as ASCII.
But to what extent are the integer values relating to ASCII encoding interchangeable with the char characters? Is there any way to refer to 'A' as 65 (decimal)?
getchar() returns an integer - presumably this relates directly to such values? Also, if I am not mistaken, it is possible in certain contexts to increment chars ... such that (roughly speaking) '?'+1 == '#'.
Or is such encoding not guaranteed to be ASCII? Does it depend entirely upon the particular environment? Is such manipulation of chars impractical or impossible in C?
Edit: Relevant: C comparison char and int
I am answering just the question about incrementing characters, since the other issues are addressed in other answers.
The C standard guarantees that '0' to '9' are consecutive, so you can increment a digit character (except '9') and get the next digit character, or do other arithmetic with them (C 1999 5.2.1 3).
The relationships between other characters are not guaranteed by the C standard, so you would need documentation from your specific C implementation (primarily the compiler) regarding this.
But to what extent are the integer values relating to ASCII encoding interchangeable with the char characters? Is there any way to refer to 'A' as 65 (decimal)?
In fact, you can't do anything else. char is just an integral type, and if you write
char ch = 'A';
then (assuming ASCII), ch will merely hold the integer value 65 - presenting it to the user is a different problem.
Or is such encoding not guaranteed to be ASCII?
No, it isn't. C doesn't rely on any specific character encoding.
Does it depend entirely upon the particular environment?
Yes, pretty much.
Is such manipulation of chars impractical or impossible in C?
No, you just have to be careful and know the standard quite well - then you'll be safe.
character literals like 'A' have type int .. they are completely interchangeable with their integer value. However, that integer value is not mandated by the C standard; it might be ASCII (and is for the vast majority of common implementations) but need not be; it is implementation defined. The mapping of integer values for characters does have one guarantee given by the Standard: the values of the decimal digits are continguous. (i.e., '1' - '0' == 1, ... '9' - '0' == 9).
Where the source code has 'A', the compiled object will just have the byte value instead. That's why it is allowed to do arithmetic with bytes (the type of 'A' is char, i.e. byte).
Of course, a character encoding (more accurately, a code page) must be applied to get that byte value, and that codepage would serve as the "native" encoding of the compiler for hard-coded strings and char values.
Loosely, you could think of char and string literals in C source as essentially being macros. On an ASCII system the "macro" 'A' would resolve to (char) 65, and on an EBCDIC system to (char) 193. Similarly, C strings compile down to zero-terminated arrays of chars (bytes). This logic affects the symbol table also, since the symbols are taken from the source in its native encoding.
So no, ASCII is not the only possibility for the encoding of literals in source code. But due to the restriction of single-quoted characters being chars, there is a guarantee that UTF-16 or other multi-byte encodings are excluded.
I have just started reading through The C Programming Language and I am having trouble understanding one part. Here is an excerpt from page 24:
#include<stdio.h>
/*countdigits,whitespace,others*/
main()
{
intc,i,nwhite,nother;
intndigit[10];
nwhite=nother=0;
for(i=0;i<10;++i)
ndigit[i]=0;
while((c=getchar())!=EOF)
if(c>='0'&&c<='9')
++ndigit[c-'0']; //THIS IS THE LINE I AM WONDERING ABOUT
else if(c==''||c=='\n'||c=='\t')
++nwhite;
else
++nother;
printf("digits=");
for(i=0;i<10;++i)
printf("%d",ndigit[i]);
printf(",whitespace=%d,other=%d\n",
nwhite,nother);
}
The output of this program run on itself is
digits=9300000001,whitespace=123,other=345
The declaration
intndigit[10];
declares ndigit to be an array of 10 integers. Array subscripts always start at zero in C, so the elements are
ndigit[0], ndigit[ 1], ..., ndigit[9]
This is reflected in the for loops that initialize and print the array. A subscript can be any integer expression, which includes integer variables like i,and integer constants. This particular program relies on the properties of the character representation of the digits. For example, the test
if(c>='0'&&c<='9')
determines whether the character in c is a digit. If it is, the numeric value of that digit is
c-'0'`
This works only if '0', '1', ..., '9' have consecutive increasing values. Fortunately, this is true for all character sets. By definition, chars are just small integers, so char variables and constants are identical to ints in arithmetic expressions. This is natural and convenient; for example
c-'0'
is an integer expression with a value between 0 and 9 corresponding to the character '0' to '9' stored in c, and thus a valid subscript for the array ndigit.
The part I am having trouble understanding is why the -'0' part is necessary in the expression c-'0'. If a character is a small integer as the author says, and the digit characters correspond to their numeric values, then what is -'0' doing?
Digit characters don't correspond to their numeric values. They correspond to their encoding values (in this case, ASCII).
IIRC, ascii '0' is the value 48. And, luckily for this example and most character sets, the values of '0' through '9' are stored in order in the character set.
So, subtracting the ASCII value for '0' from any ASCII digit returns its "true" value of 0-9.
The numeric value of a character is (on most systems) its ASCII value. The ASCII value of '0' is 48, '1' is 49, etc.
By subtracting 48 from the value of the character '0' becomes 0, '1' becomes 1, etc. By writing it as c - '0' you don't actually need to know what the ASCII value of '0' is (or that the system is using ASCII - it could be using EBCDIC). The only thing that matters is that the values are consecutive increasing integers.
It converts from the ASCII code of the '0' key on your keyboard to the value zero.
if you did int x = '0' + '0' the result would not be zero.
In most character encodings, all of the digits are placed consecutively in the character set. In ASCII for example, they start with '0' at 0x30 ('1' is 0x31, '2' is 0x32, etc.). If you want the numeric value of a given digit, you can just subtract '0' from it and get the right value. The advantage of using '0' instead of the specific value is that your code can be portable to other character sets with much less effort.
If you access a character string by their characters you'll get the ASCII values back, even if the characters happen to be numbers.
Fortunately the guys who designed that character table made sure that the characters for 0 to 9 are sequential, so you can simply convert from ASCII to a number by subtracting the ASCII-value of '0'.
That's what the code does. I have to admit that it is confusing when you see it the first time, but it's not rocket science.
The ASCII-character value of '0' is 48, '1' is 49, '2' is 50 and so on.
For reference here is a nice ASCII-chart:
http://www.sciencelobby.com/ascii-table/images/ascii-table1.gif