Char - ASCII relation - c

A char in the C programming language is a fixed-size byte entity designed specifically to be large enough to store a character value from an encoding such as ASCII.
But to what extent are the integer values relating to ASCII encoding interchangeable with the char characters? Is there any way to refer to 'A' as 65 (decimal)?
getchar() returns an integer - presumably this relates directly to such values? Also, if I am not mistaken, it is possible in certain contexts to increment chars ... such that (roughly speaking) '?'+1 == '#'.
Or is such encoding not guaranteed to be ASCII? Does it depend entirely upon the particular environment? Is such manipulation of chars impractical or impossible in C?
Edit: Relevant: C comparison char and int

I am answering just the question about incrementing characters, since the other issues are addressed in other answers.
The C standard guarantees that '0' to '9' are consecutive, so you can increment a digit character (except '9') and get the next digit character, or do other arithmetic with them (C 1999 5.2.1 3).
The relationships between other characters are not guaranteed by the C standard, so you would need documentation from your specific C implementation (primarily the compiler) regarding this.

But to what extent are the integer values relating to ASCII encoding interchangeable with the char characters? Is there any way to refer to 'A' as 65 (decimal)?
In fact, you can't do anything else. char is just an integral type, and if you write
char ch = 'A';
then (assuming ASCII), ch will merely hold the integer value 65 - presenting it to the user is a different problem.
Or is such encoding not guaranteed to be ASCII?
No, it isn't. C doesn't rely on any specific character encoding.
Does it depend entirely upon the particular environment?
Yes, pretty much.
Is such manipulation of chars impractical or impossible in C?
No, you just have to be careful and know the standard quite well - then you'll be safe.

character literals like 'A' have type int .. they are completely interchangeable with their integer value. However, that integer value is not mandated by the C standard; it might be ASCII (and is for the vast majority of common implementations) but need not be; it is implementation defined. The mapping of integer values for characters does have one guarantee given by the Standard: the values of the decimal digits are continguous. (i.e., '1' - '0' == 1, ... '9' - '0' == 9).

Where the source code has 'A', the compiled object will just have the byte value instead. That's why it is allowed to do arithmetic with bytes (the type of 'A' is char, i.e. byte).
Of course, a character encoding (more accurately, a code page) must be applied to get that byte value, and that codepage would serve as the "native" encoding of the compiler for hard-coded strings and char values.
Loosely, you could think of char and string literals in C source as essentially being macros. On an ASCII system the "macro" 'A' would resolve to (char) 65, and on an EBCDIC system to (char) 193. Similarly, C strings compile down to zero-terminated arrays of chars (bytes). This logic affects the symbol table also, since the symbols are taken from the source in its native encoding.
So no, ASCII is not the only possibility for the encoding of literals in source code. But due to the restriction of single-quoted characters being chars, there is a guarantee that UTF-16 or other multi-byte encodings are excluded.

Related

How do I compare single multibyte character constants cross-platform in C?

In my previous post I found a solution to do this using C++ strings, but I wonder if there would be a solution using char's in C as well.
My current solution uses str.compare() and size() of a character string as seen in my previous post.
Now, since I only use one (multibyte) character in the std::string, would it be possible to achieve the same using a char?
For example, if( str[i] == '¶' )? How do I achieve that using char's?
(edit: made a type on SO for comparison operator as pointed out in the comments)
How do I compare single multibyte character constants cross-platform in C?
You seem to mean an integer character constant expressed using a single multibyte character. The first thing to recognize, then, is that in C, integer character constants (examples: 'c', '¶') have type int, not char. The primary relevant section of C17 is paragraph 6.4.4.4/10:
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
(Emphasis added.)
Note well that "implementation defined" implies limited portability from the get-go. Even if we rule out implementations defining perverse behavior, we still have alternatives such as
the implementation rejects integer character constants containing multibyte source characters; or
the implementation rejects integer character constants that do not map to a single-byte execution character; or
the implementation maps source multibyte characters via a bytewise identity mapping, regardless of the byte sequence's significance in the execution character set.
That is not an exhaustive list.
You can certainly compare integer character constants with each other, but if they map to multibyte execution characters then you cannot usefully compare them to individual chars.
Inasmuch as your intended application appears to be to locate individual mutlibyte characters in a C string, the most natural thing to do appears to be to implement a C analog of your C++ approach, using the standard strstr() function. Example:
char str[] = "Some string ¶ some text ¶ to see";
char char_to_compare[] = "¶";
int char_size = sizeof(char_to_compare) - 1; // don't count the string terminator
for (char *location = strstr(str, char_to_compare);
location;
location = strstr(location + char_size, char_to_compare)) {
puts("Found!");
}
That will do the right thing in many cases, but it still might be wrong for some characters in some execution character encodings, such as those encodings featuring multiple shift states.
If you want robust handling for characters outside the basic execution character set, then you would be well advised to take control of the in-memory encoding, and to perform appropriate convertions to, operations on, and conversions from that encoding. This is largely what ICU does, for example.
I believe you meant something like this:
char a = '¶';
char b = '¶';
if (a == b) /*do something*/;
The above may or may not work, if the value of '¶' is bigger than the char range, then it will overflow, causing a and b to store a different value than that of '¶'. Regardless of which value they hold, they may actually both have the same value.
Remember, the char type is simply a single-byte wide (8-bits) integer, so in order to work with multibyte characters and avoid overflow you just have to use a wider integer type (short, int, long...).
short a = '¶';
short b = '¶';
if (a == b) /*do something*/;
From personal experience, I've also noticed, that sometimes your environment may try to use a different character encoding than what you need. For example, trying to print the 'á' character will actually produce something else.
unsigned char x = 'á';
putchar(x); //actually prints character 'ß' in console.
putchar(160); //will print 'á'.
This happens because the console uses an Extended ASCII encoding, while my coding environment actually uses Unicode, parsing a value of 225 for 'á' instead of the value of 160 that I want.

C Language: Why int variable can store char?

I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.

Can I always assume the characters '0' to '9' appear sequentially in any C character encoding

I'm writing a program in C that converts some strings to integers. The way I've implemeted this before is like so
int number = (character - '0');
This always works perfectly for me, but I started thinking, are there any systems using some obscure character encoding in which the characters '0' to '9' don't appear one after another in that order? This code assumes '1' follows '0', '2' follows '1' and so on, but is there ever a case when this is not true?
Yes, this is guaranteed by the C standard.
N1570 5.2.1 paragraph 3 says:
In both the source and execution basic character sets, the value of
each character after 0 in the above list of decimal digits shall be
one greater than the value of the previous.
This guarantee was possible because both ASCII and EBCDIC happen to have this property.
Note that there's no corresponding guarantee for letters; in EBCDIC, the letters do not have contiguous codes.

Special char Literals

I want to assign a char with a char literal, but it's a special character say 255 or 13.I know that I can assign my char with a literal int that will be cast to a char: char a = 13;I also know that Microsoft will let me use the hex code as a char literal: char a = '\xd'
I want to know if there's a way to do this that gcc supports also.
Writing something like
char ch = 13;
is mostly portable, to platforms on which the value 13 is the same thing as on your platform (which is all systems which uses the ASCII character set, which indeed is most systems today).
There may be platforms on which 13 can mean something else. However, using '\r' instead should always be portable, no matter the character encoding system.
Using other values, which does not have character literal equivalents, are not portable. And using values above 127 is even less portable, since then you're outside the ASCII table, and into the extended ASCII table, in which the letters can depend on the locale settings of the system. For example, western European and eastern European language settings will most likely have different characters in the 128 to 255 range.
If you want to use a byte which can contain just some binary data and not letters, instead of using char you might be wanting to use e.g. uint8_t, to tell other readers of your code that you're not using the variable for letters but for binary data.
The hexidecimal escape sequence is not specific to Microsoft. It's part of C/C++: http://en.cppreference.com/w/cpp/language/escape
Meaning that to assign a hexidecimal number to a char, this is cross platform code:
char a = '\xD';
The question already demonstrates assigning a decimal number to a char:
char a = 13;
And octal numbers can also be assigned as well, with only the escape switch:
char a = '\023';
Incidentally, '\0' is common in C/C++ to represent the null-character (independent of platform). '\0' is not a special character that can be escaped. That's actually invoking the octal escape sequence.

Endianness -- why do chars put in an Int16 print backwards?

The following C code, compiled and run in XCode:
UInt16 chars = 'ab';
printf("\nchars: %2.2s", (char*)&chars);
prints 'ba', rather than 'ab'.
Why?
That particular implementation seems to store multi-character constants in little-endian format. In the constant 'ab' the character 'b' is the least significant byte (the little end) and the character 'a' is the most significant byte. If you viewed chars as an array, it'd be chars[0] = 'b' and chars[1] = 'a', and thus would be treated by printf as "ba".
Also, I'm not sure how accurate you consider Wikipedia, but regarding C syntax it has this section:
Multi-character constants (e.g. 'xy') are valid, although rarely
useful — they let one store several characters in an integer (e.g. 4
ASCII characters can fit in a 32-bit integer, 8 in a 64-bit one).
Since the order in which the characters are packed into one int is not
specified, portable use of multi-character constants is difficult.
So it appears the 'ab' multi-character constant format should be avoided in general.
It depends on the system you're compiling/running your program on.
Obviously on your system, the short value is stored in memory as 0x6261 (ba): the little endian way.
When you ask to decode a string, printf will read byte by byte the value you have stored in memory, which actually is 'b', then 'a'. Thus your result.
Multicharacter character literals are implementation-defined:
C99 6.4.4.4p10: "The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined."
gcc and icl print ba on Windows 7. tcc prints a and drops the second letter altogether...
The answer to your question can be found in your tags: Endianness. On a little endian machine the least significant byte is stored first. This is a convention and does not affect efficiency at all.
Of course, this means that you cannot simply cast it to a character string, since the order of characters is wrong, because there are no significant bytes in a character string, but just a sequence.
If you want to view the bytes within your variable, I suggest using a debugger that can read the actual bytes.

Resources