cppreference.com states that char is
Equivalent to either signed char or unsigned char [...], but char is a distinct type, different from both signed char and unsigned char
I assume this means that a char can hold exactly the same values as either unsigned char or signed char, but is not compatible with either. Why was it decided to work this way? Why does unqualified char not denote a char of the platform-appropriate signedness, like with the other integer types, where int denotes exactly the same type as signed int?
The three C character types char, signed char, and unsigned char exist as codification of legacy C implementations and usage.
The XJ311 committee that codified C into the first C standard (now known as C89) stated their purpose in the Rational (italics original):
1.1 Purpose
The Committee's overall goal was to develop a clear, consistent, and
unambiguous Standard for the C programming language which codifies the
common, existing definition of C and which promotes the portability of
user programs across C language environments.
The X3J11 charter clearly mandates the Committee to codify common
existing practice. ...
N.B.: the X3J11 committee went out of their way to emphasize they were codifying existing implementations of C and common usage/practices in order to promote portability.
In other words, "standard" C was never created - existing C code, usages, and practices were codified.
Per 3.1.2.5 Types of that same Rationale (bolding mine):
Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. ...
The words of the committee are clear: three types of char exist because plain char had to be either signed or unsigned in order to match "prior practice". Plain char therefore had to be separate - portable code could not rely on plain char being signed or unsigned, but both signed char and unsigned char had to be available.
The three character types can not be compatible in any way because of portability concerns - and portability of standard-conforming C code was one of the XJ311 committee's main goals.
If extern char buffer[10] were compatible with unsigned char buffer[10] on a system where plain char is unsigned, the code would behave differently if the code were compiled* on a system where plain char is signed and therefore incompatible with unsigned char buffer[10]. For example, bit shifting elements of buffer would change behavior depending on whether or not buffer were accessed through the extern char buffer[10] declaration or the unsigned char buffer[10]; definition, breaking portability.
The fact that char could already be signed or unsigned with different behavior in such a situation already existed, and the committee could not change that without violating their goal to "codif[y] the
common, existing definition of C".
But with a goal of promoting portability, there was no reason whatsoever to create a crazed, portability-nightmare-inducing situation where "sometimes char is compatible with this and not that, and sometimes char is compatible with that and not this".
* - If the code compiled at all - but this is a hypothetical meant to demonstrate why the three char types must be incompatible.
TL;DR
Backwards compatibility. Probably. Or possibly that they had to choose and didn't care. But I have no certain answer.
Long version
Intro
Just like OP, I'd prefer a certain answer from a reliable source. In the absence of that, qualified guesses and speculations are better than nothing.
Very many things in C comes from backwards compatibility. When it was decided that whether char would be the same as signed char or unsigned char is implementation defined, there were already a lot of C code out there, some of which was using signed chars and others using unsigned. Forcing it to be one or the other would for certain break some code.
Why it (probably) does not matter
Why does unqualified char not denote a char of the platform-appropriate signedness
It does not matter much. An an implementation that is using signed chars guarantees that CHAR_MIN is equal to SCHAR_MIN and that CHAR_MAX is equal to SCHAR_MAX. Same goes for unsigned. So an unqualified char will always have the exact same range as its qualified counterpart.
From the standard 5.2.4.2.1p2:
If the value of an object of type char is treated as a signed integer when used in an
expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the
value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of
CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of
UCHAR_MAX.
This points us in the direction that they just didn't really care, or that it "feels safer".
Another interesting mention in the C standard is this:
All enumerations have an underlying type. The underlying type can be explicitly specified using an enum type specifier and is its fixed underlying type. If it is not explicitly specified, the underlying type is the enumeration’s compatible type, which is either a signed or unsigned integer type (excluding the bit-precise integer types), or char.
Possible problems with breaking this (speculation)
I'm trying to come up with a scenario where this would actually matter. One that could possibly cause issues is if you compile a source file to a shared library with one compiler using signed char and then use that library in a source file compiled with another compiler using unsigned char.
And even if that would not cause problems, imagine that the shared library is compiled with a pre-ansi compiler. Well, I cannot say for certain that this would cause problems either. But I can imagine that it could.
And another speculation from Steve Summit in comment section:
I'm speculating, but: if the Standard had required, in Eric's phrasing, "char is the same type as an implementation-defined choice of signed char or unsigned char", then if I'm on a platform on which char is the same as signed char, I can intermix the two with no warnings, and create code that's not portable to a machine where char is unsigned by default. So the definition "char is a distinct type from signed char and unsigned char" helps force people to write portable code.
Backwards compatibility is a sacred feature
But remember that the persons behind the C standard were and are VERY concerned about not breaking backwards compatibility. Even to the point that they don't want to change the signature of some library functions to return const values because it would yield warnings. Not errors. Warnings! Warnings that you can easily disable. Instead, they just wrote in the standard that it's undefined behavior to modify the values. You can read more about that here: https://thephd.dev/your-c-compiler-and-standard-library-will-not-help-you
So whenever you encounter very strange design choices in the C standard, it's a very good bet that backwards compatibility is the reason. That's the reason why you can initialize a pointer to NULL with just 0, even for a machine where NULL is not the zero address. And why bool is a macro of the keyword _Bool.
It's also the reason why bitwise | and & has higher precedence than ==, because there were a lot (several hundred kilobytes that was installed on three (3) machines :) ) of source code including stuff like if (a==b & c==d). Dennis Ritchie admitted that he should have changed it. https://www.lysator.liu.se/c/dmr-on-or.html
So we can at least say for certain that there are design choice made with backwards compatibility in mind, that has later been admitted by those who made the choices to be mistakes and that we have reliable sources for that.
C++
And also remember that your sources points to C++ sources. In that language, there are reasons that don't apply to C. Like overloading.
One part of the reasoning for not mandating either signed or unsigned for plain char is the EBCDIC code set used on IBM mainframes in particular.
In §6.2.5 Types ¶3, the C standard says:
An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.
Emphasis added.
Now, in EBCDIC, the lower-case letters have the code points 0x81-0x89, 0x91-0x99, 0xA2-0xA9; the upper-case letters have the code points 0xC1-0xC9, 0xD1-0xD9, 0xE2-0xE9; and the digits have the code points 0xF0-0xF9. So:
The alphabets are not contiguous.
Lower-case letters sort before upper-case letters.
Digits sort higher than letters.
And because of §6.2.5¶3, the type of plain char has to be unsigned.
Each of the first three points is in contradistinction to ASCII (and ISO 8859, and ISO 10646 aka Unicode).
The reason is backwards compatibility. Here is some research regarding the history behind it. It only uses authoritative first sources like the publications by Dennis M. Ritchie (the creator of C) or ISO.
In the beginning, there was only int and char. The early draft of C called "NB" for "new B" included these new types not present in the predecessors B and BCPL [Ritchie, 93]:
...it seemed that a typing scheme was necessary to cope with characters and byte addressing, and to prepare for the coming floating-point hardware
Embryonic C
NB existed so briefly that no full description of it was written. It supplied the types int and char, arrays of them, and pointers to them, declared in a style typified by
int i, j;
char c, d;
unsigned was added later [Ritchie, 93]:
During 1973-1980, the language grew a bit: the type structure gained unsigned, long...
Note that this refers to the stand-alone "type qualifier" unsigned at this point, equivalent to unsigned int.
Around this time in 1978, The C Programming Language 1st edition was published [Kernighan, 78] and in chapter 2.7 mentions type conversion problems related to char:
There is one subtle point about the conversion of characters to integers. The language does not specify whether variables of type char are signed or unsigned quantities. When a char is converted to an int, can it ever produce a negative integer? Unfortunately, this varies from machine to machine, reflecting differences in architecture. On some machines (PDP-11, for instance), a char whose leftmost bit is 1 will be converted to a negative integer (“sign extension”). On others, a char is promoted to an int by adding zeros at the left end, and thus is always positive.
At this point, the type promotion to int was what was described as problematic, not the signedness of char, which wasn't even specified. The above text remains mostly unchanged in the 2nd edition [Kernighan, 88].
However, the types themselves are described differently between editions. In the 1st edition [Kernighan, 78, 2.2], unsigned could only be applied to int and was regarded as a qualifier:
In addition, there are a number of qualifiers which can be applied to int’s: short, long, and unsigned.
Whereas the 2nd edition is in line with standard C [Kernighan, 88, 2.2]:
The qualifier signed or unsigned may be applied to char or any integer. /--/ Whether plain chars are signed or unsigned is machine-dependent, but
printable characters are always positive.
So in between the 1st and 2nd edition, they had discovered a backwards compatibility problem with applying the new unsigned/signed (now called type specifiers and not qualifiers [ANSI/ISO, 90]) to the char type, with the same concerns as were already identified regarding type conversions back in the 1st edition.
This compatibility concern remained during standardization in the late 80s. We can read this from the various rationales such as [ISO, 98, 6.1.2.5 §30]
Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. For reasons of symmetry, the keyword signed is allowed as part of the type name of other integral types. Two varieties of the integral types
are specified: signed and unsigned. If neither specifier is used, signed is assumed. In the Base Document the only unsigned type is unsigned int.
This actually suggests that signed int was allowed to make int more symmetric with char, rather than the other way around.
Sources:
[ANSI/ISO, 90] ANSI/ISO 9899:1990 - Programming Languages - C
[ISO, 98] Rationale for International Standard - Programming Language - C, WG14/N802 J11/98-001
[Kernighan, 78] Kernighan, Brian W., Ritchie, Dennis M. - The C Programming Language, 1st edition (1978)
[Kernighan, 88] Kernighan, Brian W., Ritchie, Dennis M. - The C Programming Language, 2st edition (1988)
[Ritchie, 93] Ritchie, Dennis M. - The Development of the C Language (1993)
The line you quote actually does not come from the C standard at all, but is rather comes from the C++ standard. The website you link to (cppreference.com) is primarily about C++ and the C stuff there is something of an afterthought.
The reason this is important for C++ (and not really for C) is that C++ allows overloading based on types, but you can only overload distinct types. The fact the char must be distinct from both signed char and unsigned char means you can safely overload all three:
// 3 overloads for fn
void fn(char);
void fn(signed char);
void fn(unsigned char);
and you will not get an error about ambiguous overloading or such.
Related
in some code seen online, i saw that in read function in C, someone uses a uint8_t array for buffer insted of a char array buffer.
what are the differences?
thanks
The C standard allows char to be signed or unsigned. It also allows it to be more than eight bits.
uint8_t, if it is defined, is unsigned and eight bits. This allows programmers to be completely sure of the type that will be used. In particular, signed char types sometimes cause problems with bitwise and shift operations, due to how these operations are defined (or are not defined) when negative values are involved.
So every char corresponds to a number(see ascii table here). I think people use this to avoid some problems(sorry I don't use c I come from c++)
In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.
In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.
In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
discussion on same subject
"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...
I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
void print(int);
void print(char);
print('a');
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.
using gcc on my MacBook, I try:
#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
test('a');
test("a");
test("");
test(char);
test(short);
test(int);
test(long);
test((char)0x0);
test((short)0x0);
test((int)0x0);
test((long)0x0);
return 0;
};
which when run gives:
'a': 4
"a": 2
"": 1
char: 1
short: 2
int: 4
long: 4
(char)0x0: 1
(short)0x0: 2
(int)0x0: 4
(long)0x0: 4
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
MOV #'A, R0 // 8-bit character encoding for 'A' into 16 bit register
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
MOV #"AB, R0 // 16-bit character encoding for 'A' (low byte) and 'B'
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.
I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;
while ((r = getc(file)) != EOF)
{
*(p++) = (char) r;
}
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.
So for the most part, it should cause no problems.
The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.
That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.
I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.
This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:
Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.
My friend says he read it on some page on SO that they are different,but how could the two be possibly different?
Case 1
int i=999;
char c=i;
Case 2
char c=999;
In first case,we are initializing the integer i to 999,then initializing c with i,which is in fact 999.In the second case, we initialize c directly with 999.The truncation and loss of information aside, how on earth are these two cases different?
EDIT
Here's the link that I was talking of
why no overflow warning when converting int to char
One member commenting there says --It's not the same thing. The first is an assignment, the second is an initialization
So isn't it a lot more than only a question of optimization by the compiler?
They have the same semantics.
The constant 999 is of type int.
int i=999;
char c=i;
i created as an object of type int and initialized with the int value 999, with the obvious semantics.
c is created as an object of type char, and initialized with the value of i, which happens to be 999. That value is implicitly converted from int to char.
The signedness of plain char is implementation-defined.
If plain char is an unsigned type, the result of the conversion is well defined. The value is reduced modulo CHAR_MAX+1. For a typical implementation with 8-bit bytes (CHAR_BIT==8), CHAR_MAX+1 will be 256, and the value stored will be 999 % 256, or 231.
If plain char is a signed type, and 999 exceeds CHAR_MAX, the conversion yields an implementation-defined result (or, starting with C99, raises an implementation-defined signal, but I know of no implementations that do that). Typically, for a 2's-complement system with CHAR_BIT==8, the result will be -25.
char c=999;
c is created as an object of type char. Its initial value is the int value 999 converted to char -- by exactly the same rules I described above.
If CHAR_MAX >= 999 (which can happen only if CHAR_BIT, the number of bits in a byte, is at least 10), then the conversion is trivial. There are C implementations for DSPs (digital signal processors) with CHAR_BIT set to, for example, 32. It's not something you're likely to run across on most systems.
You may be more likely to get a warning in the second case, since it's converting a constant expression; in the first case, the compiler might not keep track of the expected value of i. But a sufficiently clever compiler could warn about both, and a sufficiently naive (but still fully conforming) compiler could warn about neither.
As I said above, the result of converting a value to a signed type, when the source value doesn't fit in the target type, is implementation-defined. I suppose it's conceivable that an implementation could define different rules for constant and non-constant expressions. That would be a perverse choice, though; I'm not sure even the DS9K does that.
As for the referenced comment "The first is an assignment, the second is an initialization", that's incorrect. Both are initializations; there is no assignment in either code snippet. There is a difference in that one is an initialization with a constant value, and the other is not. Which implies, incidentally, that the second snippet could appear at file scope, outside any function, while the first could not.
Any optimizing compiler will just make the int i = 999 local variable disappear and assign the truncated value directly to c in both cases. (Assuming that you are not using i anywhere else)
It depends on your compiler and optimization settings. Take a look at the actual assembly listing to see how different they are. For GCC and reasonable optimizations, the two blocks of code are probably equivalent.
Aside from the fact that the first also defines an object iof type int, the semantics are identical.
i,which is in fact 999
No, i is a variable. Semantically, it doesn't have a value at the point of the initialization of c ... the value won't be known until runtime (even though we can clearly see what it will be, and so can an optimizing compiler). But in case 2 you're assigning 999 to a char, which doesn't fit, so the compiler issues a warning.
I am currently dealing with code purchased from a third party contractor. One struct has an unsigned char field while the function that they are passing that field to requires a signed char. The compiler does not like this, as it considers them to be mismatched types. However, it apparently compiles for that contractor. Some Googling has told me that "[i]t is implementation-defined whether a char object can hold negative values". Could the contractor's compiler basically ignore the signed/unsigned type and treat them the same? Or is there a compiler flag that will treat them the same?
C is not my strongest language--just look at my tags on my user page--so any help would be much appreciated.
Actually char, signed char and unsigned char are three different types. From the standard (ISO/IEC 9899:1990):
6.1.2.5 Types
...
The three types char, signed char and
unsigned char are collectively called
the character types.
(and in C++ for instance you have to (or at least should) write override functions with three variants of them if you have a char argument)
Plain char might be treated signed or unsigned by the compiler, but the standard says (also in 6.1.2.5):
An object declared as type char is
large enough to store any member of
the basic execution character set. If
a member of the required source
character set in 5.2.1 is stored in a
char object, its value is guarantied
to be positive. If other quantities
are stored in a char object, the
behavior is implementation-defined:
the values are treated as either
signed or nonnegative integers.
and
An object declared as type signed char occupies the same amount of storage as a ''plain'' char object.
The characters referred to in 5.2.1 are A-Z, a-z, 0-9, space, tab, newline and the following 29 graphic characters:
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
Answer
All of that I interpret to basically mean that ascii characters with value less than 128 are guarantied to be positive. So if the values stored always are less than 128 it should be safe (from a value preserving perspective), although not so good practice.
This is compiler-dependent. For example, in VC++ there's a compiler option and a corresponding _CHAR_UNSIGNED macro defined if that option instructs to use unsigned char by default.
I take it that you're talking about fields of type signed char and unsigned char, so they're explicitly wrong. If one of them was simply char, it might match in whatever compiler the contractor is using (IIRC, it's implementation-defined whether char is signed or unsigned), but not in yours. In that case, you might be able to get by with a command-line option or something to change yours.
Alternatively, the contractor might be using a compiler, or compiler options, that allow him to compile while ignoring errors or warnings. Do you know what sort of compilation environment he has?
In any case, this is not good C. If one of the types is just char, it relies on implementation-defined behavior, and therefore isn't portable. If not, it's flat wrong. I'd take this up with the contractor.