From The C Programming Language (Brian W. Kernighan), 2.7 TYPE CONVERSIONS, pg 43 :
"There is one subtle point about the
conversion of characters to integers.
... On some macines a char whose
leftmost bit is 1 will be converted to
a negative integer. On others, ... is
always positive. For portability,
specify signed or unsigned if
non-character data is to be stored in
char variables."
My questions are:
Why would anyone want to store
non-char data in char? (an example
where this is necessary will be real
nice)
Why does integer value of char
change when it is converted to int?
Can you elaborate more on this
portability issue?
In regards to 1)
People often use char arrays when they really want a byte buffer for a data stream. Its not great practice, but plenty of projects do it, and if you're careful, no real harm is done. There are probably other times as well.
In regards to 2)
Signed integers are often sign extended when they are moved from a smaller data type. Thus
11111111b (-1 in base 10) becomes 11111111 11111111 11111111 11111111 when expanded to 32 bits. However, if the char was intended to be unsigned +255, then the signed integer may end up being -1.
About portability 3)
Some machines regard chars as signed integers, while others interpret them as unsigned. It could also vary based on compiler implementation. Most of the time you don't have to worry about it. Kernighan is just trying to help you understand the details.
Edit
I know this is a dead issue, but you can use the following code to check if char's on your system are signed or unsigned:
#include <limits.h> //Include implementation specific constants (MAX_INT, et c.)
#if CHAR_MAX == SCHAR_MAX
// Plain "char" is signed
#else
// Plain "char" is unsigned
#endif
1) char is the size of a single byte in C, and is therefore used for storing any sort of data. For example, when loading an image into memory, the data is represented as an array of char. In modern code, typedefs such as uint8_t are used to indicate the purpose of a buffer more usefully than just char.
2 & 3) Whether or not char is signed or unsigned is platform dependent, so if a program depends on this behavior then it's best to specify one or the other explicitly.
The char type is defined to hold one byte, i.e. sizeof(char) is defined to be 1. This is useful for serializing data, for instance.
char is implementation-defined as either unsigned char or signed char. Now imagine that char means smallint. You are simply converting a small integer to a larger integer when you go from smallint to int. The problem is, you don't know whether that smallint is signed or unsigned.
I would say it's not really a portability issue as long as you follow The Bible (K&R).
unsigned char is often used to process binary data one byte at a time. A common example is UTF-8 strings, which are not strictly made up of "chars."
If a signed char is 8 bits and the top bit is set, that indicates that it's negative. When this is converted to a larger type, the sign is kept by extending the high bit to the high bit of the new type. This is called a "sign-extended" assignment.
1) Char is implemented as one byte across all systems so it is consistent.
2) The bit mentioned in you question is the one that is used in single byte integers for their singed-ness. When a int on a system is larger than one byte the signed flat is not affected when you convert char to int, other wise it is. ( there are also singed and unsigned chars)
3) Because of the consistence of the char implementation lots of libs use them like the Intel IPP (Intel Performance Primitives) libs and their cousins OpenCV.
Usually, in C, char to int conversion and vice versa is an issue because the stanard APIs for reading character input/writing character output use int's for the character arguments and return values. See getchar(), getc() and putchar() for example.
Also, since the size of a char is 1 byte, it is a convenient way to deal with arbitrary data as a byte stream.
Related
int main()
{
char c = 0xff;
bool b = 0xff == c;
// Under most C/C++ compilers' default options, b is FALSE!!!
}
Neither the C or C++ standard specify char as signed or unsigned, it is implementation-defined.
Why does the C/C++ standard not explicitly define char as signed or unsigned for avoiding dangerous misuses like the above code?
Historical reasons, mostly.
Expressions of type char are promoted to int in most contexts (because a lot of CPUs don't have 8-bit arithmetic operations). On some systems, sign extension is the most efficient way to do this, which argues for making plain char signed.
On the other hand, the EBCDIC character set has basic characters with the high-order bit set (i.e., characters with values of 128 or greater); on EBCDIC platforms, char pretty much has to be unsigned.
The ANSI C Rationale (for the 1989 standard) doesn't have a lot to say on the subject; section 3.1.2.5 says:
Three types of char are specified: signed, plain, and unsigned. A
plain char may be represented as either signed or unsigned, depending
upon the implementation, as in prior practice. The type signed char
was introduced to make available a one-byte signed integer type on
those systems which implement plain char as unsigned. For reasons of
symmetry, the keyword signed is allowed as part of the type name of
other integral types.
Going back even further, an early version of the C Reference Manual from 1975 says:
A char object may be used anywhere an int may be. In all cases the
char is converted to an int by propagating its sign through the upper
8 bits of the resultant integer. This is consistent with the two’s
complement representation used for both characters and integers.
(However, the sign-propagation feature disappears in other
implementations.)
This description is more implementation-specific than what we see in later documents, but it does acknowledge that char may be either signed or unsigned. On the "other implementations" on which "the sign-propagation disappears", the promotion of a char object to int would have zero-extended the 8-bit representation, essentially treating it as an 8-bit unsigned quantity. (The language didn't yet have the signed or unsigned keyword.)
C's immediate predecessor was a language called B. B was a typeless language, so the question of char being signed or unsigned did not apply. For more information about the early history of C, see the late Dennis Ritchie's home page, now moved here.
As for what's happening in your code (applying modern C rules):
char c = 0xff;
bool b = 0xff == c;
If plain char is unsigned, then the initialization of c sets it to (char)0xff, which compares equal to 0xff in the second line. But if plain char is signed, then 0xff (an expression of type int) is converted to char -- but since 0xff exceeds CHAR_MAX (assuming CHAR_BIT==8), the result is implementation-defined. In most implementations, the result is -1. In the comparison 0xff == c, both operands are converted to int, making it equivalent to 0xff == -1, or 255 == -1, which is of course false.
Another important thing to note is that unsigned char, signed char, and (plain) char are three distinct types. char has the same representation as either unsigned char or signed char; it's implementation-defined which one it is. (On the other hand, signed int and int are two names for the same type; unsigned int is a distinct type. (Except that, just to add to the frivolity, it's implementation-defined whether a bit field declared as plain int is signed or unsigned.))
Yes, it's all a bit of a mess, and I'm sure it would have be defined differently if C were being designed from scratch today. But each revision of the C language has had to avoid breaking (too much) existing code, and to a lesser extent existing implementations.
char at first is meant to store characters, so whether it's signed or unsigned is not important. What really matters is how to perform maths on char efficiently. So depend on the system, the compiler will choose what's most appropriate
Prior to ARMv4, ARM had no native support for loading halfwords and signed bytes. To load a signed byte you had to LDRB then sign extend the value (LSL it up then ASR it back down). This is painful so char is unsigned by default.
why unsigned types are more efficent in arm cpu?
In fact a lot of ARM compilers still use unsigned char by default, because even if you can load a byte with sign extension on modern ARM ISAs, that instruction is still less flexible than the zero extension version
is char signed or unsigned by default on iOS?
char is unsigned by default on Android NDK
And most modern compilers also allow you to change char's signness instead of using the default setting
What is the need for signed and unsigned characters in C?
Is there some special reason for having a signed and unsigned char in C? Or was it simply added for completeness so that the compiler does not have to check the data type before adding signed/unsigned modifier?
I am not asking about signed and unsigned variables. My doubt is about the special cases where an unsigned character variable will not be sufficient such that you have to depend on a signed character variable.
A char can be either signed or unsigned depending on what is most efficient for the underlying hardware. The keywords signed and unsigned allow you to explicitly specify that you want something else.
A quote from the C99 rationale:
Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned depending upon the implementation, as in prior practice. The type signed char was introduced in C89 to make available a one-byte signed integer type on those systems which implemented plain char as unsigned char. For reasons of symmetry, the keyword signed is allowed as part of the type name of other integer types.
Information #1: char in C is just a small int, which uses 8 bits.
Information #2: Difference between signed and unsigned, is that one bit in the representation is used as the sign bit for a signed variable.
Information #3: As a result of (#2), signed variables hold different ranges (-128 to 127, in char case) compared to unsigned (0 to 255 in char case).
Q-A #1: why do we need unsigned?
In most cases (for instance representing a pointer) we do not need signed variables. By convention all locations in the memory are exposed to the program as a contiguous array of unsigned addresses.
Q-A #2: why do we need signed?
Generally, to do signed arithmetic.
I assume you are using a char to hold numbers, not characters.
So:
signed char gives you at least the -128 to 127 range.
unsigned char gives you at least the 0 to 255 range.
A char is required by standard to be AT LEAST 8 bits, so that is the reason for my saying at least. It is possible for these values to be larger.
Anyway, to answer your question, having a char as unsigned frees the requirement for the first bit to be the 'sign' bit, thus allowing you to hold near double that of a signed char.
The thing you have to understand is that datatype "char" is actually just an integer, typically 8-bits wide. You can use it like any other inter datatype, assuming you respect the reduced value limits. There is no reason to limit "char" to characters.
On a 32/64-bit processor, there is typically no need to use such small integer fields, but on an 8-bit processor such as the 8051, 8-bit integers are no only much faster to process and use less (limited) memory.
Say I have a unicode character in wchar_t x;
Of course, the obvious way to convert x to ASCII is use the wctob function
But I'm wondering, since the first 255 characters of Unicode correspond with ASCII, will a cast to char consistently work across platforms?
char c = (char) x ; // cast to char, this works on Windows
The question is, will a cast to char guarantee to keep the LOW ORDER bits, or will it possibly keep the HIGH ORDER bits? (I'm concerned about a little-endian/big endian situation here, although I realize if it worked on my little endian system, it definitely should work on big endian systems).
For the sake of brevity, I use some terms loosely. To avoid much confusion, one is strongly advised to carefully study definitions of at least the following terms: ASCII, Unicode, UCS, UCS-2, UCS-4, UTF, UTF-8, UTF-16, UTF-32, character, character set, coded character set, repertoire, code unit.
The code of the character 'Q' is 81 in both ASCII and Unicode.
81 is just an integer, like any other integer. A char variable may store the number 81. A wchar_t variable may store the same number 81. We interpret 81 as 'Q' in both cases.
It does not make much sense to ask how the number 81 preserves when cast from e.g. long to short. If it fits then you are all set. There's no endianness or higher bits or lower bits or any of this stuff involved.
When you convert files that store characters, or streams of bytes over a network, endianness and bits and stuff begin to matter, just like with files that store (binary representations of) any old numbers.
If x does not fit in a char, then the behavior is officially "implementation-defined" and is allowed to raise a signal. If x does fit in a char, then the value is preserved (regardless of endianness).
6.3.1.3 Signed and unsigned integers
(1) When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
(2) [does not apply here]
(3) Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
For maximum portability, perform a range check first and cast only if the value is in the range SCHAR_MIN to SCHAR_MAX.
(Others have noted and I wish to repeat that ASCII extends only to character 127.)
I was under the impression that the endianess of the system does not matter in this situation.
I found a really good explanation here.
I think this should help rest your fears about casting.
When you typecast from an int to a char, you are cutting down the number of bytes used from 4 to 1. How does it pick which byte it is going to use make the char?
Does it take the most significant byte?
Or does it take the least significant?
Or is there some sort of rule I should know about?
C will take the least-significant byte when doing a narrowing conversion, so if you have the integer value 0xCAFEBABE and you convert it to a char, you'll get the value 0xBE.
Of course, there's no actual guarantee that an int is four bytes or that a char is one, but I'm pretty sure that the logic for doing the truncation will always be the same and will just drop the higher-order bits that don't fit into the char.
If char is signed, it's implementation-defined unless the original value already fits in the range of values for char. An implementation is completely free to generate nonsense (or raise a signal) if it doesn't fit. If char is unsigned (which the standard allows), then the value is reduced modulo 1<<CHAR_BIT (usually 256).
Given that signed and unsigned ints use the same registers, etc., and just interpret bit patterns differently, and C chars are basically just 8-bit ints, what's the difference between signed and unsigned chars in C? I understand that the signedness of char is implementation defined, and I simply can't understand how it could ever make a difference, at least when char is used to hold strings instead of to do math.
It won't make a difference for strings. But in C you can use a char to do math, when it will make a difference.
In fact, when working in constrained memory environments, like embedded 8 bit applications a char will often be used to do math, and then it makes a big difference. This is because there is no byte type by default in C.
In terms of the values they represent:
unsigned char:
spans the value range 0..255 (00000000..11111111)
values overflow around low edge as:
0 - 1 = 255 (00000000 - 00000001 = 11111111)
values overflow around high edge as:
255 + 1 = 0 (11111111 + 00000001 = 00000000)
bitwise right shift operator (>>) does a logical shift:
10000000 >> 1 = 01000000 (128 / 2 = 64)
signed char:
spans the value range -128..127 (10000000..01111111)
values overflow around low edge as:
-128 - 1 = 127 (10000000 - 00000001 = 01111111)
values overflow around high edge as:
127 + 1 = -128 (01111111 + 00000001 = 10000000)
bitwise right shift operator (>>) does an arithmetic shift:
10000000 >> 1 = 11000000 (-128 / 2 = -64)
I included the binary representations to show that the value wrapping behaviour is pure, consistent binary arithmetic and has nothing to do with a char being signed/unsigned (expect for right shifts).
Update
Some implementation-specific behaviour mentioned in the comments:
char != signed char. The type "char" without "signed" or "unsinged" is implementation-defined which means that it can act like a signed or unsigned type.
Signed integer overflow leads to undefined behavior where a program can do anything, including dumping core or overrunning a buffer.
#include <stdio.h>
int main(int argc, char** argv)
{
char a = 'A';
char b = 0xFF;
signed char sa = 'A';
signed char sb = 0xFF;
unsigned char ua = 'A';
unsigned char ub = 0xFF;
printf("a > b: %s\n", a > b ? "true" : "false");
printf("sa > sb: %s\n", sa > sb ? "true" : "false");
printf("ua > ub: %s\n", ua > ub ? "true" : "false");
return 0;
}
[root]# ./a.out
a > b: true
sa > sb: true
ua > ub: false
It's important when sorting strings.
There are a couple of difference. Most importantly, if you overflow the valid range of a char by assigning it a too big or small integer, and char is signed, the resulting value is implementation defined or even some signal (in C) could be risen, as for all signed types. Contrast that to the case when you assign something too big or small to an unsigned char: the value wraps around, you will get precisely defined semantics. For example, assigning a -1 to an unsigned char, you will get an UCHAR_MAX. So whenever you have a byte as in a number from 0 to 2^CHAR_BIT, you should really use unsigned char to store it.
The sign also makes a difference when passing to vararg functions:
char c = getSomeCharacter(); // returns 0..255
printf("%d\n", c);
Assume the value assigned to c would be too big for char to represent, and the machine uses two's complement. Many implementation behave for the case that you assign a too big value to the char, in that the bit-pattern won't change. If an int will be able to represent all values of char (which it is for most implementations), then the char is being promoted to int before passing to printf. So, the value of what is passed would be negative. Promoting to int would retain that sign. So you will get a negative result. However, if char is unsigned, then the value is unsigned, and promoting to an int will yield a positive int. You can use unsigned char, then you will get precisely defined behavior for both the assignment to the variable, and passing to printf which will then print something positive.
Note that a char, unsigned and signed char all are at least 8 bits wide. There is no requirement that char is exactly 8 bits wide. However, for most systems that's true, but for some, you will find they use 32bit chars. A byte in C and C++ is defined to have the size of char, so a byte in C also is not always exactly 8 bits.
Another difference is, that in C, a unsigned char must have no padding bits. That is, if you find CHAR_BIT is 8, then an unsigned char's values must range from 0 .. 2^CHAR_BIT-1. THe same is true for char if it's unsigned. For signed char, you can't assume anything about the range of values, even if you know how your compiler implements the sign stuff (two's complement or the other options), there may be unused padding bits in it. In C++, there are no padding bits for all three character types.
"What does it mean for a char to be signed?"
Traditionally, the ASCII character set consists of 7-bit character encodings. (As opposed to the 8 bit EBCIDIC.)
When the C language was designed and implemented this was a significant issue. (For various reasons like data transmission over serial modem devices.) The extra bit has uses like parity.
A "signed character" happens to be perfect for this representation.
Binary data, OTOH, is simply taking the value of each 8-bit "chunk" of data, thus no sign is needed.
Arithmetic on bytes is important for computer graphics (where 8-bit values are often used to store colors). Aside from that, I can think of two main cases where char sign matters:
converting to a larger int
comparison functions
The nasty thing is, these won't bite you if all your string data is 7-bit. However, it promises to be an unending source of obscure bugs if you're trying to make your C/C++ program 8-bit clean.
Signedness works pretty much the same way in chars as it does in other integral types. As you've noted, chars are really just one-byte integers. (Not necessarily 8-bit, though! There's a difference; a byte might be bigger than 8 bits on some platforms, and chars are rather tied to bytes due to the definitions of char and sizeof(char). The CHAR_BIT macro, defined in <limits.h> or C++'s <climits>, will tell you how many bits are in a char.).
As for why you'd want a character with a sign: in C and C++, there is no standard type called byte. To the compiler, chars are bytes and vice versa, and it doesn't distinguish between them. Sometimes, though, you want to -- sometimes you want that char to be a one-byte number, and in those cases (particularly how small a range a byte can have), you also typically care whether the number is signed or not. I've personally used signedness (or unsignedness) to say that a certain char is a (numeric) "byte" rather than a character, and that it's going to be used numerically. Without a specified signedness, that char really is a character, and is intended to be used as text.
I used to do that, rather. Now the newer versions of C and C++ have (u?)int_least8_t (currently typedef'd in <stdint.h> or <cstdint>), which are more explicitly numeric (though they'll typically just be typedefs for signed and unsigned char types anyway).
The only situation I can imagine this being an issue is if you choose to do math on chars. It's perfectly legal to write the following code.
char a = (char)42;
char b = (char)120;
char c = a + b;
Depending on the signedness of the char, c could be one of two values. If char's are unsigned then c will be (char)162. If they are signed then it will an overflow case as the max value for a signed char is 128. I'm guessing most implementations would just return (char)-32.
One thing about signed chars is that you can test c >= ' ' (space) and be sure it's a normal printable ascii char. Of course, it's not portable, so not very useful.