Given that signed and unsigned ints use the same registers, etc., and just interpret bit patterns differently, and C chars are basically just 8-bit ints, what's the difference between signed and unsigned chars in C? I understand that the signedness of char is implementation defined, and I simply can't understand how it could ever make a difference, at least when char is used to hold strings instead of to do math.
It won't make a difference for strings. But in C you can use a char to do math, when it will make a difference.
In fact, when working in constrained memory environments, like embedded 8 bit applications a char will often be used to do math, and then it makes a big difference. This is because there is no byte type by default in C.
In terms of the values they represent:
unsigned char:
spans the value range 0..255 (00000000..11111111)
values overflow around low edge as:
0 - 1 = 255 (00000000 - 00000001 = 11111111)
values overflow around high edge as:
255 + 1 = 0 (11111111 + 00000001 = 00000000)
bitwise right shift operator (>>) does a logical shift:
10000000 >> 1 = 01000000 (128 / 2 = 64)
signed char:
spans the value range -128..127 (10000000..01111111)
values overflow around low edge as:
-128 - 1 = 127 (10000000 - 00000001 = 01111111)
values overflow around high edge as:
127 + 1 = -128 (01111111 + 00000001 = 10000000)
bitwise right shift operator (>>) does an arithmetic shift:
10000000 >> 1 = 11000000 (-128 / 2 = -64)
I included the binary representations to show that the value wrapping behaviour is pure, consistent binary arithmetic and has nothing to do with a char being signed/unsigned (expect for right shifts).
Update
Some implementation-specific behaviour mentioned in the comments:
char != signed char. The type "char" without "signed" or "unsinged" is implementation-defined which means that it can act like a signed or unsigned type.
Signed integer overflow leads to undefined behavior where a program can do anything, including dumping core or overrunning a buffer.
#include <stdio.h>
int main(int argc, char** argv)
{
char a = 'A';
char b = 0xFF;
signed char sa = 'A';
signed char sb = 0xFF;
unsigned char ua = 'A';
unsigned char ub = 0xFF;
printf("a > b: %s\n", a > b ? "true" : "false");
printf("sa > sb: %s\n", sa > sb ? "true" : "false");
printf("ua > ub: %s\n", ua > ub ? "true" : "false");
return 0;
}
[root]# ./a.out
a > b: true
sa > sb: true
ua > ub: false
It's important when sorting strings.
There are a couple of difference. Most importantly, if you overflow the valid range of a char by assigning it a too big or small integer, and char is signed, the resulting value is implementation defined or even some signal (in C) could be risen, as for all signed types. Contrast that to the case when you assign something too big or small to an unsigned char: the value wraps around, you will get precisely defined semantics. For example, assigning a -1 to an unsigned char, you will get an UCHAR_MAX. So whenever you have a byte as in a number from 0 to 2^CHAR_BIT, you should really use unsigned char to store it.
The sign also makes a difference when passing to vararg functions:
char c = getSomeCharacter(); // returns 0..255
printf("%d\n", c);
Assume the value assigned to c would be too big for char to represent, and the machine uses two's complement. Many implementation behave for the case that you assign a too big value to the char, in that the bit-pattern won't change. If an int will be able to represent all values of char (which it is for most implementations), then the char is being promoted to int before passing to printf. So, the value of what is passed would be negative. Promoting to int would retain that sign. So you will get a negative result. However, if char is unsigned, then the value is unsigned, and promoting to an int will yield a positive int. You can use unsigned char, then you will get precisely defined behavior for both the assignment to the variable, and passing to printf which will then print something positive.
Note that a char, unsigned and signed char all are at least 8 bits wide. There is no requirement that char is exactly 8 bits wide. However, for most systems that's true, but for some, you will find they use 32bit chars. A byte in C and C++ is defined to have the size of char, so a byte in C also is not always exactly 8 bits.
Another difference is, that in C, a unsigned char must have no padding bits. That is, if you find CHAR_BIT is 8, then an unsigned char's values must range from 0 .. 2^CHAR_BIT-1. THe same is true for char if it's unsigned. For signed char, you can't assume anything about the range of values, even if you know how your compiler implements the sign stuff (two's complement or the other options), there may be unused padding bits in it. In C++, there are no padding bits for all three character types.
"What does it mean for a char to be signed?"
Traditionally, the ASCII character set consists of 7-bit character encodings. (As opposed to the 8 bit EBCIDIC.)
When the C language was designed and implemented this was a significant issue. (For various reasons like data transmission over serial modem devices.) The extra bit has uses like parity.
A "signed character" happens to be perfect for this representation.
Binary data, OTOH, is simply taking the value of each 8-bit "chunk" of data, thus no sign is needed.
Arithmetic on bytes is important for computer graphics (where 8-bit values are often used to store colors). Aside from that, I can think of two main cases where char sign matters:
converting to a larger int
comparison functions
The nasty thing is, these won't bite you if all your string data is 7-bit. However, it promises to be an unending source of obscure bugs if you're trying to make your C/C++ program 8-bit clean.
Signedness works pretty much the same way in chars as it does in other integral types. As you've noted, chars are really just one-byte integers. (Not necessarily 8-bit, though! There's a difference; a byte might be bigger than 8 bits on some platforms, and chars are rather tied to bytes due to the definitions of char and sizeof(char). The CHAR_BIT macro, defined in <limits.h> or C++'s <climits>, will tell you how many bits are in a char.).
As for why you'd want a character with a sign: in C and C++, there is no standard type called byte. To the compiler, chars are bytes and vice versa, and it doesn't distinguish between them. Sometimes, though, you want to -- sometimes you want that char to be a one-byte number, and in those cases (particularly how small a range a byte can have), you also typically care whether the number is signed or not. I've personally used signedness (or unsignedness) to say that a certain char is a (numeric) "byte" rather than a character, and that it's going to be used numerically. Without a specified signedness, that char really is a character, and is intended to be used as text.
I used to do that, rather. Now the newer versions of C and C++ have (u?)int_least8_t (currently typedef'd in <stdint.h> or <cstdint>), which are more explicitly numeric (though they'll typically just be typedefs for signed and unsigned char types anyway).
The only situation I can imagine this being an issue is if you choose to do math on chars. It's perfectly legal to write the following code.
char a = (char)42;
char b = (char)120;
char c = a + b;
Depending on the signedness of the char, c could be one of two values. If char's are unsigned then c will be (char)162. If they are signed then it will an overflow case as the max value for a signed char is 128. I'm guessing most implementations would just return (char)-32.
One thing about signed chars is that you can test c >= ' ' (space) and be sure it's a normal printable ascii char. Of course, it's not portable, so not very useful.
Related
int a = 0x11223344;
char b = (char)a;
I am new to programming and learning C. Why do I get value of b here as D?
If I want to store an integer into a char type variable, which byte of the integer will be stored?
This is not fully defined by the C standard.
In the particular situation you tried it, what likely happened is that the low eight bits of 0x11223344 were stored in b, producing 4416 (6810) in b, and printing that prints “D” because your system using ASCII character codes, and 68 is the ASCII code for “D”.
However, you should be wary of something like this working, because it is contingent on several things, and variations are possible.
First, the C standard allows char to be signed or unsigned. It also allows char to be any width that is eight bits or greater. In most C implementations today, it is eight bits.
Second, the conversion from int to char depends on whether char is signed or unsigned and may not be fully defined by the C standard.
If char is unsigned, then the conversion is defined to wrap modulo M+1, where M is the largest value representable in char. Effectively, this is the same as taking the low byte of the value. If the unsigned char has eight bits, its M is 255, so M+1 is 256.
If char is signed and the value is out of range of the char type, the conversion is implementation-defined: It may either trap or produce an implementation-defined value. Your C implementation may wrap conversions to signed integer types similarly to how it wraps conversions to unsigned types, but another reasonable behavior is to “clamp” out-of-range values to the limits of the type, CHAR_MIN and CHAR_MAX. For example, converting −8000 to char could yield the minimum, −128, while converting 0x11223344 to char could yield the maximum, +127.
Third, the C standard does not require implementations to use ASCII. It is very common to use ASCII. (Usually, the character encoding is not just ASCII, because ASCII covers only values from 0 to 127. C implementations often use some extension beyond ASCII for values from 128 to 255.)
I saw this implementation of strcmp a while back, and I have a question for purely education purposes. Why is it needed to convert the inputs to 16bit integers, do the math and then convert back to 8bit? What is wrong with doing the subtraction in 8bit?
int8_t strcmp (const uint8_t* s1, const uint8_t* s2)
{
while ( *s1 && (*s1 == *s2) )
{
s1++;
s2++;
}
return (int8_t)( (int16_t)*s1 - (int16_t)*s2 );
}
Note: the code assumes 16 bit int type.
EDIT:
It was mentioned that C does conversion to int (suppose 32bit) by default. Is that the case even when the code explicitly states to cast to 16bit int ?
The strcmp(a,b) function is expected to return
<0 if string a < string b
>0 if string a > string b
0 if string a == string b
The test is actually made on the first char being different in the two strings at the same location (0, the string terminator, works as well).
Here since the function takes two uint8_t (unsigned char), the developer was probably worrying about doing a comparison on two unsigned chars would give a number between 0 and 255, hence a negative value would never be returned. For instance, 118 - 236 would return -118, but on 8 bits it would return 138.
Thus the programmer decided to cast to int_16, signed integer (16 bits).
That could have worked, and given the correct negative/positive values (provided that the function returns int_16 instead of int_8).
(*edit: comment from #zwol below, the integer promotion is unavoidable, thus this int16_t casting is not necessary)
However the final int_8 cast breaks the logic. Since returned values may be from -255 to 255, some of these values will see their sign reversed after the cast to int_8.
For instance, doing 255 - 0 gives the positive 255 (on 16 bits, all lower 8 bits to 1, MSB to 0) but in the int_8 world (signed int of 8 bits) this is negative, -1, since we only have the last low 8 bits set to binary 11111111, or decimal -1.
Definitely not a good programming example.
That working function from Apple is better
for ( ; *s1 == *s2; s1++, s2++)
if (*s1 == '\0')
return 0;
return ((*(unsigned char *)s1 < *(unsigned char *)s2) ? -1 : +1);
(Linux does it in assembly code...)
Actually, the difference must be done in at least 16 bits¹ for the obvious reason that the range of the result is -255 to 255 and that does not fit in 8 bits. However, sfstewman is correct in noting that it would happen due to implicit integer promotion anyway.
The eventual cast to 8 bits is incorrect, because it can overflow as the range still does not fit in 8 bits. And anyway, strcmp is indeed supposed to return plain int.
¹ 9 would suffice, but bits normally come in batches of 8.
Input data is unsigned 8-bit, so to avoid truncation and effects of overflow/underflow it should be converted to at least 9-bit signed, therefore int16 is used.
return (int8_t)( (int16_t)*s1 - (int16_t)*s2 );
This could mean one of these two options:
Either the programmer was confused about how implicit type promotions work in C. Both operands will be implicitly converted to int no matter the casts to int16_t. So if intis for example 32 bits, the code is nonsense. Or otherwise if int is equivalent to int16_t for the specific system - then no conversion at all takes place.
Or the programmer is well-aware about how type promotions work and is writing code that needs to confirm to a standard that bans implicit type promotions, such as MISRA-C. In that case, and in case int is 16 bits on the given system, the code makes perfect sense: it forces an explicit type promotion to dodge warnings from the compiler/static analyser.
I would make a guess that the second option is the most likely, and that this code is indended for a small microcontroller system.
There are certain values that would cause the difference between the two numbers to be different if the int16_t weren't there due to overflow. In an int8_t your range is -128 to 127, in a uint8_t your range is 0 to 255, and in a int16_t your range would be -32,768 to 32,767.
Casing to an int8_t from a uint8_t will cause values over 127 to change signs due to overflow so this keeps that from happening, however the output should be an int16_t due to if you had a 255 - 0 result, it would be a truncated return.
I have a code like this:
#include <stdio.h>
int main()
{
char a=20,b=30;
char c=a*b;
printf("%c\n",c);
return 0;
}
The output of this program is X .
How is this output possible if a*b=600 which overflows as char values lies between -128 and 127 ?
Whether char is signed or unsigned is implementation defined. Either way, it is an integer type.
Anyway, the multiplication is done as int due to integer promotions and the result is converted to char.
If the value does not fit into the "smaller" type, it is implementation defined for a signed char how this is done. Far by most (if not all) implementations simply cut off the upper bits.
For an unsigned char, the standard actually requires (briefly) cutting of the upper bits.
So:
(int)20 * (int)20 -> (int)600 -> (char)(600 % 256) -> 88 == 'X'
(Assuming 8 bit char).
See the link and its surrounding paragraphs for more details.
Note: If you enable compiler warnings (as always recommended), you should get a truncation warning for the assignment. This can be avoided by an explicit cast (only if you are really sure about all implications). The gcc option is -Wconversion.
First off, the behavior is implementation-defined here. A char may be either unsigned char or signed char, so it may be able to hold 0 to 255 or -128 to 127, assuming CHAR_BIT == 8.
600 in decimal is 0x258. What happens is the least significant eight bits are stored, the value is 0x58 a.k.a. X in ASCII.
This code will cause undefined behavior if char is signed.
I thought overflow of signed integer is undefined behavior, but conversion to smaller type is implementation-defined.
quote from N1256 6.3.1.3 Signed and unsigned integers:
3 Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.
If the value is simply truncated to 8 bits, (20 * 30) & 0xff == 0x58 and 0x58 is ASCII code for X. So, if your system do this and use ASCII code, the output will be X.
First, looks like you have unsigned char with a range from 0 to 255.
You're right about the overflow.
600 - 256 - 256 = 88
This is just an ASCII code of 'X'.
I need to do modulo 256 arithmetic in C. So can I simply do
unsigned char i;
i++;
instead of
int i;
i=(i+1)%256;
No. There is nothing that guarantees that unsigned char has eight bits. Use uint8_t from <stdint.h>, and you'll be perfectly fine. This requires an implementation which supports stdint.h: any C99 compliant compiler does, but older compilers may not provide it.
Note: unsigned arithmetic never overflows, and behaves as "modulo 2^n". Signed arithmetic overflows with undefined behavior.
Yes, the behavior of both of your examples is the same. See C99 6.2.5 §9 :
A computation involving unsigned operands can never overflow,
because a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type.
unsigned char c = UCHAR_MAX;
c++;
Basically yes, there is no overflow, but not because c is of an unsigned type. There is a hidden promotion of c to int here and an integer conversion from int to unsigned char and it is perfectly defined.
For example,
signed char c = SCHAR_MAX;
c++;
is also not undefined behavior, because it is actually equivalent to:
c = (int) c + 1;
and the conversion from int to signed char is implementation-defined here (see c99, 6.3.1.3p3 on integer conversions). To simplify CHAR_BIT == 8 is assumed.
For more information on the example above, I suggest to read this post:
"The Little C Function From Hell"
http://blog.regehr.org/archives/482
Very probably yes, but the reasons for it in this case are actually fairly complicated.
unsigned char i = 255;
i++;
The i++ is equivalent to i = i + 1.
(Well, almost. i++ yields the value of i before it was incremented, so it's really equivalent to (tmp=i; i = i + 1; tmp). But since the result is discarded in this case, that doesn't raise any additional issues.)
Since unsigned char is a narrow type, an unsigned char operand to the + operator is promoted to int (assuming int can hold all possible values in the range of unsigned char). So if i == 255, and UCHAR_MAX == 255, then the result of the addition is 256, and is of type (signed) int.
The assignment implicitly converts the value 256 from int back to unsigned char. Conversion to an unsigned type is well defined; the result is reduced modulo MAX+1, where MAX is the maximum value of the target unsigned type.
If i were declared as an unsigned int:
unsigned int i = UINT_MAX;
i++;
there would be no type conversion, but the semantics of the + operator for unsigned types also specify reduction module MAX+1.
Keep in mind that the value assigned to i is mathematically equivalent to (i+1) % UCHAR_MAX. UCHAR_MAX is usually 255, and is guaranteed to be at least 255, but it can legally be bigger.
There could be an exotic system on which UCHAR_MAX is too be to be stored in a signed int object. This would require UCHAR_MAX > INT_MAX, which means the system would have to have at least 16-bit bytes. On such a system, the promotion would be from unsigned char to unsigned int. The final result would be the same. You're not likely to encounter such a system. I think there are C implementations for some DSPs that have bytes bigger than 8 bits. The number of bits in a byte is specified by CHAR_BIT, defined in <limits.h>.
CHAR_BIT > 8 does not necessarily imply UCHAR_MAX > INT_MAX. For example, you could have CHAR_BIT == 16 and sizeof (int) == 2 i.e., 16-bit bytes and 32 bit ints).
There's another alternative that hasn't been mentioned, if you don't want to use another data type.
unsigned int i;
// ...
i = (i+1) & 0xFF; // 0xFF == 255
This works because the modulo element == 2^n, meaning the range will be [0, 2^n-1] and thus a bitmask will easily keep the value within your desired range. It's possible this method would not be much or any less efficient than the unsigned char/uint8_t version, either, depending on what magic your compiler does behind the scenes and how the targeted system handles non-word loads (for example, some RISC architectures require additional operations to load non-word-size values). This also assumes that your compiler won't detect the usage of power-of-two modulo arithmetic on unsigned values and substitute a bitmask for you, of course, as in cases like that the modulo usage would have greater semantic value (though using that as the basis for your decision is not exactly portable, of course).
An advantage of this method is that you can use it for powers of two that are not also the size of a data type, e.g.
i = (i+1) & 0x1FF; // i %= 512
i = (i+1) & 0x3FF; // i %= 1024
// etc.
This should work fine because it should just overflow back to 0. As was pointed out in a comment on a different answer, you should only do this when the value is unsigned, as you may get undefined behavior with a signed value.
It is probably best to leave this using modulo, however, because the code will be better understood by other people maintaining the code, and a smart compiler may be doing this optimization anyway, which may make it pointless in the first place. Besides, the performance difference will probably be so small that it wouldn't matter in the first place.
It will work if the number of bits that you are using to represent the number is equal to number of bits in binary (unsigned) representation (100000000) of the divisor -1
which in this case is : 9-1= 8 (char)
This question already has answers here:
What is an unsigned char?
(16 answers)
char!=(signed char), char!=(unsigned char)
(4 answers)
Closed 5 years ago.
So I know that the difference between a signed int and unsigned int is that a bit is used to signify if the number if positive or negative, but how does this apply to a char? How can a character be positive or negative?
There's no dedicated "character type" in C language. char is an integer type, same (in that regard) as int, short and other integer types. char just happens to be the smallest integer type. So, just like any other integer type, it can be signed or unsigned.
It is true that (as the name suggests) char is mostly intended to be used to represent characters. But characters in C are represented by their integer "codes", so there's nothing unusual in the fact that an integer type char is used to serve that purpose.
The only general difference between char and other integer types is that plain char is not synonymous with signed char, while with other integer types the signed modifier is optional/implied.
I slightly disagree with the above. The unsigned char simply means: Use the most significant bit instead of treating it as a bit flag for +/- sign when performing arithmetic operations.
It makes significance if you use char as a number for instance:
typedef char BYTE1;
typedef unsigned char BYTE2;
BYTE1 a;
BYTE2 b;
For variable a, only 7 bits are available and its range is (-127 to 127) = (+/-)2^7 -1.
For variable b all 8 bits are available and the range is 0 to 255 (2^8 -1).
If you use char as character, "unsigned" is completely ignored by the compiler just as comments are removed from your program.
There are three char types: (plain) char, signed char and unsigned char. Any char is usually an 8-bit integer* and in that sense, a signed and unsigned char have a useful meaning (generally equivalent to uint8_t and int8_t). When used as a character in the sense of text, use a char (also referred to as a plain char). This is typically a signed char but can be implemented either way by the compiler.
* Technically, a char can be any size as long as sizeof(char) is 1, but it is usually an 8-bit integer.
Representation is the same, the meaning is different. e.g, 0xFF, it both represented as "FF". When it is treated as "char", it is negative number -1; but it is 255 as unsigned. When it comes to bit shifting, it is a big difference since the sign bit is not shifted. e.g, if you shift 255 right 1 bit, it will get 127; shifting "-1" right will be no effect.
A signed char is a signed value which is typically smaller than, and is guaranteed not to be bigger than, a short. An unsigned char is an unsigned value which is typically smaller than, and is guaranteed not to be bigger than, a short. A type char without a signed or unsigned qualifier may behave as either a signed or unsigned char; this is usually implementation-defined, but there are a couple of cases where it is not:
If, in the target platform's character set, any of the characters required by standard C would map to a code higher than the maximum `signed char`, then `char` must be unsigned.
If `char` and `short` are the same size, then `char` must be signed.
Part of the reason there are two dialects of "C" (those where char is signed, and those where it is unsigned) is that there are some implementations where char must be unsigned, and others where it must be signed.
The same way -- e.g. if you have an 8-bit char, 7 bits can be used for magnitude and 1 for sign. So an unsigned char might range from 0 to 255, whilst a signed char might range from -128 to 127 (for example).
This because a char is stored at all effects as a 8-bit number. Speaking about a negative or positive char doesn't make sense if you consider it an ASCII code (which can be just signed*) but makes sense if you use that char to store a number, which could be in range 0-255 or in -128..127 according to the 2-complement representation.
*: it can be also unsigned, it actually depends on the implementation I think, in that case you will have access to extended ASCII charset provided by the encoding used
The same way how an int can be positive or negative. There is no difference. Actually on many platforms unqualified char is signed.