I have found that the C99 standard have a statement which denies the compatibility between the type char and the type signed char/unsigned char.
Note 35 of C99 standard:
CHAR_MIN, defined in limits.h, will have one of the values 0 or SCHAR_MIN, and this can be used to distinguish the two options. Irrespective of the choice made, char is a separate type from the other two and is not compatible with either.
My question is that why does the committee deny the compatibility? What is the rationale? If char is compatible with signed char or unsigned char, will something terrible happen?
The roots are in compiler history. There were (are) essentially two C dialects in the Eighties:
Where plain char is signed
Where plain char is unsigned
Which of these should C89 have standardized? C89 chose to standardize neither, because it would have invalidated a large number of assumptions made in C code already written--what standard folks call the installed base. So C89 did what K&R did: leave the signedness of plain char implementation-defined. If you required a specific signedness, qualify your char.
Modern compilers usually let you chose the dialect with an option (eg. gcc's -funsigned-char).
The "terrible" thing that can happen if you ignore the distinction between (un)signed char and plain char is that if you do arithmetic and shifts without taking these details into account, you might get sign extensions when you don't expect them or vice versa (or even undefined behavior when shifting).
There's also some dumb advice out there that recommends to always declare your chars with an explicit signed or unsigned qualifier. This works as long as you only work with pointers to such qualified types, but it requires ugly casts as soon as you deal with strings and string functions, all of which operate on pointer-to-plain-char, which is assignment-incompatible without a cast. Such code suddenly gets plastered with tons of ugly-to-the-bone casts.
The basic rules for chars are:
Use plain char for strings and if you need to pass pointers to functions taking plain char
Use unsigned char if you need to do bit twiddling and shifting on bytes
Use signed char if you need small signed values, but think about using int if space is not a concern
Think of signed char and unsigned char as the smallest arithmetic, integral types, just like signed short/unsigned short, and so forth with int, long int, long long int. Those types are all well-specified.
On the other hand, char serves a very different purpose: It's the basic type of I/O and communication with the system. It's not meant for computations, but rather as the unit of data. That's why you find char used in the command line arguments, in the definition of "strings", in the FILE* functions and in other read/write type IO functions, as well as in the exception to the strict aliasing rule. This char type is deliberately less strictly defined so as to allow every implementation to use the most "natural" representation.
It's simply a matter of separating responsibilities.
(It is true, though, that char is layout-compatible with both signed char and unsigned char, so you may explicitly convert one to the other and back.)
Related
In plain C, by the standard there are three distinct "character" types:
plain char which one's signedness is implementation defined.
signed char.
unsigned char.
Let's assume at least C99, where stdint.h is already present (so you have the int8_t and uint8_t types as recommendable alternatives with explicit width to signed and unsigned chars).
For now for me it seems like using the plain char type is only really useful (or necessary) if you need to interface functions of the standard library such as printf, and in all other scenarios, rather to be avoided. Using char could lead to undefined behavior when it is signed on the implementation, and for any reason you need to do any arithmetic on such data.
The problem of using an appropriate type is probably the most apparent when dealing for example with Unicode text (or any code page using values above 127 to represent characters), which otherwise could be handled as a plain C string. However the relevant string.h functions all accept char, and if such data is typed char, that imposes problems when trying to interpret it for example for a display routine capable to handle its encoding.
What is the most recommendable method in such a case? Are there any particular reasons beyond this where it could be recommendable to use char over stdint.h's appropriate fixed-width types?
The char type is for characters and strings. It is the type expected and returned by all the string handling functions. (*) You really should never have to do arithmetic on char, especially not the kind where signed-ness would make a difference.
unsigned char is the type to be used for raw data. For example memcpy() or fread() interpret their void * arguments as arrays of unsigned char. The standard guarantees that any type can be also represented as an array of unsigned char. Any other conversion might be "signalling", i.e. triggering exceptions. (ISO/IEC 9899:2011, section 6.2.6 "Representation of Types"). (**)
signed char is when you need a signed integer of char size (for arithmetics).
(*): The character handling functions in <ctype.h> are a bit oddball about this, as they cater for EOF (negative), and hence "force" the character values into the unsigned char range (ISO/IEC 9899:2011, section 7.4 Character handling). But since it is guaranteed that a char can be cast to unsigned char and back without loss of information as per section 6.2.6... you get the idea.
When signed-ness of char would make a difference -- the comparison functions like in strcmp() -- the standard dictates that char is interpreted as unsigned char (ISO/IEC 9899:2011, section 7.24.4 Comparison functions).
(**): Practically, it is hard to see how a conversion of raw data to char and back could be signalling where the same done with unsigned char would not be signalling. But unsigned char is what the section of the standard says. ;-)
Use char to store characters (standard defines the behaviour for basic execution character set elements only, roughly ASCII 7-bit characters).
Use signed char or unsigned char to get the corresponding arithmetic (signed or unsigned arithmetic have different properties for integers - char is an integer type).
This doesn't means that you can't make arithmetic with raw chars, as stated:
6.2.5 Types - 3. An object declared as type char is large enough to store any member of
the basic execution character set. If a member of the basic execution
character set is stored in a char object, its value is guaranteed to
be nonnegative.
Then if you only use character set elements arithmetic on them is correctly defined.
This question already has answers here:
Whats wrong with this C code?
(4 answers)
Closed 6 years ago.
I was looking at the data types at the link data type
It is written as char type is 1 byte having a range -128 to 127 or 0 to 255.
How can this possible? By default char means signed right.
Edit: There is another question Whats wrong with this C code?. But it is not same question. Title says what is wrong with this code and search will not list this answer easily. One has to analyse the question fully to understand the issue.
Edit: After looking at several answers and comments, I got another doubt. Strings within double quotes are treated as char. I get warnings if I pass double quoted strings to a function having parameter of type signed char. Also itoa and many other library functions make use of char type parameter and not signed char. Ofcourse typecasting will avoid this problem. So what is the best parameter type for functions manipulating null terminated strings(for example LCD display related functions)? Use signed char or unsigned char (since char is implementation defined, it may not be portable I guess)
char has implementation-defined signedness. Meaning that one compiler can chose to implement it as signed and another as unsigned.
This is the reason why you should never use the char type for storing numbers. A better type to use for such is uint8_t.
char "has the same representation and alignment as either signed char or unsigned char, but is always a distinct type".
No, it doesn't mean signed char by default. According to the C standard, a char is a distinct type from both signed and unsigned chars, that merely behaves like one of the other two.
n1570/6.2.5p15
The three types char, signed char, and unsigned char are collectively called the character types. The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char.
And in a note to the above paragraph:
CHAR_MIN, defined in <limits.h>, will have one of the values 0 or SCHAR_MIN, and this can be used to distinguish the two options. Irrespective of the choice made, char is a separate type from the other two and is not compatible with either.
Say I have some utf8 encoded string. Inside it words are delimited using ";".
But each character (except ";") inside this string has utf8 value >128.
Say I store this string inside unsigned char array:
unsigned char buff[]="someutf8string;separated;with;";
Is it safe to pass this buff to strtok function? (If I just want to extracts words using ";" symbol).
My concern is that strtok (or also strcpy) expect char pointers, but inside my
string some values will have value > 128.
So is this behaviour defined?
No, it is not safe -- but if it compiles it will almost certainly work as expected.
unsigned char buff[]="someutf8string;separated;with;";
This is fine; the standard specifically permits arrays of character type (including unsigned char) to be initialized with a string literal. Successive bytes of the string literal initialize the elements of the array.
strtok(buff, ";")
This is a constraint violation, requiring a compile-time diagnostic. (That's about as close as the C standard gets to saying that something is illegal.)
The first parameter of strok is of type char*, but you're passing an argument of type unsigned char*. These two pointer types are not compatible, and there is no implicit conversion between them. A conforming compiler may reject your program if it contains a call like this (and, for example, gcc -std=c99 -pedantic-errors does reject it.)
Many C compilers are somewhat lax about strict enforcement of the standard's requirements. In many cases, compilers issue warnings for code that contains constraint violations -- which is perfectly valid. But once a compiler has diagnosed a constraint violation and proceeded to generate an executable, the behavior of that executable is not defined by the C standard.
As far as I know, any actual compiler that doesn't reject this call will generate code that behaves just as you expect it to. The pointer types char* and unsigned char* almost certainly have the same representation and are passed the same way as arguments, and the types char and unsigned char are explicitly required to have the same representation for non-negative values. Even for values exceeding CHAR_MAX, like the ones you're using, a compiler would have to go out of its way to generate misbehaving code. You could have problems on a system that doesn't use 2's-complement for signed integers, but yo're not likely to encounter such a system.
If you add an explicit cast:
strtok((char*)buff, ";")
removes the constraint violation and will probably silence any warning -- but the behavior is still strictly undefined.
In practice, though, most compilers try to treat char, signed char, and unsigned char almost interchangeably, partly to cater to code like yours, and partly because they'd have to go out of their way to do anything else.
According to the C11 Standard (ISO/IEC 9899:2011 §7.24.1 String Handling Conventions, ¶3, emphasis added):
For all functions in this subclause, each character shall be
interpreted as if it had the type unsigned char (and therefore every
possible object representation is valid and has a different value).
Note: this paragraph was not present in the C99 standard.
So I do not see a problem.
(Edited change C/C++ to C)
Please help me to find out a clean clarification on char and unsigned char in C. Specially when we transfer data between embedded devices and general PCs (The difference between buffer of unsigned char and plain char).
You're asking about two different languages but, in this respect, the answer is (more or less) the same for both. You really should decide which language you're using though.
Differences:
they are distinct types
it's implementation-defined whether char is signed or unsigned
Similarities:
they are both integer types
they are the same size (one byte, at least 8 bits)
If you're simply using them to transfer raw byte values, with no arithmetic, then there's no practical difference.
The type char is special. It is not an unsigned char or a signed char. These are three distinct types (while int and signed int are the same types). A char might have a signed or unsigned representation.
From 3.9.1 Fundamental types
Plain char, signed char, and unsigned char are three distinct types. A
char, a signed char, and an unsigned char occupy the same amount of
storage and have the same alignment requirements (3.11); that is, they
have the same object representation.
That is, why does unsigned short var= L'ÿ' work, but unsigned short var[]= L"ÿ"; does not?
L'ÿ' is of type wchar_t, which can be implicitly converted into an unsigned short. L"ÿ" is of type wchar_t[2], which cannot be implicitly converted into unsigned short[2].
L is the prefix for wide character literals and wide-character string literals. This is part of the language and not a header. It's also not GCC-specific. They would be used like so:
wchar_t some_wchar = L'ÿ';
wchar_t *some_wstring = L"ÿ"; // or wchar_t some_wstring[] = L"ÿ";
You can do unsigned short something = L'ÿ'; because a conversion is defined from wchar_t to short. There is not such conversion defined between wchar_t* and short.
wchar_t is just a typedef to one of the standard integer types. The compiler implementor choses such a type that is large enough to hold all wide characters. If you don't include the header, this is still true and L'ß' is well defined, only that you as a programmer don't know what type it has.
Your initialization to an integer type works because there are rules to convert one into another. Assigning a wide character string (i.e the address of the first address of a wide character array) to an integer pointer is only possible if you guess the integer type to which wchar_t corresponds correctly. There is no automatic conversion of pointers of different types, unless one of them is void*.
Chris has already given the correct answer, but I'd like to offer some thoughts on why you may have made the mistake to begin with. On Windows, wchar_t was defined as 16-bit way back in the early days of Unicode where it was intended to be a 16-bit character set. Unfortunately this turned out to be a bad decision (it makes it impossible for the C compiler to support non-BMP Unicode characters in a way that conforms to the C standard), but they were stuck with it.
Unix systems from the beginning have used 32-bit wchar_t, which of course means short * and wchar_t * are incompatible pointer types.
For what I remember of C
'Y' or whatever is a char and you can cast it into an int and therefore convert it into a L,
"y" is a string constant and you can't translate it into a integer value