(Edited change C/C++ to C)
Please help me to find out a clean clarification on char and unsigned char in C. Specially when we transfer data between embedded devices and general PCs (The difference between buffer of unsigned char and plain char).
You're asking about two different languages but, in this respect, the answer is (more or less) the same for both. You really should decide which language you're using though.
Differences:
they are distinct types
it's implementation-defined whether char is signed or unsigned
Similarities:
they are both integer types
they are the same size (one byte, at least 8 bits)
If you're simply using them to transfer raw byte values, with no arithmetic, then there's no practical difference.
The type char is special. It is not an unsigned char or a signed char. These are three distinct types (while int and signed int are the same types). A char might have a signed or unsigned representation.
From 3.9.1 Fundamental types
Plain char, signed char, and unsigned char are three distinct types. A
char, a signed char, and an unsigned char occupy the same amount of
storage and have the same alignment requirements (3.11); that is, they
have the same object representation.
Related
This question already has answers here:
What is an unsigned char?
(16 answers)
Closed 5 years ago.
I have seen in my legacy embedded code that people are using signed char as return type. What's the need to put signed there? Isn't that implicit that char is nothing but signed char.
char, signed char, and unsigned char are all distinct types.
Your implementation can set char to be either signed or unsigned.
unsigned char has a distinct range set by the standard. The problem with using simply char is that your code can behave differently on different platforms. It could even be a 1's complement or signed magnitude type with a range -127 to +127.
Because no-one in the candidate duplicate answer cited the right paragraph from the spec:
6.2.5 [Types], paragraph 15
The three types char, signed char, and unsigned char are collectively called the character types. The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char
So char could be either. (FWIW, I believe I ran into char = unsigned char in JNI.)
From CppReference:
Character types
signed char - type for signed character representation.
unsigned char - type for unsigned character representation. Also used to inspect object representations (raw memory).
char - type for character representation. Equivalent to either signed char or unsigned char (which one is implementation-defined and may be controlled by a compiler commandline switch), but char is a distinct type, different from both signed char and unsigned char.
So if you want to ensure that you're using either an unsigned char or a signed char, specify it explicitly. Otherwise there'd be no guarantee whether it'll be signed or not.
I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes. (In fact, I don't think of the char datatype as a character, but a byte).
But, if I understand, string literals are signed chars (actually they're not, but see the update below), and the function fgetc() returns unsigned chars casted into int. So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters? Why does reading characters from a file have a different convention than literals?
I ask because I have some code in c that does string comparison between string literals and the contents of files, but having a signed char * vs unsigned char * might really make my code error prone.
Update 1
Ok as a few people pointed out (in answers and comments) string literals are in fact char arrays, not signed char arrays. That means I really should use char * for string literals, and not think about whether they are signed or unsigned. This makes me perfectly happy (until I have to start making conversion/comparisons with unsigned chars).
However the important question remains, how do I read characters from a file, and compare them to a string literal. The crux of which is the conversion from the int read using fgetc(), which explicitly reads an unsigned char from the file, to the char type, which is allowed to be either signed or unsigned.
Allow me to provide a more detailed example.
int main(void)
{
FILE *someFile = fopen("ThePathToSomeRealFile.html", "r");
assert(someFile);
char substringFromFile[25];
memset((void*)substringFromFile,0,sizeof(substringFromFile));
//Alright, the real example is to read the first few characters from the file
//And then compare them to the string I expect
const char *expectedString = "<!DOCTYPE";
for( int counter = 0; counter < sizeof(expectedString)/sizeof(*expectedString); ++counter )
{
//Read it as an integer, because the function returns an `int`
const int oneCharacter = fgetc(someFile);
if( ferror(someFile) )
return EXIT_FAILURE;
if( int == EOF || feof(someFile) )
break;
assert(counter < sizeof(substringFromFile)/sizeof(*substringFromFile));
//HERE IS THE PROBLEM:
//I know the data contained in oneCharacter must be an unsigned char
//Therefore, this is valid
const unsigned char uChar = (const unsigned char)oneCharacter;
//But then how do I assign it to the char?
substringFromFile[counter] = (char)oneCharacter;
}
//and ultimately here's my goal
int headerIsCorrect = strncmp(substringFromFile, expectedString, 9);
if(headerIsCorrect != 0)
return EXIT_SUCCESS;
//else
return EXIT_FAILURE;
}
Essentially, I know my fgetc() function is returning something that (after some error checking) is code-able as an unsigned char. I know that char may or may not be an unsigned char. That means, depending on the implementation of the c standard, doing a cast to char will involve no reinterpretation. However, in the case that the system is implemented with a signed char, I have to worry about values that can be coded by an unsigned char that aren't code-able by char (i.e. those values between (INT8_MAX UINT8_MAX]).
tl;dr
The question is this, should I (1) copy their underlying data read by fgetc() (by casting pointers - don't worry, I know how to do that), or (2) cast down from unsigned char to char (which is only safe if I know that the values can't exceed INT8_MAX, or those values can be ignored for whatever reason)?
The historical reasons are (as I've been told, I don't have a reference) that the char type was poorly specified from the beginning.
Some implementations used "consistent integer types" where char, short, int and so on were all signed by default. This makes sense because it makes the types consistent with each other.
Other implementations used unsigned for character, since there never existed any symbol tables with negative indices (that would be stupid) and since they saw a need for more than 128 characters (a very valid concern).
By the time C got standardized properly, it was too late to change this, too many different compilers and programs written for them were already out on the market. So the signedness of char was made implementation-defined, for backwards compatibility reasons.
The signedness of char does not matter if you only use it to store characters/strings. It only matters when you decide to involve the char type in arithmetic expressions or use it to store integer values - this is a very bad idea.
For characters/string, always use char (or wchar_t).
For any other form of 1 byte large data, always use uint8_t or int8_t.
But, if I understand, string literals are signed char
No, string literals are char arrays.
the function fgetc() returns unsigned chars casted into int
No, it returns a char converted to an int. It is int because the return type may contain EOF, which is an integer constant and not a character constant.
having a signed char * vs unsigned char * might really make my code error prone.
No, not really. Formally, this rule from the standard applies:
A pointer to an object type may be converted to a pointer to a different object type. If the
resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.
There exists no case where casting from pointer to signed char to pointer to unsigned char or vice versa, would cause any alignment issues or other issues.
I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes.
If you're going to do comparison or assign char to other integer types, it should bother you.
But, if I understand, string literals are signed chars
They are of type char[], so if char === unsigned char, all string literals are unsigned char[].
the function fgetc() returns unsigned chars casted into int.
That's correct and is required to omit undesired sign extension.
So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters?
For portability I'd advise to follow practice adapted by various libc implementations: use char, but before processing cast to unsigned char (char* to unsigned char*). This way implicit integer promotions won't turn characters in the range 0x80 -- 0xff into negative numbers of wider types.
In short: (signed char)a < (signed char)b is NOT always equivalent to (unsigned char)a < (unsigned char)b. Here is an example.
Why does reading characters from a file have a different convention than literals?
getc() needs a way to return EOF such that it couldn't be confused with any real char.
So I have an application where I use a lot of arrays of chars, shorts, ints, and long longs, all unsigned. Rather than allocating space for each and deallocating, my thought is to have a static array of unsigned long longs. I would then cast this as needed as an array of the appropriate type. Is there a way to prove this is compliant with the standard?
I am statically asserting that char, short, int, and long long are of sizes 1, 2, 4, and 8, respectively, and that their alignment requirements do not exceed their sizes. I would like to know if I can prove the validity of my approach with no further static assertions.
EDIT: I thought I'd add that the standard defines object representation as a copy of an object as an array of unsigned char. It seems that this justifies using an unsigned long long array as either that or an unsigned char array, although I cannot absolutely rule out problems associated with using the object representation in context of the object itself rather than in a copy (which is how 6.2.6.1.4 discusses the object representation). This is, however, all I can find, and it does not help at all with the two intermediate integer sizes.
First off, you're not talking about casting arrays. You're talking about casting pointers.
The standard does not guarantee that what you're doing is safe. You can treat an array of unsigned long long as, for example, an array of unsigned char, but there's no guarantee that you can treat it as an array of unsigned int.
Consider a hypothetical implementation with CHAR_BIT==8, sizeof (unsigned int) == 4, and sizeof (unsigned long long) == 8. Assume unsigned int requires strict 4-byte alignment. But the underlying machine has no direct support for 64-bit quantities, so all operations on unsigned long long are done in software. Because of this, the required alignment for unsigned long long is, let's say, 2 bytes.
So an array of unsigned long long might start at an address that's not a multiple of 4 bytes, and therefore you can't safely treat it as an array of unsigned int.
I don't suggest that this is a plausible implementation. Even if 64-bit integers are implemented in software, it would probably make sense for them to have at least 32-bit alignment. But nothing in what I've described violates the standard; the hypothetical implementation could be conforming.
If you're using a C11 compiler (as indicated by the tag on your question), you could statically assert that
_Alignof (unsigned long long) > _Alignof (unsigned int)
and so forth.
Or you could use malloc during startup to allocate your array, guaranteeing that it's properly aligned for any type.
Stealing an idea from the comments, you could define a union of array types, something like:
#define BYTE_COUNT some_big_number
union arrays {
unsigned char ca[BYTE_COUNT];
unsigned short sa[BYTE_COUNT / sizeof (unsigned short)];
unsigned int ia[BYTE_COUNT / sizeof (unsigned int)];
unsigned long la[BYTE_COUNT / sizeof (unsigned long)];
unsigned long long lla[BYTE_COUNT / sizeof (unsigned long long)];
};
Or you could define your arrays for the type of data you want to store in them.
I am so confused about size_t. I have searched on the internet and everywhere mentioned that size_t is an unsigned type so, it can represent only non-negative values.
My first question is: if it is used to represent only non-negative values, why don't we use unsigned int instead of size_t?
My second question is: are size_t and unsigned int interchangeable or not? If not, then why?
And can anyone give me a good example of size_t and briefly its workings?
if it is use to represent non negative value so why we not using unsigned int instead of size_t
Because unsigned int is not the only unsigned integer type. size_t could be any of unsigned char, unsigned short, unsigned int, unsigned long or unsigned long long, depending on the implementation.
Second question is that size_t and unsigned int are interchangeable or not and if not then why?
They aren't interchangeable, for the reason explained above ^^.
And can anyone give me a good example of size_t and its brief working ?
I don't quite get what you mean by "its brief working". It works like any other unsigned type (in particular, like the type it's typedeffed to). You are encouraged to use size_t when you are describing the size of an object. In particular, the sizeof operator and various standard library functions, such as strlen(), return size_t.
Bonus: here's a good article about size_t (and the closely related ptrdiff_t type). It reasons very well why you should use it.
There are 5 standard unsigned integer types in C:
unsigned char
unsigned short
unsigned int
unsigned long
unsigned long long
with various requirements for their sizes and ranges (briefly, each type's range is a subset of the next type's range, but some of them may have the same range).
size_t is a typedef (i.e., an alias) for some unsigned type, (probably one of the above but possibly an extended unsigned integer type, though that's unlikely). It's the type yielded by the sizeof operator.
On one system, it might make sense to use unsigned int to represent sizes; on another, it might make more sense to use unsigned long or unsigned long long. (size_t is unlikely to be either unsigned char or unsigned short, but that's permitted).
The purpose of size_t is to relieve the programmer from having to worry about which of the predefined types is used to represent sizes.
Code that assumes sizeof yields an unsigned int would not be portable. Code that assumes it yields a size_t is more likely to be portable.
size_t has a specific restriction.
Quoting from http://www.cplusplus.com/reference/cstring/size_t/ :
Alias of one of the fundamental unsigned integer types.
It is a type able to represent the size of any object in bytes: size_t is the type returned by the sizeof operator and is widely used in the standard library to represent sizes and counts.
It is not interchangeable with unsigned int because the size of int is specified by the data model. For example LLP64 uses a 32-bit int and ILP64 uses a 64-bit int.
Apart from the other answers it also documents the code and tells people that you are talking about size of objects in memory
size_t is used to store sizes of data objects, and is guaranteed to be able to hold the size of any data object that the particular C implementation can create. This data type may be smaller (in number of bits), bigger or exactly the same as unsigned int.
size_t type is a base unsigned integer type of
C/C++ language. It is the type of the result
returned by sizeof operator. The type's size is
chosen so that it could store the maximum size
of a theoretically possible array of any type. On a
32-bit system size_t will take 32 bits, on a 64-
bit one 64 bits. In other words, a variable of
size_t type can safely store a pointer. The
exception is pointers to class functions but this
is a special case. Although size_t can store a
pointer, it is better to use another unsigned
integer type uintptr_t for that purpose (its name
reflects its capability). The types size_t and
uintptr_t are synonyms. size_t type is
usually used for loop counters, array indexing
and address arithmetic.
The maximum possible value of size_t type is
constant SIZE_MAX .
In simple words size_t is platform and as well as implementation dependent whereas unsigned int is only platform dependent.
I have found that the C99 standard have a statement which denies the compatibility between the type char and the type signed char/unsigned char.
Note 35 of C99 standard:
CHAR_MIN, defined in limits.h, will have one of the values 0 or SCHAR_MIN, and this can be used to distinguish the two options. Irrespective of the choice made, char is a separate type from the other two and is not compatible with either.
My question is that why does the committee deny the compatibility? What is the rationale? If char is compatible with signed char or unsigned char, will something terrible happen?
The roots are in compiler history. There were (are) essentially two C dialects in the Eighties:
Where plain char is signed
Where plain char is unsigned
Which of these should C89 have standardized? C89 chose to standardize neither, because it would have invalidated a large number of assumptions made in C code already written--what standard folks call the installed base. So C89 did what K&R did: leave the signedness of plain char implementation-defined. If you required a specific signedness, qualify your char.
Modern compilers usually let you chose the dialect with an option (eg. gcc's -funsigned-char).
The "terrible" thing that can happen if you ignore the distinction between (un)signed char and plain char is that if you do arithmetic and shifts without taking these details into account, you might get sign extensions when you don't expect them or vice versa (or even undefined behavior when shifting).
There's also some dumb advice out there that recommends to always declare your chars with an explicit signed or unsigned qualifier. This works as long as you only work with pointers to such qualified types, but it requires ugly casts as soon as you deal with strings and string functions, all of which operate on pointer-to-plain-char, which is assignment-incompatible without a cast. Such code suddenly gets plastered with tons of ugly-to-the-bone casts.
The basic rules for chars are:
Use plain char for strings and if you need to pass pointers to functions taking plain char
Use unsigned char if you need to do bit twiddling and shifting on bytes
Use signed char if you need small signed values, but think about using int if space is not a concern
Think of signed char and unsigned char as the smallest arithmetic, integral types, just like signed short/unsigned short, and so forth with int, long int, long long int. Those types are all well-specified.
On the other hand, char serves a very different purpose: It's the basic type of I/O and communication with the system. It's not meant for computations, but rather as the unit of data. That's why you find char used in the command line arguments, in the definition of "strings", in the FILE* functions and in other read/write type IO functions, as well as in the exception to the strict aliasing rule. This char type is deliberately less strictly defined so as to allow every implementation to use the most "natural" representation.
It's simply a matter of separating responsibilities.
(It is true, though, that char is layout-compatible with both signed char and unsigned char, so you may explicitly convert one to the other and back.)