C: char vs. unsigned char for non-ASCII text data - c

This question:
What is an unsigned char?
does a great job of discussing char vs. unsigned char vs. signed char in C.
However, it doesn't directly address what should be used for non-ASCII text. Thus if I have an array of bytes that represents text in some arbitrary character set like UTF-8 or Big5 (or sometimes ASCII), should I use an array of char or unsigned char?
I'm leaning towards using char because otherwise gcc gives me warnings about signedness of pointers when the array is ASCII and I use strlen. But I would like to know what is correct.

Use normal char to represent characters. Use signed char when you want a signed integer type that covers values from -127 to +127 . Use unsigned char for having an unsigned integer type that has range of values from 0 to 255 .

The question you are asking is probably much broader that you expect.
To answer it directly, most implementations use "byte" as underlying buffer. In that terms standard uint8_t typedef is your best bet. That is primarily because most character sets use variable number of bytes to store characters, so separate byte processing is essential in encoding and decoding process. It also simplifies conversion between different "endianess".
In general it's incorrect to use strlen on anything other than ASCII encoding or other single-byte code pages (0-255 range). It's certainly incorrect on any multi-byte encoding like Big5, UTF-8/16 or Shift-JIS.

As far as UTF8 or any encoding where ASCII characters have the same codepoints, char is the best type for multi-byte characters string:
assume typedef char utf8:
This is the only way to allow char * to be used as utf8 * without an explicit cast. This is extremely common and a good enough reason to be better than unsigned char.
utf8 * could be accidentally passed to function expecting a pointer to a sequence of ASCII characters, but this could also be needed if you need to printf your utf8 string (which is a valid thing to do)
The main drawback is that as char sign is unknown, usage of arithmetic operators like > is unsafe, and the only safe way to check if a character is in the ASCII range is by checking the bit directly with ISASCII(c) ((c & (1 << 7) == 0)

Related

Proper handling of 128..255 chars in C

I need to process some Win-1251-encoded text (8-bit encoding, uses some of 128..255 for Cyrillic). As far as I can tell, C was created with 7-bit ASCII in mind, no explicit support for single-byte chars above 127. So I have several questions:
Which is the more proper type for this text: char[] or unsigned char[]?
If I use unsigned char[] with built-in functions (strlen, strcmp), the compiler warns about implicit casts to char*. Can such a cast break something? Should I re-implement some of the functions to support unsigned char strings explicitly?
C has three distinct character types, signed char, unsigned char, and char, which may be either signed or unsigned. For strings, you should just use char, since that will play nice with all the built-in functions. They'll all also work fine on characters with numeric values greater than 127. You should have no problems with your case using char.

what is the use of signed char and unsigned char

since C language using the char as integer internally(correspondent ASCII is stored). for internal calculation we can use signed and unsigned char.
other than that, any other use??
signed and unsigned char are first and foremost just small integers. Do you need to store a large quantity of small numbers (in the range [-127, +127]ยน or [0, 255])? You can use an array of signed or unsigned chars and save memory compared to pretty much any other type. That's what is done for e.g. images - a grayscale image is generally stored as an array of unsigned char (and an RGB image is generally stored as an array of 3 unsigned char components).
The second usage of char is for character strings, which you probably already saw; notice that char is a distinct type from both signed char and unsigned char, and its signedness is implementation defined. This is stupid and inconvenient in many situations - and leads to sad stuff such as the mandatory cast to unsigned char when calling functions of the toupper/isupper family.
Finally, char & co. are defined as the "underlying storage" of the C abstract machine. sizeof(char) == 1 by definition, and any type can be aliased through a (signed|unsigned)? char pointer to access its underlying bit representation.
Yes, -127; [-127, +127] is the minimum range allowed for signed char by the standard, as it still allows sign and magnitude representation; more realistic, on any real-world machine of this century it will be at least [-128, 127].

Difference between char and int when declaring character

I just started learning C and am rather confused over declaring characters using int and char.
I am well aware that any characters are made up of integers in the sense that the "integers" of characters are the characters' respective ASCII decimals.
That said, I learned that it's perfectly possible to declare a character using int without using the ASCII decimals. Eg. declaring variable test as a character 'X' can be written as:
char test = 'X';
and
int test = 'X';
And for both declaration of character, the conversion characters are %c (even though test is defined as int).
Therefore, my question is/are the difference(s) between declaring character variables using char and int and when to use int to declare a character variable?
The difference is the size in byte of the variable, and from there the different values the variable can hold.
A char is required to accept all values between 0 and 127 (included). So in common environments it occupies exactly
one byte (8 bits). It is unspecified by the standard whether it is signed (-128 - 127) or unsigned (0 - 255).
An int is required to be at least a 16 bits signed word, and to accept all values between -32767 and 32767. That means that an int can accept all values from a char, be the latter signed or unsigned.
If you want to store only characters in a variable, you should declare it as char. Using an int would just waste memory, and could mislead a future reader. One common exception to that rule is when you want to process a wider value for special conditions. For example the function fgetc from the standard library is declared as returning int:
int fgetc(FILE *fd);
because the special value EOF (for End Of File) is defined as the int value -1 (all bits to one in a 2-complement system) that means more than the size of a char. That way no char (only 8 bits on a common system) can be equal to the EOF constant. If the function was declared to return a simple char, nothing could distinguish the EOF value from the (valid) char 0xFF.
That's the reason why the following code is bad and should never be used:
char c; // a terrible memory saving...
...
while ((c = fgetc(stdin)) != EOF) { // NEVER WRITE THAT!!!
...
}
Inside the loop, a char would be enough, but for the test not to succeed when reading character 0xFF, the variable needs to be an int.
The char type has multiple roles.
The first is that it is simply part of the chain of integer types, char, short, int, long, etc., so it's just another container for numbers.
The second is that its underlying storage is the smallest unit, and all other objects have a size that is a multiple of the size of char (sizeof returns a number that is in units of char, so sizeof char == 1).
The third is that it plays the role of a character in a string, certainly historically. When seen like this, the value of a char maps to a specified character, for instance via the ASCII encoding, but it can also be used with multi-byte encodings (one or more chars together map to one character).
Size of an int is 4 bytes on most architectures, while the size of a char is 1 byte.
Usually you should declare characters as char and use int for integers being capable of holding bigger values. On most systems a char occupies a byte which is 8 bits. Depending on your system this char might be signed or unsigned by default, as such it will be able to hold values between 0-255 or -128-127.
An int might be 32 bits long, but if you really want exactly 32 bits for your integer you should declare it as int32_t or uint32_t instead.
I think there's no difference, but you're allocating extra memory you're not going to use. You can also do const long a = 1;, but it will be more suitable to use const char a = 1; instead.

Using fputc() to write one byte

Given unsigned char *str, a UTF-8 encoded string, is it legal to write the first byte (not character) with fputc((char)(*str), file);
Remove the cast to char. fputc takes the character to write as an int argument whose value is expected to be in the range of unsigned char, not char. Assuming (unsigned char)(char) acts as the identity operator on unsigned char values, there's no error in your code, but it's not guaranteed to especially for oddball systems without twos complement.
It's legal. fputc converts its int input to unsigned char, and that conversion can't do anything too unpleasant. It just takes the value modulo UCHAR_MAX+1.
If char is unsigned on your implementation, then converting from unsigned char to char doesn't affect the value.
If char is signed on your implementation, then converting a value greater than CHAR_MAX to char either has an implementation-defined result, or raises a signal (6.3.1.3/3). So while your code is legal, the possible behavior includes raising a signal that terminates the program, which might not be what you want.
In practice, you expect implementations to use 2's complement, and to convert to signed types in the "obvious" way, preserving the bit pattern.
Even if nothing else goes wrong, your terminal might not do anything sensible, if you write a strange byte to STDOUT.
No, you have to pass a FILE pointer as second parameter. This is the file handle you would like to write the character to, for example stdout.
fputc(*str, stdout);
Yes it is legal. fputc will just write a byte. The cast to signed/unsigned in this case will just stop the compiler moaning at you.

C: char to int conversion

From The C Programming Language (Brian W. Kernighan), 2.7 TYPE CONVERSIONS, pg 43 :
"There is one subtle point about the
conversion of characters to integers.
... On some macines a char whose
leftmost bit is 1 will be converted to
a negative integer. On others, ... is
always positive. For portability,
specify signed or unsigned if
non-character data is to be stored in
char variables."
My questions are:
Why would anyone want to store
non-char data in char? (an example
where this is necessary will be real
nice)
Why does integer value of char
change when it is converted to int?
Can you elaborate more on this
portability issue?
In regards to 1)
People often use char arrays when they really want a byte buffer for a data stream. Its not great practice, but plenty of projects do it, and if you're careful, no real harm is done. There are probably other times as well.
In regards to 2)
Signed integers are often sign extended when they are moved from a smaller data type. Thus
11111111b (-1 in base 10) becomes 11111111 11111111 11111111 11111111 when expanded to 32 bits. However, if the char was intended to be unsigned +255, then the signed integer may end up being -1.
About portability 3)
Some machines regard chars as signed integers, while others interpret them as unsigned. It could also vary based on compiler implementation. Most of the time you don't have to worry about it. Kernighan is just trying to help you understand the details.
Edit
I know this is a dead issue, but you can use the following code to check if char's on your system are signed or unsigned:
#include <limits.h> //Include implementation specific constants (MAX_INT, et c.)
#if CHAR_MAX == SCHAR_MAX
// Plain "char" is signed
#else
// Plain "char" is unsigned
#endif
1) char is the size of a single byte in C, and is therefore used for storing any sort of data. For example, when loading an image into memory, the data is represented as an array of char. In modern code, typedefs such as uint8_t are used to indicate the purpose of a buffer more usefully than just char.
2 & 3) Whether or not char is signed or unsigned is platform dependent, so if a program depends on this behavior then it's best to specify one or the other explicitly.
The char type is defined to hold one byte, i.e. sizeof(char) is defined to be 1. This is useful for serializing data, for instance.
char is implementation-defined as either unsigned char or signed char. Now imagine that char means smallint. You are simply converting a small integer to a larger integer when you go from smallint to int. The problem is, you don't know whether that smallint is signed or unsigned.
I would say it's not really a portability issue as long as you follow The Bible (K&R).
unsigned char is often used to process binary data one byte at a time. A common example is UTF-8 strings, which are not strictly made up of "chars."
If a signed char is 8 bits and the top bit is set, that indicates that it's negative. When this is converted to a larger type, the sign is kept by extending the high bit to the high bit of the new type. This is called a "sign-extended" assignment.
1) Char is implemented as one byte across all systems so it is consistent.
2) The bit mentioned in you question is the one that is used in single byte integers for their singed-ness. When a int on a system is larger than one byte the signed flat is not affected when you convert char to int, other wise it is. ( there are also singed and unsigned chars)
3) Because of the consistence of the char implementation lots of libs use them like the Intel IPP (Intel Performance Primitives) libs and their cousins OpenCV.
Usually, in C, char to int conversion and vice versa is an issue because the stanard APIs for reading character input/writing character output use int's for the character arguments and return values. See getchar(), getc() and putchar() for example.
Also, since the size of a char is 1 byte, it is a convenient way to deal with arbitrary data as a byte stream.

Resources