Given unsigned char *str, a UTF-8 encoded string, is it legal to write the first byte (not character) with fputc((char)(*str), file);
Remove the cast to char. fputc takes the character to write as an int argument whose value is expected to be in the range of unsigned char, not char. Assuming (unsigned char)(char) acts as the identity operator on unsigned char values, there's no error in your code, but it's not guaranteed to especially for oddball systems without twos complement.
It's legal. fputc converts its int input to unsigned char, and that conversion can't do anything too unpleasant. It just takes the value modulo UCHAR_MAX+1.
If char is unsigned on your implementation, then converting from unsigned char to char doesn't affect the value.
If char is signed on your implementation, then converting a value greater than CHAR_MAX to char either has an implementation-defined result, or raises a signal (6.3.1.3/3). So while your code is legal, the possible behavior includes raising a signal that terminates the program, which might not be what you want.
In practice, you expect implementations to use 2's complement, and to convert to signed types in the "obvious" way, preserving the bit pattern.
Even if nothing else goes wrong, your terminal might not do anything sensible, if you write a strange byte to STDOUT.
No, you have to pass a FILE pointer as second parameter. This is the file handle you would like to write the character to, for example stdout.
fputc(*str, stdout);
Yes it is legal. fputc will just write a byte. The cast to signed/unsigned in this case will just stop the compiler moaning at you.
Related
This question:
What is an unsigned char?
does a great job of discussing char vs. unsigned char vs. signed char in C.
However, it doesn't directly address what should be used for non-ASCII text. Thus if I have an array of bytes that represents text in some arbitrary character set like UTF-8 or Big5 (or sometimes ASCII), should I use an array of char or unsigned char?
I'm leaning towards using char because otherwise gcc gives me warnings about signedness of pointers when the array is ASCII and I use strlen. But I would like to know what is correct.
Use normal char to represent characters. Use signed char when you want a signed integer type that covers values from -127 to +127 . Use unsigned char for having an unsigned integer type that has range of values from 0 to 255 .
The question you are asking is probably much broader that you expect.
To answer it directly, most implementations use "byte" as underlying buffer. In that terms standard uint8_t typedef is your best bet. That is primarily because most character sets use variable number of bytes to store characters, so separate byte processing is essential in encoding and decoding process. It also simplifies conversion between different "endianess".
In general it's incorrect to use strlen on anything other than ASCII encoding or other single-byte code pages (0-255 range). It's certainly incorrect on any multi-byte encoding like Big5, UTF-8/16 or Shift-JIS.
As far as UTF8 or any encoding where ASCII characters have the same codepoints, char is the best type for multi-byte characters string:
assume typedef char utf8:
This is the only way to allow char * to be used as utf8 * without an explicit cast. This is extremely common and a good enough reason to be better than unsigned char.
utf8 * could be accidentally passed to function expecting a pointer to a sequence of ASCII characters, but this could also be needed if you need to printf your utf8 string (which is a valid thing to do)
The main drawback is that as char sign is unknown, usage of arithmetic operators like > is unsafe, and the only safe way to check if a character is in the ASCII range is by checking the bit directly with ISASCII(c) ((c & (1 << 7) == 0)
Perhaps I'm overthinking this, as it seems like it should be a lot easier. I want to take a value of type int, such as is returned by fgetc(), and record it in a char buffer if it is not an end-of-file code. E.g.:
char buf;
int c = fgetc(stdin);
if (c < 0) {
/* handle end-of-file */
} else {
buf = (char) c; /* not quite right */
}
However, if the platform has signed default chars then the value returned by fgetc() may be outside the range of char, in which case casting or assigning it to (signed) char produces implementation-defined behavior (right?). Surely, though, there is tons of code out there that does exactly the equivalent of the example. Is it all relying on implementation-defined behavior and/or assuming 7-bit data?
It looks to me like if I want to be certain that the behavior of my code is defined by C to be what I want, then I need to do something like this:
buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
I think that produces defined, correct behavior whether default chars are signed or unsigned, and regardless even of the size of char. Is that right? And is it really needful to do that to ensure portability?
fgetc() returns unsigned char and EOF. EOF is always < 0. If the system's char is signed or unsigned, it makes no difference.
C11dr 7.21.7.1 2
If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined).
The concern I have about is that is looks to be 2's compliment dependent and implying the range of unsigned char and char are both just as wide. Both of these assumptions are certainly nearly always true today.
buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
[Edit per OP comment]
Let's assume fgetc() returns no more different characters than stuff-able in the range CHAR_MIN to CHAR_MAX, then (c - (UCHAR_MAX + 1)) would be more portable is replaced with (c - CHAR_MAX + CHAR_MIN). We do not know (c - (UCHAR_MAX + 1)) is in range when c is CHAR_MAX + 1.
A system could exist that has a signed char range of -127 to +127 and an unsigned char range 0 to 255. (5.2.4.2.1), but as fgetc() gets a character, it seems to have all be unsigned char or all ready limited itself to the smaller signed char range, before converting to unsigned char and return that value to the user. OTOH, if fgetc() returned 256 different characters, conversion to a narrow ranged signed char would not be portable regardless of formula.
Practically, it's simple - the obvious cast to char always works.
But you're asking about portability...
I can't see how a real portable solution could work.
This is because the guaranteed range of char is -127 to 127, which is only 255 different values. So how could you translate the 256 possible return values of fgetc (excluding EOF), to a char, without losing information?
The best I can think of is to use unsigned char and avoid char.
With thanks to those who responded, and having now read relevant portions of the C99 standard, I have come to agree with the somewhat surprising conclusion that storing an arbitrary non-EOF value returned by fgetc() as type char without loss of fidelity is not guaranteed to be possible. In large part, that arises from the possibility that char cannot represent as many distinct values as unsigned char.
For their part, the stdio functions guarantee that if data are written to a (binary) stream and subsequently read back, then the read back data will compare equal to the original data. That turns out to have much narrower implications than I at first thought, but it does mean that fputs() must output a distinct value for each distinct char it successfully outputs, and that whatever conversion fgets() applies to store input bytes as type char must accurately reverse the conversion, if any, by which fputs() would produce the input byte as its output. As far as I can tell, however, fputs() and fgets() are permitted to fail on any input they don't like, so it is not certain that fputs() maps every possible char value to an unsigned char.
Moreover, although fputs() and fgets() operate as if by performing sequences of fputc() and fgetc() calls, respectively, it is not specified what conversions they might perform between char values in memory and the underlying unsigned char values on the stream. If a platform's fputs() uses standard integer conversion for that purpose, however, then the correct back-conversion is as I proposed:
int c = fgetc(stream);
char buf;
if (c >= 0) buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
That arises directly from the integer conversion rules, which specify that integer values are converted to unsigned types by adding or subtracting the integer multiple of <target type>_MAX + 1 needed to bring the result into the range of the target type, supported by the constraints on representation of integer types. Its correctness for that purpose does not depend on the specific representation of char values or on whether char is treated as signed or unsigned.
However, if char cannot represent as many distinct values as unsigned char, or if there are char values that fgets() refuses to output (e.g. negative ones), then there are possible values of c that could not have resulted from a char conversion in the first place. No back-conversion argument is applicable to such bytes, and there may not even be a meaningful sense of char values corresponding to them. In any case, whether the given conversion is the correct reverse-conversion for data written by fputs() seems to be implementation defined. It is certainly implementation-defined whether buf = (char) c will have the same effect, though it does have on very many systems.
Overall, I am struck by just how many details of C I/O behavior are implementation defined. That was an eye-opener for me.
Best way to portably assign the result of fgetc() to a char in C
C2X is on the way
A sub-problem is saving an unsigned char value into a char, which may be signed. With 2's complement, that is not a problem.*1
On non-2's complement machines with signed char that do not support -0 *2, that is a problem. (I know of no such machines.)
In any case, with C2X, support for non-2's complement encoding is planned to be dropped, so as time goes on, we can eventually ignore non-2's complement issues and confidently use
int c = fgetc(stdin);
...
char buf = (c > CHAR_MAX) ? (char)(c - (UCHAR_MAX + 1)) : (char)c;
UCHAR_MAX > INT_MAX??
A 2nd portability issue not discussed is when UCHAR_MAX > INT_MAX. e.g. All integer types are 64-bit. Some graphics processor have used a common size for all integer types.
On such unicorn machines, if (c < 0) is insufficient. Could use:
int c = fgetc(stdin);
#if UCHAR_MAX <= INT_MAX
if (c < 0) {
#else
if (c == EOF && (feof(stdin) || ferror(stdin))) {
#endif
...
Pedantically, ferror(stdin) could be true due to a prior input function and not this one which returned UCHAR_MAX, but let us not go into that rabbit-hole.
*1 In the case of int to signed char with c > CHAR_MAX, "Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." applies. With 2's complement, this is overwhelmingly maps [128 255] to [-128 -1].
*2 With non-2's compliment and -0 support, the common mapping is least 8 bits remain the same. This does make for 2 zeros, yet properly handling of strings in <string.h> uses "For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value)." So -0 is not a null character as that char is accessed as a non-zero unsigned char.
K&R provide this getchar() example:
int getchar(void)
{
char c;
return (read(0, &c, 1) == 1) ? (unsigned char) c : EOF;
}
c is cast to unsigned char here to avoid sign extension issues, but in the fputs() example...
int fputs(char *s, FILE *iop)
{
int c;
while (c = *s++)
putc(c, iop);
return ferror(iop) ? EOF : 0;
}
*s is assigned to an int without first casting to an unsigned char. Why is the cast unnecessary this time?
It is not about "sign extension issues". This implementation of getchar makes sure that all successfully read characters are returned as non-negative int values. This behavior is required by the specification of getchar, which literally says that the character read is returned as unsigned char values converted to int, even if char is signed on the given platform. What you see there is basically a direct implementation of getchar spec.
Meanwhile fputs does not return any specific character values. fputs does not return c to the user. That c is a purely internal variable. It should preserve the original value of char type on the given platform, since the value of c is then passed to putc. putc does not expect character values converted to non-negative range, it expects original character values, which could easily be negative if char is signed.
BTW, why did you look at fputs, and not fputc? If you look at fputc, which just like getchar returns a character value, you will probably see that it is implemented similarly to getchar in that regard.
I completely misunderstood the question the first time around. The problem here is that getchar() needs to return either a char in the entire range 0-255 or EOF. On most platforms EOF = -1. In order to return both a negative value and a char, int must be used.
This is not the case in fputs. In this example, a char is being assigned to an int in the while loop. "Lower" types are promoted to "higher" types. From page 44 KR, The C Programming Language:
If either operand is a long double, convert the other to a long double.
Otherwise, if either operand is a double, convert the other to a double.
Otherwise, if either operand is a float, convert the other to a float.
Otherwise, convert char and short to int
Then, if either operand is a long, convert the other to long.
According to the man page,
The fputc() function writes the character c (converted to an unsigned char) to the output stream pointed to by stream.
The cast is specifically performed for you inside the function.
Aside from that, assigning from char to negative int and back to char is guaranteed to produce the correct result, and char to negative int to unsigned char is guaranteed to have the same result as a direct cast from char to unsigned char. Other cases may produce signed integer overflow, which produces undefined behavior (i.e., could crash). But most platforms handle that by quiet binary truncation, in such a way that many programmers never worry about it at all.
When you typecast from an int to a char, you are cutting down the number of bytes used from 4 to 1. How does it pick which byte it is going to use make the char?
Does it take the most significant byte?
Or does it take the least significant?
Or is there some sort of rule I should know about?
C will take the least-significant byte when doing a narrowing conversion, so if you have the integer value 0xCAFEBABE and you convert it to a char, you'll get the value 0xBE.
Of course, there's no actual guarantee that an int is four bytes or that a char is one, but I'm pretty sure that the logic for doing the truncation will always be the same and will just drop the higher-order bits that don't fit into the char.
If char is signed, it's implementation-defined unless the original value already fits in the range of values for char. An implementation is completely free to generate nonsense (or raise a signal) if it doesn't fit. If char is unsigned (which the standard allows), then the value is reduced modulo 1<<CHAR_BIT (usually 256).
From The C Programming Language (Brian W. Kernighan), 2.7 TYPE CONVERSIONS, pg 43 :
"There is one subtle point about the
conversion of characters to integers.
... On some macines a char whose
leftmost bit is 1 will be converted to
a negative integer. On others, ... is
always positive. For portability,
specify signed or unsigned if
non-character data is to be stored in
char variables."
My questions are:
Why would anyone want to store
non-char data in char? (an example
where this is necessary will be real
nice)
Why does integer value of char
change when it is converted to int?
Can you elaborate more on this
portability issue?
In regards to 1)
People often use char arrays when they really want a byte buffer for a data stream. Its not great practice, but plenty of projects do it, and if you're careful, no real harm is done. There are probably other times as well.
In regards to 2)
Signed integers are often sign extended when they are moved from a smaller data type. Thus
11111111b (-1 in base 10) becomes 11111111 11111111 11111111 11111111 when expanded to 32 bits. However, if the char was intended to be unsigned +255, then the signed integer may end up being -1.
About portability 3)
Some machines regard chars as signed integers, while others interpret them as unsigned. It could also vary based on compiler implementation. Most of the time you don't have to worry about it. Kernighan is just trying to help you understand the details.
Edit
I know this is a dead issue, but you can use the following code to check if char's on your system are signed or unsigned:
#include <limits.h> //Include implementation specific constants (MAX_INT, et c.)
#if CHAR_MAX == SCHAR_MAX
// Plain "char" is signed
#else
// Plain "char" is unsigned
#endif
1) char is the size of a single byte in C, and is therefore used for storing any sort of data. For example, when loading an image into memory, the data is represented as an array of char. In modern code, typedefs such as uint8_t are used to indicate the purpose of a buffer more usefully than just char.
2 & 3) Whether or not char is signed or unsigned is platform dependent, so if a program depends on this behavior then it's best to specify one or the other explicitly.
The char type is defined to hold one byte, i.e. sizeof(char) is defined to be 1. This is useful for serializing data, for instance.
char is implementation-defined as either unsigned char or signed char. Now imagine that char means smallint. You are simply converting a small integer to a larger integer when you go from smallint to int. The problem is, you don't know whether that smallint is signed or unsigned.
I would say it's not really a portability issue as long as you follow The Bible (K&R).
unsigned char is often used to process binary data one byte at a time. A common example is UTF-8 strings, which are not strictly made up of "chars."
If a signed char is 8 bits and the top bit is set, that indicates that it's negative. When this is converted to a larger type, the sign is kept by extending the high bit to the high bit of the new type. This is called a "sign-extended" assignment.
1) Char is implemented as one byte across all systems so it is consistent.
2) The bit mentioned in you question is the one that is used in single byte integers for their singed-ness. When a int on a system is larger than one byte the signed flat is not affected when you convert char to int, other wise it is. ( there are also singed and unsigned chars)
3) Because of the consistence of the char implementation lots of libs use them like the Intel IPP (Intel Performance Primitives) libs and their cousins OpenCV.
Usually, in C, char to int conversion and vice versa is an issue because the stanard APIs for reading character input/writing character output use int's for the character arguments and return values. See getchar(), getc() and putchar() for example.
Also, since the size of a char is 1 byte, it is a convenient way to deal with arbitrary data as a byte stream.