Will a cast keep the low order bytes consistently across systems? - c

Say I have a unicode character in wchar_t x;
Of course, the obvious way to convert x to ASCII is use the wctob function
But I'm wondering, since the first 255 characters of Unicode correspond with ASCII, will a cast to char consistently work across platforms?
char c = (char) x ; // cast to char, this works on Windows
The question is, will a cast to char guarantee to keep the LOW ORDER bits, or will it possibly keep the HIGH ORDER bits? (I'm concerned about a little-endian/big endian situation here, although I realize if it worked on my little endian system, it definitely should work on big endian systems).

For the sake of brevity, I use some terms loosely. To avoid much confusion, one is strongly advised to carefully study definitions of at least the following terms: ASCII, Unicode, UCS, UCS-2, UCS-4, UTF, UTF-8, UTF-16, UTF-32, character, character set, coded character set, repertoire, code unit.
The code of the character 'Q' is 81 in both ASCII and Unicode.
81 is just an integer, like any other integer. A char variable may store the number 81. A wchar_t variable may store the same number 81. We interpret 81 as 'Q' in both cases.
It does not make much sense to ask how the number 81 preserves when cast from e.g. long to short. If it fits then you are all set. There's no endianness or higher bits or lower bits or any of this stuff involved.
When you convert files that store characters, or streams of bytes over a network, endianness and bits and stuff begin to matter, just like with files that store (binary representations of) any old numbers.

If x does not fit in a char, then the behavior is officially "implementation-defined" and is allowed to raise a signal. If x does fit in a char, then the value is preserved (regardless of endianness).
6.3.1.3 Signed and unsigned integers
(1) When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
(2) [does not apply here]
(3) Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
For maximum portability, perform a range check first and cast only if the value is in the range SCHAR_MIN to SCHAR_MAX.
(Others have noted and I wish to repeat that ASCII extends only to character 127.)

I was under the impression that the endianess of the system does not matter in this situation.
I found a really good explanation here.
I think this should help rest your fears about casting.

Related

The size of a char and an int in C

In the C programming language, if an int is 4 bytes and letters are represented in ASCII as a number (also an int), then why is a char 1 byte?
A char is one byte because the standard says so. But that's not really what you are asking. In terms of the decimal values of a char it can hold from -128 to 127, have a look at a table for ASCII character codes, you'll notice that the decimal values of those codes are between 0 and 127, hence, they fit in positive values of a char. There are extended character sets that use unsigned char, values from 0 to 255.
6.2.5 Types
...
3 An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.
...
5 An object declared as type signed char occupies the same amount of storage as a ‘‘plain’’ char object. A ‘‘plain’’ int object has the natural size suggested by the architecture of the execution environment (large enough to contain any value in the range INT_MIN to INT_MAX as defined in the header <limits.h>).
C 2012 Online Draft
Type sizes are not defined in terms of bits, but in terms of the range of values that must be represented.
The basic execution character set consists of 96 or so characters (26 uppercase Latin characters, 26 lowercase latin characters, 10 decimal digits, 29 graphical characters, space, vertical tab, horizontal tab, line feed, form feed); 8 bits is more than sufficient to represent those.
int, OTOH, must be able to represent a much wider range of values; the minimum range as specified in the standard is [-32767..32767]1, although on most modern implementations it’s much wider.
The standard doesn’t assume two’s complement representation of signed integers, which is why INT_MIN is -32767 and not -32768.
In the C language, a char usually has a size of 8 bits.
In all the compilers that I have seen (which are, admittedly, not very many), the char is taken to be large enough to hold the ASCII character set (or the so called “extended ASCII”) and the size of the char data type is 8 bits (this includes compilers in major Desktop platforms, and a some embedded systems).
1 byte was sufficient to represent the whole character set.

assigning 128 to char variable in c

The output comes to be the 32-bit 2's complement of 128 that is 4294967168. How?
#include <stdio.h>
int main()
{
char a;
a=128;
if(a==-128)
{
printf("%u\n",a);
}
return 0;
}
Compiling your code with warnings turned on gives:
warning: overflow in conversion from 'int' to 'char' changes value from '128' to '-128' [-Woverflow]
which tell you that the assignment a=128; isn't well defined on your plat form.
The standard say:
6.3.1.3 Signed and unsigned integers
1 When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
2 Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.
3 Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
So we can't know what is going on as it depends on your system.
However, if we do some guessing (and note this is just a guess):
128 as 8 bit would be 0b1000.0000
so when you call printf where you get a conversion to int there will be a sign extension like:
0b1000.0000 ==> 0b1111.1111.1111.1111.1111.1111.1000.0000
which - printed as unsigned represents the number 4294967168
The sequence of steps that got you there is something like this:
You assign 128 to a char.
On your implementation, char is signed char and has a maximum value of 127, so 128 overflows.
Your implementation interprets 128 as 0x80. It uses two’s-complement math, so (int8_t)0x80 represents (int8_t)-128.
For historical reasons (relating to the instruction sets of the DEC PDP minicomputers on which C was originally developed), C promotes signed types shorter than int to int in many contexts, including variadic arguments to functions such as printf(), which aren’t bound to a prototype and still use the old argument-promotion rules of K&R C instead.
On your implementation, int is 32 bits wide and also two’s-complement, so (int)-128 sign-extends to 0xFFFFFF80.
When you make a call like printf("%u", x), the runtime interprets the int argument as an unsigned int.
As an unsigned 32-bit integer, 0xFFFFFF80 represents 4,294,967,168.
The "%u\n" format specifier prints this out without commas (or other separators) followed by a newline.
This is all legal, but so are many other possible results. The code is buggy and not portable.
Make sure you don’t overflow the range of your type! (Or if that’s unavoidable, overflow for unsigned scalars is defined as modular arithmetic, so it’s better-behaved.) The workaround here is to use unsigned char, which has a range from 0 to (at least) 255, instead of char.
First of all, as I hope you understand, the code you've posted is full of errors, and you would not want to depend on its output. If you were trying to perform any of these manipulations in a real program, you would want to do so in some other, more well-defined, more portable way.
So I assume you're asking only out of curiosity, and I answer in the same spirit.
Type char on your machine is probably a signed 8-bit quantity. So its range is from -128 to +127. So +128 won't fit.
When you try to jam the value +128 into a signed 8-bit quantity, you probably end up with the value -128 instead. And that seems to be what's happening for you, based on the fact that your if statement is evidently succeeding.
So next we try to take the value -128 and print it as if it was an unsigned int, which on your machine is evidently an 32-bit type. It can hold numbers in the range 0 to 4294967295, which obviously does not include -128. But unsigned integers typically behave pretty nicely modulo their range, so if we add 4294967296 to -128 we get 4294967168, which is precisely the number you saw.
Now that we've worked through this, let's resolve in future not to jam numbers that won't fit into char variables, or to print signed quantities with the %u format specifier.

How does char value stores in memory? For little as well as big endian?

I'm trying to print the output for the following code in c language but i don't understand how does it output -80 in little endian machine.
char d = 1200;
printf ("%d ", d);
Endianness has nothing to do with it. I assume you have 8bit char. So you try to assign an integer constant to a char that doesn't fit, 1200 is 0x4b0. C converts it by cutting off the most significant bits, the result is 0xb0. *)
Interpreted as a signed number, it's negative (most significant bit set) and has a value of 80 (inverting bits gives 0x4f, add one for two's complement is 0x50, in decimal 80).
*) Note that signed overflow is actually undefined, so you could get any other result. I just describe what probably happens here, assuming your machine uses two's complement for negative numbers. Unsigned overflow would be defined, by cutting of the most significant bits, and again, this isn't a matter of endianness.
Why it specifically becomes -80 I have no idea.
However, the error occurs because 1200 is too large for the char datatype. It is one of the few C primitives that is guaranteed to be a single size, 1 byte. This also means you need not worry about endianness.
As such, the valid range is -128 to 127. Attempts to assign variables outside that range are 'implementation defined', according to the standard. Which is to say, the compiler can do what it feels like.
Section 6.3.1.3 of the ISO/IEC 9899:2011 standard:
6.3.1.3 Signed and unsigned integers
When a value with integer type is converted to another integer type [and] the new type is signed and the value cannot be represented in it, [then] the result is implementation-defined.

When typecasting to a char in C, which bytes are used to make the character?

When you typecast from an int to a char, you are cutting down the number of bytes used from 4 to 1. How does it pick which byte it is going to use make the char?
Does it take the most significant byte?
Or does it take the least significant?
Or is there some sort of rule I should know about?
C will take the least-significant byte when doing a narrowing conversion, so if you have the integer value 0xCAFEBABE and you convert it to a char, you'll get the value 0xBE.
Of course, there's no actual guarantee that an int is four bytes or that a char is one, but I'm pretty sure that the logic for doing the truncation will always be the same and will just drop the higher-order bits that don't fit into the char.
If char is signed, it's implementation-defined unless the original value already fits in the range of values for char. An implementation is completely free to generate nonsense (or raise a signal) if it doesn't fit. If char is unsigned (which the standard allows), then the value is reduced modulo 1<<CHAR_BIT (usually 256).

C: char to int conversion

From The C Programming Language (Brian W. Kernighan), 2.7 TYPE CONVERSIONS, pg 43 :
"There is one subtle point about the
conversion of characters to integers.
... On some macines a char whose
leftmost bit is 1 will be converted to
a negative integer. On others, ... is
always positive. For portability,
specify signed or unsigned if
non-character data is to be stored in
char variables."
My questions are:
Why would anyone want to store
non-char data in char? (an example
where this is necessary will be real
nice)
Why does integer value of char
change when it is converted to int?
Can you elaborate more on this
portability issue?
In regards to 1)
People often use char arrays when they really want a byte buffer for a data stream. Its not great practice, but plenty of projects do it, and if you're careful, no real harm is done. There are probably other times as well.
In regards to 2)
Signed integers are often sign extended when they are moved from a smaller data type. Thus
11111111b (-1 in base 10) becomes 11111111 11111111 11111111 11111111 when expanded to 32 bits. However, if the char was intended to be unsigned +255, then the signed integer may end up being -1.
About portability 3)
Some machines regard chars as signed integers, while others interpret them as unsigned. It could also vary based on compiler implementation. Most of the time you don't have to worry about it. Kernighan is just trying to help you understand the details.
Edit
I know this is a dead issue, but you can use the following code to check if char's on your system are signed or unsigned:
#include <limits.h> //Include implementation specific constants (MAX_INT, et c.)
#if CHAR_MAX == SCHAR_MAX
// Plain "char" is signed
#else
// Plain "char" is unsigned
#endif
1) char is the size of a single byte in C, and is therefore used for storing any sort of data. For example, when loading an image into memory, the data is represented as an array of char. In modern code, typedefs such as uint8_t are used to indicate the purpose of a buffer more usefully than just char.
2 & 3) Whether or not char is signed or unsigned is platform dependent, so if a program depends on this behavior then it's best to specify one or the other explicitly.
The char type is defined to hold one byte, i.e. sizeof(char) is defined to be 1. This is useful for serializing data, for instance.
char is implementation-defined as either unsigned char or signed char. Now imagine that char means smallint. You are simply converting a small integer to a larger integer when you go from smallint to int. The problem is, you don't know whether that smallint is signed or unsigned.
I would say it's not really a portability issue as long as you follow The Bible (K&R).
unsigned char is often used to process binary data one byte at a time. A common example is UTF-8 strings, which are not strictly made up of "chars."
If a signed char is 8 bits and the top bit is set, that indicates that it's negative. When this is converted to a larger type, the sign is kept by extending the high bit to the high bit of the new type. This is called a "sign-extended" assignment.
1) Char is implemented as one byte across all systems so it is consistent.
2) The bit mentioned in you question is the one that is used in single byte integers for their singed-ness. When a int on a system is larger than one byte the signed flat is not affected when you convert char to int, other wise it is. ( there are also singed and unsigned chars)
3) Because of the consistence of the char implementation lots of libs use them like the Intel IPP (Intel Performance Primitives) libs and their cousins OpenCV.
Usually, in C, char to int conversion and vice versa is an issue because the stanard APIs for reading character input/writing character output use int's for the character arguments and return values. See getchar(), getc() and putchar() for example.
Also, since the size of a char is 1 byte, it is a convenient way to deal with arbitrary data as a byte stream.

Resources