Why are 4 characters allowed in a char variable? [duplicate] - c

This question already has answers here:
How to determine the result of assigning multi-character char constant to a char variable?
(5 answers)
Closed 9 years ago.
I have the following code in my program:
char ch='abcd';
printf("%c",ch);
The output is d.
I fail to understand why is a char variable allowed to take in 4 characters in its declaration without giving a compile time error.
Note: More than 4 characters is giving an error.

'abcd' is called a multicharacter constant, and will has an implementation-defined value, here your compiler gives you 'd'.
If you use gcc and compile your code with -Wmultichar or -Wall, gcc will warn you about this.

I fail to understand why is a char variable allowed to take in 4
characters in its declaration without giving a compile time error.
It's not packing 4 characters into one char. The multi-character const 'abcd' is of type int and then the compiler does constant conversion to convert it to char (which overflows in this case).

Assuming you know that you are using multi-char constant, and what it is.
I don't use VS these days, but my take on it is, that 4-char multi-char is packed into an int, then down-casted to a char. That is why it is allowed. Since the packing order of multi-char constant into an integer type is compiler-defined it can behave like you observe it.
Because multi-character constants are meant to be used to fill integer typed, you could try 8-byte long multi-char. I am not sure whether VS compiler supports it, but there is a good chance it is, because that would fit into a 64-bit long type.
It probably should give a warning about trying to fit a literal value too big for the type. It's kind of like unsigned char leet = 1337;. I am not sure, however, how does this work in VS (whether it fires a warning or an error).

4 characters are not being put into a char variable, but into an int character constant which is then assigned to a char.
3 parts of the C standard (C11dr §6.4.4.4) may help:
"An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'."
"An integer character constant has type int."
"The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined."
OP's code of char ch='abcd'; is the the assignment of an int to a char as 'abcd' is an int. Just like char ch='Z';, ch is assigned the int value of 'Z'. In this case, there is no surprise, as the value of 'Z' fits nicely in a char. In the 'abcd', case, the value does not fit in a char and so some information is lost. Various outcomes are possible. Typically on one endian platform, ch will have a value of 'a' and on another, the value of 'd'.
The 'abcd' is an int value, much like 12345 in int x = 12345;.
When the size(int) == 4, an int may be assigned a character constant such as 'abcd'.
When size(int) != 4, the limit changes. So with an 8-char int, int x = 'abcdefgh'; is possible. etc.
Given that an int is only guaranteed to have a minimum range -32767 to 32767, anything beyond 2 is non-portable.
The int endian-ness of even int = 'ab'; presents concerns.
Character constant like 'abcd' are typically used incorrectly and thus many compilers have a warning that is good to enable to flag this uncommon C construct.

Related

Using int to print character constants [duplicate]

This question already has answers here:
Multi-character constant warnings
(6 answers)
Print decimal value of a char
(5 answers)
Closed 5 years ago.
I wrote the following program,
#include<stdio.h>
int main(void)
{
int i='A';
printf("i=%c",i);
return 0;
}
and I got the result as,
i=A
So I tried another program,
#include<stdio.h>
int main(void)
{
int i='ABC';
printf("i=%c",i);
return 0;
}
According to me, since 32 bits are used to store an int value and each of 'A', 'B' and 'C' have 8 bit ASCII codes which totals to 24 bits therefore 24 bits were stored in a 32 bit unit. So I expected the output to be,
i=ABC
but the output instead was
i=C
and I can't understand why?
'ABC' in this case is a integer character constant as per section 6.4.4.4.10 of the standard.
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer. The
value of an integer character constant containing more than one
character (e.g.,'ab'), or containing a character or escape sequence
that does not map to a single-byteexecution character, is
implementation-defined. If an integer character constant contains a
single character or escape sequence, its value is the one that results
when an object with type char whose value is that of the single
character or escape sequence is converted to type int.
In this case, 'A'==0x41, 'B'==0x42, 'C'==0x43, and your compiler then interprets i to be 0x414243. As said in the other answer, this value is implementation dependent.
When you try to access it using '%c', the overflown part will be cut and you are only left with 0x43, which is 'C'.
To get more insight to it, read the answers to this question as well.
The conversion specifier c used in this call
printf("i=%c",i);
in fact extracts one character from the integer argument. So using this specifier you in any case can not get three characters as the output.
From the C Standard (7.21.6.1 The fprintf function)
c If no l length modifier is present, the int argument is converted to
an unsigned char, and the resulting character is written
Take into account that the internal representation of a multi-byte character constant is implementation defined. From the C Standard (6.4.4.4 Character constants)
...The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined.
'ABC' is an integer character constant. Depending on code set (overwhelming it is ASCII), endian, int width (apparently 32 bits in OP's case), it may have the same value like below. It is implementation defined behavior.
'ABC'
0x41424300
0x434241
or others.
The "%c" directs printf() to take the int value, cast it to unsigned char and print the associated character. This is the main reason for apparent loss of information.
In OP's case, it appears that i took on the value of 0x434241.
int i='A';
printf("i=%c",i); --> 'A'
// same as
printf("i=%c",0x434241); --> 'A'
if you want i to contain 3 characters you need to init a array that contains 3 characters
char i[3];
i[0]= 'A';
i[1]= 'B';
i[2]='C';
the ' ' can contain only one char your code converts the integer i into a character or better you store in your 32 bit intiger a converted 8 bit character. But i think You want to seperate the 32 bits into 8 bit containers make a char array like char i[3]. and then you will see that
int j=i;
this will result in an error because you are unable to convert a char array into a integer.
In C, 'A' is an int constant that's guaranteed to fit into a char.
'ABC' is a multicharacter constant. It has an int type, but an implementation defined value. The behaviour on using %c to print that in printf is possibly undefined if the value cannot fit into a char.

C Language: Why int variable can store char?

I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.

unable to get char of 2 character long in C [duplicate]

This question already has answers here:
How to determine the result of assigning multi-character char constant to a char variable?
(5 answers)
Closed 8 years ago.
following statement in c gives no error
char p='-1';
but the following gives error:
char p='-12';
ERROR: character can be one or two characters long.
I never knew that a char in c can ever be two characters long. However printf("%c",p) gives - as output. Where can i use char in c?
In C, a character constant like 'A' does not have type char, but rather type int. This creates the possibility that, even on a system where char is only 8 bits wide (and so int is wider than char), character constant notations can exist which provide integer values wider than char.
The C standard requires implementations to support multi-character constants, but their values are implementation-defined.
Why your compiler allows only two characters is likely because the type int is only 16 bits wide. Perhaps a constant like 'AB' is encoded similarly to, say, the expression ('A' << 8 | 'B'). According to the obvious extension of this scheme, 'ABC' would then have to be ('A' << 16 | 'B' << 8 | 'C') which doesn't fit into 16 bits and calls for out-of-range shifts. Hence, the two character limit.
In the GNU C compiler, four characters can be used:
#include <stdio.h>
int main(void)
{
printf("%x\n", (unsigned) 'ABCD');
return 0;
}
int is 32 bits wide, and this program prints 41424344 which, by golly, is hexadecimal for the ASCII characters ABCD. So this feature is useful for int-wide magic constants which are readable. Instead of:
#define MAGIC 0x41424344 /* This spells ABCD; easy to spot in memory dumps */
You can do this, which is nice, but less portable:
#define MAGIC 'ABCD'
What if we use five or more characters, like 'ABCDE'? Then GCC respond similarly to how Turbo C++ responds for three or more:
test.c:5:35: warning: character constant too long for its type [enabled by default]
It so happens that the program still compiles, and its output is unchanged: the E was truncated.
There is an important difference. The old Borland compiler is rejecting the excessively-long constant as an error. Though that is probably a good idea, it is not standard-conforming; when some value is implementation-defined, the implementation's response cannot be failure, such as stopping the translation or execution of the program. Issuing a diagnostic is fine, of course.
char p='-517';
printf("%c\n", p);
Running the above code gave me output 7 and a warning: overflow in implicit constant conversion [-Woverflow]
char can not contain more than 1 byte of information
You want an array of characters, also known as a C-string
// Note, if you initialize a character array with a literal string
// there is no need for a size specifier
char c[] = "-12";
// Note this is a method of copying one character array into another.
#include <string.h>
char c[4];
strcpy(c, "-12");
You'll notice that char c[4] has an indicated size of 4. Meaning, the array can only hold 4 characters. In C, character arrays have a special property: A null terminator (char '\0') is a sentinel value that C-string functions use to recognize the end of your string. So, in reality, a character string "-12" is of size 4. '-', '1', '2', and '\0'.
You can also access individual elements of an array by passing an indice to it's operator[] function.
printf("%s\n", c);
printf("%c\n", c[0]);
Notice the c[0] call, This will access the character '-' of the string "-12".
Hope I helped.

character type int

A character constant has type int in C.
Now suppose my machine's local character set is Windows Latin-1 ( http://www.ascii-code.com/) which is a 256 character set so every char between single quotes, like 'x', is mapped to an int value between 0 and 255 right ?
Suppose plain char is signed on my machine and consider the following code:
char ch = 'â'
if(ch == 'â')
{
printf("ok");
}
Because of the integer promotion ch will be promoted into a negative
quantity of type int (cause it has a leading zero) and beingâ mapped to a positive
quantity ok will not be printed.
But I'm sure i'm missing something , can you help ?
Your C implementation has a notion of an execution character set. Moreover, if your program source code is read from a file (as it always is), the compiler has (or should have) a notion of a source character set. For example, in GCC you can tune those parameters on the command line. The combination of those two settings determines the integral value that is assigned to your literal â.
Actually, the initial assignment will not work as expected:
char ch = 'â';
There's an overflow here, and gcc will warn about it. Technically, this is undefined behavior, although for the very common single-byte char type, the behavior is predictable enough -- it's a simple integer overflow. Depending on your default character set, that's a multibyte character; I get decimal 50082 if I print it as an integer on my machine.
Furthermore, the comparison is invalid, again because char is too small to hold the value being compared, and again, a good compiler will warn about it.
ISO C defines wchar_t, a type wide enough to hold extended (i.e., non-ASCII) characters, along with wide character versions of many library functions. Code that must deal with non-ASCII text should use this wide character type as a matter of course.
In a case where char is signed:
When processing char ch = 'â', the compiler will convert â to 0xFFFFFFE2, and store 0xE2 in ch. There is no overflow, as the value is signed.
When processing if(ch == 'â'), the compiler will extend ch (0xE2) to integer (0xFFFFFFE2) and compare it to 'â' (0xFFFFFFE2 also), so the condition will be true.

Question regarding C argument promotions [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Alright actually I've study about how to use looping to make my code more efficient so that I could use a particular block of code that should be repeated without typing it over and over again, and after attempted to use what I've learn so far to program something, I feel it's time for me to proceed to the next chapter to learn on how to use control statement to learn how to instructs the program to make decision.
But the thing is that, before I advance myself to it, I still have a few question that need any expert's help on previous stuff. Actually it's about datatype.
A. Character Type
I extract the following from the book C primer Plus 5th ed:
Somewhat oddly , C treats character
constans as type int rather than
char. For example, on an ASCII system
with a 32-bit int and an 8-bit char
, the code:
char grade = 'B';
represents 'B' as the numerical value
66 stored in a 32-bit unit, grade
winds up with 66 stored ub ab 8-bit
unit. This characteristic of character
constants makes it possible to define
a character constant such as 'FATE',
with four separate 8-bit ASCII codes
stored in a 32-bit unit. However ,
attempting to assign such a character
constant to a char variable results
in only the last 8 bits being used,
so the variable gets the value 'E'.
So the next thing I did after reading this was of course, follow what it mentions, that is I try store the word FATE on a variable with char grade and try to compile and see what it'll be stored using printf(), but instead of getting the character 'E' printed out, what I get is 'F'.
Does this mean there's some mistake in the book? OR is there something I misunderstood?
From the above sentences, there's a line says C treats character constants as type int. So to try it out, I assign a number bigger than 255, (e.x. 356) to the char type.
Since 356 is within the range of 32-bit int (I'm running Windows 7), therefore I expect it would print out 356 when I use the %d specifier.
But instead of printing 356, it gives me 100, which is the last 8-bits value.
Why does this happen? I thought char == int == 32-bits? (Although it does mention before char is only a byte).
B. Int and Floating Type
I understand when a number stores in variable in short type is pass to variadic function or any implicit prototype function, it'll be automatically promoted to int type.
This also happen to floating point type, when a floating-point number with float type is passed, it'll be converted to double type, that is why there's no specifier for the float type but instead there's only %f for double and %Lf for long double.
But why there's a specifier for short type although it is also promoted but not float type? Why don't they just give a specifier for float type with a modifier like %hf or something? Is there anything logical or technical behind this?
A lot of questions in one question... Here are answers to a couple:
This characteristic of character constants makes it possible to define a character constant such as 'FATE' , with four separate 8-bit ASCII codes stored in a 32-bit unit.However , attempting to assign such a character constant to a char variable results in only the last 8 bits being used , so the variable gets the value 'E'.
This is actually implementation defined behavior. So yes, there's a mistake in the book. Many books on C are written with the assumption that the only C compiler in the world is the one the author used when testing the examples.
The compiler the author use treated the characters in 'FATE' as the bytes of an integer with the 'F' being the most significant byte and 'E' being the least significant. Your compiler treats the characters in the literal as bytes of an inteder with 'F' being the least significant byte and 'E' the most significant. For example, The first method is how MSVC treats the value, while MinGW (a GCC compiler targeting Windows) treats the literal in the second way.
As far as there being no format specifier to printf() that expects float, on specifiers that expect double - this is because the values passed to printf() for formatting are part of the variable argument list (the ... in printf()'s prototype). There is not type information about these arguments, so as you mentioned, the compiler must always promote them (from C99 6.5.2.2/6 "Function calls"):
If the expression that denotes the called function has a type that does not include a prototype, the integer promotions are performed on each argument, and arguments that have type float are promoted to double. These are called the default argument promotions.
And C99 6.5.2.2/7 "Function calls"
The ellipsis notation in a function prototype declarator causes argument type conversion to stop after the last declared parameter. The default argument promotions are performed on trailing arguments.
So in effect, it's impossible to pass a float to printf() - it will always be promoted to a double. That's why the format specifiers for floating point values expect a double.
Also due to the automatic promotion that would be applied to short, I'm honestly not sure if the h specifier for formatting a short is strictly necessary (though it is necessary for use with the n specifier if you want to get the count of characters written to the stream placed in a short). It might be in C because it needs to be there to support the n specifier, historical reasons, or something that I'm just not thinking of.
First, a char is by definition exactly 1 byte wide. Then the standard more or less says that the sizes should be:
sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long)
The exact sizes vary (except for char) by system and compiler but on a 32 bit Windows the sizes with GCC and VC are (AFAIK):
sizeof(short) == 2 (byte)
sizeof(int) == sizeof(long) == 4 (byte)
Your observation of 'F' versus 'E' in this case is a typical endianness issue (little vs. big endian, how a "word" is stored in memory).
Now what happens to your value ? You have a variable that is 8 bit wide. You assign a bigger value ('FATE' or 356), but the compiler knows it only is allowed to store 8 bits so it cuts off all the other bits.
To A:
3.) This is due to the different byte orderings of big and little endian CPU achitectures. You get the first byte on a little endian (i.e. x86) and the last byte on a big endian CPU (i.e. PPC). Actually you get always the lowest 8 bit when the conversion fom int to char is done but the characters in the int are stored in reversed order.
7.) a char can only hold 8 bits, so everything else gets truncated in the moment you assign the int to a char variable and can never be restored from the char variable later.
To B:
3.) You might sometimes want to print only the lowest 16 bits of a int variable regardless of what is in the higher half. It is not uncommon to pack multiple integer values in a single variable for certain optimizations. This works well for integer types but makes not much sense for floating point types which don't support bitwise operations directly, which might be the reason why there is no separate type specifier for float in printf.
char is 1 byte long. The bit length of a byte can be 8, 16, 32 bits long. In general purpose computers generally the bitlength of character is 8 bits long. So the maximum number which the character can represent depends on the bitlength of the character. To check the bitlength of character check limits.h header file it is defined as CHAR_BIT in this file.
char x = 'FATE' will depend probably on the byte ordering which the machine/compiler will interpret the 'FATE' . So this depends on the system/compiler. Someone please confirm/correct this.
If your system has 8 bits byte, then, when you do c = 360 only the lower 8 bits of the binary representation of 360 will be stored in the variable, because char data is always allocated 1 byte of storage. So %d will print 100 because the upper bits were lost when you assigned the value in the variable, and what is left is only the lower 8 bits.

Resources