I just started learning C and am rather confused over declaring characters using int and char.
I am well aware that any characters are made up of integers in the sense that the "integers" of characters are the characters' respective ASCII decimals.
That said, I learned that it's perfectly possible to declare a character using int without using the ASCII decimals. Eg. declaring variable test as a character 'X' can be written as:
char test = 'X';
and
int test = 'X';
And for both declaration of character, the conversion characters are %c (even though test is defined as int).
Therefore, my question is/are the difference(s) between declaring character variables using char and int and when to use int to declare a character variable?
The difference is the size in byte of the variable, and from there the different values the variable can hold.
A char is required to accept all values between 0 and 127 (included). So in common environments it occupies exactly
one byte (8 bits). It is unspecified by the standard whether it is signed (-128 - 127) or unsigned (0 - 255).
An int is required to be at least a 16 bits signed word, and to accept all values between -32767 and 32767. That means that an int can accept all values from a char, be the latter signed or unsigned.
If you want to store only characters in a variable, you should declare it as char. Using an int would just waste memory, and could mislead a future reader. One common exception to that rule is when you want to process a wider value for special conditions. For example the function fgetc from the standard library is declared as returning int:
int fgetc(FILE *fd);
because the special value EOF (for End Of File) is defined as the int value -1 (all bits to one in a 2-complement system) that means more than the size of a char. That way no char (only 8 bits on a common system) can be equal to the EOF constant. If the function was declared to return a simple char, nothing could distinguish the EOF value from the (valid) char 0xFF.
That's the reason why the following code is bad and should never be used:
char c; // a terrible memory saving...
...
while ((c = fgetc(stdin)) != EOF) { // NEVER WRITE THAT!!!
...
}
Inside the loop, a char would be enough, but for the test not to succeed when reading character 0xFF, the variable needs to be an int.
The char type has multiple roles.
The first is that it is simply part of the chain of integer types, char, short, int, long, etc., so it's just another container for numbers.
The second is that its underlying storage is the smallest unit, and all other objects have a size that is a multiple of the size of char (sizeof returns a number that is in units of char, so sizeof char == 1).
The third is that it plays the role of a character in a string, certainly historically. When seen like this, the value of a char maps to a specified character, for instance via the ASCII encoding, but it can also be used with multi-byte encodings (one or more chars together map to one character).
Size of an int is 4 bytes on most architectures, while the size of a char is 1 byte.
Usually you should declare characters as char and use int for integers being capable of holding bigger values. On most systems a char occupies a byte which is 8 bits. Depending on your system this char might be signed or unsigned by default, as such it will be able to hold values between 0-255 or -128-127.
An int might be 32 bits long, but if you really want exactly 32 bits for your integer you should declare it as int32_t or uint32_t instead.
I think there's no difference, but you're allocating extra memory you're not going to use. You can also do const long a = 1;, but it will be more suitable to use const char a = 1; instead.
Related
This question:
What is an unsigned char?
does a great job of discussing char vs. unsigned char vs. signed char in C.
However, it doesn't directly address what should be used for non-ASCII text. Thus if I have an array of bytes that represents text in some arbitrary character set like UTF-8 or Big5 (or sometimes ASCII), should I use an array of char or unsigned char?
I'm leaning towards using char because otherwise gcc gives me warnings about signedness of pointers when the array is ASCII and I use strlen. But I would like to know what is correct.
Use normal char to represent characters. Use signed char when you want a signed integer type that covers values from -127 to +127 . Use unsigned char for having an unsigned integer type that has range of values from 0 to 255 .
The question you are asking is probably much broader that you expect.
To answer it directly, most implementations use "byte" as underlying buffer. In that terms standard uint8_t typedef is your best bet. That is primarily because most character sets use variable number of bytes to store characters, so separate byte processing is essential in encoding and decoding process. It also simplifies conversion between different "endianess".
In general it's incorrect to use strlen on anything other than ASCII encoding or other single-byte code pages (0-255 range). It's certainly incorrect on any multi-byte encoding like Big5, UTF-8/16 or Shift-JIS.
As far as UTF8 or any encoding where ASCII characters have the same codepoints, char is the best type for multi-byte characters string:
assume typedef char utf8:
This is the only way to allow char * to be used as utf8 * without an explicit cast. This is extremely common and a good enough reason to be better than unsigned char.
utf8 * could be accidentally passed to function expecting a pointer to a sequence of ASCII characters, but this could also be needed if you need to printf your utf8 string (which is a valid thing to do)
The main drawback is that as char sign is unknown, usage of arithmetic operators like > is unsafe, and the only safe way to check if a character is in the ASCII range is by checking the bit directly with ISASCII(c) ((c & (1 << 7) == 0)
So, where can unsigned char be useful?
If I understood right, unsigned char can represent numbers from -128 to 127. But every encoding table uses positive numbers. So, unsigned char can't be used for representing characters. Am I right?
No, unsigned char is 0 to 255.
It can be useful in representing binary data (a single byte), although, like any primitive data type, the possibilities are endless.
First of all, what you are representing is signed char, unsigned char ranges from 0 - 255.
To answer your questions about negative valued character, you are right that character encoding is done using positive values.
On a different view, just think of signed and unsigned char as integer representation.
Unsigned char is used to represent bytes. If you need just one byte of memory in a variable, you use unsigned char and assign an integer to it.
fo example, there is used uint8_t to represent bytes, but is not more than that.
A signed char can represent number from -128 to +127
and unsigned char is from 0 to 255.
Altough unsigned is more convenient in many use cases,
everthing binary-related can be done with signed too:
0=0, 1=1 ... 127=127, -128=128, -127=129, -126=130 ... -1=255
Such conversions happens automatically (or, better to say,
it´s just different interpretation).
("binary-related" means that a mathematical -2 * 2 would be possible too with unsigned,
but make even less sense)
Regarding So, where can unsigned char be useful?
Here perhaps?: (a very simple example to test for ASCII digit)
BOOL isDigit(unsigned char c)
{
if((c >= '0') &&(c <= '9')) return TRUE;
return FALSE;
}
By virtue of argument type unsigned char guarantees input will be a single ASCII character (there are 128 encoded ASCII possibilities, with Extended ASCII, there are 255 possibilities). So, in this function, all that remains is to test input value for specific criteria (in this case is it a digit) There is no requirement for function to test for negative numbers. A regular char (i.e. signed) cannot contain the entire range of ASCII characters. The sizeof unsigned char is also significant in that it is only 1 byte as opposed to 4 bytes (typically, but not always) for say, an int
Hi how to modify character size as 2 bytes in C ? because the size of character in C is 1 byte only
You can't change any data type size. You need to change the data type depending on the values you are going to store..
If you need to store more than one character,use the array of character as below..
char a[2];
In above declaration variable 'a' will hold string of two characters..
You can use unsigned short. You cannot modify the data types.
You can do as following,
typedef unsigned short newChar;
int main()
{
newChar c = 'a';
}
No it is not possible. You cannot modify the character size to 2 bytes as the size of character is set to 1 byte by default. You cannot modify the data types. You can probably use a array of characters to store more than one character like:
char s[10];
On a side note:-
From here
Size qualifiers alter the size of the basic data types. There are two
size qualifiers that can be applied to integer: short and long. The
minimum size of short int is 16 bit. The size of int must be greater
than or equal to that of a short int. The size of long int must be
greater than or equal to a short int. The minimum size of a long int
is 32 bits.
C uses Ascii format of storing characters, so the range of these ASCII characters 0-255. 0-127 are general ascii character set. 127 onwards is extended set. So C supports 1 byte characters only. Whereas Java uses unicodes, which has a larger range, so the storage in Java for characters could be greater than 1 byte.
I was playing around with unicode characters (without using wchar_t support) just for fun. I'm only using the regular char data type. I noticed that while printing them in hex they were showing up full 4 bytes instead of just one byte.
For ex. consider this c file:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
char *s = (char *) malloc(100);
fgets(s, 100, stdin);
while (s && *s != '\0') {
printf("%x\n", *s);
s++;
}
return 0;
}
After compiling with gcc and giving input as 'cent' symbol (hex: c2 a2) I get the following output
$ ./a.out
¢
ffffffc2: ?
ffffffa2: ?
a:
So instead of just printing c2 and a2 I got the whole 4 bytes as if it's an int type.
Does this mean char is not really 1-byte in length, ascii made it look like 1-byte?
Maybe the reason why the upper three bytes become 0xFFFFFF needs a bit more explanation?
The upper three bytes of the value printed for *s have a value of 0xFF due to sign extension.
The char value passed to printf is extended to an int before the call to printf.
This is due to C's default behaviour.
In the absence of signed or unsigned, the compiler can default to interpret char as signed char or unsigned char. It is consistently one or the other unless explicitly changed with a command line option or pragma's. In this case we can see that it is signed char.
In the absence of more information (prototypes or casts), C passes:
int, so char, short, unsigned char unsigned short are converted to int. It never passes a char, unsigned char, signed char, as a single byte, it always passes an int.
unsigned int is the same size as int so the value is passed without change
The compiler needs to decide how to convert the smaller value to an int.
signed values: the upper bytes of the int are sign extended from the smaller value, which effectively copies the top, sign bit, upwards to fill the int. If the top bit of the smaller signed value is 0, the upper bytes are filled with 0. If the top bit of the smaller signed value is 1, the upper bytes are filled with 1. Hence printf("%x ",*s) prints ffffffc2
unsigned values are not sign extended, the upper bytes of the int are 'zero padded'
Hence the reason C can call a function without a prototype (though the compiler will usually warn about that)
So you can write, and expect this to run (though I would hope your compiler issues warnings):
/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */
int main (int argc, const char * argv[]) {
signed char schar[] = "\x70\x80";
unsigned char uchar[] = "\x70\x80";
printf("schar[0]=%x schar[1]=%x uchar[0]=%x uchar[1]=%x\n",
schar[0], schar[1], uchar[0], uchar[1]);
return 0;
}
That prints:
schar[0]=70 schar[1]=ffffff80 uchar[0]=70 uchar[1]=80
The char value is interpreted by my (Mac's gcc) compiler as signed char, so the compiler generates code to sign extended the char to the int before the printf call.
Where the signed char value has its top (sign) bit set (\x80), the conversion to int sign extends the char value. The sign extension fills in the upper bytes (in this case 3 more bytes to make a 4 byte int) with 1's, which get printed by printf as ffffff80
Where the signed char value has its top (sign) bit clear (\x70), the conversion to int still sign extends the char value. In this case the sign is 0, so the sign extension fills in the upper bytes with 0's, which get printed by printf as 70
My example shows the case where the value is unsigned char. In these two cases the value is not sign extended because the value is unsigned. Instead they are extended to int with 0 padding. It might look like printf is only printing one byte because the adjacent three bytes of the value would be 0. But it is printing the entire int, it happens that the value is 0x00000070 and 0x00000080 because the unsigned char values were converted to
int without sign extension.
You can force printf to only print the low byte of the int, by using suitable formatting (%hhx), so this correctly prints only the value in the original char:
/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */
int main (int argc, const char * argv[]) {
char schar[] = "\x70\x80";
unsigned char uchar[] = "\x70\x80";
printf("schar[0]=%hhx schar[1]=%hhx uchar[0]=%hhx uchar[1]=%hhx\n",
schar[0], schar[1], uchar[0], uchar[1]);
return 0;
}
This prints:
schar[0]=70 schar[1]=80 uchar[0]=70 uchar[1]=80
because printf interprets the %hhx to treat the int as an unsigned char. This does not change the fact that the char was sign extended to an int before printf was called. It is only a way to tell printf how to interpret the contents of the int.
In a way, for signed char *schar, the meaning of %hhx looks slightly misleading, but the '%x' format interprets int as unsigned anyway, and (with my printf) there is no format to print hex for signed values (IMHO it would be a confusing).
Sadly, ISO/ANSI/... don't freely publish our programming language standards, so I can't point to the specification, but searching the web might turn up working drafts. I haven't tried to find them. I would recommend "C: A Reference Manual" by Samuel P. Harbison and Guy L. Steele as a cheaper alternative to the ISO document.
HTH
No. printf is a variable argument function, arguments to a variable argument function will be promoted to an int. And in this case the char was negative, so it gets sign extended.
%x tells printf that the value to print is an unsigned int. So, it promotes the char to an unsigned int, sign extending as necessary and then prints out the resulting value.
Given that signed and unsigned ints use the same registers, etc., and just interpret bit patterns differently, and C chars are basically just 8-bit ints, what's the difference between signed and unsigned chars in C? I understand that the signedness of char is implementation defined, and I simply can't understand how it could ever make a difference, at least when char is used to hold strings instead of to do math.
It won't make a difference for strings. But in C you can use a char to do math, when it will make a difference.
In fact, when working in constrained memory environments, like embedded 8 bit applications a char will often be used to do math, and then it makes a big difference. This is because there is no byte type by default in C.
In terms of the values they represent:
unsigned char:
spans the value range 0..255 (00000000..11111111)
values overflow around low edge as:
0 - 1 = 255 (00000000 - 00000001 = 11111111)
values overflow around high edge as:
255 + 1 = 0 (11111111 + 00000001 = 00000000)
bitwise right shift operator (>>) does a logical shift:
10000000 >> 1 = 01000000 (128 / 2 = 64)
signed char:
spans the value range -128..127 (10000000..01111111)
values overflow around low edge as:
-128 - 1 = 127 (10000000 - 00000001 = 01111111)
values overflow around high edge as:
127 + 1 = -128 (01111111 + 00000001 = 10000000)
bitwise right shift operator (>>) does an arithmetic shift:
10000000 >> 1 = 11000000 (-128 / 2 = -64)
I included the binary representations to show that the value wrapping behaviour is pure, consistent binary arithmetic and has nothing to do with a char being signed/unsigned (expect for right shifts).
Update
Some implementation-specific behaviour mentioned in the comments:
char != signed char. The type "char" without "signed" or "unsinged" is implementation-defined which means that it can act like a signed or unsigned type.
Signed integer overflow leads to undefined behavior where a program can do anything, including dumping core or overrunning a buffer.
#include <stdio.h>
int main(int argc, char** argv)
{
char a = 'A';
char b = 0xFF;
signed char sa = 'A';
signed char sb = 0xFF;
unsigned char ua = 'A';
unsigned char ub = 0xFF;
printf("a > b: %s\n", a > b ? "true" : "false");
printf("sa > sb: %s\n", sa > sb ? "true" : "false");
printf("ua > ub: %s\n", ua > ub ? "true" : "false");
return 0;
}
[root]# ./a.out
a > b: true
sa > sb: true
ua > ub: false
It's important when sorting strings.
There are a couple of difference. Most importantly, if you overflow the valid range of a char by assigning it a too big or small integer, and char is signed, the resulting value is implementation defined or even some signal (in C) could be risen, as for all signed types. Contrast that to the case when you assign something too big or small to an unsigned char: the value wraps around, you will get precisely defined semantics. For example, assigning a -1 to an unsigned char, you will get an UCHAR_MAX. So whenever you have a byte as in a number from 0 to 2^CHAR_BIT, you should really use unsigned char to store it.
The sign also makes a difference when passing to vararg functions:
char c = getSomeCharacter(); // returns 0..255
printf("%d\n", c);
Assume the value assigned to c would be too big for char to represent, and the machine uses two's complement. Many implementation behave for the case that you assign a too big value to the char, in that the bit-pattern won't change. If an int will be able to represent all values of char (which it is for most implementations), then the char is being promoted to int before passing to printf. So, the value of what is passed would be negative. Promoting to int would retain that sign. So you will get a negative result. However, if char is unsigned, then the value is unsigned, and promoting to an int will yield a positive int. You can use unsigned char, then you will get precisely defined behavior for both the assignment to the variable, and passing to printf which will then print something positive.
Note that a char, unsigned and signed char all are at least 8 bits wide. There is no requirement that char is exactly 8 bits wide. However, for most systems that's true, but for some, you will find they use 32bit chars. A byte in C and C++ is defined to have the size of char, so a byte in C also is not always exactly 8 bits.
Another difference is, that in C, a unsigned char must have no padding bits. That is, if you find CHAR_BIT is 8, then an unsigned char's values must range from 0 .. 2^CHAR_BIT-1. THe same is true for char if it's unsigned. For signed char, you can't assume anything about the range of values, even if you know how your compiler implements the sign stuff (two's complement or the other options), there may be unused padding bits in it. In C++, there are no padding bits for all three character types.
"What does it mean for a char to be signed?"
Traditionally, the ASCII character set consists of 7-bit character encodings. (As opposed to the 8 bit EBCIDIC.)
When the C language was designed and implemented this was a significant issue. (For various reasons like data transmission over serial modem devices.) The extra bit has uses like parity.
A "signed character" happens to be perfect for this representation.
Binary data, OTOH, is simply taking the value of each 8-bit "chunk" of data, thus no sign is needed.
Arithmetic on bytes is important for computer graphics (where 8-bit values are often used to store colors). Aside from that, I can think of two main cases where char sign matters:
converting to a larger int
comparison functions
The nasty thing is, these won't bite you if all your string data is 7-bit. However, it promises to be an unending source of obscure bugs if you're trying to make your C/C++ program 8-bit clean.
Signedness works pretty much the same way in chars as it does in other integral types. As you've noted, chars are really just one-byte integers. (Not necessarily 8-bit, though! There's a difference; a byte might be bigger than 8 bits on some platforms, and chars are rather tied to bytes due to the definitions of char and sizeof(char). The CHAR_BIT macro, defined in <limits.h> or C++'s <climits>, will tell you how many bits are in a char.).
As for why you'd want a character with a sign: in C and C++, there is no standard type called byte. To the compiler, chars are bytes and vice versa, and it doesn't distinguish between them. Sometimes, though, you want to -- sometimes you want that char to be a one-byte number, and in those cases (particularly how small a range a byte can have), you also typically care whether the number is signed or not. I've personally used signedness (or unsignedness) to say that a certain char is a (numeric) "byte" rather than a character, and that it's going to be used numerically. Without a specified signedness, that char really is a character, and is intended to be used as text.
I used to do that, rather. Now the newer versions of C and C++ have (u?)int_least8_t (currently typedef'd in <stdint.h> or <cstdint>), which are more explicitly numeric (though they'll typically just be typedefs for signed and unsigned char types anyway).
The only situation I can imagine this being an issue is if you choose to do math on chars. It's perfectly legal to write the following code.
char a = (char)42;
char b = (char)120;
char c = a + b;
Depending on the signedness of the char, c could be one of two values. If char's are unsigned then c will be (char)162. If they are signed then it will an overflow case as the max value for a signed char is 128. I'm guessing most implementations would just return (char)-32.
One thing about signed chars is that you can test c >= ' ' (space) and be sure it's a normal printable ascii char. Of course, it's not portable, so not very useful.