I having few questions about typed in ANSI C:
1. what's the difference between "\x" in the beginning of a char to 0x in the beginning of char (or in any other case for this matter). AFAIK, they both means that this is hexadecimal.. so what's the difference.
when casting char to (unsigned), not (unsigned char) - what does it mean? why (unsigned)'\xFF' != 0xFF?
Thanks!
what's the difference between "\x" in
the beginning of a char to 0x in the
beginning of char
The difference is that 0x12 is used for specifying an integer in hexadecimal, while "\x" is used for string literals. An example:
#include <stdio.h>
int main(){
int ten = 0xA;
char* tenString = "1\x30";
printf("ten as integer: %d\n", ten);
printf("ten as string: %s\n", tenString);
return 0;
}
The printf's should both output a "10" (try to understand why).
when casting char to (unsigned), not
(unsigned char) - what does it mean?
why (unsigned)'\xFF' != 0xFF?
"unsigned" is just an abbreviation for "unsigned int". So you're casting from char to int. This will give you the numeric representation of the character in the character set your platform uses. Note that the value you get for a character is platform-dependent (typically depending on the default character encoding). For ASCII characters you will (usually) get the ASCII code, but anything beyond that will depend on platform and runtime configuration.
Understanding what a cast from one typ to another does is very complicated (and often, though not always, platform-dependent), so avoid it if you can. Sometimes it is necessary, though. See e.g. need-some-clarification-regarding-casting-in-c
Related
I am recently reading The C Programming Language by Kernighan.
There is an example which defined a variable as int type but using getchar() to store in it.
int x;
x = getchar();
Why we can store a char data as a int variable?
The only thing that I can think about is ASCII and UNICODE.
Am I right?
The getchar function (and similar character input functions) returns an int because of EOF. There are cases when (char) EOF != EOF (like when char is an unsigned type).
Also, in many places where one use a char variable, it will silently be promoted to int anyway. Ant that includes constant character literals like 'A'.
getchar() attempts to read a byte from the standard input stream. The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX), or the special value EOF which is specified to be negative.
On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1, but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value.
Storing the return value of getchar() (or getc(fp)) to a char would prevent proper detection of end of file. Consider these cases (on common systems):
if char is an 8-bit signed type, a byte value of 255, which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char. Comparing this char to EOF will yield a false positive.
if char is unsigned, converting EOF to char will produce the value 255, which is different from EOF, preventing the detection of end of file.
These are the reasons for storing the return value of getchar() into an int variable. This value can later be converted to a char, once the test for end of file has failed.
Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. It would take a vicious implementation to have unexpected behavior for this simple conversion.
The value of the char does indeed depend on the execution character set. Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range.
getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. Keep in mind that compilers were not optimizing code as much as they are today. In C, int is the default return type (i.e. if you don't have a declaration of a function in C, compilers will assume that it returns int), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. Thus, many old C functions prefer to return int.
C requires int be at least as many bits as char. Therefore, int can store the same values as char (allowing for signed/unsigned differences). In most cases, int is a lot larger than char.
char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.)
For the sizes and ranges of the integer types (char included), see your <limits.h>. Here is somebody else's limits.h.
C was designed as a very low-level language, so it is close to the hardware. Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like.
Your intuition is right: it goes back to ASCII. ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); for every letter there is an unique integer. For example, the 'letter' CTRL-A is represented by the decimal number '1'. (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.)
C lets you 'coerce' variables into other types. In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it.
If memory serves me right, you can't do arithmetic on a char. But, if you call it an int, you can. So, to convert all LC letters to UC, you can do something like:
char letter;
....
if(letter-is-upper-case) {
letter = (int) letter - 32;
}
Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting.
but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter.
This question already has answers here:
How to determine the result of assigning multi-character char constant to a char variable?
(5 answers)
Closed 8 years ago.
following statement in c gives no error
char p='-1';
but the following gives error:
char p='-12';
ERROR: character can be one or two characters long.
I never knew that a char in c can ever be two characters long. However printf("%c",p) gives - as output. Where can i use char in c?
In C, a character constant like 'A' does not have type char, but rather type int. This creates the possibility that, even on a system where char is only 8 bits wide (and so int is wider than char), character constant notations can exist which provide integer values wider than char.
The C standard requires implementations to support multi-character constants, but their values are implementation-defined.
Why your compiler allows only two characters is likely because the type int is only 16 bits wide. Perhaps a constant like 'AB' is encoded similarly to, say, the expression ('A' << 8 | 'B'). According to the obvious extension of this scheme, 'ABC' would then have to be ('A' << 16 | 'B' << 8 | 'C') which doesn't fit into 16 bits and calls for out-of-range shifts. Hence, the two character limit.
In the GNU C compiler, four characters can be used:
#include <stdio.h>
int main(void)
{
printf("%x\n", (unsigned) 'ABCD');
return 0;
}
int is 32 bits wide, and this program prints 41424344 which, by golly, is hexadecimal for the ASCII characters ABCD. So this feature is useful for int-wide magic constants which are readable. Instead of:
#define MAGIC 0x41424344 /* This spells ABCD; easy to spot in memory dumps */
You can do this, which is nice, but less portable:
#define MAGIC 'ABCD'
What if we use five or more characters, like 'ABCDE'? Then GCC respond similarly to how Turbo C++ responds for three or more:
test.c:5:35: warning: character constant too long for its type [enabled by default]
It so happens that the program still compiles, and its output is unchanged: the E was truncated.
There is an important difference. The old Borland compiler is rejecting the excessively-long constant as an error. Though that is probably a good idea, it is not standard-conforming; when some value is implementation-defined, the implementation's response cannot be failure, such as stopping the translation or execution of the program. Issuing a diagnostic is fine, of course.
char p='-517';
printf("%c\n", p);
Running the above code gave me output 7 and a warning: overflow in implicit constant conversion [-Woverflow]
char can not contain more than 1 byte of information
You want an array of characters, also known as a C-string
// Note, if you initialize a character array with a literal string
// there is no need for a size specifier
char c[] = "-12";
// Note this is a method of copying one character array into another.
#include <string.h>
char c[4];
strcpy(c, "-12");
You'll notice that char c[4] has an indicated size of 4. Meaning, the array can only hold 4 characters. In C, character arrays have a special property: A null terminator (char '\0') is a sentinel value that C-string functions use to recognize the end of your string. So, in reality, a character string "-12" is of size 4. '-', '1', '2', and '\0'.
You can also access individual elements of an array by passing an indice to it's operator[] function.
printf("%s\n", c);
printf("%c\n", c[0]);
Notice the c[0] call, This will access the character '-' of the string "-12".
Hope I helped.
Sorry about the beginners question, but I've written this code:
#include <stdio.h>
int main()
{
int y = 's';
printf("%c\n", y);
return 0;
}
The compiler (Visual Studio 2012) does not warn me about possibility data-loss (like from int to float).
I didn't find an answer (or didn't search correctly) in Google.
I wonder if this because int's storage in memory is 4 and it can hold 1 memory storage as char.
I am not sure about this.
Thanks in advance.
Yes, that's fine. Characters are simply small integers, so of course the smaller value fits in the larger int variable, there's nothing to warn about.
Many standard C functions use int to transport single characters, since they then also get the possibility to express EOF (which is not a character).
A char is just an 8-bit integer.
An int is a larger integer (on MSVC 32-bit builds it should be 4 bytes).
's' corresponds to the ASCII code of the lower-case letter 's', i.e. it's the integer number 115.
So, your code is similar to:
int y = 115; // 's'
In C, all characters are stored in and dealt with as integers according to the ASCII standard. This allows for functions such as strcmp() etc.
Despite appearances there is no char anywhere in your example.
For historical reasons character constants are actually ints so the line
int y = 's';
is actually assigning one int to another.
Furthermore the %c format specifier in printf actually expects to receive an int argument, not a char. This is because the default argument promotions are applied to variadic arguments and therefore any char in a call to printf is promoted to an int before the function is called.
I have a C code in which I am using standard library function isalpha() in ctype.h, This is on Visual Studio 2010-Windows.
In below code, if char c is '£', the isalpha call returns an assertion as shown in the snapshot below:
char c='£';
if(isalpha(c))
{
printf ("character %c is alphabetic\n",c);
}
else
{
printf ("character %c is NOT alphabetic\n",c);
}
I can see that this might be because 8 bit ASCII does not have this character.
So how do I handle such Non-ASCII characters outside of ASCII table?
What I want to do is if any non-alphabetic character is found(even if it includes such character not in 8-bit ASCII table) i want to be able to neglect it.
You may want to cast the value sent to isalpha (and the other functions declared in <ctype.h>) to unsigned char
isalpha((unsigned char)value)
It's one of the (not so) few occasions where a cast is appropriate in C.
Edited to add an explanation.
According to the standard, emphasis is mine
7.4
1 The header <ctype.h> declares several functions useful for classifying and mapping
characters. In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.
The cast to unsigned char ensures calling isalpha() does not invoke Undefined Behaviour.
You must pass an int to isalpha(), not a char. Note the standard prototype for this function:
int isalpha(int c);
Passing an 8-bit signed character will cause the value to be converted into a negative integer, resulting in an illegal negative offset into the internal arrays typically used by isxxxx().
However you must ensure that your char is treated as unsigned when casting - you can't simply cast it directly to an int, because if it's an 8-bit character the resulting int would still be negative.
The typical way to ensure this works is to cast it to an unsigned char, and then rely on implicit type conversion to convert that into an int.
e.g.
char c = '£';
int a = isalpha((unsigned char) c);
You may be compiling using wchar (UNICODE) as character type, in that case the isalpha method to use is iswalpha
http://msdn.microsoft.com/en-us/library/xt82b8z8.aspx
I read that C not define if a char is signed or unsigned, and in GCC page this says that it can be signed on x86 and unsigned in PowerPPC and ARM.
Okey, I'm writing a program with GLIB that define char as gchar (not more than it, only a way for standardization).
My question is, what about UTF-8? It use more than an block of memory?
Say that I have a variable
unsigned char *string = "My string with UTF8 enconding ~> çã";
See, if I declare my variable as
unsigned
I will have only 127 values (so my program will to store more blocks of mem) or the UTF-8 change to negative too?
Sorry if I can't explain it correctly, but I think that i is a bit complex.
NOTE:
Thanks for all answer
I don't understand how it is interpreted normally.
I think that like ascii, if I have a signed and unsigned char on my program, the strings have diferently values, and it leads to confuse, imagine it in utf8 so.
I've had a couple requests to explain a comment I made.
The fact that a char type can default to either a signed or unsigned type can be significant when you're comparing characters and expect a certain ordering. In particular, UTF8 uses the high bit (assuming that char is an 8-bit type, which is true in the vast majority of platforms) to indicate that a character code point requires more than one byte to be represented.
A quick and dirty example of the problem:
#include <stdio.h>
int main( void)
{
signed char flag = 0xf0;
unsigned char uflag = 0xf0;
if (flag < (signed char) 'z') {
printf( "flag is smaller than 'z'\n");
}
else {
printf( "flag is larger than 'z'\n");
}
if (uflag < (unsigned char) 'z') {
printf( "uflag is smaller than 'z'\n");
}
else {
printf( "uflag is larger than 'z'\n");
}
return 0;
}
On most projects that I work, the unadorned char type is typically avoided in favor us using a typedef that explicitly specifies an unsigned char. Something like the uint8_t from stdint.h or
typedef unsigned char u8;
Generally dealing with an unsigned char type seems to work well and have few problems - the one area that I have seen occasional problems is when using something of that type to control a loop:
while (uchar_var-- >= 0) {
// infinite loop...
}
Two things:
Whether a char type is signed or unsigned won't affect your ability to translate UTF8-encoded-strings to and from whatever display string type you're using (WCHAR or whatnot). Don't worry about it, in other words: the UTF8 bytes are just bytes, and whatever you're using as an encoder/decoder will do the right thing.
Some of your confusion may be that you're trying to do this:
unsigned char *string = "This is a UTF8 string";
Don't do this-- you're mixing different concepts. A UTF-8 encoded string is just a sequence of bytes. C string literals (as above) were not really designed to represent this; they're designed to represent "ASCII-encoded" strings. Although for some cases (like mine here) they end up being the same thing, in your example in the question, they may not. And certainly in other cases they won't be. Load your Unicode strings from an external resource. In general I'd be wary of embedding non-ASCII characters in a .c source file; even if the compiler knows what to do with them, other software in your toolchain may not.
Using unsigned char has its pros and cons. The biggest benefits are that you don't get sign extension or other funny features such as signed overflow that would produce unexpected results from calculations. Unsigned char is also compatible with <cctype> macros/functions such as isalpha(ch) (all these require values in unsigned char range). On the other hand, all I/O functions require char*, requiring you to cast whenever you do I/O.
As for UTF-8, storing it in signed or unsigned arrays is fine but you have to be careful with those string literals as there is little guarantee about them being valid UTF-8. C++0x adds UTF-8 string literals to avoid possible issues and I would expect the next C standard to adopt those as well.
In general you should be fine, though, as long as you make sure that your source code files are always UTF-8 encoded.
signed / unsigned affect only arithmetic operations. if char is unsigned then higher values will be positive. in case of signed they will be negative. But range is same still.
Not really, unsigned / signed does not specify how many values a variable can hold. It specifies how they are interpreted.
So, an unsigned char has the same amount of values as a signed char, except that the one has negative numbers and the other doesn't. It is still 8 bits (if we assume that a char holds 8 bits, I'm not sure it does everywhere).
It makes no differences when using a char* as a string. The only time signed/unsigned would make a difference is if you would be interpreting it as a number, like for arithmetic or if you were to print it as an integer.
UTF-8 characters cannot be assumed to store in one byte. UTF-8 characters can be 1-4 bytes wide. So, a char, wchar_t, signed or unsigned would not be sufficient for assuming one unit can always store one UTF-8 character.
Most platforms (such as PHP, .NET, etc.) have you build strings normally (such as char[] in C) and you use a library to convert between encodings and parse characters out of the string.
As to you'r question:
think if I have a singed or unsigned ARRAY of chars can be it make my program run wrong? – drigoSkalWalker
Yes. Mine did. Heres a simple runnable excerpt from my app that totally comes out wrong if using ordinary signed chars.
Try running it after changing all chars to unsigned in parameters. Like this:
int is_valid(unsigned char c);
it should then work properly.
#include <stdio.h>
int is_valid(char c);
int main() {
char ch = 0xFE;
int ans = is_valid(ch);
printf("%d", ans);
}
int is_valid(char c) {
if((c == 0xFF) || (c == 0xFE)) {
printf("NOT valid\n");
return 0;
}
else {
printf("valid\n")
return 1;
}
}
What it does is validate if the char is a valid byte within utf-8.
0xFF and 0xFE are NOT valid bytes in utf-8.
imagine the problem if the function validates it as a valid byte?
what happens is this:
0xFE
=
11111110
=
254
If you save this in a ordinary char (that is signed) the leftmost bit, most significant bit, makes it negative. But what negative number is it?
It does this by flipping the bits and adding one bit.
11111110
00000001
00000001 + 00000001 =
00000010 = 2
and remember it made it negative, so it becomes -2
so (-2 == 0xFE) in the function ofcourse isnt true.
same goes for (-2 == 0xFF).
So a function that checks for invalid bytes ends up validating unvalid bytes as if they are ok :-o.
Two other reasons I can think of to stick to unsigned when dealing with utf-8 is:
If you might need some bitshifting to the right, there can be trouble because then you might end up adding 1's from the left if using signed chars.
utf-8 and unicode only uses positive numbers so... why dont you as well? keeping it simple :)