non-ASCII character declaration - c

I would like to store a character (in order to compare it with other characters).
If I declare the variable like this :
char c = 'é';
everything works well, but I get these warnings :
warning: multi-character character constant [-Wmultichar]
char c = 'é';
^
ii.c:12:3: warning: overflow in implicit constant conversion [-Woverflow]
char c = 'é';
I think I understand why there is these warnings, but I wonder why does it still work?
And should I define it like this : int d = 'é'; although it takes more space in memory?
Moreover, I also get the warning below with this declaration :
warning: multi-character character constant [-Wmultichar]
int d = 'é';
Do I miss something? Thanks ;)

Try using wchar_t rather than char. char is a single byte, which is appropriate for ASCII but not for multi-byte character sets such as UTF-8. Also, flag your character literal as being a wide character rather than a narrow character:
#include <wchar.h>
...
wchar_t c = L'é';

é has the Unicode code point 0xE9, the UTF-8 encoding is "\xc3\xa9".
I assume your source file is encoded in UTF-8, so
char c = 'é';
is (roughly) equivalent to
char c = '\xc3\xa9';
How such character constants are treated is implementation-defined. For GCC:
The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed, regardless of whether single characters are signed or not (a slight change from versions 3.1 and earlier of GCC). If there are more characters in the constant than would fit in the target int the compiler issues a warning, and the excess leading characters are ignored.
For example, 'ab' for a target with an 8-bit char would be interpreted as (int) ((unsigned char) 'a' * 256 + (unsigned char) 'b'), and '\234a' as (int) ((unsigned char) '\234' * 256 + (unsigned char) 'a').
Hence, 'é' has the value 0xC3A9, which fits into an int (at least for 32-bit int), but not into an (8-bit) char, so the conversion to char is again implementation-defined:
For conversion to a type of width N, the value is reduced modulo 2N to be within range of the type; no signal is raised.
This gives (with signed char)
#include <stdio.h>
int main(void) {
printf("%d %d\n", 'é', (char)'é');
if((char)'é' == (char)'©') puts("(char)'é' == (char)'©'");
}
Output:
50089 -87
(char)'é' == (char)'©'
50089 is 0xC3A9, 87 is 0xA9.
So you lose information when storing é into a char (there are characters like © which compare equal to é). You can
Use wchar_t, an implementation-dependent wide character type which is 4 byte on Linux holding UTF-32: wchar_t c = L'é';. You can convert them to the locale-specific multibyte-encoding (probably UTF-8, but you'll need to set the locale before, see setlocale; note, that changing the locale may change the behaviour of functions like isalpha or printf) by wcrtomb or use them directly and also use wide strings (use the L prefix to get wide character string literals)
Use a string and store UTF-8 in it (as in const char *c = "é"; or const char *c = "\u00e9"; or const char *c = "\xc3\xa9;", with possibly different semantics; for C11, perhaps also look for UTF-8 string literals and the u8 prefix)
Note, that file streams have an orientation (cf. fwide).
HTH

Related

Is a 64-bit character literal possible in C?

The following code compiles fine:
uint32_t myfunc32() {
uint32_t var = 'asdf';
return var;
}
The following code gives the warning, "character constant too long for its type":
uint64_t myfunc64() {
uint64_t var = 'asdfasdf';
return var;
}
Indeed, the 64-bit character literal gets truncated to a 32-bit constant by GCC. Are 64-bit character literals not a feature of C? I can't find any good info on this.
Edit: I am doing some more testing. It turns out that another compiler, MetroWerks CodeWarrior, can compile the 64-bit character literals as expected. If this is not already a feature of GCC, it really ought to be.
Are 64-bit character literals not a feature of C?
Indeed they are not. As per C99 §6.4.4.4 point 10 (page 73 here):
An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one character (e.g.,
'ab'), or containing a character or escape sequence that does not map to a single-byte
execution character, is implementation-defined.
So, character constants have type int, which on most modern platforms means int32_t. On the other hand, the actual value of the int resulting from a multi-byte character constant is implementation defined, so you can't really expect much from int x = 'abc';, unless you are targeting a specific compiler and compiler version. You should avoid using such statements in sane C code.
As per GCC-specific behavior, from the GCC documentation we have:
The numeric value of character constants in preprocessor expressions.
The preprocessor and compiler interpret character constants in the same way; i.e. escape sequences such as ‘\a’ are given the values they would have on the target machine.
The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed, regardless of whether single characters are signed or not. If there are more characters in the constant than would fit in the target int the compiler issues a warning, and the excess leading characters are ignored.
For example, 'ab' for a target with an 8-bit char would be interpreted as ‘(int) ((unsigned char) 'a' * 256 + (unsigned char) 'b')’, and '\234a' as ‘(int) ((unsigned char) '\234' * 256 + (unsigned char) 'a')’.

Ambiguous result with char and toupper

How does 'ab' was converted to 24930 when stored in char?
#include<stdio.h>
int main(){
char c = 'ab';
c = toupper(c);
printf("%c", c);
return 0;
}
GCC compiler warning : Overflow in conversion from 'int' to 'char' changes value from '24930' to '98'
Output : B
If possible please explain how char handled the multiple characters here.
From the C Standard (6.4.4.4 Character constants)
10 An integer character constant has type int. The value of an
integer character constant containing a single character that maps to
a single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined.
So this character constant 'ab' is stored as an object of the type int. Then it is assigned to ab object of the type char like in this declaration
char c = 'ab';
then the least significant byte of the object of the type int is used to initialize the object c. It seems in your case the character 'b' was stored in this least significant byte and was assigned to the object c.
Your characters are stored into 4 bytes on the right side. 24930 = 0x6162 equivalent to 0x61 'a' and 0x62 'b' = 'ab'
98 is your 'b', 0x62 in hexa, 98 in decimal, check with man ascii. Because of the overflow, and the fact that your system works in little endian, your 2 chars are stored on 4 bytes like this:
0x62 0x61 0x00 0x00
Because only one char is supposed to be assigned to c (sizeof char is equal to 1 byte, with a max limit of 256 bits), it truncates and keep only the first byte, your 'b'.
You can test it easily with char c = 'abcd' it will print 'D'.
A conforming char c holds a single character... multiple characters require char *string and you need to iterate through the string, eg something like:
#include <stdio.h>
#include <cype.h> // Needed for toupper()
int main()
{
char *mystring = "ab\0"; // Null terminate
char *mychar = mystring;
while ( *mychar != '\0' )
{
char c = toupper( *mychar );
printf( "%c", c );
mychar++;
}
return 0;
}
As an aside, toupper() returns an int, so there is an implicit type conversion there.

Dealing with char values over 127 in C

I'm quite new to C programming, and I have some problems trying to assign a value over 127 (0x7F) in a char array. In my program, I work with generic binary data and I don't face any problem printing a previously acquired byte stream (e.g. with fopen or fgets, then processed with some bitwise operations) as %c or %d.But if I try to print a character from its numerical value like this:
printf("%c\n", 128);
it just prints FFFD (the replacement character).Here is another example:
char abc[] = {126, 128, '\0'}; // Manually assigning values
printf("%c", abc[0]); // Prints "~", as expected
printf("%c", 121); // Prints "y"
pritf("%c", abc[1]; // Should print "€", I think, but I get "�"
I'm a bit confused since I can just print every character below 128 in these ways.The reason I'm asking this, is because I need to generate a (pseudo)random byte sequence using the rand() function.Here is an example:
char abc[10];
srand(time(NULL));
abc[0] = rand() % 256; // Gives something between 00:FF ...
printf("%c", abc[0]); // ... but I get "�"
If this is of any help, the source code is encoded in UTF-8, but changing encoding doesn't have any effect.
In C, a char is a different type than unsigned char and signed char. It has the range CHAR_MIN to CHAR_MAX. Yet it has the same range as one of unsigned char/signed char. Typically these are 8-bit types, but could be more. See CHAR_BIT. So the typical range is [0 to 255] or [-128 to 127]
If char is unsigned, abc[1] = 128 is fine. If char is signed, abc[1] = 128 is implementation-defined (see below). The typical I-D is the abc[1] will have the value of -128.
printf("%c\n", 128); will send the int value 128 to printf(). The "%c" will cast that value to an unsigned char. So far no problems. What appears on the output depends on how the output device handles code 128. Perhaps Ç, perhaps something else.
printf("%c", abc[1]; will send 128 or is I-D. If I-D and -128 was sent, then casting -128 to unsigned char is 128 and again the code for 128 is printed.
If the output device is expecting UTF8 sequences, a UTF8 sequence beginning with code 128 is invalid (it is an unexpected continuation byte) and many such systems will print the replacement character which is unicode FFFD.
Converting a value outside the range of of a signed char to char invokes:
the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised. C11dr §6.3.1.3 3
First of all, let me tell you, signed-ness of a char is implementation defined.
If you have to deal with char values over 127, you can use unsigned char. It can handle 0-255.
Also, you should be using %hhu format specifier to print the value of an unsigned char.
If you're dealing with bytes, use unsigned char instead of char for your datatypes.
With regard to printing, you can print the bytes in hex instead of decimal or as characters:
printf("%02X", abc[0]);
You probably don't want to print these bytes as characters, as you'll most likely be dealing with UTF-8 character encoding which doesn't seem to be what you're looking for.

What's the diffrence between \xFF and 0xFF

1st - What's the difference between
#define s 0xFF
and
#define s '\xFF'
2nd - Why the second line equals to -1?
3rd - Why after I try this (in the case of '\xFF')
unsigned char t = s;
putchar(t);
unsigned int p = s;
printf("\n%d\n", p);
the output is
(blank)
-1
?
thanks:)
This
#define s 0xFF
is a definition of hexadecimal integer constant. It has type int and its value is 255 in decimal notation.
This
#define s '\xFF'
is a definition of integer character constant that represented by a hexadecimal escape sequence. It also has type int but it represents a character. Its value is calculated differently.
According to the C Standard (p.#10 of section 6.4.4.4 Character constants)
...If an integer character constant contains a single character or
escape sequence, its value is the one that results when an object with
type char whose value is that of the single character or escape
sequence is converted to type int.
It seems that by default your compiler considers values of type char as values of type signed char. So according to the quote integer character constant
'\xFF' has negative value because the sign bit (MSB) is set and is equal to -1.
If you set the option of the compiler that controls whether type char is considered as signed or unsigned to unsigned char then '\xFF' and 0xFF will have the same value that is 255.
Take into account that hexadecimal escape sequences may be used in string literals along with any other escape sequences.
You can use '\xFF' in a string literal as last character and also as middle character using string concatenation but same is not true for 0xFF.
Difference between '\xFF' and 0xFF is analogous to difference between 'a' and code of character 'a' (Let's assume it is 0x61 for some implementation) with only difference '\xFF' will consume further hex characters if used in string.
When you print the character FF using putchar, output is implementation dependent. But when you print it as an integer, due to default promotion rule of varargs, it may print -1 or 255 on systems where char behaves as signed char and unsigned char respectively.

Specify a number literal as 8 bit?

unsigned char ascii;
int a = 0;
char string[4] = "foo"; "A1"
ascii = (string[a] - 'A' + 10) * 16;
warning: conversion to ‘unsigned char’
from ‘int’ may alter its value
It seems that gcc considers chars and number literals as int by default. I know I could just cast the expression to (unsigned char) but how can I specify char literals and number literals as 8 bit without casts ?
A similar issue:
Literal fractions are considered double by default but they can be specified to float by:
3.1f
Therefore, 3.1 would be considered a float rather than a double.
In C, you cannot do calculations in anything shorter than int
char a = '8' - '0'; /* '8' is int */
char a = (char)'8' - '0'; /* (char)'8' is converted to `int` before the subtraction */
char a = (char)'8' - (char)'0'; /* both (char)'8' and (char)'0' are converted */
The C language doesn't provide any way of specifying a literal with type char or unsigned char. Use the cast.
By the way, the result of your calculation is outside the range of unsigned char, so the warning is quite correct - conversion will alter its value. C doesn't provide arithmetic in any type smaller than an int. In this case I suppose that what you want is modulo-256 arithmetic, and I think that gcc will recognise that, and will not emit the warning with the casts in place. But as far as the C language is concerned, that calculation is done in the larger type and then converted down to unsigned char for storage in ascii.
You can specify character literals, of type charint (char in C++), with specific numbers using octal or hexadecimal notation. For example, \012 is octal 12, or decimal 10. Alternatively, you could write '\x0a' to mean the same thing.
However, even if you did this (and the calculation didn't overflow), it might not get rid of the warning, as the C language specifies that all operands are promoted to at least int (or unsigned int, depending on the operand types) before the calculation is done.

Resources