printing the char value of each wide character's bytes - c

when running the following:
char acute_accent[7] = "éclair";
int i;
for (i=0; i<7; ++i)
{
printf("acute_accent[%d]: %c\n", i, acute_accent[i]);
}
I get:
acute_accent[0]:
acute_accent[1]: �
acute_accent[2]: c
acute_accent[3]: l
acute_accent[4]: a
acute_accent[5]: i
acute_accent[6]: r
which makes me think that the multibyte character é is 2-byte wide.
However, when running this (after ignoring the compiler warning me from multi-character character constant):
printf("size: %lu",sizeof('é'));
I get size: 4.
What's the reason for the different sizes?
EDIT: This question differs from this one because it is more about multibyte characters encoding, the different UTFs and their sizes, than the mere understanding of a size of a char.

The reason you're seeing a discrepancy is because in your first example, the character é was encoded by the compiler as the two-byte UTF-8 codepoint 0xC3 0xA9.
See here:
http://www.fileformat.info/info/unicode/char/e9/index.htm
And as described by dbush, the character 'é' was encoded as a UTF-32 codepoint and stored in an integer; therefore it was represented as four bytes.
Part of your confusion stems from using an implementation defined feature by storing Unicode in an undefined manner.
To prevent undefined behavior you should always clearly identify the encoding type for string literals.
For example:
char acute_accent[7] = u8"éclair"
This is very bad form because unless you count it out yourself, you can't know the exact length of the string unless. And indeed, my compiler (g++) is yelling at me because, while the string is 7 bytes, it's 8 bytes total with the null character at the end. So you have actually overrun the buffer.
It's much safer to use this instead:
const char* acute_accent = u8"éclair"
Notice how your string is actually 8-bytes:
#include <stdio.h>
#include <string.h> // strlen
int main() {
const char* a = u8"éclair";
printf("String length : %lu\n", strlen(a));
// Add +1 for the null byte
printf("String size : %lu\n", strlen(a) + 1);
return 0;
}
The output is:
String length : 7
String size : 8
Also note that the size of a char is different between C and C++!!
#include <stdio.h>
int main() {
printf("%lu\n", sizeof('a'));
printf("%lu\n", sizeof('é'));
return 0;
}
In C the output is:
4
4
While in C++ the output is:
1
4

From the C99 standard, section 6.4.4.4:
2 An integer character constant is a sequence of one or more multibyte
characters enclosed in single-quotes, as in 'x'.
...
10 An integer character constant has type int.
sizeof(int) on your machine is probably 4, which is why you're getting that result.
So 'é', 'c', 'l' are all integer character constants, so all are of type int whose size is 4. The fact that some are multibyte and some are not doesn't matter in this regard.

Related

What's the length of a string in C when I use the "\x00" to interrupt a string?

char buf1[1024] = "771675175\x00AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
char buf2[1024] = "771675175\x00";
char buf3[1024] = "771675175\0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
char buf4[1024] = "771675175\0";
char buf5[1024] = "771675175";
buf5[9] = 0;
char buf6[1024] = "771675175";
buf6[9] = 0;
buf6[10] = "A";
printf("%d\n", strlen(buf1));
printf("%d\n", strlen(buf2));
printf("%d\n", strlen(buf3));
printf("%d\n", strlen(buf4));
printf("%d\n", strlen(buf5));
printf("%d\n", strlen(buf6));
if("\0" == "\x00"){
printf("YES!");
}
Output:
10
9
9
9
9
9
YES!
As shown above, I use the "\x00" to interrupt a string.
As far as I know, when the strlen() meet the "\x00", it will return the number of characters before the terminator, and does not include the "\x00".
But here, why is the length of the buf1 equal to 10?
As pointed out in the comments section, hexadecimal escape sequences have no length limit and terminate at the first character that is not a valid hexadecimal digit. All of the subsequent A characters are valid hexadecimal digits, so they are part of the escape sequence. Therefore, the result of the escape sequence does not fit in a char, so the result is unspecified.
You should change
char buf1[1024] = "771675175\x00AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
to:
char buf1[1024] = "771675175\x00" "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
Also, strlen returns a value of type size_t. The correct printf format specifier for size_t is %zu, not %d. Even if %d works on your platform, it may fail on other platforms.
The following program will print the desired result of 9:
#include <stdio.h>
#include <string.h>
int main( void )
{
char buf1[1024] = "771675175\x00" "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
printf( "%zu\n", strlen(buf1) );
}
Also, it is worth nothing that the following line does not make sense:
if("\0" == "\x00")
In that if condition, you are comparing the addresses of two pointers, which point to string literals. It depends on the compiler whether it is storing both string literals in the same memory location. Some compilers may merge identical string literals into the same memory location, some may not. Normally, this is irrelevant to the programmer. Therefore, it does not make much sense to compare these memory addresses.
You probably wanted to write the following instead, which will compare the actual character values:
if( '\0' == '\x00' )
There is a big difference between a string literal and a character constant.

In C, what would happen if I put 'successive wchar_t characters' into a wchar_t variable?

#include <stdio.h>
wchar_t wc = L' 459';
printf("%d", wc); //result : 32
I know the 'space' is 'decimal 32' in ASCII code table.
What I don't understand is, as far as I know, if there's not enough space for a variable to store value, the value would be the 'last digits' of the original value.
Like, if I put binary value '1100 1001 0011 0110' into single byte variable, it would be '0011 0110' which is 'the last byte' of the original binary value.
But the code above shows 'the first byte' of the original value.
I'd like to know what happen in memory level when I execute the code above.
_int64 x = 0x0041'0042'0043'0044ULL;
printf("%016llx\n", x); //prints 0041004200430044
wchar_t wc;
wc = x;
printf("%04X\n", wc); //prints 0044 as you expect
wc = L'\x0041\x0042\x0043\x0044'; //prints 0041, uses the first character
printf("%04X\n", wc);
If you assign an integer value that's too large, the compiler takes the max value 0x0044 that fits in 2 bytes.
If you try to assign several elements in to one element, the compiler takes the first element 0x0041 which fits. L'x' is mean to be a single wide character.
VS2019 will issue a warning for wchar_t wc = L' 459', unless warning level is set to less than 3, but that's not recommended. Use warning level 3 or higher.
wchar_t is a primitive type, not a typedef for unsigned short, but they are both 2 bytes in Windows (4 bytes in linux)
Note that 'abcd' is 4 bytes. The L prefix indicates 2 bytes per element (in Windows), so L'abcd' is 8 bytes.
To see what is inside wc, lets look at Unicode character L'X' which has UTF-16 encoding of 0x0058 (similar to ASCII values up to 128)
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main(void)
{
wchar_t wc = L'X';
wprintf(L"%c\n", wc);
char buf[256];
memcpy(buf, &wc, 2);
for (int i = 0; i < 2; i++)
printf("%02X ", buf[i] & 0xff);
printf("\n");
return 0;
}
The output will be 58 00. It is not 00 58 because Windows runs on little-endian systems and the bytes are flipped.
Another weird thing is that UTF16 uses for 4 bytes for some code points. So you will get a warning for this line:
wchar_t wc = L'😀';
Instead you want to use string:
wchar_t *wstr = L"😀";
::MessageBoxW(0, wstr, 0, 0); //console may not display this correctly
This string will be 6 bytes (2 elements + null terminating char)

Displaying wide character in hexadecimal hex shows unexpected result

I am trying to display a wide character in hexadecimal and it gives me unexpected results and it would be always like 2 digit hex and my code.
#include "stdlib.h"
#include "stdio.h"
#include"wchar.h"
#include "locale.h"
int main(){
setlocale(LC_ALL,"");
wchar_t ch;
wscanf (L"%lc",&ch);
wprintf(L"%x \n",ch);
return 0;
}
input : Ω
result: 0xea
expected result : 0xcea9
I changed setlocale several times but the results always be the same.
notice
When the input value is smaller than 1 byte it works as expected.
Note that you should use <..> for including standard headers. The line wprintf("%x", ch) is invalid, cause it's most probably undefined behavior - ch is (possibly) not an unsigned int, you can't apply %x on it.
You are expecting that wide characters will be stored in UTF-8. Well, that wouldn't make much sense, they are not. Your program reads a sequence of bytes in multibyte encoding and that sequence of bytes is then converted (depending on locale) to the wide character encoding. The wide character encoding (usually) stays the same and should be UTF-32 on linux. Locale affects the way multibyte characters are converted to wide characters and back, not the representation of wide characters.
The following program:
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(){
setlocale(LC_ALL,"");
wchar_t ch;
int cnt = wscanf(L"%lc",&ch);
if (cnt != 1) { /* handle error */ abort(); }
wprintf(L"%x\n", (unsigned int)ch);
return 0;
}
On linux when inputted Greek Capital Letter Omega Ω U+3A9 the program outputs 3a9. What actually happens is that the terminal reads UTF-8 encoded character, so it reads two bytes 0xCE 0xA9, then converts them to UTF-32 and stores the result in the wide character. You may convert the wide character from wide character encoding (UTF-32) to multibyte character encoding (UTF-8 should be default, but depends on locale) and print the bytes that represent the character in multibyte character encoding:
char tmp[MB_CUR_MAX];
int len = wctomb(tmp, ch); // prefer wcrtomb
if (len < 0) { /* handle error */ abort(); }
for (int i = 0; i < len; ++i) {
wprintf(L"%hhx", (unsigned char)tmp[i]);
}
wprintf(L"\n");
That will output cea9 on my platform.

Printf output of pointer string explanation from an interview

I had an interview and I was given this code and asked what is the output for each one of these printf statements.
I have my answers as comments, but I am not sure about the rest.
Can anyone explain the different outputs for statements 1, 3 and 7 and why?
Thank you!
#include <stdio.h>
int main(int argc, const char * argv[]) {
char *s = "12345";
printf("%d\n", s); // 1.Outputs "3999" is this the address of the first pointer?
printf("%d\n", *s); // 2.The decimal value of the first character
printf("%c\n", s); // 3.Outputs "\237" What is this value?
printf("%c\n", *s); // 4.Outputs "1"
printf("%c\n", *(s+1)); // 5.Outputs "2"
printf("%s\n", s); // 6.Outputs "12345"
printf("%s\n", *s); // 7.I get an error, why?
return 0;
}
This call
printf("%d\n", s);
has undefined behavior because an invalid format specifier is used with a pointer.
This call
printf("%d\n", *s);
outputs the internal code (for example ASCII code) of the character '1'.
This call
printf("%c\n", s);
has undefined behavior due to using an invalid format specifier with a pointer.
These calls
printf("%c\n", *s);
printf("%c\n", *(s+1));
are valid. The first one outputs the character '1' and the second one outputs the character '2'.
This call
printf("%s\n", s);
is correct and outputs the string "12345".
This call
printf("%s\n", *s);
is invalid because an invalid format specifier is used with an object of the type char.
This code is undefined behaviour (UB). You are passing a pointer, where the function requires an int value. For example, in a 64-bit architecture, a pointer is 64 bit, and an int is 32 bit. You can be printing a truncated value.
You are passing the first char value (automatically converted to an int by the compiler) and print it in decimal. Probably you got 49 (the ASCII code for '1'. This is legal use, but be careful about surprises, as you can get negative values if your platform char implementation is signed.
You are printing the passed pointer reinterpreted as a char value. Undefined behaviour, as you cannot convert a pointer to a char value.
You are printing the pointed value of s as a char so you get the first character of string "12345" ('1').
You are printing the next to first char pointed to by s, so you get the second character of string ('2').
You are printing the string pointed to by s, so you get the whole string. This is legal and indeed, the common way to print a string.
You are passing the first character of string to be interpreted as a pointer to a null terminated string to be printed (which it isn't). This is undefined behaviour again. You are reinterpreting a char value as a pointer to a null terminated string. A SIGSEGV is common in this case, (but not warranted :) ) The signal is sent when the program tries to access unallocated memory before reaching the supposed null character that terminates the string (but it could find a '\0' in the way and just print rubbish).
The 7'th line is failing because a C style string is expected as an input, and you are placing a character instead.
Take a look at:
What does %s and %d mean in printf in the C language
C style strings guide
I used the following online C compiler in order to run your code,
and here are the results:
1. 4195988 - undefined behaviour (UB), manifesting here as the address
of the char array as you stated (for a 64 bit address you might or
might not get truncation)
2. 49 - ASCII value of '1'
3. � - undefined behaviour, manifesting here as unsupported ASCII value
for a truncation of the address of the array of chars
(placing 32-bit address into a char - assuming a 32-bit system)
4. 1 - obvious
5. 2 - obvious
6. 12345 - obvious
7. Segmentation fault - undefined behaviour, trying to place the first char
of a char array into a string reserved position
(placing char into a string)
Note on point number 3: we can deduce what took place during run-time.
In the specific example provided in the question -
printf("%c\n", s); // 3.Outputs "\237". What is this value?
This is a hardware/compiler/OS related behavior when handling the UB.
Why? Due to the output "\237" -> this implies truncation under the specific hardware system executing this code!
Please see the explanation below (assumption - 32-bit system):
char *s = "12345"; // Declaring a char pointer pointing to a char array
char c = s; // Placement of the pointer into a char - our UB
printf("Pointer to character array: %08x\n", s); // Get the raw bytes
printf("Pointer to character: %08x\n", c); // Get the raw bytes
printf("%c\n", s); // place the pointer as a character
// display is dependent on the ASCII value and the OS
// definitions for 128-255 ASCII values
The outputs:
Pointer to character array: 004006e4 // Classic 32-bit pointer
Pointer to character: ffffffe4 // Truncation to a signed char
// (Note signed MSB padding to 32 bit display)
� // ASCII value E4 = 228 is not displayed properly
The final printf command is equivalent to char c = s; printf("%c\n", c);.
Why? Thanks to truncation.
An additional example with a legitimate ASCII character output:
char *fixedPointer = 0xABCD61; // Declaring a char pointer pointing to a dummy address
char c = fixedPointer; // Placement of the pointer into a char - our UB
printf("Pointer to 32-bit address: %08x\n", fixedPointer); // Get the raw bytes
printf("Pointer to character: %08x\n", c); // Get the raw bytes
printf("%c\n", fixedPointer);
And the actual outputs:
Pointer to 32-bit address: 00abcd61
Pointer to character: 00000061
a

multi-character character constant and overflow in implicit constant conversion

I write a code is about black jack.
I cannot compiler it,it occurs warning.
multi-character character constant and overflow in implicit constant conversion
Can any one tell me what's going on.
I have thought it for long time plz help me.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int flower;
int k;
int add [13]={1,2,3,4,5,6,7,8,9,10,10,10,10};
char flower_all [4]={'\3','\4','\5','\6'};
char number_all [13]={'A','2','3','4','5','6','7','8','9','10','J','Q','K'};
char player_f[13],player_n[13];
char com_f[13],com_n[13];
int poker [52]={0};
int i,j,y,num,ans;
int player_p=0,com_p=0;
void wash (){
int k;
k=rand()%52;
while(poker[k]==1)
{
k=rand()%52;
}
poker[k]=1;
}
void give_card_p (){
char player_f[13],player_n[13];
int i,k;
int ans;
printf("請問是否要補牌? 1:要 2:不要");
scanf("%d",&ans);
fflush(stdin);
while (ans==1){
wash();
player_f[i]=flower_all[k/13];
player_n[i]=number_all[k%13];
player_p+=add[k%13];
continue;
if (player_p>21)
break;
}
}
int main (){
srand(time(0));
char player_f[13],player_n[13];
int k;
for(i=0;i<2;i++){
wash ();
player_f[i]=flower_all[k/13];
player_n[i]=number_all[k%13];
player_p+=add[k%13];
}
for (i=0;i<2;i++){
wash ();
com_f[i]=flower_all[k/13];
com_n[i]=number_all[k%13];
com_p+=add[k%13];
}
printf("%c%c",player_f[i],player_n[i]);
fflush(stdin);
return 0;
}
Single quotes ' denote 'character constants'. In the following line
char number_all [13]={'A','2','3','4','5','6','7','8','9','10','J','Q','K'};
the '10' is a 'multi-character constant'. This is 'implementation defined' - that is, different compilers are free to interpret it in different ways. In this case given the error message you have provided, it is likely the source of your error. I would suggest using an enumerated type to represent your cards.
I hope you are getting these two error
h.c:9:59: warning: multi-character character constant
h.c:9: warning: overflow in implicit constant conversion
It happens because in the program,
char number_all
[13]={'A','2','3','4','5','6','7','8','9','10','J','Q','K'};
you have '10' which is a multi character constant, the compiler is unable to convert it into a single character.
From Wikipedia:
Individual character constants are single-quoted, e.g. 'A', and have
type int (in C++, char). The difference is that "A" represents a
null-terminated array of two characters, 'A' and '\0', whereas 'A'
directly represents the character value (65 if ASCII is used). The
same backslash-escapes are supported as for strings, except that (of
course) " can validly be used as a character without being escaped,
whereas ' must now be escaped.
A character constant cannot be empty (i.e. '' is invalid syntax),
although a string may be (it still has the null terminating
character). Multi-character constants (e.g. 'xy') are valid, although
rarely useful — they let one store several characters in an integer
(e.g. 4 ASCII characters can fit in a 32-bit integer, 8 in a 64-bit
one). Since the order in which the characters are packed into an int
is not specified, portable use of multi-character constants is
difficult.

Resources