Assign Unicode character to a char - c

I want to do the following assignment:
char complete = '█', blank='░';
But I got the following warning (I'm using the latest version of gcc):
trabalho3.c: In function ‘entrar’:
trabalho3.c:243:9: warning: multi-character character constant [-Wmultichar]
char complete = '█', blank='░';
^
trabalho3.c:243:3: warning: overflow in implicit constant conversion [-Woverflow]
char complete = '█', blank='░';
^
trabalho3.c:244:23: warning: multi-character character constant [-Wmultichar]
char complete = '█', blank='░';
^
trabalho3.c:244:17: warning: overflow in implicit constant conversion [-Woverflow]
char complete = '█', blank='░';
^
How can I do this assignment?

When I copy those lines from the posting and echo the result through a hex dump program, the output is:
0x0000: 63 68 61 72 20 63 6F 6D 70 6C 65 74 65 20 3D 20 char complete =
0x0010: 27 E2 96 88 27 2C 20 62 6C 61 6E 6B 3D 27 E2 96 '...', blank='..
0x0020: 91 27 3B 0A .';.
0x0024:
And when I run it through a UTF-8 decoder, the two block characters are identified as:
0xE2 0x96 0x88 = U+2588 (FULL BLOCK)
0xE2 0x96 0x91 = U+2591 (LIGHT SHADE)
And if the characters are indeed 3-bytes long, trying to store all three bytes into a single character is going to cause problems.
You need to validate these observations; there is a lot of potential for the data being filtered between your system and mine. However, the chances are that if you take a look at the source code using similar tools, you will find that the characters are either UTF-8 or UFT-16 encoded, and neither of these will fit into a single byte. If you think they are characters in a single-byte code set (CP-1252 or something similar, perhaps), you should show the hex dump for the line of code containing the initializations, and identify the platform and code set you're working with.

You can store those characters as:
a UTF-8 string, const unsigned char complete[] = u8"█";
a wide character defined in <wchar.h>, const wchar_t complete = L'█';
a UTF-32 character defined in <uchar.h>, const char32_t complete = U'█';
a UTF-16 character, although this is generally a bad idea.
Use UTF-8 when you can, something else when you have to. The 32-bit type is the only one that guarantees fixed width. There are functions in the standard library to read and write wide-character strings, and in many locales, you can read and write UTF-8 strings just like ASCII once you call setlocale() or convert them to wide characters with mbstowcs().

Related

Merge ascii char using C [duplicate]

This question already has answers here:
How can I merge two ASCII characters? [closed]
(2 answers)
Closed 2 years ago.
I tried the following code but couldn't get the desired output.
Result should be AB and it should come from single variable C
int main()
{
int a = 'A';
int b = 'B';
unsigned int C = a << 8 | b;
printf(" %c\n",C);
return 0;
}```
%c will print a single character. If you want to print a string, you have to use %s and provide a pointer to this string. Strings in C have to be null-terminated, meaning they require one additional character after the text and this character carries the value \0 (zero).
You could do this in an int, but you'd have to understand some concepts first.
If you are using a computer with Intel architecture, integer variables larger than one byte will store data in reverse order in memory. This is called little-endianness.
So a number like 0x11223344 (hexadecimal) will be stored in memory as the sequence of bytes 44 33 22 11.
'A' is equivalent to the number 65, or 0x00000041, and if put in a 32-bit integer will be stored as 41 00 00 00.
When you do 'A' << 8 | 'B' you create the number 0x00006566, but in memory it is actually 66 65 00 00 (equivalent to string "BA\0\0"). It's in the opposite order of what you're trying to do, but since it is technically null-terminated, it's a valid string.
You can print this using sprintf("%s", &C);
If you're in a big-endian architecture (such as ARM), you will have to work out the null-terminator, but I think I already gave you enough information to figure out what is going on for yourself.
you're trying to print only single byte located at &C, so depending on if you're machine is little endian you'd get "66" as output or if you're machine is big endian you'd get 0 as output.

How is a string represented in IA32 assembly?

A string is represented as an array of char. For example, if I have a string "abcdef" at address 0x80000000, is the following correct?
0x80000008
0x80000004: 00 00 46 45
0x80000000: 44 43 42 41
(In stack, it grows down so I have address decreasing)
The lower addresses are always first - even in the stack. So your example should be:
80000000: 41 42 43 44
80000004: 45 46 00 00
Your example is actually the string: "ABCDEF". The string "abcdef" should be:
80000000: 61 62 63 64
80000004: 65 66 00 00
Also, in memory dumps, the default radix is 16 (hexadecimal), so "0x" is redundant. Notice that the character codes are also in hexadecimal. For example the string "JKLMNOP" will be:
80000000: 4A 4B 4C 4D
80000000: 4E 4F 50 00
No strings are usually placed in the stack. Only in data memory. Sometimes in the stack are placed pointers to strings, i.e. the start address of the string.
Your (and my) examples concerns so called ASCII encoding. But there are many possible character encoding schemes possible. For example EBCDIC also uses 8bit codes, but different than ASCII.
But the 8 bit codes are not mandatory. UTF-32 for example uses 32 bit codes. Also, it is not mandatory to have fixed code size. UTF-8 uses variable code size from 1 to 6 bytes, depending on the characters encoded.
That isn’t actually assembly. You can get an example of that by running gcc-S. Traditionally in x86 assembly, you would declare a label followed by a string, which would be declared as db (data bytes). If it were a C-style string, it would be followed by db 0. Modern assemblers have an asciiz type that adds the zero byte automatically. If it were a Pascsl-style string, it would be preceded by an integer containing its size. These would be laid out contiguously in memory, and you would get the address of the string by using the label, similarly to how you would get the address of a branch target from its label.
Which option you would use depends on what you’re going to do with it. If you’re passing to a C standard library function, you probably want a C-style string. If you’re going to be writing it with write() or send() and copying it to buffers with bounds-checking, you might want to store its length explicitly, even though no system or library call uses that format any more. Good, secure code shouldn’t use strcpy() either. However, you can both store the length and null-terminate the string.
Some old code for MS-DOS used strings terminated with $, a convention copied from CP/M for compatibility with 8-bit code on the Z80. There were a bunch of these legacies in OSes up to Windows ME.

char * versus unsigned char * and casting

I need to use the SQLite function sqlite3_prepare_v2() (https://www.sqlite.org/c3ref/prepare.html).
This function takes a const char * as its second parameter.
On the other hand, I have prepared an unsigned char * variable v which contains something like this:
INSERT INTO t (c) VALUES ('amitié')
In hexadecimal representation (I cut the line):
49 4E 53 45 52 54 20 49 4E 54 4F 20 74 20 28 63 29
20 56 41 4C 55 45 53 20 28 27 61 6D 69 74 69 E9 27 29
Note the 0xE9 representing the character é.
In order for this piece of code to be built properly, I cast the variable v with (const char *) when I pass it, as an argument, to the sqlite3_prepare_v2() function...
What comments can you make about this cast? Is it really very very bad?
Note that I have been using an unsigned char * pointer to be able to store characters between 0x00 and 0xFF with one byte only.
The source data is coming from an ANSI encoded file.
In the documentation for the sqlite3_prepare_v2() function, I'm also reading the following comment for the second argument of this function:
/* SQL statement, UTF-8 encoded */
What troubles me is the type const char * for the function second argument... I would have been expecting a const unsigned char * instead...
To me - but then again I might be totally wrong - there are only 7 useful bits in a char (one byte), the most significant bit (leftmost) being used to denote the sign of the byte...
I guess I'm missing some kind of point here...
Thank you for helping.
You are correct.
For a UTF-8 input, the sqlite3_prepare_v2 method really should be asking for a const unsigned char * as all 8 bits are being used for data. Their implementation certainly shouldn't be using a signed comparison to check the top bit, because a simple compiler flag can set the default for char to be either unsigned or signed and the former would break the code.
As for your concerns over the cast, this is one of the more benign ones. Casting away signedness on int or float is usually a very bad thing (TM) - or at least a clear indicator that you have a problem.
When dealing with pure ASCII, you are correct that there are 7-bits of data, but the remaining 8th bit is meant to be used for a parity bit, not as a sign bit.

Can we reverse this string in C? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
It is very simple I hope. These are 20 hex values separated by a back-slash \ and C compiler indeed making them a string of 33 characters because \NUMBER is single value \NUMBER+ALPHA = 2 bytes as well as \ALPHA+NUMBER 2 bytes.
char str[] =
"\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6";
//when saved is 33 bytes
My question is after it has been saved to 33 bytes on disk, can we (after reading 33 bytes) remake the same presentation that we have in C? So the program prints "\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6", any problem solvers here?
"\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6";
//when read back program should output this ^
The string literal you have:
"\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6"
will produce undefined behavior according to C89 (not sure if the source for C89 can be trusted, but my point below still holds) and implementation-defined behavior according to C11 standard. In particular, \d, \e, \9, \c are escape sequences not defined in the standard. gcc will not complain about \e, since it is a GNU extension, which represent ESC.
Since there are implementation-defined behavior, it is necessary for us to know what compiler you are using as the result may vary.
Another thing is that, you didn't show clearly that you are aware of the content of the string after compilation. (A clearer way to show would be to include a hex dump of what the string looks like in memory, and show how you are aware of the escape sequences).
This is how the looks-like-hex string is recognized by the compiler:
String: \b 3 \b c \77 \7 \d e \e d \44 \9 3 \75 \c e \c 0 \9 \1 9 \5 9 \c 8 \f \b e \c 6 \20 \6
Char: \b 3 \b c \77 \7 d e \e d \44 \9 3 \75 c e c 0 9 \1 9 \5 9 c 8 \f \b e c 6 \20 \6
Hex: 08 33 08 63 3f 07 64 65 1b 64 24 39 33 3d 63 65 63 30 39 01 39 05 39 63 38 0c 08 65 63 36 18 06 00
Enough beating around the bush. Assuming that you are using gcc to compile the code (warnings ignored). When the code is run, the whole char[] is written to file using fwrite. I also assume only lower case characters are used in the source code.
You should map all possible escape sequences \xy that looks like 2-digit hex number to sequences of 1 or 2 bytes. There are not that many of them, and you can write a program to simulate the behavior of the compiler:
If x is any of a, b, f (other escape sequences like \n are not hex digit) and e (due to GNU extension). It is mapped to special character.
(If you use uppercase character in source code, do note that \E maps to ESC)
If xy forms a valid octal sequence. It is mapped to character with corresponding value.
If x forms a valid octal sequence. It is mapped to character with corresponding value.
Otherwise, x stays the same.
If y is not consumed, y stays the same.
Note that it is possible for the actual char to come from 2 different ways. For example, \f and \14 will map to the same char. In such case, it might not be possible to get back the string in the source. The most you can do is guess what the string in the source can be.
Use your string as an example, at the beginning, 08 and 33 can come from \b3, but it can also come from \10\63.
Using the map produce, there are cases where the mapping is clear: hex larger than 3f cannot come from octal escape sequence, and must come from direct interpretation of the character in the original string. From this, you know that if e is encountered, it must be the 2nd character in a looks-like-hex sequence.
You can use the map as a guide, and the simulation as a method to check whether the map will produce back the ASCII code. Without knowing anything about the string declared in the source code, the most you can derive is a list of candidates for the original (broken) string in the source code. You can reduce the size of the list of candidates if you at least know the length of the string in the source code.

how do I determine if this is latin1 or utf8?

I have a string "Artîsté" in latin1 table. I use a C mysql connector to get the string out of the table. I have character_set_connection set to utf8.
In the debugger it looks like :
"Art\xeest\xe9"
If I print the hex values with printf ("%02X", (unsigned char) a[i]); for each char I get
41 72 74 EE 73 74 E9
How do I know if it is utf8 or latin1?
\x74\xee\x73 isn't a valid UTF-8 sequence, since UTF-8 never has a run of only 1 byte with the top bit set. So of the two, it must be Latin-1.
However, if you see bytes that are valid UTF-8 data, then it's not always possible to rule out that it might be Latin-1 that just so happens to also be valid UTF-8.
Latin-1 does have some invalid bytes (the ASCII control characters 0x00-0x1F and the unused range 0x7f-0x9F), so there are some UTF-8 strings that you can be sure are not Latin-1. But in my experience it's common enough to see Windows CP1252 mislabelled as Latin-1, that rejecting all those code points is fairly futile except in the case where you're converting from another charset to Latin-1, and want to be strict about what you output. CP1252 has a few unused bytes too, but not as many.
as yo can see in the schema of a UTF-8 sequence you can have 2 great possibilities:
1st bit = 0 (same as ascii), 1 byte per char having value <=0X7F
1st bit = 1 of utf-8 sequence, the sequence length is >= 2 bytes having value >= 0X80
This is iso-8859 encoding
41 72 74 *EE* 73 74 *E9*
only 2 stand alone bytes with values >= 0x80
ADD BEWARE
Be carefull! Even if you found a well formatted UTF-8 sequence, you cannot differentiate it from a bounch of ISO-8859 chars!

Resources