how do I determine if this is latin1 or utf8? - c

I have a string "Artîsté" in latin1 table. I use a C mysql connector to get the string out of the table. I have character_set_connection set to utf8.
In the debugger it looks like :
"Art\xeest\xe9"
If I print the hex values with printf ("%02X", (unsigned char) a[i]); for each char I get
41 72 74 EE 73 74 E9
How do I know if it is utf8 or latin1?

\x74\xee\x73 isn't a valid UTF-8 sequence, since UTF-8 never has a run of only 1 byte with the top bit set. So of the two, it must be Latin-1.
However, if you see bytes that are valid UTF-8 data, then it's not always possible to rule out that it might be Latin-1 that just so happens to also be valid UTF-8.
Latin-1 does have some invalid bytes (the ASCII control characters 0x00-0x1F and the unused range 0x7f-0x9F), so there are some UTF-8 strings that you can be sure are not Latin-1. But in my experience it's common enough to see Windows CP1252 mislabelled as Latin-1, that rejecting all those code points is fairly futile except in the case where you're converting from another charset to Latin-1, and want to be strict about what you output. CP1252 has a few unused bytes too, but not as many.

as yo can see in the schema of a UTF-8 sequence you can have 2 great possibilities:
1st bit = 0 (same as ascii), 1 byte per char having value <=0X7F
1st bit = 1 of utf-8 sequence, the sequence length is >= 2 bytes having value >= 0X80
This is iso-8859 encoding
41 72 74 *EE* 73 74 *E9*
only 2 stand alone bytes with values >= 0x80
ADD BEWARE
Be carefull! Even if you found a well formatted UTF-8 sequence, you cannot differentiate it from a bounch of ISO-8859 chars!

Related

unsigned char in C not working as expected

Since unsigned char represents 0 - 255 and the extended ascii code for 'à' is 133, I expected the following C code to print 133
unsigned char uc;
uc='à';
printf("%hhu \n",uc);
Instead, both clang and gcc produce the following error
error: character too large for enclosing character literal type
uc='à';
^
What went wrong?
By the way I copied à from a French language website and pasted the result into the assignment statement. What I suspect is the way I created à may not be valid.
Since unsigned char represents 0 - 255
This is true in most implementations, but the C standard does not require that a char is limited to 8 bit, it can be larger and support a larger range.
and the extended ascii code for 'à' is 133,
There can be a C implementation where 'à' has the value 133 (0x85) but since most implementations use Unicode, 'à' probably uses the code point 224 (0xE0) which is most likely stored as UTF-8. Your Editor is also set to UTF-8 and therefore needs more than a single byte to represent characters outside of ASCII. In UTF-8, all ASCII characters are stored like they are in ASCII and need 1 byte, all other characters are a combination of 2-4 byte and bit 7 is set in every one of them. I suggest you learn how UTF-8 works, UTF-8 is the best way to store text most of the time, so you should only use something else when you have a good reason to do so.
I expected the following C code to print 133
In UTF-8 the code point for à is stored as 0xC3 0xA0 which is combined to the value 0xE0. You can't store 0xC3 0xA0 in a 8 bit char. So clang reports an error.
You could try to store it in a int, unsigned, wchar_t or some other integer type that is large enough. GCC would store the value 0xC3A0 and not 0xE0, because that is the value inside the ''. However, C supports wide characters. The type wchar_t which may support more characters is most likely wchar_t is 32 or 16 on your system. To write a wide character literal, you can use the prefix L. With a wide character literal, the compiler would store the correct value of 0xE0.
Change the code to:
#include <wchar.h>
....
wchar_t wc;
wc=L'à';
printf("%u \n",(unsigned)wc);

How is a string represented in IA32 assembly?

A string is represented as an array of char. For example, if I have a string "abcdef" at address 0x80000000, is the following correct?
0x80000008
0x80000004: 00 00 46 45
0x80000000: 44 43 42 41
(In stack, it grows down so I have address decreasing)
The lower addresses are always first - even in the stack. So your example should be:
80000000: 41 42 43 44
80000004: 45 46 00 00
Your example is actually the string: "ABCDEF". The string "abcdef" should be:
80000000: 61 62 63 64
80000004: 65 66 00 00
Also, in memory dumps, the default radix is 16 (hexadecimal), so "0x" is redundant. Notice that the character codes are also in hexadecimal. For example the string "JKLMNOP" will be:
80000000: 4A 4B 4C 4D
80000000: 4E 4F 50 00
No strings are usually placed in the stack. Only in data memory. Sometimes in the stack are placed pointers to strings, i.e. the start address of the string.
Your (and my) examples concerns so called ASCII encoding. But there are many possible character encoding schemes possible. For example EBCDIC also uses 8bit codes, but different than ASCII.
But the 8 bit codes are not mandatory. UTF-32 for example uses 32 bit codes. Also, it is not mandatory to have fixed code size. UTF-8 uses variable code size from 1 to 6 bytes, depending on the characters encoded.
That isn’t actually assembly. You can get an example of that by running gcc-S. Traditionally in x86 assembly, you would declare a label followed by a string, which would be declared as db (data bytes). If it were a C-style string, it would be followed by db 0. Modern assemblers have an asciiz type that adds the zero byte automatically. If it were a Pascsl-style string, it would be preceded by an integer containing its size. These would be laid out contiguously in memory, and you would get the address of the string by using the label, similarly to how you would get the address of a branch target from its label.
Which option you would use depends on what you’re going to do with it. If you’re passing to a C standard library function, you probably want a C-style string. If you’re going to be writing it with write() or send() and copying it to buffers with bounds-checking, you might want to store its length explicitly, even though no system or library call uses that format any more. Good, secure code shouldn’t use strcpy() either. However, you can both store the length and null-terminate the string.
Some old code for MS-DOS used strings terminated with $, a convention copied from CP/M for compatibility with 8-bit code on the Z80. There were a bunch of these legacies in OSes up to Windows ME.

Purpose of using octal for ASCII

Why would a C programmer use escape sequences (oct/hex) for ASCII values rather than decimal?
Follow up: does this have to do with either performance or portability?
Example:
char c = '\075';
You use octal or hexadecimal because there isn't a way to specify decimal codes inside a character literal or string literal. Octal was prevalent in PDP-11 code. These days, it probably makes more sense to use hexadecimal, though '\0' is more compact than '\x0' (so use '\0' when you null terminate a string, etc.).
Also, beware that "\x0ABad choice" doesn't have the meaning you might expect, whereas "\012007 wins" probably does. (The difference is that a hex escape runs on until it comes across a non-hex digit, whereas octal escapes stop after 3 digits at most. To get the expected result, you'd need "\x0A" "Bad choice" using 'adjacent string literal concatenation'.)
And this has nothing to do with performance and very little to do with portability. Writing '\x41' or '\101' instead of 'A' is a way of decreasing the portability and readability of your code. You should only consider using escape sequences when there isn't a better way to represent the character.
No it does not have anything to do with performance and portability. It is just one convenient way to define character literals and to use in string literal specially for non-printable characters.
It has nothing to do with performance nor portability. In fact, you don't need any codes at all, instead of this:
char c = 65;
You can simply write:
char c = 'A';
But some characters are not so easy to type, e.g. ASCII SOH, so you might write:
char c = 1; // SOH
Or any other form, hexadecimal, octal, depending on your preference.
It has nothing to do with performance nor with portability. It is simply that the ASCII character set (as are its derivatives up to UTF) is organized in bytes and bits. For example, the 32 first characters are the control characters, 32 = 040 = 0x20, ASCII code of 'A' is 65 = 0101 = 0x41 and 'a' is 97 = 0141 = 0x61, ASCII code of '0' is 48 = 060 = 0x30.
I do not know for you, but for me '0x30' and 0x'41' are easier to remember and use in manual operations than 48 and 65.
By the way a byte represents exactly all value between 0 and 255 that is 0 and 0xFF ...
I didn't know this works.
But I got immediatly a pretty usefull idea for it.
Imagin you have got a low memory enviroment and have to use a permission system like the unix folder permissions.
Lets say there are 3 groups and for each group 2 different options which can be allowed or denied.
0 means none of both options,
1 means first option allowed,
2 means second option allowed and
3 means both allowed.
To store the permissions you could do it like:
char* bar = "213"; // first group has allowed second option, second group has first option allowed and third has full acces.
But there you have four byte storage for that information.
Ofc you could just convert that to decimal notation. But thats less readable.
But now as I know this....
doing:
char bar = '\0213';
Is pretty readable and also saving memory!
I love it :D

Can we reverse this string in C? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
It is very simple I hope. These are 20 hex values separated by a back-slash \ and C compiler indeed making them a string of 33 characters because \NUMBER is single value \NUMBER+ALPHA = 2 bytes as well as \ALPHA+NUMBER 2 bytes.
char str[] =
"\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6";
//when saved is 33 bytes
My question is after it has been saved to 33 bytes on disk, can we (after reading 33 bytes) remake the same presentation that we have in C? So the program prints "\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6", any problem solvers here?
"\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6";
//when read back program should output this ^
The string literal you have:
"\b3\bc\77\7\de\ed\44\93\75\ce\c0\9\19\59\c8\f\be\c6\30\6"
will produce undefined behavior according to C89 (not sure if the source for C89 can be trusted, but my point below still holds) and implementation-defined behavior according to C11 standard. In particular, \d, \e, \9, \c are escape sequences not defined in the standard. gcc will not complain about \e, since it is a GNU extension, which represent ESC.
Since there are implementation-defined behavior, it is necessary for us to know what compiler you are using as the result may vary.
Another thing is that, you didn't show clearly that you are aware of the content of the string after compilation. (A clearer way to show would be to include a hex dump of what the string looks like in memory, and show how you are aware of the escape sequences).
This is how the looks-like-hex string is recognized by the compiler:
String: \b 3 \b c \77 \7 \d e \e d \44 \9 3 \75 \c e \c 0 \9 \1 9 \5 9 \c 8 \f \b e \c 6 \20 \6
Char: \b 3 \b c \77 \7 d e \e d \44 \9 3 \75 c e c 0 9 \1 9 \5 9 c 8 \f \b e c 6 \20 \6
Hex: 08 33 08 63 3f 07 64 65 1b 64 24 39 33 3d 63 65 63 30 39 01 39 05 39 63 38 0c 08 65 63 36 18 06 00
Enough beating around the bush. Assuming that you are using gcc to compile the code (warnings ignored). When the code is run, the whole char[] is written to file using fwrite. I also assume only lower case characters are used in the source code.
You should map all possible escape sequences \xy that looks like 2-digit hex number to sequences of 1 or 2 bytes. There are not that many of them, and you can write a program to simulate the behavior of the compiler:
If x is any of a, b, f (other escape sequences like \n are not hex digit) and e (due to GNU extension). It is mapped to special character.
(If you use uppercase character in source code, do note that \E maps to ESC)
If xy forms a valid octal sequence. It is mapped to character with corresponding value.
If x forms a valid octal sequence. It is mapped to character with corresponding value.
Otherwise, x stays the same.
If y is not consumed, y stays the same.
Note that it is possible for the actual char to come from 2 different ways. For example, \f and \14 will map to the same char. In such case, it might not be possible to get back the string in the source. The most you can do is guess what the string in the source can be.
Use your string as an example, at the beginning, 08 and 33 can come from \b3, but it can also come from \10\63.
Using the map produce, there are cases where the mapping is clear: hex larger than 3f cannot come from octal escape sequence, and must come from direct interpretation of the character in the original string. From this, you know that if e is encountered, it must be the 2nd character in a looks-like-hex sequence.
You can use the map as a guide, and the simulation as a method to check whether the map will produce back the ASCII code. Without knowing anything about the string declared in the source code, the most you can derive is a list of candidates for the original (broken) string in the source code. You can reduce the size of the list of candidates if you at least know the length of the string in the source code.

What does \x6d\xe3\x85 mean?

I don't know what is that, I found this in the openSSL source code.
Is those some sort of byte sequence? Basically I just need to convert my char * to that kind of style as a parameter.
It's a byte sequence in hexadecimal. \x6d\xe3\x85 is hex character 6d followed by hex e3, followed by hex 85. The syntax is \xnn where nn is your hex sequence.
If what you read was
char foo[] = "\x6d\xe3\x85";
then that is the same as
char foo[] = { 0x6d, 0xE3, 0x85, 0x00 };
Further, I can tell you that 0x6D is the ASCII code point for 'm', 0xE3 is the ISO 8859.1 code point for 'ã', and 0x85 is the Windows-1252 code point for '…'.
But without knowing more about the context, I can't tell you how to "convert [your] char * to that kind of style as a parameter", except to say that you might not need to do any conversion at all! The \x notation allows you to write string constants containing arbitrary byte sequences into your source code. If you already have an arbitrary byte sequence in a buffer in your program, I can't imagine your needing to back-convert it to \x notation before feeding it to OpenSSL.
Try the following code snippet to understand more about hex byte sequence..
#include <stdio.h>
int main(void)
{
char word[]="\x48\x65\x6c\x6c\x6f";
printf("%s\n", word);
return 0;
}
/*
Output:
$
$ ./a.out
Hello
$
*/
The \x sequence is used to escape byte values in hexadecimal notation. So the sequence you've cited escapes the bytes 6D, E3 and 85 which translate into 109, 227 and 133. While 6D can also be represented as the character m in ASCII, you cannot represent the later two in ASCII as it only covers the range 0..127. So for values beyond 127 you need a special way to write them, and the \x is such a way.
The other way is to escape as octal numbers using \<number>, for example 109 would be \155.
If you need explicit byte values it's better to use these escape sequences since (AFAIK) the C standard doesn't guarantee that your string will be encoded using ASCII. So when you compile for example on an EBCDIC system your m would be represented as byte value 148 instead of 109.

Resources