char * versus unsigned char * and casting - c

I need to use the SQLite function sqlite3_prepare_v2() (https://www.sqlite.org/c3ref/prepare.html).
This function takes a const char * as its second parameter.
On the other hand, I have prepared an unsigned char * variable v which contains something like this:
INSERT INTO t (c) VALUES ('amitié')
In hexadecimal representation (I cut the line):
49 4E 53 45 52 54 20 49 4E 54 4F 20 74 20 28 63 29
20 56 41 4C 55 45 53 20 28 27 61 6D 69 74 69 E9 27 29
Note the 0xE9 representing the character é.
In order for this piece of code to be built properly, I cast the variable v with (const char *) when I pass it, as an argument, to the sqlite3_prepare_v2() function...
What comments can you make about this cast? Is it really very very bad?
Note that I have been using an unsigned char * pointer to be able to store characters between 0x00 and 0xFF with one byte only.
The source data is coming from an ANSI encoded file.
In the documentation for the sqlite3_prepare_v2() function, I'm also reading the following comment for the second argument of this function:
/* SQL statement, UTF-8 encoded */
What troubles me is the type const char * for the function second argument... I would have been expecting a const unsigned char * instead...
To me - but then again I might be totally wrong - there are only 7 useful bits in a char (one byte), the most significant bit (leftmost) being used to denote the sign of the byte...
I guess I'm missing some kind of point here...
Thank you for helping.

You are correct.
For a UTF-8 input, the sqlite3_prepare_v2 method really should be asking for a const unsigned char * as all 8 bits are being used for data. Their implementation certainly shouldn't be using a signed comparison to check the top bit, because a simple compiler flag can set the default for char to be either unsigned or signed and the former would break the code.
As for your concerns over the cast, this is one of the more benign ones. Casting away signedness on int or float is usually a very bad thing (TM) - or at least a clear indicator that you have a problem.
When dealing with pure ASCII, you are correct that there are 7-bits of data, but the remaining 8th bit is meant to be used for a parity bit, not as a sign bit.

Related

How has this typecasting and displaying been possible?

int *i= (int*)5;
char *p=(char*)'A';
printf("i=%d and p=%c",i,p);
I accidentally tried this and I got the output as i=5 and p=A.
Can someone explain? What has happened?
Converting from an integer to a pointer might be ok, given that it doesn't cause misalignment (which it does for sure in case of 5). On some computers, these pointer conversions by themselves might cause a trap/crash. On other computers it will work fine.
Then when you call printf with the wrong types, you get undefined behavior. One possible outcome of undefined behavior on some system might be "seems to work just fine". For example if sizeof(int) == sizeof(int*), the code might print the result you are getting, although no guarantees.
Regarding passing a char to printf, printf is a variadic function and those functions come with implicit promotion of the passed arguments into the int type. Which is why one might get the result A too, in case sizeof(int) happended to equal sizeof(char*).
And on a system with 64 bit pointers, you might want to try this:
printf("size of char* is %zu but the lower 32 bits are %X", sizeof(char*), (unsigned int)p);
On a little endian x86 I get the output
size of char* is 8 but the lower 32 bits are 41
Since the pointer format is little endian, the raw pointer value 41 00 00 00 00 00 00 00 can be read as a 32 bit integer 41 00 00 00 to get ASCII 'A'. So it seems likely that printf just peeled off the lowest 32 bits of the pointer.
Silly bonus program for x86 to demonstrate this:
int* i = (int*) (5 | 1145128260ull<<32); // store some secret msg in upper 32 bits
char* p = (char*)('A' | 1178944834ull<<32); // store some secret msg in upper 32 bits
printf("i=%d and p=%c\n",i,p);
printf("%.4s",(char*)&i + 4);
printf("%.4s",(char*)&p + 4);
Output:
i=5 and p=A
DEADBEEF

The MD5 Hash in Arm Assembly and endianness

I am new to Arm assembly programming. I am attempting to write a function in arm cortex m4 assembly that performs the MD5 Hash algorithm. I am following the wiki page algorithm found here https://en.wikipedia.org/wiki/MD5.
The wikipage declares constants A,B,C,D and the arrays S and K. All the values are shown in little endian.
About little endian:
I have done some research and it seems that in the memory, an entire string is shown in order, as if the entire string was in big endian. This is because each character is a byte. The values in the wiki are declared in little endian, so after i declare them, they show up as big endian (normal order) in the memory.
I have done the preprocessing for the MD5 hash. Let me show you what it looks like in memory for the string "The Quick Brown Fox Jumps Over The Lazy Dog":
54686520 51756963 6B204272 6F776E20 466F7820 4A756D70 73204F76 65722054
6865204C 617A7920 446F672E 80000000 00000000 00000000 00000000 00006001
So 54=T, 68, =h,... etc...
Now heres where my confusion is.
After the message, a single 1 bit is appended. This is the byte 0x80. After that, the rest of 512 bits are filled with zeros until the last 64 bits, that is where the length of the message goes. So as shown, the message is 0x160 bits long. But the length is in little endian in the memory so it shows up as 6001.
So the length is in little endian in the memory.
But the constants A,B,C,D and array K are declared initially in little endian according to the wiki.
So when I view them in the memory, they show up as normal.
So now I am confused! my length is in little endian in the memory, and the constants and K array are in big endian in the memory.
What would be the correct way to view the example in the memory?
It's not really true to describe ASCII strings as big-endian. Endianness applies only to multi-byte values, so ASCII strings have no endianness because they're just arrays of bytes. If you had an array of 16-bit numbers, for example, then endianness would apply individually to each value in the array but not to the ordering of the elements.
The real answer to your question is that there is no easy way to view 'raw' memory data when it's organised in this way. Most debuggers have variable watches which can be used to view the contents of memory locations in a type-aware way, which is usually easier; so for example you could tell the watch window that K points to a 64-byte string and that K+56 points to a little-endian 64-bit unsigned integer, and these values would then be interpreted and reported correctly.
More generally it is often difficult to interpret 'raw' memory data in a little-endian system, because knowing which bytes to swap to put values into an order that's easily human-readable relies on knowing how long each value is, and this information is not present at runtime. It's the downside of the little-endian system, the upside being that casting pointers doesn't change their absolute values because a pointer always points to the least-significant byte no matter how large the data type.
Programming language and architecture have nothing to do with this. You are trying to prepare 32 bit values from a string.
"The Quick Brown Fox Jumps Over The Lazy Dog."
As an ASCII string the bytes looks like this in hex:
54 68 65 20 51 75 69 63 6B 20 42 72 6F 77 6E 20 46 6F 78 20 4A 75 6D 70 73 20 4F 76 65 72 20 54 68 65 20 4C 61 7A 79 20 44 6F 67 2E
But md5 is about data not strings correct? More on this in a bit.
You have to be careful with endianness. Generally folks are talking about byteswapping larger quantities (the address of the byte starts at the top or bottom, big end or little end). 16 or 32 or 64, etc bits. Initially talking about a 64 bit quantity for the length:
0x1122334455667788
when looked as a list of bytes in increasing address order, little endian (as far as is generally understood), is
88 77 66 55 44 33 22 11
so
0x0000000000000160
would be
60 01 00 00 00 00 00 00
And the next question is your string. Should it start with 0x54686520 or should it start with 0x20656854 or 0x63697551?
I believe from the text in wikipedia
The MD5 hash is calculated according to this algorithm. All values are in little-endian.
//Note: All variables are unsigned 32 bit and wrap modulo 2^32 when calculating
Then your last (only) chunk should look like
0x20656854
0x63697551
0x7242206B
0x206E776F
0x20786F46
0x706D754A
0x764F2073
0x54207265
0x4C206568
0x20797A61
0x2E676F44
0x00000080
0x00000000
0x00000000
0x00000160
0x00000000
Using an md5 source routine I found online and using the comes with my Linux distro I got
ec60fd67aab1c782cd3f690702b21527
As the hash in both cases, and the prepared data for the last/only chunk started with 0x20656854 from this program. This program also correctly calculated the result for a string on wikipedia.
So from the wikipedia article, which should have handled the 64 bit length a smidge better. Your data (its not a string) needs to be processed in 32 bit little endian quantities from the 512 bits.
54 68 65 20 becomes 0x20656854 0x000000000000160 becomes 0x00000160, 0x00000000.
If i were to do this, i will find an MD5 library or class, write a simple example to take it text that i want to hash, then ask compiler to generate assembly for the ARM part that i need.
You may consider an mbed [1] or an Arduino [2] version.
[1] https://os.mbed.com/users/hlipka/code/MD5/
[2] https://github.com/tzikis/ArduinoMD5

How is a string represented in IA32 assembly?

A string is represented as an array of char. For example, if I have a string "abcdef" at address 0x80000000, is the following correct?
0x80000008
0x80000004: 00 00 46 45
0x80000000: 44 43 42 41
(In stack, it grows down so I have address decreasing)
The lower addresses are always first - even in the stack. So your example should be:
80000000: 41 42 43 44
80000004: 45 46 00 00
Your example is actually the string: "ABCDEF". The string "abcdef" should be:
80000000: 61 62 63 64
80000004: 65 66 00 00
Also, in memory dumps, the default radix is 16 (hexadecimal), so "0x" is redundant. Notice that the character codes are also in hexadecimal. For example the string "JKLMNOP" will be:
80000000: 4A 4B 4C 4D
80000000: 4E 4F 50 00
No strings are usually placed in the stack. Only in data memory. Sometimes in the stack are placed pointers to strings, i.e. the start address of the string.
Your (and my) examples concerns so called ASCII encoding. But there are many possible character encoding schemes possible. For example EBCDIC also uses 8bit codes, but different than ASCII.
But the 8 bit codes are not mandatory. UTF-32 for example uses 32 bit codes. Also, it is not mandatory to have fixed code size. UTF-8 uses variable code size from 1 to 6 bytes, depending on the characters encoded.
That isn’t actually assembly. You can get an example of that by running gcc-S. Traditionally in x86 assembly, you would declare a label followed by a string, which would be declared as db (data bytes). If it were a C-style string, it would be followed by db 0. Modern assemblers have an asciiz type that adds the zero byte automatically. If it were a Pascsl-style string, it would be preceded by an integer containing its size. These would be laid out contiguously in memory, and you would get the address of the string by using the label, similarly to how you would get the address of a branch target from its label.
Which option you would use depends on what you’re going to do with it. If you’re passing to a C standard library function, you probably want a C-style string. If you’re going to be writing it with write() or send() and copying it to buffers with bounds-checking, you might want to store its length explicitly, even though no system or library call uses that format any more. Good, secure code shouldn’t use strcpy() either. However, you can both store the length and null-terminate the string.
Some old code for MS-DOS used strings terminated with $, a convention copied from CP/M for compatibility with 8-bit code on the Z80. There were a bunch of these legacies in OSes up to Windows ME.

why byte ordering (endianess) is not an issue for standard C string?

My professor mentioned that byte ordering (endianess) is not an issue for standard C String (Char arrays):
for ex: char[6]="abcde";
But he did not explain why?
Any explanations for this will be helpful
Endianess only matters when you have multi-byte data (like integers and floating point numbers). Standard C strings consist of 1-byte characters, so you don't need to worry about endianness.
A char takes up only one byte, thats why the ordering does not matter. for example, int has 4 bytes and these bytes could be arranged in little-endian and big-endian`.
Eg: 0x00010204 can be arranged in two ways in memory:
04 02 01 00 or 00 01 02 04
char being a single byte will be fetched as single byte to CPU. and String is just a chararray.
Smallest Unit of memory storage is 1 Byte, If you have a 4 byte value ( ex : 0x01234567 )
It will be arranged and fetched in the order of Endianness in contigous locations
Big Endian : 01 23 45 67
Little Endian : 67 45 23 01
Whereas for a 1 byte Value it can be fetched from one memory location itself since that is the smallest block of memory , with no need for byte ordering .
Hope this Helps !

Assign Unicode character to a char

I want to do the following assignment:
char complete = '█', blank='░';
But I got the following warning (I'm using the latest version of gcc):
trabalho3.c: In function ‘entrar’:
trabalho3.c:243:9: warning: multi-character character constant [-Wmultichar]
char complete = '█', blank='░';
^
trabalho3.c:243:3: warning: overflow in implicit constant conversion [-Woverflow]
char complete = '█', blank='░';
^
trabalho3.c:244:23: warning: multi-character character constant [-Wmultichar]
char complete = '█', blank='░';
^
trabalho3.c:244:17: warning: overflow in implicit constant conversion [-Woverflow]
char complete = '█', blank='░';
^
How can I do this assignment?
When I copy those lines from the posting and echo the result through a hex dump program, the output is:
0x0000: 63 68 61 72 20 63 6F 6D 70 6C 65 74 65 20 3D 20 char complete =
0x0010: 27 E2 96 88 27 2C 20 62 6C 61 6E 6B 3D 27 E2 96 '...', blank='..
0x0020: 91 27 3B 0A .';.
0x0024:
And when I run it through a UTF-8 decoder, the two block characters are identified as:
0xE2 0x96 0x88 = U+2588 (FULL BLOCK)
0xE2 0x96 0x91 = U+2591 (LIGHT SHADE)
And if the characters are indeed 3-bytes long, trying to store all three bytes into a single character is going to cause problems.
You need to validate these observations; there is a lot of potential for the data being filtered between your system and mine. However, the chances are that if you take a look at the source code using similar tools, you will find that the characters are either UTF-8 or UFT-16 encoded, and neither of these will fit into a single byte. If you think they are characters in a single-byte code set (CP-1252 or something similar, perhaps), you should show the hex dump for the line of code containing the initializations, and identify the platform and code set you're working with.
You can store those characters as:
a UTF-8 string, const unsigned char complete[] = u8"█";
a wide character defined in <wchar.h>, const wchar_t complete = L'█';
a UTF-32 character defined in <uchar.h>, const char32_t complete = U'█';
a UTF-16 character, although this is generally a bad idea.
Use UTF-8 when you can, something else when you have to. The 32-bit type is the only one that guarantees fixed width. There are functions in the standard library to read and write wide-character strings, and in many locales, you can read and write UTF-8 strings just like ASCII once you call setlocale() or convert them to wide characters with mbstowcs().

Resources