How is a string represented in IA32 assembly? - c

A string is represented as an array of char. For example, if I have a string "abcdef" at address 0x80000000, is the following correct?
0x80000008
0x80000004: 00 00 46 45
0x80000000: 44 43 42 41
(In stack, it grows down so I have address decreasing)

The lower addresses are always first - even in the stack. So your example should be:
80000000: 41 42 43 44
80000004: 45 46 00 00
Your example is actually the string: "ABCDEF". The string "abcdef" should be:
80000000: 61 62 63 64
80000004: 65 66 00 00
Also, in memory dumps, the default radix is 16 (hexadecimal), so "0x" is redundant. Notice that the character codes are also in hexadecimal. For example the string "JKLMNOP" will be:
80000000: 4A 4B 4C 4D
80000000: 4E 4F 50 00
No strings are usually placed in the stack. Only in data memory. Sometimes in the stack are placed pointers to strings, i.e. the start address of the string.
Your (and my) examples concerns so called ASCII encoding. But there are many possible character encoding schemes possible. For example EBCDIC also uses 8bit codes, but different than ASCII.
But the 8 bit codes are not mandatory. UTF-32 for example uses 32 bit codes. Also, it is not mandatory to have fixed code size. UTF-8 uses variable code size from 1 to 6 bytes, depending on the characters encoded.

That isn’t actually assembly. You can get an example of that by running gcc-S. Traditionally in x86 assembly, you would declare a label followed by a string, which would be declared as db (data bytes). If it were a C-style string, it would be followed by db 0. Modern assemblers have an asciiz type that adds the zero byte automatically. If it were a Pascsl-style string, it would be preceded by an integer containing its size. These would be laid out contiguously in memory, and you would get the address of the string by using the label, similarly to how you would get the address of a branch target from its label.
Which option you would use depends on what you’re going to do with it. If you’re passing to a C standard library function, you probably want a C-style string. If you’re going to be writing it with write() or send() and copying it to buffers with bounds-checking, you might want to store its length explicitly, even though no system or library call uses that format any more. Good, secure code shouldn’t use strcpy() either. However, you can both store the length and null-terminate the string.
Some old code for MS-DOS used strings terminated with $, a convention copied from CP/M for compatibility with 8-bit code on the Z80. There were a bunch of these legacies in OSes up to Windows ME.

Related

The MD5 Hash in Arm Assembly and endianness

I am new to Arm assembly programming. I am attempting to write a function in arm cortex m4 assembly that performs the MD5 Hash algorithm. I am following the wiki page algorithm found here https://en.wikipedia.org/wiki/MD5.
The wikipage declares constants A,B,C,D and the arrays S and K. All the values are shown in little endian.
About little endian:
I have done some research and it seems that in the memory, an entire string is shown in order, as if the entire string was in big endian. This is because each character is a byte. The values in the wiki are declared in little endian, so after i declare them, they show up as big endian (normal order) in the memory.
I have done the preprocessing for the MD5 hash. Let me show you what it looks like in memory for the string "The Quick Brown Fox Jumps Over The Lazy Dog":
54686520 51756963 6B204272 6F776E20 466F7820 4A756D70 73204F76 65722054
6865204C 617A7920 446F672E 80000000 00000000 00000000 00000000 00006001
So 54=T, 68, =h,... etc...
Now heres where my confusion is.
After the message, a single 1 bit is appended. This is the byte 0x80. After that, the rest of 512 bits are filled with zeros until the last 64 bits, that is where the length of the message goes. So as shown, the message is 0x160 bits long. But the length is in little endian in the memory so it shows up as 6001.
So the length is in little endian in the memory.
But the constants A,B,C,D and array K are declared initially in little endian according to the wiki.
So when I view them in the memory, they show up as normal.
So now I am confused! my length is in little endian in the memory, and the constants and K array are in big endian in the memory.
What would be the correct way to view the example in the memory?
It's not really true to describe ASCII strings as big-endian. Endianness applies only to multi-byte values, so ASCII strings have no endianness because they're just arrays of bytes. If you had an array of 16-bit numbers, for example, then endianness would apply individually to each value in the array but not to the ordering of the elements.
The real answer to your question is that there is no easy way to view 'raw' memory data when it's organised in this way. Most debuggers have variable watches which can be used to view the contents of memory locations in a type-aware way, which is usually easier; so for example you could tell the watch window that K points to a 64-byte string and that K+56 points to a little-endian 64-bit unsigned integer, and these values would then be interpreted and reported correctly.
More generally it is often difficult to interpret 'raw' memory data in a little-endian system, because knowing which bytes to swap to put values into an order that's easily human-readable relies on knowing how long each value is, and this information is not present at runtime. It's the downside of the little-endian system, the upside being that casting pointers doesn't change their absolute values because a pointer always points to the least-significant byte no matter how large the data type.
Programming language and architecture have nothing to do with this. You are trying to prepare 32 bit values from a string.
"The Quick Brown Fox Jumps Over The Lazy Dog."
As an ASCII string the bytes looks like this in hex:
54 68 65 20 51 75 69 63 6B 20 42 72 6F 77 6E 20 46 6F 78 20 4A 75 6D 70 73 20 4F 76 65 72 20 54 68 65 20 4C 61 7A 79 20 44 6F 67 2E
But md5 is about data not strings correct? More on this in a bit.
You have to be careful with endianness. Generally folks are talking about byteswapping larger quantities (the address of the byte starts at the top or bottom, big end or little end). 16 or 32 or 64, etc bits. Initially talking about a 64 bit quantity for the length:
0x1122334455667788
when looked as a list of bytes in increasing address order, little endian (as far as is generally understood), is
88 77 66 55 44 33 22 11
so
0x0000000000000160
would be
60 01 00 00 00 00 00 00
And the next question is your string. Should it start with 0x54686520 or should it start with 0x20656854 or 0x63697551?
I believe from the text in wikipedia
The MD5 hash is calculated according to this algorithm. All values are in little-endian.
//Note: All variables are unsigned 32 bit and wrap modulo 2^32 when calculating
Then your last (only) chunk should look like
0x20656854
0x63697551
0x7242206B
0x206E776F
0x20786F46
0x706D754A
0x764F2073
0x54207265
0x4C206568
0x20797A61
0x2E676F44
0x00000080
0x00000000
0x00000000
0x00000160
0x00000000
Using an md5 source routine I found online and using the comes with my Linux distro I got
ec60fd67aab1c782cd3f690702b21527
As the hash in both cases, and the prepared data for the last/only chunk started with 0x20656854 from this program. This program also correctly calculated the result for a string on wikipedia.
So from the wikipedia article, which should have handled the 64 bit length a smidge better. Your data (its not a string) needs to be processed in 32 bit little endian quantities from the 512 bits.
54 68 65 20 becomes 0x20656854 0x000000000000160 becomes 0x00000160, 0x00000000.
If i were to do this, i will find an MD5 library or class, write a simple example to take it text that i want to hash, then ask compiler to generate assembly for the ARM part that i need.
You may consider an mbed [1] or an Arduino [2] version.
[1] https://os.mbed.com/users/hlipka/code/MD5/
[2] https://github.com/tzikis/ArduinoMD5

char * versus unsigned char * and casting

I need to use the SQLite function sqlite3_prepare_v2() (https://www.sqlite.org/c3ref/prepare.html).
This function takes a const char * as its second parameter.
On the other hand, I have prepared an unsigned char * variable v which contains something like this:
INSERT INTO t (c) VALUES ('amitié')
In hexadecimal representation (I cut the line):
49 4E 53 45 52 54 20 49 4E 54 4F 20 74 20 28 63 29
20 56 41 4C 55 45 53 20 28 27 61 6D 69 74 69 E9 27 29
Note the 0xE9 representing the character é.
In order for this piece of code to be built properly, I cast the variable v with (const char *) when I pass it, as an argument, to the sqlite3_prepare_v2() function...
What comments can you make about this cast? Is it really very very bad?
Note that I have been using an unsigned char * pointer to be able to store characters between 0x00 and 0xFF with one byte only.
The source data is coming from an ANSI encoded file.
In the documentation for the sqlite3_prepare_v2() function, I'm also reading the following comment for the second argument of this function:
/* SQL statement, UTF-8 encoded */
What troubles me is the type const char * for the function second argument... I would have been expecting a const unsigned char * instead...
To me - but then again I might be totally wrong - there are only 7 useful bits in a char (one byte), the most significant bit (leftmost) being used to denote the sign of the byte...
I guess I'm missing some kind of point here...
Thank you for helping.
You are correct.
For a UTF-8 input, the sqlite3_prepare_v2 method really should be asking for a const unsigned char * as all 8 bits are being used for data. Their implementation certainly shouldn't be using a signed comparison to check the top bit, because a simple compiler flag can set the default for char to be either unsigned or signed and the former would break the code.
As for your concerns over the cast, this is one of the more benign ones. Casting away signedness on int or float is usually a very bad thing (TM) - or at least a clear indicator that you have a problem.
When dealing with pure ASCII, you are correct that there are 7-bits of data, but the remaining 8th bit is meant to be used for a parity bit, not as a sign bit.

why byte ordering (endianess) is not an issue for standard C string?

My professor mentioned that byte ordering (endianess) is not an issue for standard C String (Char arrays):
for ex: char[6]="abcde";
But he did not explain why?
Any explanations for this will be helpful
Endianess only matters when you have multi-byte data (like integers and floating point numbers). Standard C strings consist of 1-byte characters, so you don't need to worry about endianness.
A char takes up only one byte, thats why the ordering does not matter. for example, int has 4 bytes and these bytes could be arranged in little-endian and big-endian`.
Eg: 0x00010204 can be arranged in two ways in memory:
04 02 01 00 or 00 01 02 04
char being a single byte will be fetched as single byte to CPU. and String is just a chararray.
Smallest Unit of memory storage is 1 Byte, If you have a 4 byte value ( ex : 0x01234567 )
It will be arranged and fetched in the order of Endianness in contigous locations
Big Endian : 01 23 45 67
Little Endian : 67 45 23 01
Whereas for a 1 byte Value it can be fetched from one memory location itself since that is the smallest block of memory , with no need for byte ordering .
Hope this Helps !

Write raw struct contents (bytes) to a file in C. Confused about actual size written

Basic question, but I expected this struct to occupy 13 bytes of space (1 for the char, 12 for the 3 unsigned ints). Instead, sizeof(ESPR_REL_HEADER) gives me 16 bytes.
typedef struct {
unsigned char version;
unsigned int root_node_num;
unsigned int node_size;
unsigned int node_count;
} ESPR_REL_HEADER;
What I'm trying to do is initialize this struct with some values and write the data it contains (the raw bytes) to the start of a file, so that when I open this file I later I can reconstruct this struct and gain some meta data about what the rest of the file contains.
I'm initializing the struct and writing it to the file like this:
int esprime_write_btree_header(FILE * fp, unsigned int node_size) {
ESPR_REL_HEADER header = {
.version = 1,
.root_node_num = 0,
.node_size = node_size,
.node_count = 1
};
return fwrite(&header, sizeof(ESPR_REL_HEADER), 1, fp);
}
Where node_size is currently 4 while I experiment.
The file contains the following data after I write the struct to it:
-bash$ hexdump test.dat
0000000 01 bf f9 8b 00 00 00 00 04 00 00 00 01 00 00 00
0000010
I expect it to actually contain:
-bash$ hexdump test.dat
0000000 01 00 00 00 00 04 00 00 00 01 00 00 00
0000010
Excuse the newbiness. I am trying to learn :) How do I efficiently write just the data components of my struct to a file?
Microprocessors are not designed to fetch data from arbitrary addresses. Objects such as 4-byte ints should only be stored at addresses divisible by four. This requirement is called alignment.
C gives the compiler freedom to insert padding bytes between struct members to align them. The amount of padding is just one variable between different platforms, another major variable being endianness. This is why you should not simply "dump" structures to disk if you want the program to run on more than one machine.
The best practice is to write each member explicitly, and to use htonl to fix endianness to big-endian before binary output. When reading back, use memcpy to move raw bytes, do not use
char *buffer_ptr;
...
++ buffer_ptr;
struct.member = * (int *) buffer_ptr; /* potential alignment error */
but instead do
memcpy( buffer_ptr, (char *) & struct.member, sizeof struct.member );
struct.member = ntohl( struct.member ); /* if member is 4 bytes */
That is because of structure padding, see http://en.wikipedia.org/wiki/Sizeof#Implementation
When you write structures as is with fwrite, you get then written as they are in memory, including the "dead bytes" inside the struct that are inserted due to the padding. Additionally, your multi-byte data is written with the endiannes of your system.
If you do not want that to happen, write a function that serializes the data from your structure. You can write only the non-padded areas, and also write multibyte data in a predictable order (e.g. in the network byte order).
The struct is subject to alignment rules, which means some items in it get padded. Looking at it, it looks like the first unsigned char field has been padded to 4 bytes.
One of the gotchas here is that the rules can be different from system to system, so if you write the struct as a whole using fwrite in a program compiled with one compiler on one platform, and then try to read it using fread on another, you could get garbage because the second program will assume the data is aligned to fit its conception of the struct layout.
Generally, you have to either:
Decide that saved data files are only valid for builds of your program that share certain characteristics (depending on the documented behaviour of the compiler you used), or
Not write a whole structure as one, but implement a more formal data format where each element is written individually with its size explicitly controlled.
(A related issue is that byte order could be different; the same choice generally applies there too, except that in option 2 you want to explicitly specify the byte order of the data format.)
Try hard not do this! The size discrepancy is caused by the padding and alignment used by compilers/linkers to optimze accesses to vars by speed. The padding and alignment rules with language and OS. Furthermore, writing ints and reading them on different hardware can be problematic due to endianness.
Write your metadata byte-by-byte in a structure that cannot be misunderstood. Null-terminated ASCII strings are OK.
I use a awesome open source piece of code written by Troy D. Hanson called TPL: http://tpl.sourceforge.net/.
With TPL you don't have any external dependency. It's as simple as including tpl.c and tpl.h into your own program and use TPL API.
Here is the guide: http://tpl.sourceforge.net/userguide.html
This is because of something called memory alignment. The first char is extended to take 4 bytes of memory. In fact, bigger types like int can only "start" at the beginning of a block of 4 bytes, so the compiler pads with bytes to reach this point.
I had the same problem with the bitmap header, starting with 2 char. I used a char bm[2] inside the struct and wondered for 2 days where the #$%^ the 3rd and 4th bytes of the header where going...
If you want to prevent this you can use __attribute__((packed)) but beware, memory alignment IS necessary to your program to run conveniently.
If you want to write the data in a specific format, use array(s) of unsigned char ...
unsigned char outputdata[13];
outputdata[0] = 1;
outputdata[1] = 0;
/* ... of course, use data from struct ... */
outputdata[12] = 0;
fwrite(outputdata, sizeof outputdata, 1, fp);

how do I determine if this is latin1 or utf8?

I have a string "Artîsté" in latin1 table. I use a C mysql connector to get the string out of the table. I have character_set_connection set to utf8.
In the debugger it looks like :
"Art\xeest\xe9"
If I print the hex values with printf ("%02X", (unsigned char) a[i]); for each char I get
41 72 74 EE 73 74 E9
How do I know if it is utf8 or latin1?
\x74\xee\x73 isn't a valid UTF-8 sequence, since UTF-8 never has a run of only 1 byte with the top bit set. So of the two, it must be Latin-1.
However, if you see bytes that are valid UTF-8 data, then it's not always possible to rule out that it might be Latin-1 that just so happens to also be valid UTF-8.
Latin-1 does have some invalid bytes (the ASCII control characters 0x00-0x1F and the unused range 0x7f-0x9F), so there are some UTF-8 strings that you can be sure are not Latin-1. But in my experience it's common enough to see Windows CP1252 mislabelled as Latin-1, that rejecting all those code points is fairly futile except in the case where you're converting from another charset to Latin-1, and want to be strict about what you output. CP1252 has a few unused bytes too, but not as many.
as yo can see in the schema of a UTF-8 sequence you can have 2 great possibilities:
1st bit = 0 (same as ascii), 1 byte per char having value <=0X7F
1st bit = 1 of utf-8 sequence, the sequence length is >= 2 bytes having value >= 0X80
This is iso-8859 encoding
41 72 74 *EE* 73 74 *E9*
only 2 stand alone bytes with values >= 0x80
ADD BEWARE
Be carefull! Even if you found a well formatted UTF-8 sequence, you cannot differentiate it from a bounch of ISO-8859 chars!

Resources