Appending the message in MD5 - md5

I am trying to understand how the MD5 hashing algorithm work and have been reading the Wikipedia article about it.
After one appends the message so that the length of the message (in bits) is congruent to 448 mod 512, one is supposed to
append length mod (2 pow 64) to message
From what I can understand this means to append the message with 64 bits representing the length of the message. I am a bit confused about how this is done.
My first questions is: is this the length of the original unappended message or the length that one gets after having appended it with the 1 followed by zeros?
My second question is: Is the length the length in bytes? That is, if my message is one byte, would I append the message with 63 0's and then a 1. Or if the message is 10 bytes, then I would append the message with 60 0's and 1010.

The length of the unpadded message. From the MD5 RFC, 3.2:
A 64-bit representation of b (the length of the message before the
padding bits were added) is appended to the result of the previous
step. In the unlikely event that b is greater than 2^64, then only
the low-order 64 bits of b are used. (These bits are appended as two
32-bit words and appended low-order word first in accordance with the
previous conventions.)
The length is in bits. See MD5 RFC, 3.1:
The message is "padded" (extended) so that its length (in bits) is
congruent to 448, modulo 512. That is, the message is extended so
that it is just 64 bits shy of being a multiple of 512 bits long.
Padding is always performed, even if the length of the message is
already congruent to 448, modulo 512.
The MD5 spec is far more precise than the Wikipedia article. I always suggest reading the spec over the Wiki page if you want implementation-level detail.
if my message is one byte, would I append the message with 63 0's and then a 1. Or if the message is 10 bytes, then I would append the message with 60 0's and 1010.
Not quite. Don't forget the obligatory bit value "1" that is always appended at the start of the padding. From the spec:
Padding is performed as follows: a single "1" bit is appended to the
message, and then "0" bits are appended so that the length in bits of
the padded message becomes congruent to 448, modulo 512. In all, at
least one bit and at most 512 bits are appended.
This reference C implementation (disclaimer: my own) of MD5 may be of help, it's written so that hopefully it's easy to follow.

Related

Data layouts used by C compilers (the alignment concept)

Below is an excerpt from the red dragon book.
Example 7.3. Figure 7.9 is a simplification of the data layout used by C compilers for two machines that we call Machine 1 and Machine 2.
Machine 1 : The memory of Machine 1 is organized into bytes consisting of 8 bits each. Even though every byte has an address, the instruction set favors short integers being positioned at bytes whose addresses are even, and integers being positioned at addresses that are divisible by 4. The compiler places short integers at even addresses, even if it has to skip a byte as padding in the process. Thus, four bytes, consisting of 32 bits, may be allocated for a character followed by a short integer.
Machine 2: each word consists of 64 bits, and 24 bits are allowed for the address of a word. There are 64 possibilities for the individual bits inside a word, so 6 additional bits are needed to distinguish between them. By design, a pointer to a character on Machine 2 takes 30 bits — 24 to find the word and 6 for the position of the character inside the word. The strong word orientation of the instruction set of Machine 2 has led the compiler to allocate a complete word at a time, even when fewer bits would suffice to represent all possible values of that type; e.g., only 8 bits are needed to represent a character. Hence, under alignment, Fig. 7.9 shows 64 bits for each type. Within each word, the bits for each basic type are in specified positions. Two words consisting of 128 bits would be allocated for a character followed by a short integer, with the character using only 8 of the bits in the first word and the short integer using only 24 of the bits in the second word. □
I found about the concept of alignment here ,here and here. What I could understand from them is as follows: In word addressable CPUs (where size is more than a byte), there certain paddings are introduced in the data objects, such that CPU can efficiently retrieve data from the memory with minimum no. of memory cycles.
Now the Machine 1 here is actually a byte address one. And the conditions in the Machine 1 specification are probably more difficult than a simple word addressable machine having word size of say 4 bytes. In such a 64 bit machine, we need to make sure that our data items are just word aligned ,no more difficulty. But how to find the alignment in systems like Machine 1 (as given in the table above) where the simple concept of word alignment does not work, because it is byte addressable and has much more difficult specifications.
Moreover I find it quite weird that in the row for double the size of the type is more than what is given in the alignment field. Shouldn't alignment(in bits) ≥ size (in bits) ? Because alignment refers to the memory actually allocated for the data object (?).
"each word consists of 64 bits, and 24 bits are allowed for the address of a word. There are 64 possibilities for the individual bits inside a word, so 6 additional bits are needed to distinguish between them. By design, a pointer to a character on Machine 2 takes 30 bits — 24 to find the word and 6 for the position of the character inside the word." - Moreover how should this statement about the concept of the pointers, based on alignment is to be visualized (2^6 = 64, it is fine but how is this 6 bits correlating with the alignment concept)
First of all, the machine 1 is not special at all - it is exactly like a x86-32 or 32-bit ARM.
Moreover I find it quite weird that in the row for double the size of the type is more than what is given in the alignment field. Shouldn't alignment(in bits) ≥ size (in bits) ? Because alignment refers to the memory actually allocated for the data object (?).
No, this isn't true. Alignment means that the address of the lowest addressable byte in the object must be divisible by the given number of bytes.
Additionally, with C, it is also true that within arrays sizeof (ElementType) will need to be greater than or equal to the alignment of each member and sizeof (ElementType) be divisible by alignment, thus the footnote a. Therefore on the latter computer:
struct { char a, b; }
might have sizeof 16 because the characters are in distinct addressable words, whereas
struct { char a[2]; }
could be squeezed into 8 bytes.
how should this statement about the concept of the pointers, based on alignment is to be visualized (2^6 = 64, it is fine but how is this 6 bits correlating with the alignment concept)
As for the character pointers, the 6 bits is bogus. 3 bits are needed to choose one of the 8 bytes within the 8-byte words, so this is an error in the book. An ordinary byte would select just a word with 24 bits, and a character (a byte) pointer would select the word with 24 bits, and one of the 8-bit bytes inside the word with 3 bits.

concept of converting from UTF8 to UTF16 LE the math operation in c programming [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to know the concept of conversion from UTF8 to UTF16 LE
for e.g
input sequence E3 81 82
output sequence is 42 30
what is the actual arithmetic operation in this conversion.(I do not want to call in-built libraries)
Basically, Unicode is a way to represent many as possible symbols in one continuous code space, the code of each symbol is usually called a "code point".
UTF-8 and UTF-16 are just ways to encode and represent those code point in one or more octets (UTF-8) or 16-bit words (UTF-16), the latest can be represented as pair of octets in either little-endian ("least significant first" or "Intel byte order") or big-endian ("most significant first", or "Motorola byte order") sequence, which gives us two variants: UTF-16LE and UTF-16BE.
First you need to do, is to extract the code point from the UTF-8 sequence.
UTF-8 is encoded as follows:
0x00...0x7F encode symbol "as-is", it corresponds to standard ASCII symbols
but, if most significant bit is set (i.e. 0x80...0xFF), then it means that this is a sequence of several bytes, which all together encode the code point
bytes from range 0xC0...0xFF are on the first position of that sequence, in binary representation they will be:
0b110xxxxx - 1 more byte follows and xxxxx are 5 most significant bits of the code point
0b1110xxxx - 2 more bytes follow and xxxx are 4 most significant bits of the code point
0b11110xxx - 3 more bytes...
0b111110xx - 4 more bytes...
There are no code points defined in Unicode standard, which require more than 5 UTF-8 bytes for now.
following bytes are from range 0x80...0xBF (i.e. 0b10xxxxxx) and encode next six bits (from most to least significant) from the code point value.
So, looking at your example: E3 81 82
0xE3 == 0b11100011 means there will be more 2 bytes in this code point and 0011 - are most significant bits of it
0x81 == 0b10000001 means this is not the first byte in the code point sequence and it encodes next 6 bits: 000001
0x82 == 0b10000010 means this is not the first byte in the code point sequence and it encodes next 6 bits: 000010
i.e. result will be 0011 000001 000010 == 0x3042
UTF-16 works the same way. Most usual code points are just encoded "as-is" but some large values are packed in so-called "surrogate pairs", which are combination of two 16-bit words:
values from range 0xD800...0xDBFF represents the first of them, its 10 lower bits are encoding 10 most significant bits of the resulting code point.
values from range 0xDC00...0xDFFF represents the second, its lower bits are encoding 10 least significant bits of the resulting code point.
Surrogate pairs are required for values more than 0xFFFF (obvious) and for values 0xD800...0xDFFF - but this range is reserved in Unicode standard for surrogate pairs and there must no be such symbols.
So, in our example 0x3042 does not hit that range and therefore requires only one 16-bit word.
Since in your example UTF-16LE (little-endian) variant is given, that means, in the byte sequence first will be a least significant half of that word. I.e.
0x42 0x30

Convert int/string to byte array with length n

How can I convert a value like 5 or "Testing" to an array of type byte with a fixed length of n byte?
Edit:
I want to represent the number 5 in bits. I know that it's 101, but I want it represented as array with a length of for example 6 bytes, so 000000 ....
I'm not sure what you are trying to accomplish here but all I can say is assuming you simply want to represent characters in the binary form of it's ASCII code, you can pad the binary representation with zeros. For example if the set number of characters you want is 10, then encoding the letter a (with ASCII code of 97) in binary will be 1100001, padded to 10 characters will be 0001100001, but that is for a single character to be encoded. The encoding of a string, which is made up of multiple characters will be a set of these 10 digit binary codes which represent the corresponding character in the ASCII table. The encoding of data is important so that the system knows how to interpret the binary data. Then there is also endianness depending on the system architecture - but that's less of an issue these days with more old and modern processors like the ARM processors being bi-endian.
So forget about representing the number 5 and the string "WTF" using
the same number of bytes - it makes the brain hurt. Stop it.
A bit more reading on character encoding will be great.
Start here - https://en.wikipedia.org/wiki/ASCII
Then this - https://en.wikipedia.org/wiki/UTF-8
Then brain hurt - https://en.wikipedia.org/wiki/Endianness

Understanding the magic number 0x07EFEFEFF used for strlen optimization

I stumbled upon this answer regarding the utilization of the magic number 0x07EFEFEFF used for strlen's optimization, and here is what the top answer says:
Look at the magic bits. Bits number 16, 24 and 31 are 1. 8th bit is 0.
8th bit represents the first byte. If the first byte is not zero, 8th bit becomes 1 at this point. Otherwise it's 0.
16th bit represents the second byte. Same logic.
24th bit represents the third byte.
31th bit represents the fourth byte.
However, if I calculate result = ((a + magic) ^ ~a) & ~magic with a = 0x100, I find that result = 0x81010100, meaning that according to the top answerer, the second byte of a equals 0, which is obviously false.
What am I missing?
Thanks!
The bits only tell you if a byte is zero if the lower bytes are non-zero -- so it can only tell you the FIRST 0 byte, but not about bytes after the first 0.
bit8=1 means first byte is zero. Other bytes, unknown
bit8=0 means first byte is non-zero
bit8=0 & bit16=1 means second byte is zero, higher bytes unknown
bit8=0 & bit16=0 mans first two bytes are non-zero.
Also, the last bit (bit31) only tells you about 7 bits of the last byte (and only if the first 3 bytes are non-zero) -- if it is the only bit set then the last byte is 0 or 128 (and the rest are non-zero).

Write 9 bits binary data in C

I am trying to write to a file binary data that does not fit in 8 bits. From what I understand you can write binary data of any length if you can group it in a predefined length of 8, 16, 32,64.
Is there a way to write just 9 bits to a file? Or two values of 9 bits?
I have one value in the range -+32768 and 3 values in the range +-256. What would be the way to save most space?
Thank you
No, I don't think there's any way using C's file I/O API:s to express storing less than 1 char of data, which will typically be 8 bits.
If you're on a 9-bit system, where CHAR_BIT really is 9, then it will be trivial.
If what you're really asking is "how can I store a number that has a limited range using the precise number of bits needed", inside a possibly larger file, then that's of course very possible.
This is often called bitstreaming and is a good way to optimize the space used for some information. Encoding/decoding bitstream formats requires you to keep track of how many bits you have "consumed" of the current input/output byte in the actual file. It's a bit complicated but not very hard.
Basically, you'll need:
A byte stream s, i.e. something you can put bytes into, such as a FILE *.
A bit index i, i.e. an unsigned value that keeps track of how many bits you've emitted.
A current byte x, into which bits can be put, each time incrementing i. When i reaches CHAR_BIT, write it to s and reset i to zero.
You cannot store values in the range –256 to +256 in nine bits either. That is 513 values, and nine bits can only distinguish 512 values.
If your actual ranges are –32768 to +32767 and –256 to +255, then you can use bit-fields to pack them into a single structure:
struct MyStruct
{
int a : 16;
int b : 9;
int c : 9;
int d : 9;
};
Objects such as this will still be rounded up to a whole number of bytes, so the above will have six bytes on typical systems, since it uses 43 bits total, and the next whole number of eight-bit bytes has 48 bits.
You can either accept this padding of 43 bits to 48 or use more complicated code to concatenate bits further before writing to a file. This requires additional code to assemble bits into sequences of bytes. It is rarely worth the effort, since storage space is currently cheap.
You can apply the principle of base64 (just enlarging your base, not making it smaller).
Every value will be written to two bytes and and combined with the last/next byte by shift and or operations.
I hope this very abstract description helps you.

Resources