Maximum length for MD5 input/output - md5

What is the maximum length of the string that can have md5 hashed? Or: If it has no limit, and if so what will be the max length of the md5 output value?

MD5 processes an arbitrary-length message into a fixed-length output of 128 bits, typically represented as a sequence of 32 hexadecimal digits.

The length of the message is unlimited.
Append Length
A 64-bit representation of b (the length of the message before the
padding bits were added) is appended to the result of the previous
step. In the unlikely event that b is greater than 2^64, then only
the low-order 64 bits of b are used.
The hash is always 128 bits. If you encode it as a hexdecimal string you can encode 4 bits per character, giving 32 characters.
MD5 is not encryption. You cannot in general "decrypt" an MD5 hash to get the original string.
See more here.

You can have any length, but of course, there can be a memory issue on the computer if the String input is too long. The output is always 32 characters.

The algorithm has been designed to support arbitrary input length. I.e you can compute hashes of big files like ISO of a DVD...
If there is a limitation for the input it could come from the environment where the hash function is used. Let's say you want to compute a file and the environment has a MAX_FILE limit.
But the output string will be always the same: 32 hex chars (128 bits)!

A 128-bit MD5 hash is represented as a sequence of 32 hexadecimal digits.

You may want to use SHA-1 instead of MD5, as MD5 is considered broken.
You can read more about MD5 vulnerabilities in this Wikipedia article.

There is no limit to the input of md5 that I know of. Some implementations require the entire input to be loaded into memory before passing it into the md5 function (i.e., the implementation acts on a block of memory, not on a stream), but this is not a limitation of the algorithm itself. The output is always 128 bits. Note that md5 is not an encryption algorithm, but a cryptographic hash. This means that you can use it to verify the integrity of a chunk of data, but you cannot reverse the hashing.
Also note that md5 is considered broken, so you shouldn't use it for anything security-related (it's still fine to verify the integrity of downloaded files and such).

md5 algorithm appends the message length to the last 64 bits of the last block, thus it would be fair to say that the message can be 2^64 bits long (18 e18 bits).

Max length for MD5 input : largest definable and usable stream of bit
A stream of bit definition constraints can depend on operating system, hardware constraints, programming language and more...
Length for MD5 output : Fixed-length always 128 bits
For easier display, they are usually displayed in hex, which because each hex digit (0-1-2-3-4-5-6-7-8-9-A-B-C-D-E-F) takes up 4 bits of space, so its output can be displayed as 32 hex digits.
128 bits = 16 bytes = 32 hex digits

The md5 output is always 32 characters.Therefore, when setting character limit for the password in the database, do not give a value below 32 characters. If you give a value below 32, the password will be incompletely recorded in the database and therefore users will encounter an error while logging into the system.

Related

How many bytes will be required to store number in binary and text files respectively

If I want to store a number, let's say 56789 in a file, how many bytes will be required to store it in binary and text files respectively? I want to know how bytes are allocated to data in binary and text files.
It depends on:
text encoding and number system (decimal, hexadecimal, many more...)
signed/not signed
single integer or multiple (require separators)
data type
target architecture
use of compressed encodings
In ASCII a character takes 1 byte. In UTF-8 a character takes 1 to 4 bytes, but digits always take 1 byte. In UTF-16 or Unicode it takes 2 or more bytes per character.
Non-ASCII formats may require additional 2 bytes (initial BOM) for the file, this depends on the editor and/or settings used when the file was created.
But let's assume you store the data in a simple ASCII file, or the discussion becomes needlessly complex.
Let's also assume you use the decimal number system.
In hexadecimal you use digits 0-9 and letters a-f to represent numbers. A decimal (base-10) like 34234324423 would be 7F88655C7 in hexadecimal (base-16). In the first system we have 11 digits, in the second just 9 digits. The minimum base is 2 (digits 0 and 1) and the common maximum base is 64 (base-64). Technically, with ASCII you could go as high as base-96 maybe base-100, but that's very uncommon.
Each digit (0-9) will take one byte. If you have signed integers, an additional minus sign will lead the digits (so negative numbers charge 1 additional byte).
In some circumstances you may want to store several numerals. You will need a separator to tell the numerals apart. A comma (,), colon (:), semicolon (;), pipe (|) or newline (LF, CR or on Windows CRLF, which takes 2 bytes) have all been observed in the djungle as legit separators of numerals.
What is a numeral? The concept or idea of the quantity 8 that is IN YOUR HEAD is the number. Any representation of that concept on stone, paper, magnetic tape, or pixels on a screen are just that: REPRESENTATIONS. They are symbols which stand for what you understand in your brain. Those are numerals. Please don't ever confuse numbers with numerals, this distinction is the foundation of mathematics and computer science.
In these cases you want to count an additional character for the separator per numeral. Or maybe per numeral minus one. It depends on if you want to terminate each numeral with a marker or separate the numerals from each other:
Example (three digits and three newlines): 6 bytes
1<LF>
2<LF>
3<LF>
Example (three digits and two commas): 5 bytes
1,2,3
Example (four digits and one comma): 5 bytes
2134,
Example (sign and one digit): 2 bytes
-3
If you store the data in a binary format (not to be confused with the binary number system, which would still be a text format) the occupied memory depends on the integer type (or, better, bit length of the integer).
An octet (0..255) will occupy 1 byte. No separators or leading signs required.
A 16-bit float will occupy 2 bytes. For C and C++ the underlying architecture must be taken into account. A common integer on a 32-bit architecture will take 4 bytes. The very same code, compiled against a 64-bit architecture, will take 8 bytes.
There are exceptions to those flat rules. As an example, Google's protobuf uses a zig-zag VarInt implementation that leverages variable length encoding.
Here is a VarInt implementation in C/C++.
EDIT: added Thomas Weller's suggestion
Beyond the actual file CONTENT you will have to store metadata about the file (for bookkeeping such as the first sector, the filename, access permissions and more). This metadata is not shown for the file occupying space on disk, but actually is there.
If you store each numeral in a separate file such as the numeral 10 in the file result-10, these metadata entries will occupy more space than the numerals themselves.
If you store ten, hundred, thousands or millions/billions of numerals in one file, that overhead becomes increasingly irrelevant.
More about metadata here.
EDIT: to be clearer about file overhead
The overhead is under circumstances relevant, as discussed above.
But it is not a differentiator between textual and binary formats. As doug65536 says, however you store the data, if the filesystem structure is the same, it does not matter.
A file is a file, independently if it contains binary data or ASCII text.
Still, the above reasoning applies independently from the format you choose.
The number of digits needed to store a number in a given number base is ceil(log(n)/log(base)).
Storing as decimal would be base 10, storing as hexadecimal text would be base 16. Storing as binary would be base 2.
You would usually need to round up to a multiple of eight or power of two when storing as binary, but it is possible to store a value with an unusual number of bits in a packed format.
Given your example number (ignoring negative numbers for a moment):
56789 in base 2 needs 15.793323887 bits (16)
56789 in base 10 needs 4.754264221 decimal digits (5)
56789 in base 16 needs 3.948330972 hex digits (4)
56789 in base 64 needs 2.632220648 characters (3)
Representing sign needs an additional character or bit.
To look at how binary compares to text, assume a byte is 8 bits, each ASCII character would be a byte in text encoding (8 bits). A byte has a range of 0 to 255, a decimal digit has a range from 0 to 9. Each character (8 bits) can encode about 3.32 bits of a number per byte (log(10)/log(2)). A binary encoding can store 8 bits of a number per byte. Encoding numbers as text takes about 2.4x more space. If you pad out your numbers so they line up in fields, then numbers are very poor storage encoding, with a typical width being 10 digits you'll be storing 80 bits, which would be only 33 bits of binary encoded data.
I am not too developed in this subject; however, I believe it would not just be a case of the content, but also the META-DATA attached. But if you were just talking about the number, you could store it in ASCII or in a binary form.
In binary, 56789 could be converted to 1101110111010101; there is a 'simple' way to work this out on paper. But, http://www.binaryhexconverter.com/decimal-to-binary-converter is a website you can use to convert it.
1101110111010101 has 16 characters, therefore 16 bits which is two bytes.
Each integer is usually around 4 bytes of storage. So if you are storing the number in binary in the text file, and the binary equivalent is 1101110111010101, there are 16 integers in that binary number. 16 * 4 = 64. So your number will take up about 64 bytes of storage. If your integers were stored in 64bit rather than 32bit, each integer would instead take up 8 bytes of storage, so your total would equal 128 bytes.
Before you post any question, you should do your research.
Size of the file depends on many factors but for the sake of simplicity, in text format numbers will occupy 1 byte for each character if you are using UTF-8 encoding. On the other hand a binary value for long data type will take 4 bytes.

LZW Compression with Entire unicode library

I am trying to do this problem:
Assume we have an initial alphabet of the entire Unicode character set,
instead of just all the possible byte values. Recall that unicode
characters are unsigned 2-byte values, so this means that each
2 bytes of uncompressed data will be treated as one symbol, and
we'll have an alphabet with over 60,000 symbols. (Treating symbols as
2-byte Unicodes, rather than a byte at a time, makes for better
compression in the case of internationalized text.) And, note, there's
nothing that limits the number of bits per code to at most 16. As you
generalize the LZW algorithm for this very large alphabet, don't worry
if you have some pretty long codes.
With this, give the compressed version of this four-symbol sequence,
using our project assumptions, including an EOD code, and grouping
into 4-byte ints. (These three symbols are Unicode values,
represented numerically.) Write your answer as 3 8-digit hex values,
space separated, using capital hex digits, not lowercase.
32767 32768 32767 32768
The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.
Also, it seems to me that it will only take two integers the compressed data.
The problem statement is ill-formed.
In Unicode, as we know it today, code points (those numbers that represent characters, composable parts of characters and other useful but more sneaky things) cannot be all numbered from 0 to 65535 to fit into 16 bits. There are more than 100 thousand of Chinese, Japanese and Korean characters in Unicode. Clearly, you'd need 17+ bits just for those. So, Unicode clearly cannot be the correct option here.
OTOH, there exist a sort of "abridged" version of Unicode, Universal Character Set, whose UCS-2 encoding uses 16-bit code points and can technically be used for at most 65536 characters and the like. Those characters with codes greater than 65535 are, well, unlucky, you can't have them with UCS-2.
So, if it's really UCS-2, you can download its specification (ISO/IEC 10646, I believe) and figure out exactly which codes out of those 64K are used and thus should form your initial LZW alphabet.

How to simply generate a random base64 string compatible with all base64 encodings

In C, I was asked to write a function to generate a random Base64 string of length 40 characters (30 bytes ?).
But I don't know the Base64 flavor, so it needs to be compatible with many version of Base64.
What can I do ? What is the best option ?
All the Base64 encodings agree on some things, such as the use of [0-9A-Za-z], which are 62 characters. So you won't get a full 64^40 possible combinations, but you can get 62^40 which is still quite a lot! You could just generate a random number for each digit, mod 62. Or slice it up more carefully to reduce the amount of entropy needed from the system. For example, given a 32-bit random number, take 6 bits at a time (0..63), and if those bits are 62 or 63, discard them, otherwise map them to one Base64 digit. This way you only need about 8, 32-bit integers to make a 40-character string.
If this system has security considerations, you need to consider the consequences of generating "unusual" Base64 numbers (e.g. an attacker could detect that your Base64 numbers are special in having only 62 symbols with just a small corpus--does that matter?).

Understanding `read, write` system calls in Unix

My Systems Programming project has us implementing a compression/decompression program to crunch down ASCII text files by removing the zero top bit and writing the output to a separate file, depending on whether the compression or decompression routine is working. To do this, the professor has required us to use the binary files and Unix system calls, which include open, close, read, write, etc.
From my understanding of read and write, it reads the binary data by defined byte chunks. However, since this data is binary, I'm not sure how to parse it.
This is a stripped down version of my code, minus the error checking:
void compress(char readFile[]){
char buffer[BUFFER] //buffer size set to 4096, but tunable to system preference
int openReadFile;
openReadFile= open(readFile, O_RDONLY);
}
If I use read to read the data into buffer, will the data in buffer be in binary or character format? Nothing I've come across addresses that detail, and its very relevant to how I parse the contents.
read() will read the bytes in without any interpretation (so "binary" mode).
Being binary, and you want to access the individual bytes, you should use a buffer of unsigned char
unsigned char buffer[BUFFER]. You can regard char/unsigned char as bytes, they'll be 8 bits on linux.
Now, since what you're dealing with is 8 bit ascii compressed down to 7 bit, you'll have to convert those 7 bits into 8 bits again so you can make sense of the data.
To explain what's been done - consider the text Hey .That's 3 bytes. The bytes will have 8 bits each, and in ascii that's the bit patterns :
01001000 01100101 01111001
Now, removing the most significant bit from this, you shift the remaining bits one bit to the left.
X1001000 X1100101 X1111001
Above, X is the bit to removed. Removing those, and shifting the others you end up with bytes with this pattern:
10010001 10010111 11001000
The rightmost 3 bits is just filled in with 0. So far, no space is saved though. There's still 3 bytes.
With a string of 8 bytes, we'd saved 1 byte as that would compress down to 7 bytes.
Now you have to do the reverse on the bytes you've read back in
I'll quote the manual of the fopen function (that is based on the open function/primitive) from http://www.kernel.org/doc/man-pages/online/pages/man3/fopen.3.html
The mode string can also include the
letter 'b' either as a last character
or as a character between the
characters in any of the two-character
strings described above. This is
strictly for compatibility with C89
and has no effect; the 'b' is ignored
on all POSIX conforming systems,
including Linux
So even the high level function ignores the mode :-)
It will read the binary content of the file and load it in the memory buffer points to. Of course, a byte is 8 bits, and that's why a char is 8 bits, so, if the file was a regular plain text document you'll end up with a printable string (be careful with how it ends, read returns the number of bytes (characters in a ascii-encoded plain text file) read).
Edit: in case the file you're reading isn't a text file, and is a collection of binary representations, you can make the type of the buffer the one of the file, even if it's a struct.

What is meant by Octet String? What's the difference between Octet and Char?

What is the difference between octet string and char? How can an octet string be used? Can anybody write a small C program on Octet string? How are octet strings stored in memory?
Standards (and such) use "octet" to explicitly state that they're talking about 8-bit groups. While most current computers work with bytes that are also 8 bits in size, that's not necessarily the case. In fact, "byte" is rather poorly defined, with considerable disagreement over what it means for sure -- so it's generally avoided when precision is needed.
Nonetheless, on a typical computer, an octet is going to be the same thing as a byte, and an octet stream will be stored in a series of bytes.
An octet is another word for a 8-bit byte.
A char is usually 8 bits, but may be another size on some architectures.
An octet is 8 bits meant to be handled together (hence the "oct" in "octet"). It's what we think of when we say "byte" these days.
A char is basically a byte -- it's defined as the smallest addressable unit of memory, which on almost all modern computers is the same as an octet. But there have been computers with 9-bit, 16-bit, even 36-bit "words" that qualify as chars by that definition. You only need to care about those computers (and thus, about the difference between a char and an octet) if you have one -- let the people who have the weird hardware worry about how to make their programs run on it.
An octet string is simply a sequence of bits grouped into chunks of 8. Those 8-sized groups often represent characters. Octet string is a basic data type used for SNMP.
A string used to be a set of octets, which is turn is a set of 8 bits.
A string in C, is always a null-terminated, memory contiguous, set of bytes.
Back in the day, each byte, an octet, represented a character. That's why they named the type used to make strings, char.
The ASCII table, that goes from 0 to 127, with the graphics/accents version going from 0 to 255, was no longer enough for displaying characters in a string, so someone though of adding bits to a character representation. Dumb-asses from CS though of 9bit and so forth, to what HW guys replied "are you nuts??? keep it a multiple of memory addressing unit", which was the byte, back then.
Enter wide-character strings, i.e. 16bits per character.
On a WC string, each character is represented by 2 bytes... there goes your char=1 byte rule down the drain.
To keep an exact description of a string, if it's a set of characters-represented-by-8bits (in Earth, following the ASCII table, but I've been to Mars), it's an "octet string".
If it's not "octet string" it may or may not be WC... Joel was a nice post on this.

Resources