Will the MD5 cryptographic hash function output be same in all programming languages? - md5

I am basically creating an API in php, and one of the parameters that it will accept is an md5 encrypted value. I don't have much knowledge of different programming languages and also about the MD5. So my basic question is, if I am accepting md5 encrypted values, will the value remain same, generated from any programing language like .NET, Java, Perl, Ruby... etc.
Or there would be some limitation or validations for it.

Yes, correct implementation of md5 will produce the same result, otherwise md5 would not be useful as a checksum. The difference may come up with encoding and byte order. You must be sure that text is encoded to exactly the same sequence of bytes.

It will, but there's a but.
It will because it's spec'd to reliably produce the same result given a repeated series of bytes - the point being that we can then compare that results to check the bytes haven't changed, or perhaps only digitally sign the MD5 result rather than signing the entire source.
The but is that a common source of bugs is making assumptions about how strings are encoded. MD5 works on bytes, not characters, so if we're hashing a string, we're really hashing a particular encoding of that string. Some languages (and more so, some runtimes) favour particular encodings, and some programmers are used to making assumptions about that encoding. Worse yet, some spec's can make assumptions about encodings. This can be a cause of bugs where two different implementations will produce different MD5 hashes for the same string. This is especially so in cases where characters are outside of the range U+0020 to U+007F (and since U+007F is a control, that one has its own issues).
All this applies to other cryptographic hashes, such as the SHA- family of hashes.

Yes. MD5 isn't an encryption function, it's a hash function that uses a specific algorithm.

Yes, md5 hashes will always be the same regardless of their origin - as long as the underlying algorithm is correctly implemented.

A vital point of secure hash functions, such as MD5, is that they always produce the same value for the same input.
However, it does require you to encode the input data into a sequence of bytes (or bits) the same way. For instances, there are many ways to encode a string.

Related

What is the conflict probability of md5 digestion if input string only contains alphanumericals

The input strings have the following conditions:
Only contain alphanumericals ([a-zA-Z0-9])
The size of a string is always less than 256 bytes
Total number of input strings is less then 1000,000
So what is the conflict probability of md5 digestion if the input strings are all under the above conditions? Can I just assume that there has no conflict?
If the inputs are random the likelihood of a collision in that input set is very low. That being said MD5 is a broken algorithm and a human can easily use software to find a collision. So you probably just shouldn't use MD5, but it depends on what you're using it for. I'm not sure why you would ever want to use MD5 anymore. You should look into the blake2 family or the newer SHAs (SHA256, SHA512, not SHA-1). If these are passwords you should pretty much definitely be using a hash designed for passwords like PBKDF2 or one of the Argons. To be honest I'd recommend just using libsodium's defaults for most things.

What does a file MD5 Hash represent? [duplicate]

This question already has answers here:
when I hash a file with Md5 what is hashed?
(2 answers)
Closed 7 years ago.
I know a file's MD5 Hash is like a digital fingerprint used to confirm integrity and authenticity. There are many utilities to get the MD5 Hash of a file but what does that hash base on? File size? File low level binaries? Code?
MD5 is a so-called cryptographic hash function.
This basically means that you can give in any bitstring as input for the function, and you will get out a fixed-size bitstring (128-bit in the case of MD5) as output. The output is usually called "digest".
The digest depends solely on the input and nothing else. Thus in itself it can be used as an integrity proof, but not as authenticity, if the underlying hash function has the necessary properties (in this case collision-resistance). This means that for two different outputs the digest itself should be also different. The problem is that the digest's size is fixed, which in turn means that with sufficient number of messages it will always be possible to find a collision (i.e., two different inputs yielding the same output).
One should also note that there is nowadays no justification to use MD5, as weaknesses have been discovered (namely post-fix collision attacks). Also using SHA-256/512 on modern hardware is usually faster then MD5.
Shortly: the output of the cryptographic hash functions (and so MD5's) depends on the input bitstring.
Update: based on your comment for the other answer, you are looking for this: https://en.wikipedia.org/wiki/MD5#Algorithm
You can read about it here:
https://en.wikipedia.org/wiki/Md5sum
In general, the algorithm runs over the file and it's output is checksum, that means that if someone changes a bit in the file the checksum will be changed. so it's a way to validate that the file you looking at is the file you think you looking at and lowering the probability that someone put a melicius code in it

How can I generate unique, non-sequential serial keys without 3rd party software?

I'm working on a project that involves writing low-level C software for a hardware implementation. We are wanting to implement a new feature for our devices that our users can unlock when they purchase an associated license key.
The desired implementation steps are simple. The user calls us up, they request the feature and sends us a payment. Next, we email them a product key which they input into their hardware to unlock the feature.
Our hardware is not connected to the internet. Therefore, an algorithm must be implemented in such a way that these keys can be generated from both the server and from within the device. Seeds for the keys can be derived from the hardware serial number, which is available in both locations.
I need a simple algorithm that can take sequential numbers and generate unique, non-sequential keys of 16-20 alphanumeric characters.
UPDATE
SHA-1 looks to be the best way to go. However, what I am seeing from sample output of SHA-1 keys is that they are pretty long (40 chars). Would I obtain sufficient results if I took the 40 char key and, say, truncated all but the last 16 characters?
You could just concatenate the serial number of the device, the feature name/code and some secret salt and hash the result with SHA1 (or another secure hashing algorithm). The device compares the given hash to the hash generated for each feature, and if it finds a match it enables the feature.
By the way, to keep the character count down I'd suggest to use base64 as encoding after the hashing pass.
SHA-1 looks to be the best way to go. However, what I am seeing from sample output of SHA-1 keys is that they are pretty long (40 chars). Would I obtain sufficient results if I took the 40 char result and, say, truncated all but the last 16 characters?
Generally it's not a good idea to truncate hashes, they are designed to exploit all the length of the output to provide good security and resistance to collisions. Still, you could cut down the character count using base64 instead of hexadecimal characters, it would go from 40 characters to 27.
Hex: a94a8fe5ccb19ba61c4c0873d391e987982fbbd3
Base64: qUqP5cyxm6YcTAhz05Hph5gvu9M
---edit---
Actually, #Nick Johnson claims with convincing arguments that hashes can be truncated without big security implications (obviously increasing chances of collisions of two times for each bit you are dropping).
You should also use an HMAC instead of naively prepending or appending the key to the hash. Per Wikipedia:
The design of the HMAC specification was motivated by the existence of
attacks on more trivial mechanisms for combining a key with a hash
function. For example, one might assume the same security that HMAC
provides could be achieved with MAC = H(key ∥ message). However, this
method suffers from a serious flaw: with most hash functions, it is
easy to append data to the message without knowing the key and obtain
another valid MAC. The alternative, appending the key using MAC =
H(message ∥ key), suffers from the problem that an attacker who can
find a collision in the (unkeyed) hash function has a collision in the
MAC. Using MAC = H(key ∥ message ∥ key) is better, however various
security papers have suggested vulnerabilities with this approach,
even when two different keys are used.
For more details on the security implications of both this and length truncation, see sections 5 and 6 of RFC2104.
One option is to use a hash as Matteo describes.
Another is to use a block cipher (e.g. AES). Just pick a random nonce and invoke the cipher in counter mode using your serial numbers as the counter.
Of course, this will make the keys invertible, which may or may not be a desirable property.
You can use an Xorshift random number generator to generate a unique 64-bit key, and then encode that key using whatever scheme you want. If you use base-64, the key is 11 characters long. If you use hex encoding, the key would be 16 characters long.
The Xorshift RNG is basically just a bit mixer, and there are versions that have a guaranteed period of 2^64, meaning that it's guaranteed to generate a unique value for every input.
The other option is to use a linear feedback shift register, which also will generate a unique number for each different input.

md5 collision database?

I'm writing a file system deduper. The first pass generates md5 checksums, and the second pass compares the files with identical checksums.
Is there a collection of strings which differ but generate identical md5 checksums I can incorporate into my test case collection?
Update: mjv's answer points to these two files, perfect for my test case.
http://www.win.tue.nl/~bdeweger/CollidingCertificates/MD5Collision.certificate1.cer
http://www.win.tue.nl/~bdeweger/CollidingCertificates/MD5Collision.certificate2.cer
You can find a couple of different X.509 certificate files with the same MD5 hash at this url.
I do not know of MD5 duplicate files repositories, but you can probably create your own, using the executables and/or the techniques described on Vlastimil Klima's page on MD5 Collision
Indeed MD5 has been know for its weakness with regards to collision resistance, however I wouldn't disqualify it for a project such as your file system de-duper; you may just want to add a couple of additional criteria (which can be very cheap, computationally speaking) to further decrease the possibility of duplicates.
Alternatively, for test purposes, you may simply modify your MD5 compare logic so that it deems some MD5 values identical even though they are not (say if the least significant byte of the MD5 matches, or systematically, every 20 comparisons, or at random ...). This may be less painful than having to manufacture effective MD5 "twins".
http://www.nsrl.nist.gov/ might be what you want.

How do I choose a good magic number for my file format?

I am designing a binary file format from scratch, and I would like to include some magic bytes at the beginning so that it can be identified easily. How do I go about choosing which bytes? I am not aware of any central registry of magic numbers, so is it just a matter of picking something fairly random that isn't already identified by, say, the file command on a nearby UNIX box?
Stay away from super-short magic numbers. Just because you're designing a binary format doesn't mean you can't use a text string for identifier. Follow that by an EOF char, and as an added bonus people who cat or type your binary file won't get a mangled terminal.
There is no universally correct way. Best practices can be suggested, but these often situational. For example, if you're checking the integrity of volatile memory, which has an undefined initial state when power is applied, it may be beneficial to incorporate many 0s or 1s in a sequence (i.e. FFF0 00FF F000) which can stand out against random noise.
If the file is mostly binary, a popular choice is using a text encoding like ASCII which stands out among the binary data in a hex editor. For example, GIF uses GIF89a, FLAC uses fLaC. On the other hand, a plain text identifier may be falsely detected in a random text file, so invalid/control characters might be incorporated.
In general, it does not matter that much what they are, even a bunch of NULL bytes can be used for file detection. But ideally you want the longest unique identifier you can afford, and at minimum 4 bytes long. Any identifier under 4 bytes will show up more often in random data. The longer it is, the less likely it will ever be detected as a false positive. Some known examples are as long as 40 bytes. In a way, it's like a password.
Also, it doesn't have to be at offset 0. The file signature has conventionally been at offset zero, since it made sense to store it first if it will be processed first.
That said, a single file signature should not be the only line of defense. The actual parsing process itself should be able to verify integrity and weed out invalid files even if the signature matches. This can be done with additional file signatures, using length-sensitive data, value/range checking, and especially, hash/checksum values.

Resources