How to include the md5 value of a file into itself

How to include the md5 value of a file into itself - md5

Supposed that you want to create a file which contains the md5 digest of itself. How to do that?
Before you include the md5 value into the file, you have to calculate the value, but if and only if the md5 digest has been included into the file, you can calculate the value. It's kind o a dilemma. Any Idea?

The short answer is that unless you use MD5 vulnerabilities you can't do that. I believe that even using MD5 vulnerabilities building such collision is impractical. A solution would be to either attach the digest at the end of the file or ship it separately.

You have to avoid that cyclic dependency, so you can't checksum the checksum. To work around that, you could reserver space for the checksum in your file, but set that space to zeroes. Then calculate the checksum and embed it. In order to check it later, you'd have to read it in, and set those bytes in the file to zero again.

Related

How to find out if two binary files are exactly the same

I have got a repository where I store all my image files. I know that there are much images which are duplicated and I want to delete each one of duplicated ones.
I thought if I generate checksum for each image file and rename the file to its checksum, I can easily find out if there are duplicated ones by examining the filename. But the problem is that, I cannot be sure about selecting the checksum algorithm to use. For example, if I generate the checksums using MD5, can I exactly trust if the checksums are the same that means the files are exactly the same?

Judging from the response to a similar question in security forum (https://security.stackexchange.com/a/3145), the collision rate is about 1 collision per 2^64 messages. If your files are differenet and your collection is not huge (i.e. close to this number), md5 can be used safely.
Also, see response to a very similar question here: How many random elements before MD5 produces collisions?

The chances of getting the same checksum for 2 different files are extremely slim, but can never be absolutely guaranteed (Pigeonhole principle). An indication of how slim may be that GIT uses the SHA-1 checksum for software development source code including Linux and has never caused any known problems so I would say that you are safe. I would use SHA-1 instead of MD5 because it is slightly better if you are really paranoid.

To make really sure you best follow a two-step-procedure: first calculate a checksum for every file. If the checksums differ you're sure the files are not identical. If you happen to find some files with equal checksums there's no way around doing a bit-by-bit-comparison to make 100% sure if they are really identical. This holds regardless of the hashing-algorithm used.
What you'll get is a massive time-saving as doing bit-by-bit comparison for every possible pair of files will take forever and a day while comparing a hand full of possible candidates is fairly easy.

checksum and md5, not the same thing?

I downloaded a file and used md5sum to see if the download was successful without corruption. I got the following value:
a7099fcf9572d91b10d0073b07e112cb ./Macaca_mulatta.MMUL_1.70.dna.chromosome.1.fa.gz
But when I checked the website I downloaded the file from, it gave me the following value.
10256 63747 Macaca_mulatta.MMUL_1.70.dna.chromosome.1.fa.gz
What is this 10 digit code? is it not md5?
I downloaded the file from : ftp://ftp.ensembl.org/pub/release-70/fasta/macaca_mulatta/dna/CHECKSUMS

Ensembl is using the unix 'sum' utilty to calcualte the CHECKSUM.gz file.
Here's more info about the program : http://en.wikipedia.org/wiki/Sum_%28Unix%29
To see if your download is correct, try:
sum Macaca_mulatta.MMUL_1.70.dna.chromosome.1.fa.gz
NOTE: It happened before that Ensembl did not update their CHECKSUM file so it can always happen that the download is correct but the CHECKSUM.gz file is incorrect.

They are not the same thing. MD5 is a checksum but there are other checksum algorithms that are not MD5, such as SHA, CRC etc.
Generally a checksum is a function that takes an input that's larger in size than its output and (it better) produces greatly different outputs even if one bit in the input is changed.
The output you're looking at consists of two 5-digit decimal numbers, so it's likely your checksum algorithm is CRC32. The unix sum command may be used to calculate/verify it.

MD5 is a way to do a checksum, but there are others. CRC is one, so is SHA. All MD5 does is produce a hash code, and it is not the only algorithm to do so. I'm not sure what the 10 digit one is, but it can't be MD5.

MD5 collision for known input

Is it possible to create a MD5 collision based on a known input value?
So for example I have input string abc with MD5 900150983cd24fb0d6963f7d28e17f72.
Now I want to add bytes to string def to get the same MD5 900150983cd24fb0d6963f7d28e17f72.
(I know this is possible by bruteforcing and waiting a long time; I want to know if there is a more efficient way in doing this)

Unitl now no algorithm has been discovered that allows you to find a matching input that will generate a given md5 hash.
What has been proven is that you can create md5 collisions quite easily, for example with what is known as chosen-prefix-collision: you can create two files yielding the same md5 hash by appending different data to a specified file. If you want to know more or get the program to try it, look here.

Are there algorithms for putting a digest into the file being digested?

Are there algorithms for putting a digest into the file being digested?
In otherwords, are there algorithms or libraries, or is it even possible to have a hash/digest of a file contained in the file being hashed/digested. This would be handy for obvious reasons, such as built in digests of ISOs. I've tried googling things like "MD5 injection" and "digest in a file of a file." No luck (probably for good reason.)
Not sure if it is even mathematically possible. Seems you'd be able to roll through the file but then you'd have to brute the last bit (assuming the digest was the last thing in the file or object.)
Thanks,
Chenz

It is possible in a limited sense:
Non-cryptographically-secure hashes
You can do this with insecure hashes like the CRC family of checksums.
Maclean's gzip quine
Caspian Maclean created a gzip quine, which decompresses to itself. Since the Gzip format includes a CRC-32 checksum (see the spec here) of the uncompressed data, and the uncompressed data equals the file itself, this file contains its own hash. So it's possible, but Maclean doesn't specify the algorithm he used to generate it:
It's quite simple in theory, but the helper programs I used were on a hard disk that failed, and I haven't set up a new working linux system to run them on yet. Solving the checksum by hand in particular would be very tedious.
Cox's gzip, tar.gz, and ZIP quines
Russ Cox created 3 more quines in Gzip, tar.gz, and ZIP formats, and wrote up in detail how he created them in an excellent article. The article covers how he embedded the checksum: brute force—
The second obstacle is that zip archives (and gzip files) record a CRC32 checksum of the uncompressed data. Since the uncompressed data is the zip archive, the data being checksummed includes the checksum itself. So we need to find a value x such that writing x into the checksum field causes the file to checksum to x. Recursion strikes back.
The CRC32 checksum computation interprets the entire file as a big number and computes the remainder when you divide that number by a specific constant using a specific kind of division. We could go through the effort of setting up the appropriate equations and solving for x. But frankly, we've already solved one nasty recursive puzzle today, and enough is enough. There are only four billion possibilities for x: we can write a program to try each in turn, until it finds one that works.
He also provides the code that generated the files.
(See also Zip-file that contains nothing but itself?)
Cryptographically-secure digests
With a cryptographically-secure hash function, this shouldn't be possible without either breaking the hash function (particularly, a secure digest should make it "infeasible to generate a message that has a given hash"), or applying brute force.
But these hashes are much longer than 32 bits, precisely in order to deter that sort of attack. So you can write a brute-force algorithm to do this, but unless you're extremely lucky you shouldn't expect it to finish before the universe ends.
MD5 is broken, so it might be easier
The MD5 algorithm is seriously broken, and a chosen-prefix collision attack is already practical (as used in the Flame malware's forged certificate; see http://www.cwi.nl/news/2012/cwi-cryptanalist-discovers-new-cryptographic-attack-variant-in-flame-spy-malware, http://arstechnica.com/security/2012/06/flame-crypto-breakthrough/). I don't know of what you want having actually been done, but there's a good chance it's possible. It's probably an open research question.
For example, this could be done using a chosen-prefix preimage attack, choosing the prefix equal to the desired hash, so that the hash would be embedded in the file. A
preimage attack is more difficult than collision attacks, but there has been some progress towards it. See Does any published research indicate that preimage attacks on MD5 are imminent?.
It might also be possible to find a fixed point for MD5; inserting a digest is essentially the same problem. For discussion, see md5sum a file that contain the sum itself?.
Related questions:
Is there any x for which SHA1(x) equals x?
Is a hash result ever the same as the source value?

The only way to do this is if you define your file format so the hash only applies to the part of the file that doesn't contain the hash.
However, including the hash inside a file (like built into an ISO) defeats the whole security benefit of the hash. You need to get the hash from a different channel and compare it with your file.

No, because that would mean that the hash would have to be a hash of itself, which is not possible.

CRC checks for files

I'm working with a small FAT16 filesystem, and I want to generate CRC values for indidual XML files which store configuration information. In case the data changes or is corrupted, I want to be able to check the CRC to determine that the file is still in it's original state.
The question is, how do I put the CRC value into the file, without changing the CRC value of the file itself? I can think of a couple solutions, but I think there must be a fairly standard solution for this issue.

You could append the CRC value to the end of the file. Then, when computing the CRC value later for checking, omit the last four bytes.

Define a header, generate the CRC of everything except the header then put the value in the header.

A common solution is to just use different files. Alongside each file simply have a file with the same file name with a different extension. For example: foobar.txt and foobar.txt.md5 (or .crc).

The common solution which is widely used in communication protocols is to set the CRC field to 0, compute the CRC and then place it instead of the 0. The checking code should do the reverse process - read the CRC, zero the field, calculate the CRC and compare.
Also, for a file checksum I strongly recommend MD5 instead of CRC.

One solution would be to use dsofile.dll to add extended properties to your files. You could save the CRC value (converted to a string) as an extended file property. That way you don't change the structure of the file.
dsofile.dll is an ActiveX dll so it can be called from various languages, however it confines you to running on Windows. Here's more information on dsofile.dll: http://support.microsoft.com/kb/224351

I wouldn't store the CRC in the file itself. I would have a single file ( I would use XML format ) that your program uses, with a list of filenames and their associated CRC values. No need to make it that complicated.

There is no way to do this. You could make the first x bytes (CRC uses a 32 bit integer, so 4 bytes) of the file contain the CRC, and then when calculating your CRC, you could only consider the bytes that come after that initial 4 bytes.
Another solution would be to include the CRC into the file name. So MyFile.Config would end up being MyFile.CRC1234567.Config.