Are there algorithms for putting a digest into the file being digested? - md5

Are there algorithms for putting a digest into the file being digested?
In otherwords, are there algorithms or libraries, or is it even possible to have a hash/digest of a file contained in the file being hashed/digested. This would be handy for obvious reasons, such as built in digests of ISOs. I've tried googling things like "MD5 injection" and "digest in a file of a file." No luck (probably for good reason.)
Not sure if it is even mathematically possible. Seems you'd be able to roll through the file but then you'd have to brute the last bit (assuming the digest was the last thing in the file or object.)
Thanks,
Chenz

It is possible in a limited sense:
Non-cryptographically-secure hashes
You can do this with insecure hashes like the CRC family of checksums.
Maclean's gzip quine
Caspian Maclean created a gzip quine, which decompresses to itself. Since the Gzip format includes a CRC-32 checksum (see the spec here) of the uncompressed data, and the uncompressed data equals the file itself, this file contains its own hash. So it's possible, but Maclean doesn't specify the algorithm he used to generate it:
It's quite simple in theory, but the helper programs I used were on a hard disk that failed, and I haven't set up a new working linux system to run them on yet. Solving the checksum by hand in particular would be very tedious.
Cox's gzip, tar.gz, and ZIP quines
Russ Cox created 3 more quines in Gzip, tar.gz, and ZIP formats, and wrote up in detail how he created them in an excellent article. The article covers how he embedded the checksum: brute force—
The second obstacle is that zip archives (and gzip files) record a CRC32 checksum of the uncompressed data. Since the uncompressed data is the zip archive, the data being checksummed includes the checksum itself. So we need to find a value x such that writing x into the checksum field causes the file to checksum to x. Recursion strikes back.
The CRC32 checksum computation interprets the entire file as a big number and computes the remainder when you divide that number by a specific constant using a specific kind of division. We could go through the effort of setting up the appropriate equations and solving for x. But frankly, we've already solved one nasty recursive puzzle today, and enough is enough. There are only four billion possibilities for x: we can write a program to try each in turn, until it finds one that works.
He also provides the code that generated the files.
(See also Zip-file that contains nothing but itself?)
Cryptographically-secure digests
With a cryptographically-secure hash function, this shouldn't be possible without either breaking the hash function (particularly, a secure digest should make it "infeasible to generate a message that has a given hash"), or applying brute force.
But these hashes are much longer than 32 bits, precisely in order to deter that sort of attack. So you can write a brute-force algorithm to do this, but unless you're extremely lucky you shouldn't expect it to finish before the universe ends.
MD5 is broken, so it might be easier
The MD5 algorithm is seriously broken, and a chosen-prefix collision attack is already practical (as used in the Flame malware's forged certificate; see http://www.cwi.nl/news/2012/cwi-cryptanalist-discovers-new-cryptographic-attack-variant-in-flame-spy-malware, http://arstechnica.com/security/2012/06/flame-crypto-breakthrough/). I don't know of what you want having actually been done, but there's a good chance it's possible. It's probably an open research question.
For example, this could be done using a chosen-prefix preimage attack, choosing the prefix equal to the desired hash, so that the hash would be embedded in the file. A
preimage attack is more difficult than collision attacks, but there has been some progress towards it. See Does any published research indicate that preimage attacks on MD5 are imminent?.
It might also be possible to find a fixed point for MD5; inserting a digest is essentially the same problem. For discussion, see md5sum a file that contain the sum itself?.
Related questions:
Is there any x for which SHA1(x) equals x?
Is a hash result ever the same as the source value?

The only way to do this is if you define your file format so the hash only applies to the part of the file that doesn't contain the hash.
However, including the hash inside a file (like built into an ISO) defeats the whole security benefit of the hash. You need to get the hash from a different channel and compare it with your file.

No, because that would mean that the hash would have to be a hash of itself, which is not possible.

Related

What does a file MD5 Hash represent? [duplicate]

This question already has answers here:
when I hash a file with Md5 what is hashed?
(2 answers)
Closed 7 years ago.
I know a file's MD5 Hash is like a digital fingerprint used to confirm integrity and authenticity. There are many utilities to get the MD5 Hash of a file but what does that hash base on? File size? File low level binaries? Code?
MD5 is a so-called cryptographic hash function.
This basically means that you can give in any bitstring as input for the function, and you will get out a fixed-size bitstring (128-bit in the case of MD5) as output. The output is usually called "digest".
The digest depends solely on the input and nothing else. Thus in itself it can be used as an integrity proof, but not as authenticity, if the underlying hash function has the necessary properties (in this case collision-resistance). This means that for two different outputs the digest itself should be also different. The problem is that the digest's size is fixed, which in turn means that with sufficient number of messages it will always be possible to find a collision (i.e., two different inputs yielding the same output).
One should also note that there is nowadays no justification to use MD5, as weaknesses have been discovered (namely post-fix collision attacks). Also using SHA-256/512 on modern hardware is usually faster then MD5.
Shortly: the output of the cryptographic hash functions (and so MD5's) depends on the input bitstring.
Update: based on your comment for the other answer, you are looking for this: https://en.wikipedia.org/wiki/MD5#Algorithm
You can read about it here:
https://en.wikipedia.org/wiki/Md5sum
In general, the algorithm runs over the file and it's output is checksum, that means that if someone changes a bit in the file the checksum will be changed. so it's a way to validate that the file you looking at is the file you think you looking at and lowering the probability that someone put a melicius code in it

How to find out if two binary files are exactly the same

I have got a repository where I store all my image files. I know that there are much images which are duplicated and I want to delete each one of duplicated ones.
I thought if I generate checksum for each image file and rename the file to its checksum, I can easily find out if there are duplicated ones by examining the filename. But the problem is that, I cannot be sure about selecting the checksum algorithm to use. For example, if I generate the checksums using MD5, can I exactly trust if the checksums are the same that means the files are exactly the same?
Judging from the response to a similar question in security forum (https://security.stackexchange.com/a/3145), the collision rate is about 1 collision per 2^64 messages. If your files are differenet and your collection is not huge (i.e. close to this number), md5 can be used safely.
Also, see response to a very similar question here: How many random elements before MD5 produces collisions?
The chances of getting the same checksum for 2 different files are extremely slim, but can never be absolutely guaranteed (Pigeonhole principle). An indication of how slim may be that GIT uses the SHA-1 checksum for software development source code including Linux and has never caused any known problems so I would say that you are safe. I would use SHA-1 instead of MD5 because it is slightly better if you are really paranoid.
To make really sure you best follow a two-step-procedure: first calculate a checksum for every file. If the checksums differ you're sure the files are not identical. If you happen to find some files with equal checksums there's no way around doing a bit-by-bit-comparison to make 100% sure if they are really identical. This holds regardless of the hashing-algorithm used.
What you'll get is a massive time-saving as doing bit-by-bit comparison for every possible pair of files will take forever and a day while comparing a hand full of possible candidates is fairly easy.

What is benefit to use context trigger piecewise hashing over traditional piecewise hashing?

Original piecewise hashing uses fixed-size segment of a file to evaluate a hash value.
And I found some piecewise hashing such as context trigger piecewise hashing uses rolling hash to trigger when to evaluate a hash value of a file.
And I'm quite not sure a point to use this technique over traditional one.
I assume you mean "Context Triggered Piecewise Hashing"?
The ssdeep project links to a paper called "Identifying almost identical files using context triggered piecewise hashing", by Jesse Kornblum. That paper covers the origin and goals of CTPH in the form of the spamsum algorithm.
To summarize:
Computing a full file hash, such as by running sha1sum file, allows you to find pairs of files that are exactly the same, in time linear in the total size of the files.
Using fixed-size segments for piecewise hashing means that if bytes are re-written in the middle of a file, you can probably still identify that it's the same as a reference file. However, if bytes are inserted or deleted, then the checksums for the entire rest of the file change.
CTPH should allow recognizing the similarity between files even in the presence of more substantial differences. As long as the changes aren't too large, CTPH can handle inserting or deleting portions of the file. The paper claims that given just the first third or the final third of a file, spamsum can recognize which file it probably came from.

md5 collision database?

I'm writing a file system deduper. The first pass generates md5 checksums, and the second pass compares the files with identical checksums.
Is there a collection of strings which differ but generate identical md5 checksums I can incorporate into my test case collection?
Update: mjv's answer points to these two files, perfect for my test case.
http://www.win.tue.nl/~bdeweger/CollidingCertificates/MD5Collision.certificate1.cer
http://www.win.tue.nl/~bdeweger/CollidingCertificates/MD5Collision.certificate2.cer
You can find a couple of different X.509 certificate files with the same MD5 hash at this url.
I do not know of MD5 duplicate files repositories, but you can probably create your own, using the executables and/or the techniques described on Vlastimil Klima's page on MD5 Collision
Indeed MD5 has been know for its weakness with regards to collision resistance, however I wouldn't disqualify it for a project such as your file system de-duper; you may just want to add a couple of additional criteria (which can be very cheap, computationally speaking) to further decrease the possibility of duplicates.
Alternatively, for test purposes, you may simply modify your MD5 compare logic so that it deems some MD5 values identical even though they are not (say if the least significant byte of the MD5 matches, or systematically, every 20 comparisons, or at random ...). This may be less painful than having to manufacture effective MD5 "twins".
http://www.nsrl.nist.gov/ might be what you want.

Reuse of characters in compiled .exe file

Once long ago, out of curiosity, I've tried hex-editing the executable file of the game "Dangerous Dave".
I've looked around the file for any strings I could find, and made some random edits to see if it would actually change the text displayed within the game.
I was surprised to see the result, which I have now recreated using a hex-editor and DOSBox:
As can be seen, editing the two characters "RO" in the string "ROMERO" resulted in 4 characters being changed, with the result becoming "ZUMEZU". It seems as if the program is reusing the two characters and prints them at the start and end of that string.
What is the cause of this? My first guess would be trying to make the executable smaller but just the code that reuses the characters would probably require more space than those 2 bytes to be saved.
Is it just a trick done by the author, or just some compiler voodoo?
Tricky to say for sure without reverse-engineering, but my guess would be that a lot of the constant data in the program is compressed using an algorithm from the LZ family. These compression schemes work essentially in the way that you've observed: they encode repeated substrings as references to text that has previously been decoded.
These compression algorithms were probably used for more than just this one string, and not just for text either; it's quite possible that they were also used to compress other data, such as graphics or level layouts. In short, there were probably significant savings made by using this algorithm!
The use of these compression algorithms is common in older games as a way of saving disk space, but was not automatic - the implementation of this algorithm would likely have been something Romero added himself.

Resources