I downloaded a file and used md5sum to see if the download was successful without corruption. I got the following value:
a7099fcf9572d91b10d0073b07e112cb ./Macaca_mulatta.MMUL_1.70.dna.chromosome.1.fa.gz
But when I checked the website I downloaded the file from, it gave me the following value.
10256 63747 Macaca_mulatta.MMUL_1.70.dna.chromosome.1.fa.gz
What is this 10 digit code? is it not md5?
I downloaded the file from : ftp://ftp.ensembl.org/pub/release-70/fasta/macaca_mulatta/dna/CHECKSUMS
Ensembl is using the unix 'sum' utilty to calcualte the CHECKSUM.gz file.
Here's more info about the program : http://en.wikipedia.org/wiki/Sum_%28Unix%29
To see if your download is correct, try:
sum Macaca_mulatta.MMUL_1.70.dna.chromosome.1.fa.gz
NOTE: It happened before that Ensembl did not update their CHECKSUM file so it can always happen that the download is correct but the CHECKSUM.gz file is incorrect.
They are not the same thing. MD5 is a checksum but there are other checksum algorithms that are not MD5, such as SHA, CRC etc.
Generally a checksum is a function that takes an input that's larger in size than its output and (it better) produces greatly different outputs even if one bit in the input is changed.
The output you're looking at consists of two 5-digit decimal numbers, so it's likely your checksum algorithm is CRC32. The unix sum command may be used to calculate/verify it.
MD5 is a way to do a checksum, but there are others. CRC is one, so is SHA. All MD5 does is produce a hash code, and it is not the only algorithm to do so. I'm not sure what the 10 digit one is, but it can't be MD5.
Related
This question already has answers here:
when I hash a file with Md5 what is hashed?
(2 answers)
Closed 7 years ago.
I know a file's MD5 Hash is like a digital fingerprint used to confirm integrity and authenticity. There are many utilities to get the MD5 Hash of a file but what does that hash base on? File size? File low level binaries? Code?
MD5 is a so-called cryptographic hash function.
This basically means that you can give in any bitstring as input for the function, and you will get out a fixed-size bitstring (128-bit in the case of MD5) as output. The output is usually called "digest".
The digest depends solely on the input and nothing else. Thus in itself it can be used as an integrity proof, but not as authenticity, if the underlying hash function has the necessary properties (in this case collision-resistance). This means that for two different outputs the digest itself should be also different. The problem is that the digest's size is fixed, which in turn means that with sufficient number of messages it will always be possible to find a collision (i.e., two different inputs yielding the same output).
One should also note that there is nowadays no justification to use MD5, as weaknesses have been discovered (namely post-fix collision attacks). Also using SHA-256/512 on modern hardware is usually faster then MD5.
Shortly: the output of the cryptographic hash functions (and so MD5's) depends on the input bitstring.
Update: based on your comment for the other answer, you are looking for this: https://en.wikipedia.org/wiki/MD5#Algorithm
You can read about it here:
https://en.wikipedia.org/wiki/Md5sum
In general, the algorithm runs over the file and it's output is checksum, that means that if someone changes a bit in the file the checksum will be changed. so it's a way to validate that the file you looking at is the file you think you looking at and lowering the probability that someone put a melicius code in it
I have got a repository where I store all my image files. I know that there are much images which are duplicated and I want to delete each one of duplicated ones.
I thought if I generate checksum for each image file and rename the file to its checksum, I can easily find out if there are duplicated ones by examining the filename. But the problem is that, I cannot be sure about selecting the checksum algorithm to use. For example, if I generate the checksums using MD5, can I exactly trust if the checksums are the same that means the files are exactly the same?
Judging from the response to a similar question in security forum (https://security.stackexchange.com/a/3145), the collision rate is about 1 collision per 2^64 messages. If your files are differenet and your collection is not huge (i.e. close to this number), md5 can be used safely.
Also, see response to a very similar question here: How many random elements before MD5 produces collisions?
The chances of getting the same checksum for 2 different files are extremely slim, but can never be absolutely guaranteed (Pigeonhole principle). An indication of how slim may be that GIT uses the SHA-1 checksum for software development source code including Linux and has never caused any known problems so I would say that you are safe. I would use SHA-1 instead of MD5 because it is slightly better if you are really paranoid.
To make really sure you best follow a two-step-procedure: first calculate a checksum for every file. If the checksums differ you're sure the files are not identical. If you happen to find some files with equal checksums there's no way around doing a bit-by-bit-comparison to make 100% sure if they are really identical. This holds regardless of the hashing-algorithm used.
What you'll get is a massive time-saving as doing bit-by-bit comparison for every possible pair of files will take forever and a day while comparing a hand full of possible candidates is fairly easy.
After writing a basic LFSR-based stream cipher encryption module in C, I tried it on usual text files, and then on a .exe file in Windows. However, after decrypting it back the file is not running, giving some error about being a 16-bit. Evidently some error in decrypting. Or are files made so that if I tamper with their binary code they become corrupted?
I'm checking my program on text-files in the hope of locating any error on my part. However, the question is had anyone tried running your own encryption programs on an executable file? Is their any obvious answer to this?
There is nothing special about executables. They are obviously binary files and thus contain 00 bytes and bytes >127. As long as your algorithm is binary safe, it should work.
Compare the original file and the decrypted file using a hex-editor. To see how they differ.
The error you get means that you didn't decrypt the executable header correctly, so the decryption mistake must already affect the first few bytes of your file.
Evidently some error in decrypting. An exe is a bag 'o bytes just like any other file, there's no magic. You are merely likely to run into byte values that you won't get in a text file. Like a zero.
A decryption process should be the inverse of its encryption. In other words, Decrypt(Encrypt(X)) == X for all inputs X, of all possible lengths, of all possible byte values.
I suggest you build yourself a test harness that will run some pairwise checks with randomised data so you can prove to yourself that the two transformations do indeed cancel each other out. I mean something like:
for length from 0 to 1000000:
generate a string of that length with random contents
encrypt it to a fresh memory buffer
decrypt it to a fresh memory buffer
compare the decrypted string with the original string
Do this first of all on in-memory strings so you can isolate the algorithm from your file-handling code.
Once you've proved the algorithm is properly inverting, you can then do the same for files; as others have said you might well be running into issues with handling binary files, that's a common gotcha.
Supposed that you want to create a file which contains the md5 digest of itself. How to do that?
Before you include the md5 value into the file, you have to calculate the value, but if and only if the md5 digest has been included into the file, you can calculate the value. It's kind o a dilemma. Any Idea?
The short answer is that unless you use MD5 vulnerabilities you can't do that. I believe that even using MD5 vulnerabilities building such collision is impractical. A solution would be to either attach the digest at the end of the file or ship it separately.
You have to avoid that cyclic dependency, so you can't checksum the checksum. To work around that, you could reserver space for the checksum in your file, but set that space to zeroes. Then calculate the checksum and embed it. In order to check it later, you'd have to read it in, and set those bytes in the file to zero again.
Are there algorithms for putting a digest into the file being digested?
In otherwords, are there algorithms or libraries, or is it even possible to have a hash/digest of a file contained in the file being hashed/digested. This would be handy for obvious reasons, such as built in digests of ISOs. I've tried googling things like "MD5 injection" and "digest in a file of a file." No luck (probably for good reason.)
Not sure if it is even mathematically possible. Seems you'd be able to roll through the file but then you'd have to brute the last bit (assuming the digest was the last thing in the file or object.)
Thanks,
Chenz
It is possible in a limited sense:
Non-cryptographically-secure hashes
You can do this with insecure hashes like the CRC family of checksums.
Maclean's gzip quine
Caspian Maclean created a gzip quine, which decompresses to itself. Since the Gzip format includes a CRC-32 checksum (see the spec here) of the uncompressed data, and the uncompressed data equals the file itself, this file contains its own hash. So it's possible, but Maclean doesn't specify the algorithm he used to generate it:
It's quite simple in theory, but the helper programs I used were on a hard disk that failed, and I haven't set up a new working linux system to run them on yet. Solving the checksum by hand in particular would be very tedious.
Cox's gzip, tar.gz, and ZIP quines
Russ Cox created 3 more quines in Gzip, tar.gz, and ZIP formats, and wrote up in detail how he created them in an excellent article. The article covers how he embedded the checksum: brute force—
The second obstacle is that zip archives (and gzip files) record a CRC32 checksum of the uncompressed data. Since the uncompressed data is the zip archive, the data being checksummed includes the checksum itself. So we need to find a value x such that writing x into the checksum field causes the file to checksum to x. Recursion strikes back.
The CRC32 checksum computation interprets the entire file as a big number and computes the remainder when you divide that number by a specific constant using a specific kind of division. We could go through the effort of setting up the appropriate equations and solving for x. But frankly, we've already solved one nasty recursive puzzle today, and enough is enough. There are only four billion possibilities for x: we can write a program to try each in turn, until it finds one that works.
He also provides the code that generated the files.
(See also Zip-file that contains nothing but itself?)
Cryptographically-secure digests
With a cryptographically-secure hash function, this shouldn't be possible without either breaking the hash function (particularly, a secure digest should make it "infeasible to generate a message that has a given hash"), or applying brute force.
But these hashes are much longer than 32 bits, precisely in order to deter that sort of attack. So you can write a brute-force algorithm to do this, but unless you're extremely lucky you shouldn't expect it to finish before the universe ends.
MD5 is broken, so it might be easier
The MD5 algorithm is seriously broken, and a chosen-prefix collision attack is already practical (as used in the Flame malware's forged certificate; see http://www.cwi.nl/news/2012/cwi-cryptanalist-discovers-new-cryptographic-attack-variant-in-flame-spy-malware, http://arstechnica.com/security/2012/06/flame-crypto-breakthrough/). I don't know of what you want having actually been done, but there's a good chance it's possible. It's probably an open research question.
For example, this could be done using a chosen-prefix preimage attack, choosing the prefix equal to the desired hash, so that the hash would be embedded in the file. A
preimage attack is more difficult than collision attacks, but there has been some progress towards it. See Does any published research indicate that preimage attacks on MD5 are imminent?.
It might also be possible to find a fixed point for MD5; inserting a digest is essentially the same problem. For discussion, see md5sum a file that contain the sum itself?.
Related questions:
Is there any x for which SHA1(x) equals x?
Is a hash result ever the same as the source value?
The only way to do this is if you define your file format so the hash only applies to the part of the file that doesn't contain the hash.
However, including the hash inside a file (like built into an ISO) defeats the whole security benefit of the hash. You need to get the hash from a different channel and compare it with your file.
No, because that would mean that the hash would have to be a hash of itself, which is not possible.