Use a combination of SHA1+MD5 - md5

I'm trying use a secure way to create checksum for files (Larger than 10GB !).
SHA256 is secure enough for me but this algorithm is so process expensive and it is not suitable.
Well I know that both SHA1 and MD5 checksums are insecure through the collisions.
So I just think the fastest and the safest way is combining MD5 with SHA1 like : SHA1+MD5 and I don't think there is way to get file (Collision) with the same MD5 and SHA1 both at a same time .
So is combining SHA1+MD5 secure enough for file checksum? or is there any attack like collision for it ?
I use c# mono in two way (Bufferstream and without Bufferedstream)
public static string GetChecksum(string file)
{
using (FileStream stream = File.OpenRead(file))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(stream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
public static string GetChecksumBuffered(Stream stream)
{
using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(bufferedStream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
Update 1:
I mean SHA1 hash + MD5 hash. First calculate SHA1 of file then calculate MD5 of file then add this two sting together.
Update 2 :
As #zaph mentioned I implement my code(C# MONO) again according what I read here but it doesn't make my code as fast as he said ! It makes my speed for a 4.6 GB file from (approximate) 12mins to about 8.~ mins but sha1+md5 takes me less than 100 secs for this file. So I still think it isn't right to use SHA256 instead.

There should be only a small difference between SHA-256 and a combination of MD5+SHA1.
The only way to know is to benchmark:
On my desk top:
SHA-256: 200 MB/s
MD5: 470 MB/s
SHA1: 500 MB/s (updated, previously incorrect)
MD5+SHA1 240 MB/s
These times are only for the hashing, disk read time is not included. The tests were done with a 1MB buffer and averaged over 10 runs. The language was "C" and the library used was Apple's Common Crypto. The cpu was a 2.8 GHz Quad-Core Intel Xeon (2010 MacPro, my laptop is faster).
In the end it is 23% faster to use the combined MD5+SHA1.
Note: Most Intel processors have instruction that can be used to make crypto operations faster. Not all implementations utilize these instructions.
YOumight try a native implementations such as sha256sum.

If by SHA1+MD5 you mean hashing with SHA-1 first and then using that digest at input into MD5, then you are not eliminating collisions completely, just potentially reducing the chance of one occurring.
Both SHA-1 and MD5 are fixed length cryptographic hash functions, and according to the Pigeonhole Principle collisions are bound to occur if the message length is greater than the digest size. There are two instances of this in your use case:
When you hash your arbitrary-length message with SHA-1
When the 160-bit SHA-1 digest is used as input to MD5
My point is that collisions will always exist. However, the probability of finding one is exceedingly small. If the sole purpose is for file integrity, SHA-1 will do the job just fine on its own.
Related:
What checksum algorithm should I use?
Is MD5 still good enough to uniquely identify files?

Related

Recommended short (128 bit) and fast checksum for deduplication?

I am using MD5 for file deduplication. This is great since the files are from a trusted source, and they will not intentionally exploit the MD5 shortcomings.
Later, I may accept untrusted files, and deduplicate them as well. I've read the summary of MD5 on Wikipedia, and there seems to be many shortcomings in the quality of this hash. It is generally recommended to use SHA-1 or better.
SHA-1 & SHA-256 are longer. I could use XOR two halfs to reach a 128-bit answer, but that seems inefficient.
Is there a recommended 128-bit hash that fits the deduplication use case? I know it is so minor I could just continue using MD5, but seems another 128-bit alternative would have been developed by now?
You could just chop down SHA-256 or SHA-512 using the left 128 bits. SHA-512 is actually 30% to 60% faster on 64 bit systems. There is no reduction in security beyond the smaller digest length (128 bit digest = 128 bit strength against preimage and 64 bit strength against collisions).
Another option is SHAKE256. NIST codified 4 fixed length drop in replacements for SHA-2 as SHA3-224, SHA3-256, SHA3-384, SAH3-512 but the underlying algorithm supports arbitrary bit lengths. The term SHAKE256 applies to the underlying algorithm being used for arbitrary sized outputs. You could have a 128 bit or even 179 bit output using SHAKE256.
Still I would consider rehashing your file store using SHA-2 to get 256 bit identifiers while all the files are from trusted sources and then migration towards an all SHA-256 system prior to accepting potentially untrustworthy data.

Wierd use of md5 for hashing a file - Does it do anything?

For uploading a file to a service, I was calculating the md5 based on the whole content of the file.
I was asked to do in a different way: the md5 of the file, and then also 3 more parts: 2% from the start of the file, 2% from 1/3 of the file, and 2% from 2/3, and 2% of the end of the file and then hash it the file's size and added the file size in bytes at the end.
Apparently this solves hash collisions between files. To me it seems like a waste of time, since your not increasing the size of md5. So for a huge large number of files, you're still gonna have, statistically, the same number of collisions.
Please help me understand the reasoning behind this.
EDIT: we are then hashing the resulting hashes.
A good cryptographically strong hashing algorithm is already designed with the goal to make it infeasible to intentionally find two different pieces of data with the same hash, let alone by accident. Therefore, just hashing the file is sufficient. Extra hashing of parts of the file is pointless.
This may seem unintuitive because obviously there must exist collisions if the length of the hash is shorter than the length of the data. However, it is not feasible to find these collisions because an MD5 hash is an unpredictable 128-bit number and the amount of possible 128-bit numbers (2^128) is mind boggling. If you could count at a rate of a trillion trillion per second, counting through all 128-bit numbers would still take (2^128 / 1e24) seconds ~ about 10 million years. This is probably a good lower limit to the amount of time that it would take to find a hash collision the brute force way without custom hardware.
That said, this is all assuming that there are no weaknesses in the hashing algorithm that allow you to do better than brute force. MD5 is broken in this regard, so you should not use it if you need to defend against attackers that would try to create collisions. It would be better to use a newer hashing algorithm like SHA-2 or SHA-3. (These also support even larger outputs such as 256 bits.)
Sounds like a dangerous practice, because you're re-hashing without factoring in a lot of data. The advantage however is that by running other hashes, you are effectivley winding up with a hash signature consisting of "more bits" - (i.e. you are getting three MD5 hashes as a result).
If you want to do this - and are in-fact okay with having more (larger) hash data to store/compare - you would be MUCH better advise to simply run a different hash function (other than MD5) that is either more secure, and/or uses a larger number of bits.
MD5 is an "older" algorithm and is known to have cryptographic weaknessess. I'd recommend one of the "SHA" algorithms - like SHA-256 or SHA-512. Advantages are that it is a stronger algorithm, you'd only have to has the data ONCE, and you'd get more bits than an MD5, yet since your running it once, it would be faster.
Note, that the possibility of hash collisions always exists. Even "high end" storage products which use hashes for detection will compare buffers to verify an exact match even if the comparison of two hashes matches.

SHA-256 or MD5 for file integrity

I know that SHA-256 is favored over MD5 for security, etc., but, if I am to use a method to only check file integrity (that is, nothing to do with password encryption, etc.), is there any advantage of using SHA-256?
Since MD5 is 128-bit and SHA-256 is 256-bit (therefore twice as big)...
Would it take up to twice as long to encrypt?
Where time is not of essence, like in a backup program, and file integrity is all that is needed, would anyone argue against MD5 for a different algorithm, or even suggest a different technique?
Does using MD5 produce a checksum?
Both SHA256 and MD5 are hashing algorithms. They take your input data, in this case your file, and output a 256/128-bit number. This number is a checksum. There is no encryption taking place because an infinite number of inputs can result in the same hash value, although in reality collisions are rare.
SHA256 takes somewhat more time to calculate than MD5, according to this answer.
Offhand, I'd say that MD5 would be probably be suitable for what you need.
Every answer seems to suggest that you need to use secure hashes to do the job but all of these are tuned to be slow to force a bruteforce attacker to have lots of computing power and depending on your needs this may not be the best solution.
There are algorithms specifically designed to hash files as fast as possible to check integrity and comparison (murmur, XXhash...). Obviously these are not designed for security as they don't meet the requirements of a secure hash algorithm (i.e. randomness) but have low collision rates for large messages. This features make them ideal if you are not looking for security but speed.
Examples of this algorithms and comparison can be found in this excellent answer: Which hashing algorithm is best for uniqueness and speed?.
As an example, we at our Q&A site use murmur3 to hash the images uploaded by the users so we only store them once even if users upload the same image in several answers.
To 1):
Yes, on most CPUs, SHA-256 is about only 40% as fast as MD5.
To 2):
I would argue for a different algorithm than MD5 in such a case. I would definitely prefer an algorithm that is considered safe. However, this is more a feeling. Cases where this matters would be rather constructed than realistic, e.g. if your backup system encounters an example case of an attack on an MD5-based certificate, you are likely to have two files in such an example with different data, but identical MD5 checksums. For the rest of the cases, it doesn't matter, because MD5 checksums have a collision (= same checksums for different data) virtually only when provoked intentionally.
I'm not an expert on the various hashing (checksum generating) algorithms, so I can not suggest another algorithm. Hence this part of the question is still open.
Suggested further reading is Cryptographic Hash Function - File or Data Identifier on Wikipedia. Also further down on that page there is a list of cryptographic hash algorithms.
To 3):
MD5 is an algorithm to calculate checksums. A checksum calculated using this algorithm is then called an MD5 checksum.
The underlying MD5 algorithm is no longer deemed secure, thus while md5sum is well-suited for identifying known files in situations that are not security related, it should not be relied on if there is a chance that files have been purposefully and maliciously tampered. In the latter case, the use of a newer hashing tool such as sha256sum is highly recommended.
So, if you are simply looking to check for file corruption or file differences, when the source of the file is trusted, MD5 should be sufficient. If you are looking to verify the integrity of a file coming from an untrusted source, or over from a trusted source over an unencrypted connection, MD5 is not sufficient.
Another commenter noted that Ubuntu and others use MD5 checksums. Ubuntu has moved to PGP and SHA256, in addition to MD5, but the documentation of the stronger verification strategies are more difficult to find. See the HowToSHA256SUM page for more details.
No, it's less fast but not that slow
For a backup program it's maybe necessary to have something even faster than MD5
All in all, I'd say that MD5 in addition to the file name is absolutely safe. SHA-256 would just be slower and harder to handle because of its size.
You could also use something less secure than MD5 without any problem. If nobody tries to hack your file integrity this is safe, too.
It is technically approved that MD5 is faster than SHA256 so in just verifying file integrity it will be sufficient and better for performance.
You are able to checkout the following resources:
Speed Comparison of Popular Crypto Algorithms
Comparison of cryptographic hash functions
Yes, on most CPUs, SHA-256 is two to three times slower than MD5, though not primarily because of its longer hash. See other answers here and the answers to this Stack Overflow questions.
Here's a backup scenario where MD5 would not be appropriate:
Your backup program hashes each file being backed up. It then stores
each file's data by its hash, so if you're backing up the same file
twice you only end up with one copy of it.
An attacker can cause the system to backup files they control.
The attacker knows the MD5 hash of a file they want to remove from the
backup.
The attacker can then use the known weaknesses of MD5 to craft a new
file that has the same hash as the file to remove. When that file is
backed up, it will replace the file to remove, and that file's backed up
data will be lost.
This backup system could be strengthened a bit (and made more efficient)
by not replacing files whose hash it has previously encountered, but
then an attacker could prevent a target file with a known hash from
being backed up by preemptively backing up a specially constructed bogus
file with the same hash.
Obviously most systems, backup and otherwise, do not satisfy the
conditions necessary for this attack to be practical, but I just wanted
to give an example of a situation where SHA-256 would be preferable to
MD5. Whether this would be the case for the system you're creating
depends on more than just the characteristics of MD5 and SHA-256.
Yes, cryptographic hashes like the ones generated by MD5 and SHA-256 are a type of checksum.
Happy hashing!

Is CRC32 really so bad for file integrity check?

Of course that MD5 is better then CRC32, SHA1 is better then MD5 and so on... But also they are also much slower then CRC32.
Right know, I am thinking about how to check consistency of being transfered file and CRC32 is fastest option.
I haven't found anywhere how bad is CRC32 for integrity checks (maybe in other words how is probably that CRC32 will not detect malformed file)?
Quoting from http://www.mathpages.com/home/kmath458.htm :
So, if we assume that any corruption of our data affects our string
in a completely random way, i.e., such that the corrupted string is
totally uncorrelated with the original string, then the probability
of a corrupted string going undetected is 1/(2^n). This is the basis
on which people say a 16-bit CRC has a probability of 1/(2^16) =
1.5E-5 of failing to detect an error in the data, and a 32-bit CRC has a probability of 1/(2^32), which is about
2.3E-10 (less than one in a billion).
My opinion: CRC-32 is more than enough for error detection. It is being used widely. However, it is not secure when you want to use it as a "hash function".
Collisions (same hash output but different data) can occur easily using CRC-32 because CRC-32 use only 32bits compare to other algorithms ex. MD5 is 128-bits, SHA-1 is 160-bits, SHA-2 (SHA256/512 series) is 224bits-512bits. (depend on what you use). Also, for SHA-2 series no collision has been found.
For more info about mathematics and probability that would cause your data a collision. Please look for Hash Collision and Birthday paradox problem

A suitable hash function to detect data corruption / check for data integrity?

What is the most suitable hash function for file integrity checking (checksums) to detect corruption?
I need to consider the following:
Wide range of file size (1 kb to 10GB+)
Lots of different file types
Large collection of files (+/-100 TB and growing)
Do larger files require higher digest sizes (SHA-1 vs SHA 512)?
I see that the SHA-family is referred to as cryptographic hash functions. Are they ill-suited for "general purpose" use such as detecting file corruption? Will something like MD5 or Tiger be better?
If malicious tampering is also a concern, will your answer change w.r.t the most suitable hash function?
External libraries are not an option, only whats available on Win XP SP3+.
Naturally performance is also of concern.
(Please excuse my terminology if it is incorrect, my knowledge on this subject is very limited).
Any cryptographic hash function, even a broken one, will be fine for detecting accidental corruption. A given hash function may be defined only for inputs up to some limit, but for all standard hash function that limit is at least 264 bits, i.e. about 2 millions of terabytes. That's quite large.
File type has no incidence whatsoever. Hash functions operate over sequences of bits (or bytes) regardless of what those bits represent.
Hash function performance is unlikely to be an issue. Even the "slow" hash functions (e.g. SHA-256) will run faster on a typical PC than the harddisk: reading the file will be the bottleneck, not hashing it (a 2.4 GHz PC can hash data with SHA-512 at a speed close to 200 MB/s, using a single core). If hash function performance is an issue, then either your CPU is very feeble, or your disks are fast SSD (and if you have 100 TB of fast SSD then I am kind of jealous). In that case, some hash functions are somewhat faster than other, MD5 being one of the "fast" functions (but MD4 is faster, and it is simple enough that its code can be included in any application without much hassle).
If malicious tampering is a concern, then this becomes a security issue, and that's more complex. First, you will like to use one of the cryptographically unbroken hash function, hence SHA-256 or SHA-512, not MD4, MD5 or SHA-1 (the weaknesses found in MD4, MD5 and SHA-1 might not apply to a specific situation, but this is a subtle matter and it is better to play safe). Then, hashing may or may not be sufficient, depending on whether the attacker has access to the hash results. Possibly, you may need to use a MAC, which can be viewed as a kind of keyed hash. HMAC is a standard way of building a MAC out of a hash function. There are other non-hash-based MAC. Moreover, a MAC uses a secret "symmetric" key, which is not appropriate if you want some people to be able to verify the file integrity without being able to perform silent alterations; in that case, you would have to resort to digital signatures. To be brief, in a security context, you need a thorough security analysis with a clearly defined attack model.

Resources