Is it possible to break md5 hash using genetic algorithms? - md5

With a knowledge of how md5 works, would it be possible to use a population based algorithm such as genetic programming to break simple passwords?
For examples, given a md5 hash for a string that is between 5 to 10 characters, we are to try to get the string back.
If yes, what could be
A good representation for an individual of the population
Selction criteria
Recombination methods
This is to understand the application of genetic algorithms and to know if anyone has done anything of this sort.

Not really.
With just 5 characters, you could brute force it in not too unreasonable amounts of time, but presumably you're asking more about GAs than you are about breaking MD5. The problem is that there's no exploitable structure in an MD5 hash. Strings that are "close together" do not generate hashes that are "close together" under any useful distance relationship. The fitness function will basically be random.

I think the answer is "no". Because you are not able to get any crossover function. And the fitness function will be boolean. GA with only mutation operator and such fitness function is a bruteforce.

No, it is highly unlikely.
The genetic algorithm is used eg. for finding local/global maximum/minimum of some function. In case of md5 hash, if you change the value for which you calculate md5 hash, the md5 hash changes completely, thus narrowing the input value range is completely unuseful. MD5 algorithm was designed to hash the generated value if input data changes in any way. The only possibility to find the correct value is when you apply mutation, but it results in checking random input values for whether they generate the given hash (which - as oxilumin said - is just a brute force attack).
You can read more about finding value that generated specific md5 hash here (rainbow tables).

Although the answer is probably "no", there is one caveat to consider: The published collisions are strings that only differ by a few key bytes: https://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities
Guessing the plaintext with a genetic algorithm isn't guaranteed, but it may be more efficient to discover a collision that way.
Or if it's in PHP and compares the md5 hash with the == operator... https://eval.in/108854

Related

What is a propper way of calculating the size of a hash table

I am bullding a hash table using double hashing to solve collision. How can I know what is the propper size? I know it has to prime to minmize the number of collisions.
The easiest way to implement hash tables is to use a power-of-2 sized hash tables.
The reason is that if N = 2M, then calculating H % N is as simple as calculating H & (N - 1).
With fast hash functions such as MurmurHash3_32, the slowest part of using the hash table is actually calculating the modulo. H & (N - 1) does not calculate modulo but bitwise AND which is much faster (and it's the same as modulo if N is a power of 2).
Somebody could validly claim that MurmurHash suffers from seed-independent multicollisions and therefore is susceptible to a hash collision denial of service attack. That's true, but you shouldn't use linked lists to resolve hash collisions. You should use only hash tables where the keys are sortable by some comparison function (larger than, equal, smaller than) and then you can use red-black trees (or AVL trees) to resolve hash collisions. If there's no natural comparison functions (such as for complex numbers), you can invent one.
Using a red-black tree that almost always is just a single root element with MurmurHash is much faster than trying to be "secure" by using SipHash and then stupidly using linked lists to resolve hash collisions (which caused the need for the absurdly slow SipHash in the first place).
In theory, with non-power-of-2-sized hash tables where the size is rarely varying, you could use the "fast division by invariant integers using multiplication" trick but that's slower than power-of-2-sizing and bitwise AND.
The prime sizing is just for really poor hash functions. MurmurHash, although it suffers from seed-independent multicollisions, does not suffer from collisions with reasonable (non-attacker-generated) keys if the table size is a power of 2.
No, there is no point in making the size be prime and that adds a lot of extra work for you. Just make the size be a power of two and double it whenever the number of objects in the hash table reaches some threshold, like 50% or 25% of the size.
If you are asking about the current size, you may use the sizeof(table)/sizeof(element) function since you are using the double hashing method.
If you are asking about the new size of the hash table once full (passed a certain criterion), then the most common is to add 10 new slots. This should be based on what are you using your table for. The default setting on most built in tables in other languages is if 0.75 full, then add 10 slots.
If it's about something else, then please modify your question, so it's more descriptive.
Edit: I just noticed the answer above me and I think that using the 2^p method is very common too in exponentially increasing tables and is very helpful with double hashing.

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

hashing function guaranteed to be unique?

In our app we're going to be handed png images along with a ~200 character byte array. I want to save the image with a filename corresponding to that bytearray, but not the bytearray itself, as i don't want 200 character filenames. So, what i thought was that i would save the bytearray into the database, and then MD5 it to get a short filename. When it comes time to display a particular image, i look up its bytearray, MD5 it, then look for that file.
So far so good. The problem is that potentially two different bytearrays could hash down to the same MD5. Then, one file would effectively overwrite another. Or could they? I guess my questions are
Could two ~200 char bytearrays MD5-hash down to the same string?
If they could, is it a once-per-10-ages-of-the-universe sort of deal or something that could conceivably happen in my app?
Is there a hashing algorithm that will produce a (say) 32 char string that's guaranteed to be unique?
It's logically impossible to get a 32 byte code from a 200 byte source which is unique among all possible 200 byte sources, since you can store more information in 200 bytes than in 32 bytes.
They only exception would be that the information stored in these 200 bytes would also fit into 32 bytes, in which case your source date format would be extremely inefficient and space-wasting.
When hashing (as opposed to encrypting), you're reducing the information space of the data being hashed, so there's always a chance of a collision.
The best you can hope for in a hash function is that all hashes are evenly distributed in the hash space and your hash output is large enough to provide your "once-per-10-ages-of-the-universe sort of deal" as you put it!
So whether a hash is "good enough" for you depends on the consequences of a collision. You could always add a unique id to a checksum/hash to get the best of both worlds.
Why don't you use a unique ID from your database?
The probability of two hashes will likely to collide depends on the hash size. MD5 produces 128-bit hash. So for 2128+1 number of hashes there will be at least one collision.
This number is 2160+1 for SHA1 and 2512+1 for SHA512.
Here this rule applies. The more the output bits the more uniqueness and more computation. So there is a trade off. What you have to do is to choose an optimal one.
Could two ~200 char bytearrays MD5-hash down to the same string?
Considering that there are more 200 byte strings than 32 byte strings (MD5 digests), that is guaranteed to be the case.
All hash functions have that problem, but some are more robust than MD5. Try SHA-1. git is using it for the same purpose.
It may happen that two MD5 hashes collides (are the same). In 1996, a flaw was found in MD5 algorithm, and cryptanalysts advised to switch to SHA-1 hashing algorithm.
So, I will advise you to switch to SHA-1 (40 characters). But do not worry: I doubt that your two pictures will get the same hash. I think you can assume this risk in your application.
As other said before. Hash doesnt give you what you need unless you are fine with risk of collision.
Database is helpful here.
You get unique index for each 200 long string. No collisions here, and you need to set your 200 long names to be indexed, in that way it will use extra memory but it will sort it for you making search very very fast. You get unique id which can be easily used for filenames.
I have'nt worked much on hashing algorithms but as per my understanding there is always a chance of collison in hashing algorithm i.e. two differnce object may be hashed to same hash value but it is guaranteed that every time a object will be hashed to same hash value. There are other techniques that may be used for this , like linear probing.

Is there any difference between md5 and sha1 in this situation?

It is known that
1. if ( md5(a) == md5(b) )
2. then ( md5(a.z) == md5(b.z) )
3. but ( md5(z.a) != md5(z.b) )
where the dots concatenate the strings.
EDIT ---
Here you can find a and b:
http://www.mscs.dal.ca/~selinger/md5collision/
Check these links:
hexpaste.com/qzNCBRYb/1 - this is a.md5(a)."kutykurutty"
hexpaste.com/mSXMl13A/1 - this is b.md5(b)."kutykurutty"
They share the same md5 hash, yet they are different. But you can call these strings a' and b', because they have the same md5.
--- EDIT
What happens in the second row if we change all the md5 to sha1? So:
1. if ( sha1(c) == sha1(d) )
2. then ( sha1(c.z) ?= sha1(d.z) )
I couldn't find two different strings with same sha1, that's why I'm asking this. Are there any other interesting "rules" about sha1?
SHA1 will behave exactly like MD5 in this scenario.
The only two references I have found are the following -
http://www.iaik.tugraz.at/content/research/krypto/sha1/MeaningfulCollisions.php
http://www.schneier.com/blog/archives/2005/02/sha1_broken.html#c1654 (See comment by David Schwartz)
From the IAIK website -
Note that for colliding SHA-1 message pairs (as for all other hash functions following a similar design principle) it is always possible to append suffixes to both messages as long as they are the same.
I don't think anybody has found two colliding strings for SHA1, so this is mostly an academic discussion. But from what I understand, when a collision is discovered, it should be possible to create several other collisions by using this property.
The first statement will only hold true for very specific z specifically computed for given a and b. It is true that you can generate an MD5 collision, but this is not trivial - some computational effort is required and certainly you can't expect that any z will do.
Currently SHA-1 is believed to be cryptographically secure which means noone has come up with a way to generate SHA-1 collisions. It doesn't mean that it is really secure and collision generation is not possible - maybe there is a yet uncovered vulnerability. Even if there is a vulnerability it's highly unlikely that the same strings will at the same time form both an MD5 and a SHA-1 collision.
Sha1 isn't as easily cracked as md5, but they did find some vulnerabilities in it back in '05 I believe.
Your example is wrong in my opinion.
Let me show you why:
md5(a) == md5(b)
When both hashes are the same, the corresponding strings have to be same (this could be collisions, but it's not important in my thesis), so we'll have:
a = b
When you now concatenate both strings with a string z, you will have
a.z = b.z
and their md5-hashes will be the same, because they have the same string-input
md5(a.z) == md5(b.z)
and the md5-hash will a third time be equals while both string inputs are the same
md5(z.a) == md5(z.b)
And this is true for md5 and every other hashing algorithm while they have to be deterministic and side effect free.
So your example will only make sense when z is a special string which will result in an collision. And therefore the behaviour of md5 and sha1 will exactly be the same:
The collision-string appended will result in a collision, but prepended will be different hashes (but there's a really really really low probability you find a collision-string which will be prependend and appended result in an collision, but none example has been yet found in reality)
You only didn't find thwo different string with same sha1 because collisions are harder to find as explained by the people before me.

C: Sorting Methods Analysis

I have alot of different sorting algorithms which all have the following signature:
void <METHOD>_sort_ints(int * array, const unsigned int ARRAY_LENGTH);
Are there any testing suites for sorting which I could use for the purpose of making empirical comparisons?
This detailed discussion, as well as linking to a large number of related web pages you are likely to find useful, also describes a useful set of input data for testing sorting algorithms (see the linked page for reasons). Summarising:
Completely randomly reshuffled array
Already sorted array
Already sorted in reverse order array
Chainsaw array
Array of identical elements
Already sorted array with N permutations (with N from 0.1 to 10% of the size)
Already sorted array in reverse order array with N permutations
Data that have normal distribution with duplicate (or close) keys (for stable sorting only)
Pseudorandom data (daily values of S&P500 or other index for a decade might be a good test set here; they are available from Yahoo.com )
The definitive study of sorting is Bob Sedgewick's doctoral dissertation. But there's a lot of good information in his algorithms textbooks, and those are the first two places I would look for test suite and methodology. If you've had a recent course you'll know more than I do; last time I had a course, the best method was to use quicksort down to partitions of size 12, then run insertion sort on the entire array. But the answers change as quickly as the hardware.
Jon Bentley's Programming Perls books have some other info on sorting.
You can quickly whip up a test suite containing
Random integers
Sorted integers
Reverse sorted integers
Sorted integers, mildly perturbed
If memory serves, these are the most important cases for a sort algorithm.
If you're looking to sort arrays that won't fit in cache, you'll need to measure cache effects. valgrind is effective if slow.
sortperf.py has a well selected suite of benchmark test cases and was used to support the essay found here and make timsort THE sort in Python lo that many years ago. Note that, at long last, Java may be moving to timsort too, thanks to Josh Block (see here), so I imagine they have written their own version of the benchmark test cases -- however, I can't easily find a reference to it. (timsort, a stable, adaptive, iterative natural mergesort variant, is especially suited to languages with reference-to-object semantics like Python and Java, where "data movement" is relatively cheap [[since all that's ever being moved is references aka pointers, not blobs of unbounded size;-)]], but comparisons can be relatively costly [[since there is no upper bound to the complexity of a comparison function -- but then this holds for any language where sorting may be customized via a custom comparison or key-extraction function]]).
This site shows the various sorting algorithms using four groups:
http://www.sorting-algorithms.com/
In addition to the four group in Norman's answer you want to check the sorting algorithms with collection of numbers that have a few similarities in the numbers:
All integers are unique
The same integer in the whole collection
Few Unique Keys
Changing the number of elements in the collection is also a good practice check each algorithm with 1K, 1M, 1G etc. to see what are the memory implications of that algorithm.

Resources