How to brute force MD5 for a partially known string? - md5

I was looking at HashCatt (any other tool is fine) to determine a string that was MD5 hashed.
I know the main part of the string however there is a variable in the middle I am unaware of, how to I go about brute forcing it?
eg. original string is "fresh " + {type of fruit} + " is tasty!" and I wanted to know the type of fruit.

You'll have to list up every type of fruit you know, build the string with that type of fruit, then calculate the hash and compare.
MD5 is not reverse-engineerable, and it is not encryption, so that is the only way you'll figure it out. And only if you spell it exactly the same and use the same string encoding (i.e., if the string was UTF8 and you're encoding with UTF16 it's not going to fly).

Do you know the range of values for the possible variable, or can you estimate it? If so, it reduces the range of the brute-force. That's probably your best approach.
It's worth recognising that with a well-designed hash algorithm you shouldn't, ideally, be able to draw any conclusions from string > hash conversion about the output of string + second_string > hash. To use the examples you've provided, the MD5 output of "fresh apples is tasty!" is 53aff0275a241760e654b3fe0e2184c4 - the output for "fresh oranges is tasty!" is e7c124fef3816ba5a6d6917e20bd5a8b. Very different results. So, if what you're looking for is a way to split the hash output and focus just on a substring, or something, you're not going to find it. Your best bet is reducing the brute-force's range.

Related

How to store vocabulary in an array more effectively?

I've got a vocabulary, a, abandon, ... , z.
For some reason, I will use array rather than Trie to store them.
Thus a simple method can be: wordA\0wordB\0wordC\0...word\0
But there are some more economic methods for memory I think.
Since like is a substring of likely, we can only store the first position and length of like instead of the string itself. Thus we generate a "large string" which contains every words in vocabulary and use position[i] and length[i] to get the i-th word.
For example, vocabulary contains three words ab, cd and bc.
I construct abcd as the "large string".
position[0] = 0, length[0] = 2
position[1] = 2, length[1] = 2
position[2] = 1, length[2] = 2
So how to generate the "large string" is the key to this problem, are there any cool suggestions?
I think the problem is similar to TSP problem(Traveling Salesman Problem), which is a NP problem.
The search keyword you're looking for is "dictionary". i.e. data structures that can be used to store a list of words, and test other strings for being present or absent in the dictionary.
Your idea is more compact than storing every word separately, but far less compact than a good data structure like a DAWG. As you note, it isn't obvious how to optimally choose how to overlap your strings. What you're doing is a bit like what a lossless compression scheme (like gzip) would do. If you don't need to check words against your compact dictionary, maybe just use gzip or LZMA to compress a sorted word list. Let their algorithms find the redundancy and represent it compactly.
I looked into dictionaries for a recent SO answer that caught my interest: Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)
For a dictionary that doesn't have to have new words added on the fly, a Directed Acyclic Word Graph is the way to go. You match a string against it by following graph nodes until you either hit a point where there's no edge matching the next character, or you get to the end of your input string and find that the node in the DAWG is marked as being a valid end-of-word. (Rather than simply a substring that is only a prefix of some words). There are algorithms for building these state-machines from a simple array-of-words dictionary in reasonable time.
Your method can only take advantage of redundancy when a whole word is a substring of another word, or end-of-one, start-of-another. A DAWG can take advantage of common substrings everywhere, and is also quite fast to match words against. Probably comparable speed to binary-searching your data structure, esp. if the giant string is too big to fit in the cache. (Once you start exceeding cache size, compactness of the data structure starts to outweigh code complexity for speed.)
Less complex but still efficient is a Trie (or Radix Trie), where common prefixes are merged, but common substrings later in words don't converge again.
If you don't need to modify your DAWG or Trie at all, you can store it efficiently in a single block of memory, rather than dynamically allocating each node. You didn't say why you didn't want to use a Trie, and also didn't acknowledge the existence of the other data structures that do this job better than a plain Trie.

fast indexing for slow cpu?

I have a large document that I want to build an index of for word searching. (I hear this type of array is really called a concordances). Currently it takes about 10 minutes. Is there a fast way to do it? Currently I iterate through each paragraph and if I find a word I have not encountered before, I add it too my word array, along with the paragraph number in a subsidiary array, any time I encounter that same word again, I add the paragraph number to the index. :
associativeArray={chocolate:[10,30,35,200,50001],parsnips:[5,500,100403]}
This takes forever, well, 5 minutes or so. I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway.
Is there a faster way to build a text index other than linear brute force? I'm not looking for a product that will do the index for me, just the fastest known algorithm. The index should be accurate, not fuzzy, and there will be no need for partial searches.
I think the best idea is to build a trie, adding a word at the time of your text, and having for each leaf a List of location you can find that word.
This would not only save you some space, since storing word with similar prefixes will require way less space, but the search will be faster too. Search time is O(M) where M is the maximum string length, and insert time is O(n) where n is the length of the key you are inserting.
Since the obvious alternative is an hash table, here you can find some more comparison between the two.
I would use a HashMap<String, List<Occurrency>> This way you can check if a word is already in yoz index in about O(1).
At the end, when you have all word collected and want to search them very often, you might try to find a hash-function that has no or nearly no collisions. This way you can guarantee O(1) time for the search (or nearly O(1) if you have still some collisions).
Well, apart from going along with MrSmith42's suggestion of using the built in HashMap, I also wonder how much time you are spending tracking the paragraph number?
Would it be faster to change things to track line numbers instead? (Especially if you are reading the input line-by-line).
There are a few things unclear in your question, like what do you mean in "I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway."?! What array, is your input in form of array of paragraphs or do you mean the concordance entries per word, or what.
It is also unclear why your program is so slow, probably there is something inefficient there - i suspect is you check "if I find a word I have not encountered before" - i presume you look up the word in the dictionary and then iterate through the array of occurrences to see if paragraph# is there? That's slow linear search, you will be better served to use a set there (think hash/dictionary where you care only about the keys), kind of
concord = {
'chocolate': {10:1, 30:1, 35:1, 200:1, 50001:1},
'parsnips': {5:1, 500:1, 100403:1}
}
and your check then becomes if paraNum in concord[word]: ... instead of a loop or binary search.
PS. actually assuming you are keeping list of occurrences in array AND scanning the text from 1st to last paragraph, that means arrays will form sorted, so you only need to check the very last element there if word in concord and paraNum == concord[word][-1]:. (Examples are in pseudocode/python but you can translate to your language)

Hash function for hash table with strings and integers as keys

i am in search for a good Hash function which i can use in Hash table implementation. The thing is that i want to give both strings and integers as parameters(keys) in my hash function.
i have a txt file with ~500 data and every one of them consists of integers and strings(max 15 chars). So, the thing that i want to do is to pick one of these ints/strings and use it as a key for my hash function in order to put my data in the "right" bucket.
Is there any good function to do this?
Thank you :)
Use the Integer value if that's present & reasonably well distributed, then hash the String if it's not. Integer hashcode is much cheaper to compute than String.
The algorithm has to be repeatable, obviously.
Your question is somewhat vague. It's unclear if your data set has 500 columns and you are trying to figure out which column to use for hashing, or if it has 500 items which you want to hash.
If you are looking for a decent general purpose hash that will produce well-distributed hash values, you may want to check out the Jenkins hash functions which have variants for strings and integers. But, to be frank, if your dataset has 500 fixed items you may want to look at a perfect hash function generator, like GNU gperf or even alternative data structures depending on your data.
Since you want to hash using two keys, I presume the distribution improves using two keys.
For string hashing, I have had good results with PJW algorithm. Just google for "PJW Hash String". One variation here
To augment the hash with an integer, see here

Linking filenames or labels to numeric index

In a C99+SDL game, I have an array that contains sound effects (SDL_mixer chunk data and some extra flags and filename string) and is referenced by index such as "sounds[2].data".
I'd like to be able to call sounds by filename, but I don't want to strcmp all the array until a match is found. This way as I add more sounds, or change the order, or allow for player-defined sound mods, they can still be called with a common identifier (such as "SHOT01" or "EXPL04").
What would be the fastest approach for this? I heard about hashing, which would result in something similar to lua's string indexes (such as table["field"]) but I don't know anything about the topic, and seems fairly complicated.
Just in case it matters, I plan to have filenames or labels be from 6 to 8 all caps filenames (such as "SHOT01.wav").
So to summarize, where can I learn about hashing short strings like that, or what would be the fastest way to keep track of something like sound effects so they can be called using arbitrary labels or identifiers?
I think in your case you can probably just keep all the sounds in a sorted data structure and use a fast search algorithm to find matches. Something like a binary search is very simple implement and it gives good performance.
However, if you are interested in hash tables and hashing, the basics of it all are pretty simple. There is no place like Wikipedia to get the basics down and you can then tailor your searches better on Google to find more in depth articles.
The basics are you start out with a fixed size array and store everything in there. To figure out where to store something you take the key (in your case the sound name) and you perform some operation on it such that it gives you an exact location where the value can be found. So the simplest case for string hashing is just adding up all the letters in the string as integer values then take the value and use modulus to give you an index in your array.
position = SUM(string letters) % [array size]
Of course naturally multiple strings will have same sum and thus give you the same position. This is called a collision, and collisions can be handled in many ways. The simplest way is to have an array of lists rather than array of values, and simply append to the list every there there is a collision. When searching for a value, simply iterate the lists and find the value you need.
Ideally a good hashing algorithm will have few collisions and quick hashing algorithm thus providing huge performance boost.
I hope this helps :)
You are right, when it comes to mapping objects with a set of string keys, hash tables are often the way to go.
I think this article on wikipedia is a good starting point to understand hash table mechanism: http://en.wikipedia.org/wiki/Hash_table

Hash Function Determination

How can we find the most efficient hash function(least possible chances of collision) for the set of strings.
Suppose we are given with some strings.. And the length of the strings is also not defined.
Ajay
Vijay
Rakhi
....
we know the count of no. of strings available, so we can design a hash table of size(count available). what could be the perfect hash function that we could design for such problem??
Multiplying each character ascii value by 31(prime no.) in increment fashion leads to the a hash value greater than the value of MAX_INT, and then modulus would not work properly... So please give some efficient hash function build up solution....
I have few set of strings,, lets say count = 10.... I need to implement a hash function such that all those 10 strings fit in uniquely in the hash table.... Any perfect hash function O(1) available, for this kind of problem?? hash table size will be 10, for this case...
Only C Programming...
Please explain the logic at website.... http://burtleburtle.net/bob/c/perfect.c
This looks very complicated but perfect to me..!! what is the algorithm used here... Reading the code straight away, is very difficult!!
Thanks....
Check some of these out, they apparantly have good distributions
http://www.partow.net/programming/hashfunctions/#HashingMethodologies
You might want to look into perfect hashing.
you might want to have a look at gperf, you could kinda do this on the fly if you didn't do it too often and your data set a small. if the strings are know ahead of time, then this is the method
Hash tables are meant to be able to handle dynamic input. If you can guarantee only a particular set of inputs, and you want to guarantee a particular slot for each input, why hash at all?
Just make an array indexed for each known available input.

Resources