probability of collision in MD5 - md5

Worst case, I have 180 million values in a cache(15 minute window before they go stale) and an MD5 has 2^128 values. What is my probability of a collision? or better yet, is there a web page somewhere to answer that question or a rough estimate thereof? That would rock so I know my chances.

The probability is 1-m!/(mⁿ(m-n)!) where m = 2¹²⁸ and n = 180000000.
Running it through on-line Wolfram exceeds available computational time!
If you have SmallTalk installed locally, you can run this:
|m n p|
m := 2 raisedTo:128.
n := 180000000.
p := (1-(m factorial/((m raisedTo:n)*(m-n)factorial)))asFloat.
Transcript show:p printString;cr.
A search for the Birthday Problem brings up a Wikipedia page where they provide a table showing for 128 bits and 2.6×10¹⁰ hashes, the probability of a collision is 1 in 10¹⁸, so that’s for 140× the number of hashes than what you’re considering. So you know your odds are “worse” than this.
A good approximation if n ≪ m is 1-e-n2/2m, where if you plug in m and n above, you get 4.76×10⁻²³ or 1 in 2.10×10²² as the probability of a collision.
Even though the probability of a collision is very low, it is prudent in the FOOBAR case, say if there is an issue and the hashes accumulate for more than 15 minutes, to at least confirm what would happen in the event of a collision. This will also help if someone somehow injects duplicate hashes in order to try to compromise it.

Related

What is the time complexity of this algorithmic problem?

*
A search method has time complexity O(n2), where n is the number of states in the space to be
searched. If it takes 1 second to search a space of a thousand states, roughly how long will it take to
search a space of a million states?*
I have found that its approximately 12 days but the way I found is quite wrong i think.
I did 1million^2 / 86400(seconds in a day ) and found 11.56 so approximately 12 days. Is there a better and more efficient solution?
There is not nearly enough information to answer this question. See Big-O description.
O(N^2) means only that the algorithm's execution time will be dominated by an N^2 term. As N grows large, the ratio between two execution times will asymptotically approach the square of their ratios. It says nothing about the execution time for particular values.
Let's keep this simple, assuming a set-up overhead with an array initialization O(N) and some system start-up, a constant. This makes the execution time
t = a * N^2 + b * N + c
for some values of a, b, and c. Even if we know that this is the equation form, we do not have enough information to solve given only one (t, N) data point. We don't know enough to derive t for N= 10^6.
I suspect that whomever posed this problem is looking for the invalid solution, making the unwarranted assumption that N=1000 has already blown all smaller terms to insignificance. In this case, simply scale up by the square of the size ratio:
N1 / N2 = 10^6 / 10^3 = 10^3
Scale up by N^2, or (10^3)^2 = 10^6
That gives you 10^6 seconds, or somewhat over a day; I'll leave the math to you.

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.
The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.
(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)
Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.
That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.
A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).
The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).
Read the paper for the real logic of course.
Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

Why setting HashTable's length to a Prime Number is a good practice?

I was going through Eric Lippert's latest Blog post for Guidelines and rules for GetHashCode when i hit this para:
We could be even more clever here; just as a List resizes itself when it gets full, the bucket set could resize itself as well, to ensure that the average bucket length stays low. Also, for technical reasons it is often a good idea to make the bucket set length a prime number, rather than 100. There are plenty of improvements we could make to this hash table. But this quick sketch of a naive implementation of a hash table will do for now. I want to keep it simple.
So looks like i'm missing something. Why is it a good practice to set it to a prime number?.
You can find people that suggest the two opposite ends of the spectrum. On the one side, choosing a prime number for the size of the hash table will reduce the chances of collisions, even if the hash function is not too effective distributing the results. Note that if (in the simplest example to argue about) a power of 2 size is decided, only the lower bits affect the bucket, while for a prime number most bits in the result of the hash will be used.
On the other hand, you can gain more by choosing a better hash function, or even rehashing he result of the hash function by applying some bit operations, and using a power of 2 hash size to speed up calculations.
As an example from real life, Java HashTable were initially implemented by using prime (or almost prime sizes), but from Java 1.4 on, the design was changed to use power of two number of buckets and added a second fast hash function applied to the result of the initial hash. An interesting article commenting that change can be found here.
So basically:
a prime number helps dispersing the inputs across the different buckets even in the event of not-so-good hash functions.
a similar effect can be achieved by post processing the result of the hash function, and using a power of 2 size to speedup the modulo operation (bit mask) and compensate for the post processing.
Because this produces a better hash function and reduces the number of possible collisions. This is explained in Choosing a good hashing function:
A basic requirement is that the
function should provide a uniform
distribution of hash values. A
non-uniform distribution increases the
number of collisions, and the cost of
resolving them.
The distribution needs to be uniform
only for table sizes s that occur in
the application. In particular, if one
uses dynamic resizing with exact
doubling and halving of s, the hash
function needs to be uniform only when
s is a power of two. On the other
hand, some hashing algorithms provide
uniform hashes only when s is a prime
number.
Say your bucket set length is a power of 2 - that makes the mod calculations quite fast. It also means that the bucket selection is determine solely by the top m bits of the hash code. (Where m = 32 - n, where n is the power of 2 being used). So it's like you're throwing away useful bits of the hashcode immediately.
Or as in this blog post from 2006 puts it:
Suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering:
...
Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.

Determining which inputs to weigh in an evolutionary algorithm

I once wrote a Tetris AI that played Tetris quite well. The algorithm I used (described in this paper) is a two-step process.
In the first step, the programmer decides to track inputs that are "interesting" to the problem. In Tetris we might be interested in tracking how many gaps there are in a row because minimizing gaps could help place future pieces more easily. Another might be the average column height because it may be a bad idea to take risks if you're about to lose.
The second step is determining weights associated with each input. This is the part where I used a genetic algorithm. Any learning algorithm will do here, as long as the weights are adjusted over time based on the results. The idea is to let the computer decide how the input relates to the solution.
Using these inputs and their weights we can determine the value of taking any action. For example, if putting the straight line shape all the way in the right column will eliminate the gaps of 4 different rows, then this action could get a very high score if its weight is high. Likewise, laying it flat on top might actually cause gaps and so that action gets a low score.
I've always wondered if there's a way to apply a learning algorithm to the first step, where we find "interesting" potential inputs. It seems possible to write an algorithm where the computer first learns what inputs might be useful, then applies learning to weigh those inputs. Has anything been done like this before? Is it already being used in any AI applications?
In neural networks, you can select 'interesting' potential inputs by finding the ones that have the strongest correlation, positive or negative, with the classifications you're training for. I imagine you can do similarly in other contexts.
I think I might approach the problem you're describing by feeding more primitive data to a learning algorithm. For instance, a tetris game state may be described by the list of occupied cells. A string of bits describing this information would be a suitable input to that stage of the learning algorithm. actually training on that is still challenging; how do you know whether those are useful results. I suppose you could roll the whole algorithm into a single blob, where the algorithm is fed with the successive states of play and the output would just be the block placements, with higher scoring algorithms selected for future generations.
Another choice might be to use a large corpus of plays from other sources; such as recorded plays from human players or a hand-crafted ai, and select the algorithms who's outputs bear a strong correlation to some interesting fact or another from the future play, such as the score earned over the next 10 moves.
Yes, there is a way.
If you choose M selected features there are 2^M subsets, so there is a lot to look at.
I would to the following:
For each subset S
run your code to optimize the weights W
save S and the corresponding W
Then for each pair S-W, you can run G games for each pair and save the score L for each one. Now you have a table like this:
feature1 feature2 feature3 featureM subset_code game_number scoreL
1 0 1 1 S1 1 10500
1 0 1 1 S1 2 6230
...
0 1 1 0 S2 G + 1 30120
0 1 1 0 S2 G + 2 25900
Now you can run some component selection algorithm (PCA for example) and decide which features are worth to explain scoreL.
A tip: When running the code to optimize W, seed the random number generator, so that each different 'evolving brain' is tested against the same piece sequence.
I hope it helps in something!

Random number generator repeates some numbers too often

I'm writing a lottery draw simulation program as a project. The way the game works is you need to pick the 6 numbers that are draw from the 49 to win.
Your chance of winning is 1/13,983,816 because that's how many combinations of 6 in 49 there are. The demo program on Go playground generates six new numbers each time around the loop forever.
Each time a new set of numbers is generated I test to see if it already exists and if it does I break out of the loop. With 13,983,816 combinations you would think it would be a long time before the same 6 numbers would repeat but, in testing it fails always before 10000 iteration. Does anyone know why this is happening?
In my opinion you have a couple of problems here.
You use Go playground, where your randomness is fixed. This line rand.Seed(time.Now().UnixNano()) always produce the same seed because time.Now() is the same.
You test completely different things with your simulation. I will write about it in the end.
if you want to do something similar to gambling - you have to use cryptographically secure PRNG and Go has it. If you want you can read more details here (the answer is to php question, but it explains the difference).
On the probability part:
The probability of winning your lottery is indeed 1/C(49, 6) = 1/13,983,816. But this is the probability that someone would select an already predefined set of numbers. For example you claim that your winner is {1, 5, 47, 3, 4, 5} and now the probability that someone would win is approximately 1 in 14 mln. So you have to do the following. Randomly select a set of 6 numbers and then compare your new selection in a loop to already found.
But what you do is to check the probability of collision. That having N people some of them would select the same sets (not necessarily even the winning set). This is known as the birthday paradox. And as you see there, the probability of collision increase dramatically with the increase of number of people N.
This is absolutely the same problem, but your number of days in the year is 13,983,816 and you can check here that for this number of days you need only 5000 iterations to guarantee with 0.59 percents that you will get a collision. And with 9000 iterations you will find the collision with probability 0.94.
I believe you are solving a different problem, the likelihood that two identical draws show up is much higher than one specific one will show up.
This is known as Birthday Problem
BTW, a rough rule of thumb for the birthday paradox is that if you have N days, you need roundabout sqrt(N) people to get a good chance (around 50%) of a collision.
So, for the original birthday paradox, you have 365 days, so the rule of thumb gives you that with 365^.5 or about 19 people you already have a decent chance of collision (true answer for >50%: 23 people).
Here, with 13,983,816 possible outcomes, the rule of thumb tells you that with 3739 draws you have a pretty good chance of collision (true answer for 50%: 4400 draws).

Resources