What is the probability of md5 collision if I pass in 2^32 sets of string? - md5

What is the probability of md5 collision if I pass in 2^32 sets of string?
Can I say the answer is just 2^32/2^128 = 1/1.2621774e-29 as the length of bit of md5 hash is 128?

This question is similar to the so-called "birthday paradox".
In probability theory, the birthday problem or birthday paradox concerns the probability that, in a set of n randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 367 (since there are 366 possible birthdays, including February 29). However, 99% probability is reached with just 57 people, and 50% probability with 23 people. These conclusions are based on the assumption that each day of the year (except February 29) is equally probable for a birthday.
The mathematics behind this problem led to a well-known cryptographic attack called the birthday attack, which uses this probabilistic model to reduce the complexity of cracking a hash function.
According to the Wikipedia article, the chance of a collision when choosing n = 232 random numbers from a space with d = 2128 numbers is approximately:
If you work this calculation out the chance is about 2.7×10-20. This is a very small probability, but notice that it is 9 orders of magnitude higher than your proposed calculation.

Related

probability of collision in MD5

Worst case, I have 180 million values in a cache(15 minute window before they go stale) and an MD5 has 2^128 values. What is my probability of a collision? or better yet, is there a web page somewhere to answer that question or a rough estimate thereof? That would rock so I know my chances.
The probability is 1-m!/(mⁿ(m-n)!) where m = 2¹²⁸ and n = 180000000.
Running it through on-line Wolfram exceeds available computational time!
If you have SmallTalk installed locally, you can run this:
|m n p|
m := 2 raisedTo:128.
n := 180000000.
p := (1-(m factorial/((m raisedTo:n)*(m-n)factorial)))asFloat.
Transcript show:p printString;cr.
A search for the Birthday Problem brings up a Wikipedia page where they provide a table showing for 128 bits and 2.6×10¹⁰ hashes, the probability of a collision is 1 in 10¹⁸, so that’s for 140× the number of hashes than what you’re considering. So you know your odds are “worse” than this.
A good approximation if n ≪ m is 1-e-n2/2m, where if you plug in m and n above, you get 4.76×10⁻²³ or 1 in 2.10×10²² as the probability of a collision.
Even though the probability of a collision is very low, it is prudent in the FOOBAR case, say if there is an issue and the hashes accumulate for more than 15 minutes, to at least confirm what would happen in the event of a collision. This will also help if someone somehow injects duplicate hashes in order to try to compromise it.

MCTS UCT with a scoring system

I'm trying to solve a variant of 2048 by a Monte-Carlo Tree Search. I found that UCT could a good way to have some trade-off between exploration/exploitation.
My only issue is that all the versions I've seen assume that the score is a win percentage. How can I adapt it to a game where the score is the value of the board at the last state, and thus going from 1-MAX and not a win.
I could normalize the score using the constant c by dividing by MAX but then it would overweight exploration at early stage of the game (since you get bad average score) and overweight exploitation at late stage of the game.
Indeed most of the literature assumes your games are either lost or won and award a score of 0 or 1, which will turn into a win ratio when averaged over the number of games played. Then exploration parameter C is usually set to sqrt(2) which is optimal for the UCB in bandit problems.
To find out what a good C is in general you have to step back a bit and see what the UCT is really doing. If one node in your tree had an exceptionally bad score in the one rollout it had then exploitation says you should never choose it again. But you've only played that node once, so it might have just been bad luck. To acknowledge this you give that node a bonus. How much? Enough to make it a viable choice even if its average score is the lowest possible and some other node has the highest average score possible. Because with enough plays it might turn out that the one rollout your bad node had was indeed a fluke, and the node actually turns out to be pretty reliable with good scores. Of course, if you get more bad scores then it will likely not be bad luck so it won't deserve more rollouts.
So with scores ranging from 0 to 1 a C of sqrt(2) is a good value. If your game has a maximum achievable score then you can normalize your scores by dividing by the max and force your scores into to 0-1 range to suit a C of sqrt(2). Alternatively you don't normalize the scores but multiply C by your maximum score. The effect is the same: the UCT exploration bonus is large enough to give your underdog nodes some rollouts and a chance to prove themselves.
There is an alternative way of setting C dynamically that has given me good results. As you play, you keep track of the highest and lowest scores you've ever seen in each node (and subtree). This is the range of scores possible and this gives you a hint of how big C should be in order to give not well explored underdog nodes a fair chance. Every time i descend into the tree and pick a new root i adjust C to be sqrt(2) * score range for the new root. In addition, as rollouts complete and their scores turn out the be a new highest or lowest score i adjust C in the same way. By continually adjusting C this way as you play but also as you pick a new root you keep C as large as it needs to be to converge but as small as it can be to converge fast. Note that the minimum score is as important as the max: if every rollout will yield at minimum a certain score then C won't need to overcome it. Only the difference between max and min matters.

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.
The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.
(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)
Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.
That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.
A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).
The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).
Read the paper for the real logic of course.
Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

Knapsack variation?

So, I'm going shopping, and I have x money, my truck can take up to y weight, and each item has a bonus credit, a weight and a price. The output should give the maximum bonus credit that can be obtained such that the total weight of the chosen items does not exceed the capacity of the truck and the money I have to spend!
Do you know the name of the algorithm? How should I proceed? I have to do it in C!
What have you tried?
These types of problems usually fall into the category of optimization or constraint satisfaction.
Try writing a functional expression for your problem and see if you can solve it with simple calculus or simplex methods.
I know of two variations of the knapsack problem. The 0-1 version can't contain fractional weights (take it or leave it), for example, I can't take half of the second best choice. The other version is the opposite, fractional items are allowed. This small difference is extremely significant and works in favor of the fractional version.
The fractional version can be solved via a greedy algorithm. You can simply take as much as you can of the item with the most VALUABLE "unit price". Repeat until your truck is full.
The 0-1 version is a bit harder, as it can't be solved via a straightforward greedy algorithm. As an example, say your truck can carry 800lbs. We can pick from a
1 Table weighing 500lbs with a value of $1000 (unit price $2/lb)
1 Bench : weight - 701lb : value - $1577.25 : unit $2.25/lb
3 Bookcases : weight - 100lb : value - $200 : unit $2/lb
A greedy algorithm would take the bench for a total of $1577.25.
The optimal value is 3 bookcases and the table = $1600.
If the above were the fractional knapsack version would would simply take the bench and 99lbs of table/bookcase for a total of $1775.25.
In the 0-1 case we would need to use something like dynamic programming to examine all solutions.
The item weights and item prices are constraints. The bonus credits are the objectives. Thus, you have a multi-dimensional knapsack problem (one objective; two constraints). The well-known dynamic programming solution knapsack will generalize, but the complexity grows exponentially with the number of constraints.

Random number generator repeates some numbers too often

I'm writing a lottery draw simulation program as a project. The way the game works is you need to pick the 6 numbers that are draw from the 49 to win.
Your chance of winning is 1/13,983,816 because that's how many combinations of 6 in 49 there are. The demo program on Go playground generates six new numbers each time around the loop forever.
Each time a new set of numbers is generated I test to see if it already exists and if it does I break out of the loop. With 13,983,816 combinations you would think it would be a long time before the same 6 numbers would repeat but, in testing it fails always before 10000 iteration. Does anyone know why this is happening?
In my opinion you have a couple of problems here.
You use Go playground, where your randomness is fixed. This line rand.Seed(time.Now().UnixNano()) always produce the same seed because time.Now() is the same.
You test completely different things with your simulation. I will write about it in the end.
if you want to do something similar to gambling - you have to use cryptographically secure PRNG and Go has it. If you want you can read more details here (the answer is to php question, but it explains the difference).
On the probability part:
The probability of winning your lottery is indeed 1/C(49, 6) = 1/13,983,816. But this is the probability that someone would select an already predefined set of numbers. For example you claim that your winner is {1, 5, 47, 3, 4, 5} and now the probability that someone would win is approximately 1 in 14 mln. So you have to do the following. Randomly select a set of 6 numbers and then compare your new selection in a loop to already found.
But what you do is to check the probability of collision. That having N people some of them would select the same sets (not necessarily even the winning set). This is known as the birthday paradox. And as you see there, the probability of collision increase dramatically with the increase of number of people N.
This is absolutely the same problem, but your number of days in the year is 13,983,816 and you can check here that for this number of days you need only 5000 iterations to guarantee with 0.59 percents that you will get a collision. And with 9000 iterations you will find the collision with probability 0.94.
I believe you are solving a different problem, the likelihood that two identical draws show up is much higher than one specific one will show up.
This is known as Birthday Problem
BTW, a rough rule of thumb for the birthday paradox is that if you have N days, you need roundabout sqrt(N) people to get a good chance (around 50%) of a collision.
So, for the original birthday paradox, you have 365 days, so the rule of thumb gives you that with 365^.5 or about 19 people you already have a decent chance of collision (true answer for >50%: 23 people).
Here, with 13,983,816 possible outcomes, the rule of thumb tells you that with 3739 draws you have a pretty good chance of collision (true answer for 50%: 4400 draws).

Resources