I was wondering if there is a way for a network of N participants to agree that a number from 1 to M was chosen at random. (e.g. not influenced by any of the participants) This has been solved for values of n=2 and m=2 by the coin tossing protocol. Does anyone know of any solutions that can work for arbitrary values of N and M?
Edit
Better algorithm (thanks wnoise):
Everyone picks a secret number from 0 to M-1
Everyone appends a load of random gunk to their number and hashes the result with a secure hash
Everyone tells everyone else this hash
Everyone tells everyone else their secret number, plus the random gunk they appended to it
Everyone verifies that the numbers and hashes+gunk match
Add all the secret numbers together modulo M, then adds 1 to get the final result
As a participant, I should be satisfied with this because I know that I had full influence over the final result - the final number could have been anything at all, depending on my choice of secret number. So since no-one else could predict my number, they couldn't have predicted the final result either.
Any way to reduce the messages from the 3M^2 that i suspect a broadcast approach would require?
I reckon that only the hash publication has to be a broadcast, but it's still O(M^2). I guess the only way around that would be to pre-exchange digital signature keys, or to have a trusted communication hub.
Edit2 - How safe is the hashing thing?
Possible attacks include:
If I can generate a hash collision then I have two secret numbers with the same hash. So once I know everyone else's secret numbers I can choose which of my secret numbers to reveal, thus selecting one of two possible results.
If I generate my secret number and random gunk using a PRNG, then an attacker trying to brute-force my hash doesn't have to try every possible number+gunk, only every possible seed for the PRNG.
I use the number+gunk that everyone reveals to determine information about their PRNGs - I could try to guess or brute-force the seeds, or calculate the internal state from the output. This helps me predict what numbers they will generate next time around, which narrows the search space for a brute-force attack.
Therefore, you should
Use a trusted, unbroken hashing algorithm.
Use a cryptographically secure random number generator that has a big seed / state, and try to seed it from a good source of entropy.
I don't know if it is possible for people to agree on the randomness of a single number; it should be in the statistics. If the statistics of many random numbers matched the statistics of numbers taken from here then I would consider your number random, but I don't know about the next guy N+1 on the network.
This is probably not what you're looking but just to start this thread how about this -
Select a leader, let the leader choose the number, distribute the number to everyone.
Related
Often in database you or DB generate UUID. This needs to be random but also not something already present in the list.
One approach i've seen is first generate a random number, see if it exists in the list, if yes, try again otherwise use/save it.
However, once you hit 50% of capacity of number of possible numbers, your algo is going to has 50/50 chance of collision.
Furthermore, regardless of capacity used, the worst case scenario is infinity. Although not very probable in the beginning, there is a possibility that in worst case scenario your prog would generate all numbers that are already taken forever. Obviously, chances increase as more possible numbers are taken.
It feels like a great way to introduce bugs that are impossible to catch.
This is little known, but the sequence 0, 1, 2, 3, 4, 5, 6... is made of non-repeating numbers :-) If you store the last output number, you can avoid collisions forever, at a cost O(1).
If you feel that this is not "random enough", you can scramble the bits in an arbitrary but reversible way. This ensures that the no-collision property remains.
If you want truly random numbers, you can append truly random numbers to the non-colliding ones.
You can even ensure that several systems generate non colliding numbers without communicating, by assigning every system a unique ID and making this ID part of the numbers.
I have a request as part of an application I am working on to generate a UserId with the following criteria:
It will have a set 4 digit prefix
The next 10 digits will be random
A check-digit will be added on the end
Points 1 and 3 are straight-forward, but what would the best way be to generate a 10 digit random number whilst ensuring that it hasn't already been used.
I don't particularly like the idea of choosing one randomly, seeing if it has been taken and then either accepting or trying again.
My other thought was to generate a list of X numbers in advance (X being a number greater than the number of accounts you expect to be created) and just take the next one off the list as the accounts are created.
Any thoughts?
EDIT:
Lets bring technology into this. If I am using a SQL server database, is there a way I can make the database do this for me? E.g. enforce a unique constraint and get the database to generate the number?
Encryption. See this answer.
What you need is basically encrypt a sequential ID, thus producing a seemingly random number.
What's especially good about this is that you can do this all client-side, albeit in two consecutive transactions.
The only way to be 100% sure there is no other userid with the same random digits is to compare the gernerated to all existing users, you have to do it sometime
The only possibility is to do this code-side.
BUT you can save the random generated id to the userdatabase and compare this with your new key (in your code)
e.g. SELECT * FROM xy WHERE userid = 'newuserid'
If the result is null your key never was generated.
Thanks to Anton's answer I found this C# implementation of the skip32 encryption algorithm.
https://github.com/eleven41/Eleven41.Skip32
Using this I can pass in an incrementing database identity integer and get a nice random looking 9/10 digit number from it. So a simple call like
int result = cipher.Encrypt(databaseIdentity);
will give me the number I want. No need for duplicate checking as each will be unique.
I was going through Eric Lippert's latest Blog post for Guidelines and rules for GetHashCode when i hit this para:
We could be even more clever here; just as a List resizes itself when it gets full, the bucket set could resize itself as well, to ensure that the average bucket length stays low. Also, for technical reasons it is often a good idea to make the bucket set length a prime number, rather than 100. There are plenty of improvements we could make to this hash table. But this quick sketch of a naive implementation of a hash table will do for now. I want to keep it simple.
So looks like i'm missing something. Why is it a good practice to set it to a prime number?.
You can find people that suggest the two opposite ends of the spectrum. On the one side, choosing a prime number for the size of the hash table will reduce the chances of collisions, even if the hash function is not too effective distributing the results. Note that if (in the simplest example to argue about) a power of 2 size is decided, only the lower bits affect the bucket, while for a prime number most bits in the result of the hash will be used.
On the other hand, you can gain more by choosing a better hash function, or even rehashing he result of the hash function by applying some bit operations, and using a power of 2 hash size to speed up calculations.
As an example from real life, Java HashTable were initially implemented by using prime (or almost prime sizes), but from Java 1.4 on, the design was changed to use power of two number of buckets and added a second fast hash function applied to the result of the initial hash. An interesting article commenting that change can be found here.
So basically:
a prime number helps dispersing the inputs across the different buckets even in the event of not-so-good hash functions.
a similar effect can be achieved by post processing the result of the hash function, and using a power of 2 size to speedup the modulo operation (bit mask) and compensate for the post processing.
Because this produces a better hash function and reduces the number of possible collisions. This is explained in Choosing a good hashing function:
A basic requirement is that the
function should provide a uniform
distribution of hash values. A
non-uniform distribution increases the
number of collisions, and the cost of
resolving them.
The distribution needs to be uniform
only for table sizes s that occur in
the application. In particular, if one
uses dynamic resizing with exact
doubling and halving of s, the hash
function needs to be uniform only when
s is a power of two. On the other
hand, some hashing algorithms provide
uniform hashes only when s is a prime
number.
Say your bucket set length is a power of 2 - that makes the mod calculations quite fast. It also means that the bucket selection is determine solely by the top m bits of the hash code. (Where m = 32 - n, where n is the power of 2 being used). So it's like you're throwing away useful bits of the hashcode immediately.
Or as in this blog post from 2006 puts it:
Suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering:
...
Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.
Should I use high numbers for user IDs in database?
Are there any benefits with starting user_id from 1000 (from <9000 users project) or 10000 for more...?
The advantage of starting user IDs from 1000 (even when you will have fewer than 9,000 IDs) is that they will all have the same number of digits, so that files, for example, suffixed with the UID will sort in numeric order automatically, even if the sorter only uses alphabetic numbering. And you don't have to pad the numbers with leading zeroes to get there.
The converse is that if you only have 1000 users, numbers starting at 1,000,000,000 would look a little silly: 1,000,000,001 then 1,000,000,002 and so on.
For many purposes, therefore, it doesn't matter which you do. A uniform number of digits has some advantages, and that is why a value other than zero or one is often used as the starting point.
not really. I would just start from 1. if you have any needs to put stuff in before one, there are no issues w/ using negative numbers, so you can just do an insert and manually specifiy the id. At my company, we start all the users at one, auto incrementing, and our global admin user is ID 0.
I know this answer comes late, but still there is something to add, imo:
1) The approach to use 1000 as a start id can be of an advantage, e.g. if you do not want to make it obvious how many users you have (in case you make the id visible somewhere in an url or sth), and therefore (or in addition)
2) it can be useful if you want to make ids harder to guess, because usually the first ids belong to admins or moderators, so if you takes any id to start (e.g. 1421), you could just add another security tweak to your db...
While thinking about this question and conversing with the participants, the idea came up that shuffling a finite set of clearly biased random numbers makes them random because you don't know the order in which they were chosen. Is this true and if so can someone point to some resources?
EDIT: I think I might have been a little unclear. Suppose a bad random numbers generator. Take n values. These are biased(the rng is bad). Is there a way through shuffling to make the output of the rng over multiple trials statistically match the output of a known good rng?
False.
There is an easy test: Assume the bias in the original set creation algorithm is "creates sets whose arithmetic average is significantly lower than expected average". Obviously, shuffling the result of the algorithm will not change the averages and thus not remove the bias.
Also, regarding your clarification: How would you shuffle the set? Using the same bad output from the bad RNG that created the set in the first place? Or using a better RNG? Which raises the question why you don't use that directly.
It's not true. In the other question the problem is to select 30 random numbers in [1..9] with a sum of 200. After choosing about on average 20 of them randomly, you reach a point where you can't select nines anymore because this would make the total sum go over 200. Of the remaining 10 numbers, most will be ones and twos. So in the end, ones and twos are very overrepresented in the selected numbers. Shuffling doesn't change that. But it's not clear how the random distribution really should look like, so one could say this is as good a solution as any.
In general, if your "random" numbers will be biased to, say, low numbers, they will be biased that way no matter the ordering.
Just shuffling a set of numbers of already random numbers won't do anything to the probability distribution of course. That would mean false. Perhaps I misunderstand your question though?
I would say false, with a caveat:
I think there is random, and then there is 'random-enough'. For most applications that I have needed to work on, 'random-enough' was more than enough, i.e. picking a 'random' ad to display on a page from a list of 300 or so that have paid to be placed on that site.
I am sure a mathematician could prove my very basic 'random' selection criteria is not truly random at all, but in fact is predictable - for my clients, and for the users, nobody cares.
On the other hand if I was writing a video game to be used in Las Vegas where large amounts of money was at hand I'd define random differently (and may have a hard time coming up with truly random).
False
The set is finite, suppose consists of n numbers. What happens if you choose n+1 numbers? Let's also consider a basic random function as implemented in many languages which gives you a random number in [0,1). However, this number is limited to three digits after the decimal giving you a set of 1000 possible numbers (0.000 - 0.999). However in most cases you will not need to use all these 1000 numbers so this amount of randomness is more than enough.
However for some uses, you will need a better random generator than this. So it all comes down to exactly how many random numbers you are going to need, and how random you need them to be.
Addition after reading original question: in the case that you have some sort of limitation (such as in the original question in which each set of selected numbers must sum up to a certain N) you are not really selected random numbers per se, but rather choosing numbers in a random order from a given set (specifically, a permutation of numbers summing up to N).
Addition to edit: Suppose your bad number generator generated the sequence (1,1,1,2,2,2). Does the permutation (1,2,2,1,1,2) satisfy your definition of random?
Completely and utterly untrue: Shuffling doesn't remove a bias, it just conceals it from the casual observer. It's like removing your dog's fondly-laid present from your carpet by just pushing under the sofa - you really haven't solved the problem, you've just made it less conspicuous. Anyone with a nose knows that there is still a problem that needs removing.
The randomness must be applied evenly over the whole range, so here's one way (off the top of my head, lots of assumptions, yadda yadda. The point is the approach, not the code - start with everything even, then introduce your randomness in a consistent fashion until you're done. The only bias now is dependent on the values chosen for 'target' and 'numberofnumbers', which is part of the question.)
target = 200
numberofnumbers = 30
numbers = array();
for (i=0; i<numberofnumbers; i++)
numbers[i] = 9
while (sum(numbers)>target)
numbers[random(numberofnumbers)]--
False. Consider a bad random number generator producing only zeros (I said it was BAD :-) No amount of shuffling the zeros would change any property of that sequence.