Shuffling biased random numbers - shuffle

While thinking about this question and conversing with the participants, the idea came up that shuffling a finite set of clearly biased random numbers makes them random because you don't know the order in which they were chosen. Is this true and if so can someone point to some resources?
EDIT: I think I might have been a little unclear. Suppose a bad random numbers generator. Take n values. These are biased(the rng is bad). Is there a way through shuffling to make the output of the rng over multiple trials statistically match the output of a known good rng?

False.
There is an easy test: Assume the bias in the original set creation algorithm is "creates sets whose arithmetic average is significantly lower than expected average". Obviously, shuffling the result of the algorithm will not change the averages and thus not remove the bias.
Also, regarding your clarification: How would you shuffle the set? Using the same bad output from the bad RNG that created the set in the first place? Or using a better RNG? Which raises the question why you don't use that directly.

It's not true. In the other question the problem is to select 30 random numbers in [1..9] with a sum of 200. After choosing about on average 20 of them randomly, you reach a point where you can't select nines anymore because this would make the total sum go over 200. Of the remaining 10 numbers, most will be ones and twos. So in the end, ones and twos are very overrepresented in the selected numbers. Shuffling doesn't change that. But it's not clear how the random distribution really should look like, so one could say this is as good a solution as any.
In general, if your "random" numbers will be biased to, say, low numbers, they will be biased that way no matter the ordering.

Just shuffling a set of numbers of already random numbers won't do anything to the probability distribution of course. That would mean false. Perhaps I misunderstand your question though?

I would say false, with a caveat:
I think there is random, and then there is 'random-enough'. For most applications that I have needed to work on, 'random-enough' was more than enough, i.e. picking a 'random' ad to display on a page from a list of 300 or so that have paid to be placed on that site.
I am sure a mathematician could prove my very basic 'random' selection criteria is not truly random at all, but in fact is predictable - for my clients, and for the users, nobody cares.
On the other hand if I was writing a video game to be used in Las Vegas where large amounts of money was at hand I'd define random differently (and may have a hard time coming up with truly random).

False
The set is finite, suppose consists of n numbers. What happens if you choose n+1 numbers? Let's also consider a basic random function as implemented in many languages which gives you a random number in [0,1). However, this number is limited to three digits after the decimal giving you a set of 1000 possible numbers (0.000 - 0.999). However in most cases you will not need to use all these 1000 numbers so this amount of randomness is more than enough.
However for some uses, you will need a better random generator than this. So it all comes down to exactly how many random numbers you are going to need, and how random you need them to be.
Addition after reading original question: in the case that you have some sort of limitation (such as in the original question in which each set of selected numbers must sum up to a certain N) you are not really selected random numbers per se, but rather choosing numbers in a random order from a given set (specifically, a permutation of numbers summing up to N).
Addition to edit: Suppose your bad number generator generated the sequence (1,1,1,2,2,2). Does the permutation (1,2,2,1,1,2) satisfy your definition of random?

Completely and utterly untrue: Shuffling doesn't remove a bias, it just conceals it from the casual observer. It's like removing your dog's fondly-laid present from your carpet by just pushing under the sofa - you really haven't solved the problem, you've just made it less conspicuous. Anyone with a nose knows that there is still a problem that needs removing.
The randomness must be applied evenly over the whole range, so here's one way (off the top of my head, lots of assumptions, yadda yadda. The point is the approach, not the code - start with everything even, then introduce your randomness in a consistent fashion until you're done. The only bias now is dependent on the values chosen for 'target' and 'numberofnumbers', which is part of the question.)
target = 200
numberofnumbers = 30
numbers = array();
for (i=0; i<numberofnumbers; i++)
numbers[i] = 9
while (sum(numbers)>target)
numbers[random(numberofnumbers)]--

False. Consider a bad random number generator producing only zeros (I said it was BAD :-) No amount of shuffling the zeros would change any property of that sequence.

Related

How do you generate random but unique number where worse case scenario isn't infinity?

Often in database you or DB generate UUID. This needs to be random but also not something already present in the list.
One approach i've seen is first generate a random number, see if it exists in the list, if yes, try again otherwise use/save it.
However, once you hit 50% of capacity of number of possible numbers, your algo is going to has 50/50 chance of collision.
Furthermore, regardless of capacity used, the worst case scenario is infinity. Although not very probable in the beginning, there is a possibility that in worst case scenario your prog would generate all numbers that are already taken forever. Obviously, chances increase as more possible numbers are taken.
It feels like a great way to introduce bugs that are impossible to catch.
This is little known, but the sequence 0, 1, 2, 3, 4, 5, 6... is made of non-repeating numbers :-) If you store the last output number, you can avoid collisions forever, at a cost O(1).
If you feel that this is not "random enough", you can scramble the bits in an arbitrary but reversible way. This ensures that the no-collision property remains.
If you want truly random numbers, you can append truly random numbers to the non-colliding ones.
You can even ensure that several systems generate non colliding numbers without communicating, by assigning every system a unique ID and making this ID part of the numbers.

Named range of consistent random numbers

Background
Following on from a question I asked a while ago about getting an array of different (but not necessarily unique) random numbers to which the answer was this:
=RANDBETWEEN(ROW(A1:A10)^0,10)
To get an array of 10 random numbers between 1 and 10
The Problem
If I create a named range (called "randArray") with the formula above I hoped I would be able to reference randArray a number of times and get the same set of random numbers. Granted, they would change each time I press F9 or update the worksheet -- but change together.
This is what I get instead, two completely different sets of random numbers
I'm not surprised by this behavior but how can I achieve this without using VBA and without putting the random numbers onto the worksheet?
If you're interested
This example is intended to be MCVE. In my actual case, I am using random numbers to estimate Pi. The user stipulates how many random points to apply and gets an accordingly accurate estimation. The problem arises because I also graph the points and when there are a small number of points it's very clear to see that the estimation and the graph don't represent the same dataset
Update
I have awarded the initial bounty to #Michael for providing an interesting and different solution. I am still looking for a complete solution which allows the user to stipulate how many random points to use, and although there might not be a perfect answer I'm still interested in any other possible solutions and more than happy to put up further bounties.
Thank you to everyone who has contributed so far.
This solution generates 10 seemingly random numbers between 1 and 10 that persist for nearly 9 seconds at a time. This allows repeated calls of the same formula to return the same set of values in a single refresh.
You can modify the time frame if required. Shorter time periods allow for more frequent updates, but also slightly increase the extremely unlikely chance that some calls to the formula occur after the cutover point resulting in a 2nd set of 10 random numbers for subsequent calls.
Firstly, define an array "Primes" with 10 different prime numbers:
={157;163;167;173;179;181;191;193;197;199}
Then, define this formula that will return an array of 10 random numbers:
=MOD(ROUND(MOD(ROUND(NOW(),4)*70000,Primes),0),10)+1
Explanation:
We need to build our own random number generator that we can seed with the same value for an amount of time; long enough for the called formula to keep returning the same value.
Firstly, we create a seed: ROUND(NOW(),4) creates a new seed number every 0.0001 days = 8.64 seconds.
We can generate rough random numbers using the following formula:
Random = Seed * 7 mod Prime
https://cdsmith.wordpress.com/2011/10/10/build-your-own-simple-random-numbers/
Ideally, a sequence of random numbers is generated by taking input from the previous output, but we can't do that in a single function. So instead, this uses 10 different prime numbers, essentially starting 10 different random number generators. Now, this has less reliability at generating random numbers, but testing results further below shows it actually seems to do a pretty good job.
ROUND(NOW(),4)*70000 gets our seed up to an integer and multiplies by 7 at the same time
MOD(ROUND(NOW(),4)*70000,Prime) generates a sequence of 10 random numbers from 0 to the respective prime number
ROUND(MOD(ROUND(NOW(),4)*70000,Prime),0) is required to get us back to an integer because Excel seems to struggle with apply Mod to floating point numbers.
=MOD(ROUND(MOD(ROUND(NOW(),4)*70000,Prime),0),10)+1 takes just the value from the ones place (random number from 0 to 9) and shifts it to give us a random number from 1 to 10
Testing results:
I generated 500 lots of 10 random numbers (in columns instead of rows) for seed values incrementing by 0.0001 and counted the number of times each digit occurred for each prime number. You can see that each digit occurred nearly 500 times in total and that the distribution of each digit is nearly equal between each prime number. So, this may be adequate for your purposes.
Looking at the numbers generated in immediate succession you can see similarities between adjacent prime numbers, they're not exactly the same but they're pretty close in places, even if they're offset by a few rows. However, if the refresh is occurring at random intervals, you'll still get seemingly random numbers and this should be sufficient for your purposes. Otherwise, you can still apply this approach to a more complex random number generator or try a different mix of prime numbers that are further apart.
Update 1: Trying to find a way of being able to specify the number of random numbers generated without storing a list of primes.
Attempt 1: Using a single prime with an array of seeds:
=MOD(ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))/10000,4)*70000,1013),0),10)+1
This does give you an even distribution, but it really is just repeating the exact same sequence of 10 numbers over and over. Any analysis of the sample would be identical to analysing =MOD(ROW(1:SampleSize),10)+1. I think you want more variation than that!
Attempt 2: Working on a 2-dimensional array that still uses 10 primes....
Update 2: Didn't work. It had terrible performance. A new answer has been submitted that takes a similar but different approach.
OK, here's a solution where users can specify the number of values in defined name SAMPLESIZE
=MOD(ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize)),4)*10000*163,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*2,4)*10000*211,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*3,4)*10000*17,1013),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*5,4)*10000*179,53),0)+ROUND(MOD(ROUND(NOW()+ROW(OFFSET(INDIRECT("A1"),0,0,SampleSize))*7,4)*10000*6101,1013),0),10)+1
It's a long formula, but has good efficiency and can be used in other functions. Attempts at a shorter formula resulted in unusably poor performance and arrays that for some reason couldn't be used in other functions.
This solution combines 5 different prime number generators to increase variety in the generated random numbers. Some arbitrary constants were introduced to try to reduce repeating patterns.
This has correct distribution and fairly good randomness. Repeated testing with a SampleSize of 10,000 resulted in frequencies of individual numbers varying between 960 and 1040 with no overall favoritism. However it seems to have the strange property of never generating the same number twice in a row!
You can achieve this using just standard spreadsheet formulas.
One way is to use the so called Lehmer random number method. It generates a sequence of random numbers in your spreadsheet that stays the same until you change the "seed number", a number you choose yourself and will recreate a different random sequence for each seed number you choose.
The short version:
In cell B1, enter your "seed" number, it can be any number from 1 to 2,147,483,647
In cell B2 enter the formula =MOD(48271*B1,2^31-1) , this will generate the first random number of your sequence.
Now copy this cell down as far as the the random sequence you want to generate.
That's it. For your named range, go ahead and name the range from B2 down as far as your sequence goes. If you want a different set of numbers, just change the seed in B1. If you ever want to recreate the same set of numbers just use the same seed and the same random sequence will appear.
More details in this tutorial:
How to generate random numbers that don't change in Excel and Google Sheets
It's not a great answer but considering the limitation of a volatile function, it is definitely a possible answer to use the IF formula with Volatile function and a Volatile variable placed somewhere in the worksheet.
I used the below formula to achieve the desired result
=IF(rngIsVolatile,randArray,A1:A10)
I set cell B12 as rngIsVolatile. I pasted the screenshots below to see it in working.
When rngIsVolatile is set to True, it picks up new values from randArray:
When rngIsVolatile is set to False, it picks up old values from A1:A10:

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.
The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.
(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)
Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.
That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.
A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).
The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).
Read the paper for the real logic of course.
Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

What's a good shuffle percentage?

I'm basically new to coding for random results, but did some reading and tested out the javascript version of the Fisher-Yates algorithm (as seen on wikipedia), with an ordered list.
I ended up adding code to make sure the array was shuffled differently than its initial order, and also calculated the percentage of how many objects were shuffled to a different position by the algorithm.
So I'm wondering what might be considered a good result. Kind of a generic question. If I shuffled a deck of cards, what would be the least acceptable amount of shuffle? Right now I have mine coded to repeat the algorithm if it comes out less than 25 percent shuffled.
What do you think?
Zero. You can make any number of checks you like to make it feel more random, but even the check for the same order makes your algorithm flawed.
If your shuffle algorithm is correctly implemented and produces a truly random shuffle (modulo your PRNG's randomness, or lack thereof), I wouldn't reshuffle at all. In particular, the fact that you don't accept random configurations that are 25+% similar to your original configuration tells an adversary that they can expect not to see any of those configurations after your shuffling completes.
Thanks for the feedback. All of your answers were relevant.
I'm adding this answer since my notion of shuffling went from random to non random, and now I'm using a hybrid of the two.
With the perfect in-shuffle I can generate multiple lists in different orders (which modulate off the number of items). However, when the number of items is odd, the last number does not get shuffled. So I decided to randomize its position in the shuffled list.
In the process of figuring that out, I made a table generator which displays all the lists possible given a number of items. It's pretty interesting. For example, the number 52 generates 52 columns, while the number 51 only generates 8 columns.

finding a number appearing again among numbers stored in a file

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

Resources