How to sort an int array in linear time? - c

I had been given a homework to do a program to sort an array in ascending order.I did this:
#include <stdio.h>
int main()
{
int a[100],i,n,j,temp;
printf("Enter the number of elements: ");
scanf("%d",&n);
for(i=0;i<n;++i)
{
printf("%d. Enter element: ",i+1);
scanf("%d",&a[i]);
}
for(j=0;j<n;++j)
for(i=j+1;i<n;++i)
{
if(a[j]>a[i])
{
temp=a[j];
a[j]=a[i];
a[i]=temp;
}
}
printf("Ascending order: ");
for(i=0;i<n;++i)
printf("%d ",a[i]);
return 0;
}
The input will not be more than 10 numbers. Can this be done in less amount of code than i did here? I want the code to be as shortest as possible.Any help will be appreciated.Thanks!

If you know the range of the array elements, one way is to use another array to store the frequency of each of the array elements ( all elements should be int :) ) and print the sorted array. I am posting it for large number of elements (106). You can reduce it according to your need:
#include <stdio.h>
#include <malloc.h>
int main(void){
int t, num, *freq = malloc(sizeof(int)*1000001);
memset(freq, 0, sizeof(int)*1000001); // Set all elements of freq to 0
scanf("%d",&t); // Ask for the number of elements to be scanned (upper limit is 1000000)
for(int i = 0; i < t; i++){
scanf("%d", &num);
freq[num]++;
}
for(int i = 0; i < 1000001; i++){
if(freq[i]){
while(freq[i]--){
printf("%d\n", i);
}
}
}
}
This algorithm can be modified further. The modified version is known as Counting sort and it sorts the array in Θ(n) time.
Counting sort:1
Counting sort assumes that each of the n input elements is an integer in the range
0 to k, for some integer k. When k = O(n), the sort runs in Θ(n) time.
Counting sort determines, for each input element x, the number of elements less
than x. It uses this information to place element x directly into its position in the
output array. For example, if 17 elements are less than x, then x belongs in output
position 18. We must modify this scheme slightly to handle the situation in which
several elements have the same value, since we do not want to put them all in the
same position.
In the code for counting sort, we assume that the input is an array A[1...n] and
thus A.length = n. We require two other arrays: the array B[1....n] holds the
sorted output, and the array C[0....k] provides temporary working storage.
The pseudo code for this algo:
for i ← 1 to k do
c[i] ← 0
for j ← 1 to n do
c[A[j]] ← c[A[j]] + 1
//c[i] now contains the number of elements equal to i
for i ← 2 to k do
c[i] ← c[i] + c[i-1]
// c[i] now contains the number of elements ≤ i
for j ← n downto 1 do
B[c[A[i]]] ← A[j]
c[A[i]] ← c[A[j]] - 1
1. Content has been taken from Introduction to Algorithms by
Thomas H. Cormen and others.

You have 10 lines doing the sorting. If you're allowed to use someone else's work (subsequent notes indicate that you can't do this), you can reduce that by writing a comparator function and calling the standard C library qsort() function:
static int compare_int(void const *v1, void const *v2)
{
int i1 = *(int *)v1;
int i2 = *(int *)v2;
if (i1 < i2)
return -1;
else if (i1 > i2)
return +1;
else
return 0;
}
And then the call is:
qsort(a, n, sizeof(a[0]), compare_int);
Now, I wrote the function the way I did for a reason. In particular, it avoids arithmetic overflow which writing this does not:
static int compare_int(void const *v1, void const *v2)
{
return *(int *)v1 - *(int *)v2;
}
Also, the original pattern generalizes to comparing structures, etc. You compare the first field for inequality returning the appropriate result; if the first fields are unequal, then you compare the second fields; then the third, then the Nth, only returning 0 if every comparison shows the values are equal.
Obviously, if you're supposed to write the sort algorithm, then you'll have to do a little more work than calling qsort(). Your algorithm is a Bubble Sort. It is one of the most inefficient sorting techniques — it is O(N2). You can look up Insertion Sort (also O(N2)) but more efficient than Bubble Sort), or Selection Sort (also quadratic), or Shell Sort (very roughly O(N3/2)), or Heap Sort (O(NlgN)), or Quick Sort (O(NlgN) on average, but O(N2) in the worst case), or Intro Sort. The only ones that might be shorter than what you wrote are Insertion and Selection sorts; the others will be longer but faster for large amounts of data. For small sets like 10 or 100 numbers, efficiency is immaterial — all sorts will do. But as you get towards 1,000 or 1,000,000 entries, then the sorting algorithms really matter. You can find a lot of questions on Stack Overflow about different sorting algorithms. You can easily find information in Wikipedia for any and all of the algorithms mentioned.
Incidentally, if the input won't be more than 10 numbers, you don't need an array of size 100.

Related

Hacker Earth(Basic I/O Question) Play With Numbers [ subarry ]

I have been trying to solve this problem and it works good with small numbers but not the big 10^9 numbers in Hacker Earth
You are given an array of n numbers and q queries. For each query you have to print the floor of the expected value(mean) of the subarray from L to R.
INPUT:
First line contains two integers N and Q denoting number of array elements and number of queries.
Next line contains N space separated integers denoting array elements.
Next Q lines contain two integers L and R(indices of the array).
OUTPUT:
print a single integer denoting the answer.
Constraints:
1<= N ,Q,L,R <= 10^6
1<= Array elements <= 10^9
NOTE
Use Fast I/O
using namespace std;
long int solvepb(int a, int b, long int *arr,int n){
int result, count = 0;
vector<long int>res;
for(int i=0;i<n;i++){
if(i+1 >= a && i+1 <=b){
res.push_back(arr[i]);
count += arr[i];
}
}
result = count / res.size();
return result;
}
int main(){
int n,q;cin>>n>>q;
long int arr[n];
for(int i=0;i<n;i++){
cin>>arr[i];
}
while(q--){
int a,b;
cin>>a>>b;
cout<<solvepb(a,b,arr,n)<<endl;
}
return 0;
}```
So currently, the issue with your algorithm is that each time you are computing the mean over two indices in the array. This means that if the queries are particularly bad, for each of the Q queries, you might iterate through all N elements of the array.
How can one try to reduce this? Notice that because sums are additive, the sum up to an index i is the same as the sum up to an index j plus the sum of the numbers between i and j. Let me rewrite that as an equation -
sum[0:i] = sum[0:j] + sum[j+1:i]
It should be obvious now that by rearranging this equation, you can quickly get the sum between two indices by storing the sum of numbers up to an index. (i.e. sum[j+1:i] = sum[0:i] - sum[0:j]). This means that rather than having O(N*Q), you can have O(N + Q) runtime complexity. The O(N) part of the new complexity is from iterating the array once to get all the sums. The O(Q) part comes from answering the Q queries.
This kind of approach is called prefix sums. There are some optimized data structures like Fenwick trees made specifically for prefix sums that you can read about online or on Wikipedia. But for your question, a simple array should work just fine.
A few comments about your code:
In your for loop in the solvepb function, you are going from 0 to n always, but you didn't need to. You could have specified to go from a to b if you knew a was smaller than b. Otherwise, you go from b to a.
You also do not really use the vector. The vector in the solvepb function stores array elements, but these are never used again. You only seem to use it to find the number of elements from a to b, but you can get that by simply subtracting the difference between the two indices (i.e. b-a+1 if a < b otherwise a-b+1)

Shuffle an array while making each index have the same probability to be in any index

I want to shuffle an array, and that each index will have the same probability to be in any other index (excluding itself).
I have this solution, only i find that always the last 2 indexes will always ne swapped with each other:
void Shuffle(int arr[]. size_t n)
{
int newIndx = 0;
int i = 0;
for(; i > n - 2; ++i)
{
newIndx = rand() % (n - 1);
if (newIndx >= i)
{
++newIndx;
}
swap(i, newIndx, arr);
}
}
but in the end it might be that some indexes will go back to their first place once again.
Any thoughts?
C lang.
A permutation (shuffle) where no element is in its original place is called a derangement.
Generating random derangements is harder than generating random permutations, can be done in linear time and space. (Generating a random permutation can be done in linear time and constant space.) Here are two possible algorithms.
The simplest solution to understand is a rejection strategy: do a Fisher-Yates shuffle, but if the shuffle attempts to put an element at its original spot, restart the shuffle. [Note 1]
Since the probability that a random shuffle is a derangement is approximately 1/e, the expected number of shuffles performed is about e (that is, 2.71828…). But since unsuccessful shuffles are restarted as soon as the first fixed point is encountered, the total number of shuffle steps is less than e times the array size for a detailed analysis, see this paper, which proves the expected number of random numbers needed by the algorithm to be around (e−1) times the number of elements.
In order to be able to do the check and restart, you need to keep an array of indices. The following little function produces a derangement of the indices from 0 to n-1; it is necessary to then apply the permutation to the original array.
/* n must be at least 2 for this to produce meaningful results */
void derange(size_t n, size_t ind[]) {
for (size_t i = 0; i < n; ++i) ind[i] = i;
swap(ind, 0, randint(1, n));
for (size_t i = 1; i < n; ++i) {
int r = randint(i, n);
swap(ind, i, r);
if (ind[i] == i) i = 0;
}
}
Here are the two functions used by that code:
void swap(int arr[], size_t i, size_t j) {
int t = arr[i]; arr[i] = arr[j]; arr[j] = t;
}
/* This is not the best possible implementation */
int randint(int low, int lim) {
return low + rand() % (lim - low);
}
The following function is based on the 2008 paper "Generating Random Derangements" by Conrado Martínez, Alois Panholzer and Helmut Prodinger, although I use a different mechanism to track cycles. Their algorithm uses a bit vector of size N but uses a rejection strategy in order to find an element which has not been marked. My algorithm uses an explicit vector of indices not yet operated on. The vector is also of size N, which is still O(N) space [Note 2]; since in practical applications, N will not be large, the difference is not IMHO significant. The benefit is that selecting the next element to use can be done with a single call to the random number generator. Again, this is not particularly significant since the expected number of rejections in the MP&P algorithm is very small. But it seems tidier to me.
The basis of the algorithms (both MP&P and mine) is the recursive procedure to produce a derangement. It is important to note that a derangement is necessarily the composition of some number of cycles where each cycle is of size greater than 1. (A cycle of size 1 is a fixed point.) Thus, a derangement of size N can be constructed from a smaller derangement using one of two mechanisms:
Produce a derangement of the N-1 elements other than element N, and add N to some cycle at any point in that cycle. To do so, randomly select any element j in the N-1 cycle and place N immediately after j in the j's cycle. This alternative covers all possibilities where N is in a cycle of size > 3.
Produce a derangement of N-2 of the N-1 elements other than N, and add a cycle of size 2 consisting of N and the element not selected from the smaller derangement. This alternative covers all possibilities where N is in a cycle of size 2.
If Dn is the number of derangements of size n, it is easy to see from the above recursion that:
Dn = (n−1)(Dn−1 + Dn−2)
The multiplier is n−1 in both cases: in the first alternative, it refers to the number of possible places N can be added, and in the second alternative to the number of possible ways to select n−2 elements of the recursive derangement.
Therefore, if we were to recursively produce a random derangement of size N, we would randomly select one of the N-1 previous elements, and then make a random boolean decision on whether to produce alternative 1 or alternative 2, weighted by the number of possible derangements in each case.
One advantage to this algorithm is that it can derange an arbitrary vector; there is no need to apply the permuted indices to the original vector as with the rejection algorithm.
As MP&P note, the recursive algorithm can just as easily be performed iteratively. This is quite clear in the case of alternative 2, since the new 2-cycle can be generated either before or after the recursion, so it might as well be done first and then the recursion is just a loop. But that is also true for alternative 1: we can make element N the successor in a cycle to a randomly-selected element j even before we know which cycle j will eventually be in. Looked at this way, the difference between the two alternatives reduces to whether or not element j is removed from future consideration or not.
As shown by the recursion, alternative 2 should be chosen with probability (n−1)Dn−2/Dn, which is how MP&P write their algorithm. I used the equivalent formula Dn−2 / (Dn−1 + Dn−2), mostly because my prototype used Python (for its built-in bignum support).
Without bignums, the number of derangements and hence the probabilities need to be approximated as double, which will create a slight bias and limit the size of the array to be deranged to about 170 elements. (long double would allow slightly more.) If that is too much of a limitation, you could implement the algorithm using some bignum library. For ease of implementation, I used the Posix drand48 function to produce random doubles in the range [0.0, 1.0). That's not a great random number function, but it's probably adequate to the purpose and is available in most standard C libraries.
Since no attempt is made to verify the uniqueness of the elements in the vector to be deranged, a vector with repeated elements may produce a derangement where one or more of these elements appear to be in the original place. (It's actually a different element with the same value.)
The code:
/* Deranges the vector `arr` (of length `n`) in place, to produce
* a permutation of the original vector where every element has
* been moved to a new position. Returns `true` unless the derangement
* failed because `n` was 1.
*/
bool derange(int arr[], size_t n) {
if (n < 2) return n != 1;
/* Compute derangement counts ("subfactorials") */
double subfact[n];
subfact[0] = 1;
subfact[1] = 0;
for (size_t i = 2; i < n; ++i)
subfact[i] = (i - 1) * (subfact[i - 2] + subfact[i - 1]);
/* The vector 'todo' is the stack of elements which have not yet
* been (fully) deranged; `u` is the count of elements in the stack
*/
size_t todo[n];
for (size_t i = 0; i < n; ++i) todo[i] = i;
size_t u = n;
/* While the stack is not empty, derange the element at the
* top of the stack with some element lower down in the stack
*/
while (u) {
size_t i = todo[--u]; /* Pop the stack */
size_t j = u * drand48(); /* Get a random stack index */
swap(arr, i, todo[j]); /* i will follow j in its cycle */
/* If we're generating a 2-cycle, remove the element at j */
if (drand48() * (subfact[u - 1] + subfact[u]) < subfact[u - 1])
todo[j] = todo[--u];
}
return true;
}
Notes
Many people get this wrong, particularly in social occasions such as "secret friend" selection (I believe this is sometimes called "the Santa game" in other parts of the world.) The incorrect algorithm is to just choose a different swap if the random shuffle produces a fixed point, unless the fixed point is at the very end in which case the shuffle is restarted. This will produce a random derangement but the selection is biased, particularly for small vectors. See this answer for an analysis of the bias.
Even if you don't use the RAM model where all integers are considered fixed size, the space used is still linear in the size of the input in bits, since N distinct input values must have at least N log N bits. Neither this algorithm nor MP&P makes any attempt to derange lists with repeated elements, which is a much harder problem.
Your algorithm is only almost correct (which in algorithmics means unexpected results). Because of some little errors scattered along, it will not produce expected results.
First, rand() % N is not guaranteed to produce an uniformal distribution, unless N is a divisor of the number of possible values. In any other case, you will get a slight bias. Anyway my man page for rand describes it as a bad random number generator, so you should try to use random or if available arc4random_uniform.
But avoiding that an index come back at its original place is both incommon, and rather hard to achieve. The only way I can imagine is to keep an array of the numbers [0; n[ and swap it the same as the real array to be able to know the original index of a number.
The code could become:
void Shuffle(int arr[]. size_t n)
{
int i, newIndx;
int *indexes = malloc(n * sizeof(int));
for (i=0; i<n; i++) indexes[i] = i;
for(i=0; i < n - 1; ++i) // beware to the inequality!
{
int i1;
// search if index i is in the [i; n[ current array:
for (i1=i; i1 < n; ++i) {
if (indexes[i1] == i) { // move it to i position
if (i1 != i) { // nothing to do if already at i
swap(i, i1, arr);
swap(i, i1, indexes);
}
break;
}
}
i1 = (i1 == n) ? i : i+1; // we will start the search at i1
// to guarantee that no element keep its place
newIndx = i1 + arc4random_uniform(n - i1);
/* if arc4random is not available:
newIndx = i1 + (random() % (n - i1));
*/
swap(i, newIndx, arr);
swap(i, newIndx, indexes);
}
/* special case: a permutation of [0: n-1[ have left last element in place
* we will exchange the last element with a random one
*/
if (indexes[n-1] == n-1) {
newIndx = arc4random_uniform(n-1)
swap(n-1, newIndx, arr);
swap(n-1, newIndx, indexes);
}
free(indexes); // don't forget to free what we have malloc'ed...
}
Beware: the algorithm should be correct, but the code has not been tested and can contain typos...

What is the best way to find N consecutive elements of a sorted version of an unordered array?

For instance: I have an unsorted list A of 10 elements. I need the sublist of k consecutive elements from i through i+k-1 of the sorted version of A.
Example:
Input: A { 1, 6, 13, 2, 8, 0, 100, 3, -4, 10 }
k = 3
i = 4
Output: sublist B { 2, 3, 6 }
If i and k are specified, you can use a specialized version of quicksort where you stop recursion on parts of the array that fall outside of the i .. i+k range. If the array can be modified, perform this partial sort in place, if the array cannot be modified, you will need to make a copy.
Here is an example:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// Partial Quick Sort using Hoare's original partition scheme
void partial_quick_sort(int *a, int lo, int hi, int c, int d) {
if (lo < d && hi > c && hi - lo > 1) {
int x, pivot = a[lo];
int i = lo - 1;
int j = hi;
for (;;) {
while (a[++i] < pivot)
continue;
while (a[--j] > pivot)
continue;
if (i >= j)
break;
x = a[i];
a[i] = a[j];
a[j] = x;
}
partial_quick_sort(a, lo, j + 1, c, d);
partial_quick_sort(a, j + 1, hi, c, d);
}
}
void print_array(const char *msg, int a[], int count) {
printf("%s: ", msg);
for (int i = 0; i < count; i++) {
printf("%d%c", a[i], " \n"[i == count - 1]);
}
}
int int_cmp(const void *p1, const void *p2) {
int i1 = *(const int *)p1;
int i2 = *(const int *)p2;
return (i1 > i2) - (i1 < i2);
}
#define MAX 1000000
int main(void) {
int *a = malloc(MAX * sizeof(*a));
clock_t t;
int i, k;
srand((unsigned int)time(NULL));
for (i = 0; i < MAX; i++) {
a[i] = rand();
}
i = 20;
k = 10;
printf("extracting %d elements at %d from %d total elements\n",
k, i, MAX);
t = clock();
partial_quick_sort(a, 0, MAX, i, i + k);
t = clock() - t;
print_array("partial qsort", a + i, k);
printf("elapsed time: %.3fms\n", t * 1000.0 / CLOCKS_PER_SEC);
t = clock();
qsort(a, MAX, sizeof *a, int_cmp);
t = clock() - t;
print_array("complete qsort", a + i, k);
printf("elapsed time: %.3fms\n", t * 1000.0 / CLOCKS_PER_SEC);
return 0;
}
Running this program with an array of 1 million random integers, extracting the 10 entries of the sorted array starting at offset 20 gives this output:
extracting 10 elements at 20 from 1000000 total elements
partial qsort: 33269 38347 39390 45413 49479 50180 54389 55880 55927 62158
elapsed time: 3.408ms
complete qsort: 33269 38347 39390 45413 49479 50180 54389 55880 55927 62158
elapsed time: 149.101ms
It is indeed much faster (20x to 50x) than sorting the whole array, even with a simplistic choice of pivot. Try multiple runs and see how the timings change.
An idea could be to scan your array for bigger or equal numbers of i and smaller or equal numbers of i+k and add them to another list/container.
This will take you O(n) and give an unordered list of the numbers you need. Then you sort that list O(nlogn) and you are done.
For really big arrays the advantage of this method is that you will sort a smaller list of numbers. (given that the k is relatively small).
You can use Quickselect, or a heap selection algorithm to get the i+k smallest items. Quickselect works in-place, but it modifies the original array. It also won't work if the list of items is larger than will fit in memory. Quickselect is O(n), but with a fairly high constant. When the number of items you are selecting is a very small fraction of the total number of items, the heap selection algorithm is faster.
The idea behind the heap selection algorithm is that you initialize a max-heap with the first i+k items. Then, iterate through the rest of the items. If an item is smaller than the largest item on the max-heap, remove the largest item from the max-heap and replace it with the new, smaller item. When you're done, you have the first i+k items on the heap, with the largest k items at the top.
The code is pretty simple:
heap = new max_heap();
Add first `i+k` items from a[] to heap
for all remaining items in a[]
if item < heap.peek()
heap.pop()
heap.push(item)
end-if
end-for
// at this point the smallest i+k items are on the heap
This requires O(i+k) extra memory, and worst case running time is O(n log(i+k)). When (i+k) is less than about 2% of n, it will usually outperform Quickselect.
For much more information about this, see my blog post When theory meets practice.
By the way, you can optimize your memory usage somewhat based on i. That is, if there are a billion items in the array and you want items 999,999,000 through 999,999,910, the standard method above would require a huge heap. But you can re-cast that problem to one in which you need to select the smallest of the last 1,000 items. Your heap then becomes a min-heap of 1,000 items. It just takes a little math to determine which way will require the smallest heap.
That doesn't help much, of course, if you want items 600,000,000 through 600,000,010, because your heap still has 400 million items in it.
It occurs to me, though, that if time isn't a huge issue, you can just build the heap in the array in-place using Floyd's algorithm, pop the first i items like you would with heap sort, and the next k items are what you're looking for. This would require constant extra space and O(n + (i+k)*log(n)) time.
Come to think of it, you could implement the heap selection logic with a heap of (i+k) items (as described above) in-place, as well. It would be a little tricky to implement, but it wouldn't require any extra space and would have the same running time O(n*log(i+k)).
Note that both would modify the original array.
One thing you could do is modify heapsort, such that you will first create the heap, but then pop the first i elements. The next k elements you pop form the heap will be your result. Discarding the n - i - k elements remaining let's the algorithm terminate early.
The result will be in O((i + k) log n) which is in O(n log n), but is significantly faster with relative low values for i and k.

Given a sorted array with a few numbers in between reversed. How to sort it?

I am actually trying to solve a problem where I have an array which is sorted but a few numbers are reversed . For example : 1 2 3 4 9 8 7 11 12 14 is the array.
Now , my first thought was applying a Binary Search algorithm to find a PEAK ( a[i]>a[i+1] && a[i]>a[i-1])
However , I feel it might not always give the correct result. Moreover it might not be efficient since the list is almost sorted.
Next impression : Applying Insertion Sort since the list is sorted and insertion sort gives best performance in such case IF I am not wrong.
So can anyone suggest better solutions or whether my solutions are correct or not? Efficient of In-efficient?
P.S - This is NOT homework !
UPDATE : Insertion Sort (O(n) in this case) or Linear Scan to find the subsequence and then reversing it (O(n)) again. Is there any chance if we could optimize it? Or probably do in O(logn) ?
Search linearly for the first inversion (i.e. a[i+1] < a[i]), call its index inv1. Continue until inversions stop, call the last index inv2. Reverse the array between inv1 and inv2, inclusive.
In your example, inv1 is 4, and inv2 is 6; array elements are numbered from zero.
The algorithm is linear in the number of entries in the original.
If you're sure that the list is sorted except for an embedded subsequence that is reversed, I suggest that you do a simple scan, detect the start of the reversed subsequence (by finding the first counter-directional change), scan to the end of the subsequence (where the changes resume the correct direction) and reverse the subsequence. This should also work for multiple subsequences provided they do not overlap. The complexity should be O(n).
Note: there should be an extra decision whether to cut between {4,9}, or between {9,8}. (I just add one ;-)
#include <stdio.h>
int array[] = {1,2,3,4,9,8,7,11,12,14};
unsigned findrev(int *arr, unsigned len, unsigned *ppos);
void revrev(int *arr, unsigned len);
unsigned findrev(int *arr, unsigned len, unsigned *ppos)
{
unsigned zlen,pos;
for(zlen=pos=0; pos < len-1; pos++ ) {
if (arr[pos+1] < arr[pos]) zlen++;
else if (zlen) break;
}
if (zlen) *ppos = pos - zlen++;
return zlen;
}
void revrev(int *arr, unsigned len)
{
unsigned pos;
for (pos = 0; pos < --len; pos++) {
int tmp;
tmp = arr[pos];
arr[pos] = arr[len] ;
arr[len] = tmp;
}
}
int main(void)
{
unsigned start,len;
len = findrev(array, 10, &start);
printf("Start=%u Len=%u\n", start, len);
revrev(array+start, len);
for (start=0; start < 10; start++) {
printf(" %d", array[start] );
}
printf("\n" );
return 0;
}
NOTE: the length of the reversed run could also be found by a binary-search for the first value larger (or equal) than the first element of the reversed sequence.
Timsort is quite good at sorting mostly-already-sorted arrays - on top of that, it does an in-place mergesort by using two different mergesteps depending on which will work better. I'm told it's found in the python and java standard libraries, perhaps others. You still probably shouldn't use it inside a loop though - inside a loop you're better off with a treap (for good average speed) or red-black tree (for low standard deviation speed).
I think linear solution [O(n)] is the best possible solution since in a list of n numbers, if n/2 numbers are reverse sorted as in example below we will have to invert n/2 numbers which gives complexity of O(n).
Also even in this case for a similar sequence, I think insertion sort will be O (n^2) and not O (n) in worst case.
Example: Consider an array with distribution below and we attempt to use insertion sort,
n/4 sorted numbers | n2 reverse sorted numbers | n/4 sorted numbers
For the n/2 reverse sorted numbers the sorting complexity will be O(n^2).

How do you efficiently generate a list of K non-repeating integers between 0 and an upper bound N [duplicate]

This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 5 years ago.
The question gives all necessary data: what is an efficient algorithm to generate a sequence of K non-repeating integers within a given interval [0,N-1]. The trivial algorithm (generating random numbers and, before adding them to the sequence, looking them up to see if they were already there) is very expensive if K is large and near enough to N.
The algorithm provided in Efficiently selecting a set of random elements from a linked list seems more complicated than necessary, and requires some implementation. I've just found another algorithm that seems to do the job fine, as long as you know all the relevant parameters, in a single pass.
In The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition, Knuth describes the following selection sampling algorithm:
Algorithm S (Selection sampling technique). To select n records at random from a set of N, where 0 < n ≤ N.
S1. [Initialize.] Set t ← 0, m ← 0. (During this algorithm, m represents the number of records selected so far, and t is the total number of input records that we have dealt with.)
S2. [Generate U.] Generate a random number U, uniformly distributed between zero and one.
S3. [Test.] If (N – t)U ≥ n – m, go to step S5.
S4. [Select.] Select the next record for the sample, and increase m and t by 1. If m < n, go to step S2; otherwise the sample is complete and the algorithm terminates.
S5. [Skip.] Skip the next record (do not include it in the sample), increase t by 1, and go back to step S2.
An implementation may be easier to follow than the description. Here is a Common Lisp implementation that select n random members from a list:
(defun sample-list (n list &optional (length (length list)) result)
(cond ((= length 0) result)
((< (* length (random 1.0)) n)
(sample-list (1- n) (cdr list) (1- length)
(cons (car list) result)))
(t (sample-list n (cdr list) (1- length) result))))
And here is an implementation that does not use recursion, and which works with all kinds of sequences:
(defun sample (n sequence)
(let ((length (length sequence))
(result (subseq sequence 0 n)))
(loop
with m = 0
for i from 0 and u = (random 1.0)
do (when (< (* (- length i) u)
(- n m))
(setf (elt result m) (elt sequence i))
(incf m))
until (= m n))
result))
The random module from Python library makes it extremely easy and effective:
from random import sample
print sample(xrange(N), K)
sample function returns a list of K unique elements chosen from the given sequence.
xrange is a "list emulator", i.e. it behaves like a list of consecutive numbers without creating it in memory, which makes it super-fast for tasks like this one.
It is actually possible to do this in space proportional to the number of elements selected, rather than the size of the set you're selecting from, regardless of what proportion of the total set you're selecting. You do this by generating a random permutation, then selecting from it like this:
Pick a block cipher, such as TEA or XTEA. Use XOR folding to reduce the block size to the smallest power of two larger than the set you're selecting from. Use the random seed as the key to the cipher. To generate an element n in the permutation, encrypt n with the cipher. If the output number is not in your set, encrypt that. Repeat until the number is inside the set. On average you will have to do less than two encryptions per generated number. This has the added benefit that if your seed is cryptographically secure, so is your entire permutation.
I wrote about this in much more detail here.
The following code (in C, unknown origin) seems to solve the problem extremely well:
/* generate N sorted, non-duplicate integers in [0, max] */
int *generate(int n, int max) {
int i, m, a;
int *g = (int *)calloc(n, sizeof(int));
if (!g) return 0;
m = 0;
for (i = 0; i < max; i++) {
a = random_in_between(0, max - i);
if (a < n - m) {
g[m] = i;
m++;
}
}
return g;
}
Does anyone know where I can find more gems like this one?
Generate an array 0...N-1 filled a[i] = i.
Then shuffle the first K items.
Shuffling:
Start J = N-1
Pick a random number 0...J (say, R)
swap a[R] with a[J]
since R can be equal to J, the element may be swapped with itself
subtract 1 from J and repeat.
Finally, take K last elements.
This essentially picks a random element from the list, moves it out, then picks a random element from the remaining list, and so on.
Works in O(K) and O(N) time, requires O(N) storage.
The shuffling part is called Fisher-Yates shuffle or Knuth's shuffle, described in the 2nd volume of The Art of Computer Programming.
Speed up the trivial algorithm by storing the K numbers in a hashing store. Knowing K before you start takes away all the inefficiency of inserting into a hash map, and you still get the benefit of fast look-up.
My solution is C++ oriented, but I'm sure it could be translated to other languages since it's pretty simple.
First, generate a linked list with K elements, going from 0 to K
Then as long as the list isn't empty, generate a random number between 0 and the size of the vector
Take that element, push it into another vector, and remove it from the original list
This solution only involves two loop iterations, and no hash table lookups or anything of the sort. So in actual code:
// Assume K is the highest number in the list
std::vector<int> sorted_list;
std::vector<int> random_list;
for(int i = 0; i < K; ++i) {
sorted_list.push_back(i);
}
// Loop to K - 1 elements, as this will cause problems when trying to erase
// the first element
while(!sorted_list.size() > 1) {
int rand_index = rand() % sorted_list.size();
random_list.push_back(sorted_list.at(rand_index));
sorted_list.erase(sorted_list.begin() + rand_index);
}
// Finally push back the last remaining element to the random list
// The if() statement here is just a sanity check, in case K == 0
if(!sorted_list.empty()) {
random_list.push_back(sorted_list.at(0));
}
Step 1: Generate your list of integers.
Step 2: Perform Knuth Shuffle.
Note that you don't need to shuffle the entire list, since the Knuth Shuffle algorithm allows you to apply only n shuffles, where n is the number of elements to return. Generating the list will still take time proportional to the size of the list, but you can reuse your existing list for any future shuffling needs (assuming the size stays the same) with no need to preshuffle the partially shuffled list before restarting the shuffling algorithm.
The basic algorithm for Knuth Shuffle is that you start with a list of integers. Then, you swap the first integer with any number in the list and return the current (new) first integer. Then, you swap the second integer with any number in the list (except the first) and return the current (new) second integer. Then...etc...
This is an absurdly simple algorithm, but be careful that you include the current item in the list when performing the swap or you will break the algorithm.
The Reservoir Sampling version is pretty simple:
my $N = 20;
my $k;
my #r;
while(<>) {
if(++$k <= $N) {
push #r, $_;
} elsif(rand(1) <= ($N/$k)) {
$r[rand(#r)] = $_;
}
}
print #r;
That's $N randomly selected rows from STDIN. Replace the <>/$_ stuff with something else if you're not using rows from a file, but it's a pretty straightforward algorithm.
If the list is sorted, for example, if you want to extract K elements out of N, but you do not care about their relative order, an efficient algorithm is proposed in the paper An Efficient Algorithm for Sequential Random Sampling (Jeffrey Scott Vitter, ACM Transactions on Mathematical Software, Vol. 13, No. 1, March 1987, Pages 56-67.).
edited to add the code in c++ using boost. I've just typed it and there might be many errors. The random numbers come from the boost library, with a stupid seed, so don't do anything serious with this.
/* Sampling according to [Vitter87].
*
* Bibliography
* [Vitter 87]
* Jeffrey Scott Vitter,
* An Efficient Algorithm for Sequential Random Sampling
* ACM Transactions on MAthematical Software, 13 (1), 58 (1987).
*/
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <string>
#include <iostream>
#include <iomanip>
#include <boost/random/linear_congruential.hpp>
#include <boost/random/variate_generator.hpp>
#include <boost/random/uniform_real.hpp>
using namespace std;
// This is a typedef for a random number generator.
// Try boost::mt19937 or boost::ecuyer1988 instead of boost::minstd_rand
typedef boost::minstd_rand base_generator_type;
// Define a random number generator and initialize it with a reproducible
// seed.
// (The seed is unsigned, otherwise the wrong overload may be selected
// when using mt19937 as the base_generator_type.)
base_generator_type generator(0xBB84u);
//TODO : change the seed above !
// Defines the suitable uniform ditribution.
boost::uniform_real<> uni_dist(0,1);
boost::variate_generator<base_generator_type&, boost::uniform_real<> > uni(generator, uni_dist);
void SequentialSamplesMethodA(int K, int N)
// Outputs K sorted random integers out of 0..N, taken according to
// [Vitter87], method A.
{
int top=N-K, S, curr=0, currsample=-1;
double Nreal=N, quot=1., V;
while (K>=2)
{
V=uni();
S=0;
quot=top/Nreal;
while (quot > V)
{
S++; top--; Nreal--;
quot *= top/Nreal;
}
currsample+=1+S;
cout << curr << " : " << currsample << "\n";
Nreal--; K--;curr++;
}
// special case K=1 to avoid overflow
S=floor(round(Nreal)*uni());
currsample+=1+S;
cout << curr << " : " << currsample << "\n";
}
void SequentialSamplesMethodD(int K, int N)
// Outputs K sorted random integers out of 0..N, taken according to
// [Vitter87], method D.
{
const int negalphainv=-13; //between -20 and -7 according to [Vitter87]
//optimized for an implementation in 1987 !!!
int curr=0, currsample=0;
int threshold=-negalphainv*K;
double Kreal=K, Kinv=1./Kreal, Nreal=N;
double Vprime=exp(log(uni())*Kinv);
int qu1=N+1-K; double qu1real=qu1;
double Kmin1inv, X, U, negSreal, y1, y2, top, bottom;
int S, limit;
while ((K>1)&&(threshold<N))
{
Kmin1inv=1./(Kreal-1.);
while(1)
{//Step D2: generate X and U
while(1)
{
X=Nreal*(1-Vprime);
S=floor(X);
if (S<qu1) {break;}
Vprime=exp(log(uni())*Kinv);
}
U=uni();
negSreal=-S;
//step D3: Accept ?
y1=exp(log(U*Nreal/qu1real)*Kmin1inv);
Vprime=y1*(1. - X/Nreal)*(qu1real/(negSreal+qu1real));
if (Vprime <=1.) {break;} //Accept ! Test [Vitter87](2.8) is true
//step D4 Accept ?
y2=0; top=Nreal-1.;
if (K-1 > S)
{bottom=Nreal-Kreal; limit=N-S;}
else {bottom=Nreal+negSreal-1.; limit=qu1;}
for(int t=N-1;t>=limit;t--)
{y2*=top/bottom;top--; bottom--;}
if (Nreal/(Nreal-X)>=y1*exp(log(y2)*Kmin1inv))
{//Accept !
Vprime=exp(log(uni())*Kmin1inv);
break;
}
Vprime=exp(log(uni())*Kmin1inv);
}
// Step D5: Select the (S+1)th record
currsample+=1+S;
cout << curr << " : " << currsample << "\n";
curr++;
N-=S+1; Nreal+=negSreal-1.;
K-=1; Kreal-=1; Kinv=Kmin1inv;
qu1-=S; qu1real+=negSreal;
threshold+=negalphainv;
}
if (K>1) {SequentialSamplesMethodA(K, N);}
else {
S=floor(N*Vprime);
currsample+=1+S;
cout << curr << " : " << currsample << "\n";
}
}
int main(void)
{
int Ntest=10000000, Ktest=Ntest/100;
SequentialSamplesMethodD(Ktest,Ntest);
return 0;
}
$ time ./sampling|tail
gives the following ouptut on my laptop
99990 : 9998882
99991 : 9998885
99992 : 9999021
99993 : 9999058
99994 : 9999339
99995 : 9999359
99996 : 9999411
99997 : 9999427
99998 : 9999584
99999 : 9999745
real 0m0.075s
user 0m0.060s
sys 0m0.000s
This Ruby code showcases the Reservoir Sampling, Algorithm R method. In each cycle, I select n=5 unique random integers from [0,N=10) range:
t=0
m=0
N=10
n=5
s=0
distrib=Array.new(N,0)
for i in 1..500000 do
t=0
m=0
s=0
while m<n do
u=rand()
if (N-t)*u>=n-m then
t=t+1
else
distrib[s]+=1
m=m+1
t=t+1
end #if
s=s+1
end #while
if (i % 100000)==0 then puts i.to_s + ". cycle..." end
end #for
puts "--------------"
puts distrib
output:
100000. cycle...
200000. cycle...
300000. cycle...
400000. cycle...
500000. cycle...
--------------
250272
249924
249628
249894
250193
250202
249647
249606
250600
250034
all integer between 0-9 were chosen with nearly the same probability.
It's essentially Knuth's algorithm applied to arbitrary sequences (indeed, that answer has a LISP version of this). The algorithm is O(N) in time and can be O(1) in memory if the sequence is streamed into it as shown in #MichaelCramer's answer.
Here's a way to do it in O(N) without extra storage. I'm pretty sure this is not a purely random distribution, but it's probably close enough for many uses.
/* generate N sorted, non-duplicate integers in [0, max[ in O(N))*/
int *generate(int n, int max) {
float step,a,v=0;
int i;
int *g = (int *)calloc(n, sizeof(int));
if ( ! g) return 0;
for (i=0; i<n; i++) {
step = (max-v)/(float)(n-i);
v+ = floating_pt_random_in_between(0.0, step*2.0);
if ((int)v == g[i-1]){
v=(int)v+1; //avoid collisions
}
g[i]=v;
}
while (g[i]>max) {
g[i]=max; //fix up overflow
max=g[i--]-1;
}
return g;
}
This is Perl Code. Grep is a filter, and as always I didn't test this code.
#list = grep ($_ % I) == 0, (0..N);
I = interval
N = Upper Bound
Only get numbers that match your interval via the modulus operator.
#list = grep ($_ % 3) == 0, (0..30);
will return 0, 3, 6, ... 30
This is pseudo Perl code. You may need to tweak it to get it to compile.

Resources