Sorting 4 numbers with minimum x<y comparisons - arrays

This is an interview question. Say you have an array of four ints named A, and also this function:
int check(int x, int y){
if (x<=y) return 1;
return 0;
}
Now, you want to create a function that will sort A, and you can use only the function checkfor comparisons. How many calls for check do you need?
(It is ok to return a new array for result).
I found that I can do this in 5 calls. Is it possible to do it with less calls (on worst case)?
This is how I thought of doing it (pseudo code):
int[4] B=new int[4];
/*
The idea: put minimum values in even cells and maximum values in odd cells using check.
Then swap (if needed) between minimum values and also between maximum values.
And finally, swap the second element (max of minimums)
and the third element (min of maximums) if needed.
*/
if (check(A[0],A[1])==1){ //A[0]<=A[1]
B[0]=A[0];
B[2]=A[1];
}
else{
B[0]=A[1];
B[2]=A[0];
}
if (check(A[2],A[3])==1){ //A[2]<=A[3]
B[1]=A[2];
B[3]=A[3];
}
else{
B[1]=A[3];
B[3]=A[2];
}
if (check(B[0],B[1])==0){ //B[0]>B[1]
swap(B[0],B[1]);
}
if (check(B[2],B[3])==0){ //B[2]>B[3]
swap(B[2],B[3]);
}
if (check(B[1],B[2])==0){ // B[1]>B[2]
swap(B[1],B[2]);
}

There are 24 possible orderings of a 4-element list. (4 factorial) If you do only 4 comparisons, then you can get only 4 bits of information, which is enough to distinguish between 16 different cases, which isn't enough to cover all the possible output cases. Therefore, 5 comparisons is the optimal worst case.

In The Art of Computer Programming, p. 183 (Section 3.5.1), Donald Knuth has the following table of lower and upper bounds on the minimum numbers of comparisons:
The ceil(ln n!) is the "information theoretic" lower bound, whereas B(n) is the maximum number of comparisons in an insertion binary sort. Since the lower and upper bounds are equal for n=4, 5 comparisons are needed.
The information theoretic bound is derived by recognizing that there are n! possible orderings of n unique items. We distinguish these cases by asking S yes-no questions in the form of is X<Y?. These questions form a tree which has at most 2^S tips. We need n!<=2^S; solving for S gives ceil(lg(n!)).
Incidentally, you can use Stirling's approximation to show that this implies that sorting requires O(n log n) time.
The rest of the section goes on to describe a number of approaches to creating these bounds and studying this question, though work is on-going (see, for instance Peczarski (2011)).

Related

Does the array “sum and/or sub” to `x`?

Goal
I would like to write an algorithm (in C) which returns TRUE or FALSE (1 or 0) depending whether the array A given in input can “sum and/or sub” to x (see below for clarification). Note that all values of A are integers bounded between [1,x-1] that were randomly (uniformly) sampled.
Clarification and examples
By “sum and/or sub”, I mean placing "+" and "-" in front of each element of array and summing over. Let's call this function SumSub.
int SumSub (int* A,int x)
{
...
}
SumSub({2,7,5},10)
should return TRUE as 7-2+5=10. You will note that the first element of A can also be taken as negative so that the order of elements in A does not matter.
SumSub({2,7,5,2},10)
should return FALSE as there is no way to “sum and/or sub” the elements of array to reach the value of x. Please note, this means that all elements of A must be used.
Complexity
Let n be the length of A. Complexity of the problem is of order O(2^n) if one has to explore all possible combinations of pluses and minus. However, some combinations are more likely than others and therefore are worth being explored first (hoping the output will be TRUE). Typically, the combination which requires substracting all elements from the largest number is impossible (as all elements of A are lower than x). Also, if n>x, it makes no sense to try adding all the elements of A.
Question
How should I go about writing this function?
Unfortunately your problem can be reduced to subset-sum problem which is NP-Complete. Thus the exponential solution can't be avoided.
The original problem's solution is indeed exponential as you said. BUT with the given range[1,x-1] for numbers in A[] you can make the solution polynomial. There is a very simple dynamic programming solution.
With the order:
Time Complexity: O(n^2*x)
Memory Complexity: O(n^2*x)
where, n=num of elements in A[]
You need to use dynamic programming approach for this
You know the min,max range that can be made in in the range [-nx,nx]. Create a 2d array of size (n)X(2*n*x+1). Lets call this dp[][]
dp[i][j] = taking all elements of A[] from [0..i-1] whether its possible to make the value j
so
dp[10][3] = 1 means taking first 10 elements of A[] we CAN create the value 3
dp[10][3] = 0 means taking first 10 elements of A[] we can NOT create the value 3
Here is a kind of pseudo code for this:
int SumSub (int* A,int x)
{
bool dp[][];//set all values of this array 0
dp[0][0] = true;
for(i=1;i<=n;i++) {
int val = A[i-1];
for(j=-n*x;j<=n*x;j++) {
dp[i][j]=dp[ i-1 ][ j + val ] | dp[ i-1 ][ j - val ];
}
}
return dp[n][x];
}
Unfortunately this is NP-complete even when x is restricted to the value 0, so don't expect a polynomial-time algorithm. To show this I'll give a simple reduction from the NP-hard Partition Problem, which asks whether a given multiset of positive integers can be partitioned into two parts having equal sums:
Suppose we have an instance of the Partition Problem consisting of n positive integers B_1, ..., B_n. Create from this an instance of your problem in which A_i = B_i for each 1 <= i <= n, and set x = 0.
Clearly if there is a partition of B into two parts C and D having equal sums, then there is also a solution to the instance of your problem: Put a + in front of every number in C, and a - in front of every number in D (or the other way round). Since C and D have equal sums, this expression must equal 0.
OTOH, if the solution to the instance of your problem that we just created is YES (TRUE), then we can easily create a partition of B into two parts having equal sums: just put all the positive terms in one part (say, C), and all the negative terms (without the preceding - of course) in the other (say, D). Since we know that the total value of the expression is 0, it must be that the sum of the (positive) numbers in C is equal to the (negated) sum of the numbers in D.
Thus a YES to either problem instance implies a YES to the other problem instance, which in turn implies that a NO to either problem instance implies a NO to the other problem instance -- that is, the two problem instances have equal solutions. Thus if it were possible to solve your problem in polynomial time, it would be possible to solve the NP-hard Partition Problem in polynomial time too, by constructing the above instance of your problem, solving it with your poly-time algorithm, and reporting the result it gives.

What kind of drawbacks are there performance-wise , if I sort an array by using hashing?

I wrote a simple function to sort an array int a[]; using hash.
For that I stored frequency for every element in new array hash1[] and then I put back in original array in linear time.
#include<bits/stdc++.h>
using namespace std;
int hash1[10000];
void sah(int a[],int n)
{
int maxo=-1;
for(int i=0;i<n;i++)
{
hash1[a[i]]++;
if(maxo<a[i]){maxo=a[i];}
}
int i=0,freq=0,idx=0;
while(i<maxo+1)
{
freq=hash1[i];
if(freq>0)
{
while(freq>0)
{
a[idx++]=i;freq--;
}
}
i++;
}
}
int main()
{
int a[]={6,8,9,22,33,59,12,5,99,12,57,7};
int n=sizeof(a)/sizeof(a[0]);
sah(a,n);
for(int i=0;i<n;i++)
{
printf("%d ",a[i]);
}
}
This algorithm runs in O(max_element). What kind of disadvantages I'm facing here considering only performance( time and space)?
The algorithm you've implemented is called counting sort. Its runtime is O(n + U), where n is the total number of elements and U is the maximum value in the array (assuming the numbers go from 0 to U), and its space usage is Θ(U). Your particular implementation assumes that U = 10,000. Although you've described your approach as "hashing," this really isn't a hash (computing some function of the elements and using that to put them into buckets) as a distribution (spreading elements around according to their values).
If U is a fixed constant - as it is in your case - then the runtime is O(n) and the space usage is O(1), though remember that big-O talks about long-term growth rates and that if U is large the runtime can be pretty high. This makes it attractive if you're sorting very large arrays with a restricted range of values. However, if the range of values can be large, this is not a particularly good approach. Interestingly, you can think of radix sort as an algorithm that repeatedly runs counting sort with U = 10 (if using the base-10 digits of the numbers) or U = 2 (if going in binary) and has a runtime of O(n log U), which is strongly preferable for large values of U.
You can clean up this code in a number of ways. For example, you have an if statement and a while loop with the same condition, which can be combined together into a single while loop. You also might want to put in some assert checks to make sure all the values are in the range from 0 to 9,999, inclusive, since otherwise you'll have a bounds error. Additionally, you could consider making the global array either a local variable (though watch your stack usage) or a static local variable (to avoid polluting the global namespace). You could alternatively have the user pass in a parameter specifying the maximum size or could calculate it yourself.
Issues you may consider:
Input validation. What if the user enters -10 or a very large value.
If the maximum element is large, you will at some point get a performance hit when the L1 cache is exhausted. The hash1-array will compete for memory bandwidth with the a-array. When I have implemented radix-sorting in the past I found that 8-bits per iteration was fastest.
The time complexity is actually O(max_element + number_of_elements). E.g. what if you sorted 2 million ones or zeros. It is not as fast as sorting 2 ones or zeros.

Find the minimum number of elements required so that their sum equals or exceeds S

I know this can be done by sorting the array and taking the larger numbers until the required condition is met. That would take at least nlog(n) sorting time.
Is there any improvement over nlog(n).
We can assume all numbers are positive.
Here is an algorithm that is O(n + size(smallest subset) * log(n)). If the smallest subset is much smaller than the array, this will be O(n).
Read http://en.wikipedia.org/wiki/Heap_%28data_structure%29 if my description of the algorithm is unclear (it is light on details, but the details are all there).
Turn the array into a heap arranged such that the biggest element is available in time O(n).
Repeatedly extract the biggest element from the heap until their sum is large enough. This takes O(size(smallest subset) * log(n)).
This is almost certainly the answer they were hoping for, though not getting it shouldn't be a deal breaker.
Edit: Here is another variant that is often faster, but can be slower.
Walk through elements, until the sum of the first few exceeds S. Store current_sum.
Copy those elements into an array.
Heapify that array such that the minimum is easy to find, remember the minimum.
For each remaining element in the main array:
if min(in our heap) < element:
insert element into heap
increase current_sum by element
while S + min(in our heap) < current_sum:
current_sum -= min(in our heap)
remove min from heap
If we get to reject most of the array without manipulating our heap, this can be up to twice as fast as the previous solution. But it is also possible to be slower, such as when the last element in the array happens to be bigger than S.
Assuming the numbers are integers, you can improve upon the usual n lg(n) complexity of sorting because in this case we have the extra information that the values are between 0 and S (for our purposes, integers larger than S are the same as S).
Because the range of values is finite, you can use a non-comparative sorting algorithm such as Pigeonhole Sort or Radix Sort to go below n lg(n).
Note that these methods are dependent on some function of S, so if S gets large enough (and n stays small enough) you may be better off reverting to a comparative sort.
Here is an O(n) expected time solution to the problem. It's somewhat like Moron's idea but we don't throw out the work that our selection algorithm did in each step, and we start trying from an item potentially in the middle rather than using the repeated doubling approach.
Alternatively, It's really just quickselect with a little additional book keeping for the remaining sum.
First, it's clear that if you had the elements in sorted order, you could just pick the largest items first until you exceed the desired sum. Our solution is going to be like that, except we'll try as hard as we can to not to discover ordering information, because sorting is slow.
You want to be able to determine if a given value is the cut off. If we include that value and everything greater than it, we meet or exceed S, but when we remove it, then we are below S, then we are golden.
Here is the psuedo code, I didn't test it for edge cases, but this gets the idea across.
def Solve(arr, s):
# We could get rid of worse case O(n^2) behavior that basically never happens
# by selecting the median here deterministically, but in practice, the constant
# factor on the algorithm will be much worse.
p = random_element(arr)
left_arr, right_arr = partition(arr, p)
# assume p is in neither left_arr nor right_arr
right_sum = sum(right_arr)
if right_sum + p >= s:
if right_sum < s:
# solved it, p forms the cut off
return len(right_arr) + 1
# took too much, at least we eliminated left_arr and p
return Solve(right_arr, s)
else:
# didn't take enough yet, include all elements from and eliminate right_arr and p
return len(right_arr) + 1 + Solve(left_arr, s - right_sum - p)
One improvement (asymptotically) over Theta(nlogn) you can do is to get an O(n log K) time algorithm, where K is the required minimum number of elements.
Thus if K is constant, or say log n, this is better (asymptotically) than sorting. Of course if K is n^epsilon, then this is not better than Theta(n logn).
The way to do this is to use selection algorithms, which can tell you the ith largest element in O(n) time.
Now do a binary search for K, starting with i=1 (the largest) and doubling i etc at each turn.
You find the ith largest, and find the sum of the i largest elements and check if it is greater than S or not.
This way, you would run O(log K) runs of the selection algorithm (which is O(n)) for a total running time of O(n log K).
eliminate numbers < S, if you find some number ==S, then solved
pigeon-hole sort the numbers < S
Sum elements highest to lowest in the sorted order till you exceed S.

Compare two integer arrays with same length

[Description] Given two integer arrays with the same length. Design an algorithm which can judge whether they're the same. The definition of "same" is that, if these two arrays were in sorted order, the elements in corresponding position should be the same.
[Example]
<1 2 3 4> = <3 1 2 4>
<1 2 3 4> != <3 4 1 1>
[Limitation] The algorithm should require constant extra space, and O(n) running time.
(Probably too complex for an interview question.)
(You can use O(N) time to check the min, max, sum, sumsq, etc. are equal first.)
Use no-extra-space radix sort to sort the two arrays in-place. O(N) time complexity, O(1) space.
Then compare them using the usual algorithm. O(N) time complexity, O(1) space.
(Provided (max − min) of the arrays is of O(Nk) with a finite k.)
You can try a probabilistic approach - convert the arrays into a number in some huge base B and mod by some prime P, for example sum B^a_i for all i mod some big-ish P. If they both come out to the same number, try again for as many primes as you want. If it's false at any attempts, then they are not correct. If they pass enough challenges, then they are equal, with high probability.
There's a trivial proof for B > N, P > biggest number. So there must be a challenge that cannot be met. This is actually the deterministic approach, though the complexity analysis might be more difficult, depending on how people view the complexity in terms of the size of the input (as opposed to just the number of elements).
I claim that: Unless the range of input is specified, then it is IMPOSSIBLE to solve in onstant extra space, and O(n) running time.
I will be happy to be proven wrong, so that I can learn something new.
Insert all elements from the first array into a hashtable
Try to insert all elements from the second array into the same hashtable - for each insert to element should already be there
Ok, this is not with constant extra space, but the best I could come up at the moment:-). Are there any other constraints imposed on the question, like for example to biggest integer that may be included in the array?
A few answers are basically correct, even though they don't look like it. The hash table approach (for one example) has an upper limit based on the range of the type involved rather than the number of elements in the arrays. At least by by most definitions, that makes the (upper limit on) the space a constant, although the constant may be quite large.
In theory, you could change that from an upper limit to a true constant amount of space. Just for example, if you were working in C or C++, and it was an array of char, you could use something like:
size_t counts[UCHAR_MAX];
Since UCHAR_MAX is a constant, the amount of space used by the array is also a constant.
Edit: I'd note for the record that a bound on the ranges/sizes of items involved is implicit in nearly all descriptions of algorithmic complexity. Just for example, we all "know" that Quicksort is an O(N log N) algorithm. That's only true, however, if we assume that comparing and swapping the items being sorted takes constant time, which can only be true if we bound the range. If the range of items involved is large enough that we can no longer treat a comparison or a swap as taking constant time, then its complexity would become something like O(N log N log R), were R is the range, so log R approximates the number of bits necessary to represent an item.
Is this a trick question? If the authors assumed integers to be within a given range (2^32 etc.) then "extra constant space" might simply be an array of size 2^32 in which you count the occurrences in both lists.
If the integers are unranged, it cannot be done.
You could add each element into a hashmap<Integer, Integer>, with the following rules: Array A is the adder, array B is the remover. When inserting from Array A, if the key does not exist, insert it with a value of 1. If the key exists, increment the value (keep a count). When removing, if the key exists and is greater than 1, reduce it by 1. If the key exists and is 1, remove the element.
Run through array A followed by array B using the rules above. If at any time during the removal phase array B does not find an element, you can immediately return false. If after both the adder and remover are finished the hashmap is empty, the arrays are equivalent.
Edit: The size of the hashtable will be equal to the number of distinct values in the array does this fit the definition of constant space?
I imagine the solution will require some sort of transformation that is both associative and commutative and guarantees a unique result for a unique set of inputs. However I'm not sure if that even exists.
public static boolean match(int[] array1, int[] array2) {
int x, y = 0;
for(x = 0; x < array1.length; x++) {
y = x;
while(array1[x] != array2[y]) {
if (y + 1 == array1.length)
return false;
y++;
}
int swap = array2[x];
array2[x] = array2[y];
array2[y] = swap;
}
return true;
}
For each array, Use Counting sort technique to build the count of number of elements less than or equal to a particular element . Then compare the two built auxillary arrays at every index, if they r equal arrays r equal else they r not . COunting sort requires O(n) and array comparison at every index is again O(n) so totally its O(n) and the space required is equal to the size of two arrays . Here is a link to counting sort http://en.wikipedia.org/wiki/Counting_sort.
given int are in the range -n..+n a simple way to check for equity may be the following (pseudo code):
// a & b are the array
accumulator = 0
arraysize = size(a)
for(i=0 ; i < arraysize; ++i) {
accumulator = accumulator + a[i] - b[i]
if abs(accumulator) > ((arraysize - i) * n) { return FALSE }
}
return (accumulator == 0)
accumulator must be able to store integer with range = +- arraysize * n
How 'bout this - XOR all the numbers in both the arrays. If the result is 0, you got a match.

Does qsort demand consistent comparisons or can I use it for shuffling?

Update: Please file this under bad ideas. You don't get anything for free in life and here is certainly proof. A simple idea gone bad. It is definitely something to learn from however.
Lazy programming challenge. If I pass a function that 50-50 returns true or false for the qsort's comparision function I think that I can effectively unsort an array of structures writing 3 lines of code.
int main ( int argc, char **argv)
{
srand( time(NULL) ); /* 1 */
...
/* qsort(....) */ /* 2 */
}
...
int comp_nums(const int *num1, const int *num2)
{
float frand =
(float) (rand()) / ((float) (RAND_MAX+1.0)); /* 3 */
if (frand >= 0.5f)
return GREATER_THAN;
return LESS_THAN;
}
Any pitfalls I need to look for? Is it possible in fewer lines through swapping or is this the cleanest I get for 3 non trivial lines?
Bad idea. I mean really bad.
Your solution gives an unpredictable result, not a random result and there is a big difference. You have no real idea of what a qsort with a random comparison will do and whether all combinations are equally likely. This is the most important criterion for a shuffle: all combinations must be equally likely. Biased results equal big trouble. There's no way to prove that in your example.
You should implement the Fisher-Yates shuffle (otherwise known as the Knuth shuffle).
In addition to the other answers, this is worse than a simple Fisher-Yates shuffle because it is too slow. The qsort algorithm is O(n*log(n)), the Fisher-Yates is O(n).
Some more detail is available in Wikipedia on why this kind of "shuffle" does not generally work as well as the Fisher-Yates method:
Comparison with other shuffling
algorithms
The Fisher-Yates shuffle is quite
efficient; indeed, its asymptotic time
and space complexity are optimal.
Combined with a high-quality unbiased
random number source, it is also
guaranteed to produce unbiased
results. Compared to some other
solutions, it also has the advantage
that, if only part of the resulting
permutation is needed, it can be
stopped halfway through, or even
stopped and restarted repeatedly,
generating the permutation
incrementally as needed. In high-level
programming languages with a fast
built-in sorting algorithm, an
alternative method, where each element
of the set to be shuffled is assigned
a random number and the set is then
sorted according to these numbers, may
be faster in practice[citation
needed], despite having worse
asymptotic time complexity (O(n log n)
vs. O(n)). Like the Fisher-Yates
shuffle, this method will also produce
unbiased results if correctly
implemented, and may be more tolerant
of certain kinds of bias in the random
numbers. However, care must be taken
to ensure that the assigned random
numbers are never duplicated, since
sorting algorithms in general won't
order elements randomly in case of a
tie. A variant of the above method
that has seen some use in languages
that support sorting with
user-specified comparison functions is
to shuffle a list by sorting it with a
comparison function that returns
random values. However, this does not
always work: with a number of commonly
used sorting algorithms, the results
end up biased due to internal
asymmetries in the sorting
implementation.[7]
This links to here:
just one more thing While writing this
article I experimented with various
versions of the methods and discovered
one more flaw in the original version
(renamed by me to shuffle_sort). I was
wrong when I said “it returns a nicely
shuffled array every time it is
called.”
The results are not nicely shuffled at
all. They are biased. Badly. That
means that some permutations (i.e.
orderings) of elements are more likely
than others. Here’s another snippet of
code to prove it, again borrowed from
the newsgroup discussion:
N = 100000
A = %w(a b c)
Score = Hash.new { |h, k| h[k] = 0 }
N.times do
sorted = A.shuffle
Score[sorted.join("")] += 1
end
Score.keys.sort.each do |key|
puts "#{key}: #{Score[key]}"
end
This code
shuffles 100,000 times array of three
elements: a, b, c and records how many
times each possible result was
achieved. In this case, there are only
six possible orderings and we should
got each one about 16666.66 times. If
we try an unbiased version of shuffle
(shuffle or shuffle_sort_by), the
result are as expected:
abc: 16517
acb: 16893
bac: 16584
bca: 16568
cab: 16476
cba: 16962
Of course,
there are some deviations, but they
shouldn’t exceed a few percent of
expected value and they should be
different each time we run this code.
We can say that the distribution is
even.
OK, what happens if we use the
shuffle_sort method?
abc: 44278
acb: 7462
bac: 7538
bca: 3710
cab: 3698
cba: 33314
This is not
an even distribution at all. Again?
It shows how the sort method is biased and goes into detail why this is so. FInally he links to Coding Horror:
Let's take a look at the correct
Knuth-Fisher-Yates shuffle algorithm.
for (int i = cards.Length - 1; i > 0; i--)
{
int n = rand.Next(i + 1);
Swap(ref cards[i], ref cards[n]);
}
Do you see the difference? I missed
it the first time. Compare the swaps
for a 3 card deck:
Naïve shuffle Knuth-Fisher-Yates shuffle
rand.Next(3); rand.Next(3);
rand.Next(3); rand.Next(2);
rand.Next(3);
The naive shuffle
results in 3^3 (27) possible deck
combinations. That's odd, because the
mathematics tell us that there are
really only 3! or 6 possible
combinations of a 3 card deck. In the
KFY shuffle, we start with an initial
order, swap from the third position
with any of the three cards, then swap
again from the second position with
the remaining two cards.
No, this won't properly shuffle the array, it will barely move elements around their original locations, with exponential distribution.
The comparison function isn't supposed to return a boolean type, it's supposed to return a negative number, a positive number, or zero which qsort() uses to determine which argument is greater than the other.
The Old New Thing takes on this one
I think the basic idea of randomly partition the set recursively on the way down and concatenate the results on the way up will work (It will average O(n*log n) binary decisions and that is darn close to log2(fact(n)) but q-sort will not be sure to do that with a random predicate.
BTW I think the same argument and issues can be said for any O(n*log n) sort strategy.
Rand isn't the most random thing out there... If you want to shuffle cards or something this isn't the best. Also a Knuth shuffle would be quicker, but your solution is ok if it doesn't loop forever

Resources