Rank and unrank Combination with constraints - arrays

I want to rank and unrank combinations with an Element distance constraint. Selected elements cannot repeated.
For example:
n := 10 elements choose from
k := 5 elements being choosen
d := 3 max distance between 2 choosen elements
1,3,5,8,9 matches the constraint
1,5,6,7,8 dont matches the constraint
How can ranking the combination with given distance constraint, where 1,2,3,4,5 is smaller than 1,2,3,4,6 ? Is there a way do the ranking without compute the combinations with smaller Rank?

You can do this by first creating and populating a two-dimensional array, which I will call NVT for "number of valid tails", to record the number of valid "tails" that start at a certain position with a given value. For example, NVT[4][6] = 3, because a combination that has 6 in position #4 can end in 3 distinct ways: (…, 6, 7), (…, 6, 8), and (…, 6, 9).
To populate NVT, start with NVT[k][…], which is just a row of all 1s. Then work your way back to earlier positions; for example, NVT[2][5] = NVT[3][6] + NVT[3][7] + NVT[3][8], because a "tail" starting at position #3 with value 5 will consist of that 5 plus a "tail" starting at position #4 with value 6, 7, or 8.
Note that we don't care whether there's actually a valid way to reach a given tail. For example, NVT[4][1] = 3 because of the valid tails (1, 2), (1, 3), and (1, 4), even though there are no complete combinations of the form (…, 1, _).
Once you've done that, you can compute the rank of a combination C as follows:
For position #1, count up all the valid tails starting at position #1 with a value less than C[1]. For example, if C starts with 3, then this will be NVT[1][1] + NVT[1][2], representing all combinations that start with 1 or 2.
Then do the same for all subsequent positions. These will represent combinations that start off the same way as C up until a given position, but then have a lesser value at that position.
For example, if C is (1, 3, 5, 8, 9), this comes out to
0 +
(NVT[2][1] + NVT[2][2]) +
(NVT[3][1] + NVT[3][2] + NVT[3][3] + NVT[3][4]) +
(NVT[4][1] + NVT[4][2] + NVT[4][3] + NVT[4][4] + NVT[4][5] + NVT[4][6] + NVT[4][7]) +
(NVT[5][1] + NVT[5][2] + NVT[5][3] + NVT[5][4] + NVT[5][5] + NVT[5][6] + NVT[5][7] + NVT[5][8]).
Conversely, you can find the combination C with a given rank r as follows:
Create a temporary variable rr, for "remaining rank", initially equal to r.
To find C[1] — the value in position #1 — count up valid tails starting at position #1, starting with the least possible value (namely 1), stopping once this would exceed rr. For example, since NVT[1][1] = 66 and NVT[1][2] = 27, the combination with rank 75 must start with 2 (because 75 ≥ 66 and 75 < 66 + 27). Then subtract this sum from rr (in this case leaving 75 − 66 = 9).
Then do the same for all subsequent positions, making sure to keep in mind the least possible value given what was in the previous position. Continuing our example with r = 75, C[1] = 2, and rr = 9, we know that C[2] ≥ 3; and since NVT[2][3] = 23 > rr, we immediately find that C[2] = 3.
Complexity analysis:
Space:
NVT requires O(nk) space.
Returning a combination as a length-k array inherently requires O(k) space; but if we return the combination one value at a time (by invoking a callback or printing to a console or something), then the computation itself doesn't actually depend on this array, and only requires O(1) extra space.
Aside from that, everything else can be managed in O(1) space; we don't need any recursion or temporary arrays or anything.
Time:
Populating NVT takes O(nkd) time. (Note: if d is greater than n, then we can just set d equal to n.)
Given NVT, computing the rank of a given combination takes worst-case O(nk) time.
Given NVT, computing the combination with a given rank takes worst-case O(nk) time.
Implementation note: The details of the above are a bit fiddly; it would be easy to get an off-by-one error, or mix up two variables, or whatnot, if you don't have concrete data to look at. Since there are only 168 valid combinations for your example, I recommend generating all of them, so that you can reference them during debugging.
Possible additional optimization: If you expect n to be quite large, and you expect to do a lot of queries to "rank" and "unrank" combinations, then you might find it useful to create a second array, which I will call NVTLT for "number of valid tails less than", to record the number of valid "tails" that start at a certain position with a value less than a given value. For example, NVTLT[3][5] = NVT[3][1] + NVT[3][2] + NVT[3][3] + NVT[3][4], or if you prefer, NVTLT[3][5] = NVTLT[3][4] + NVT[3][4]. (You can do this as an in-place transformation, completely overwriting NVT, so it's an O(nk) pass with no additional space.) Using NVTLT instead of NVT for your queries will let you do binary search for values, rather than linear search, giving worst-case O(k log n) time instead of O(nk) time. Note that this optimization is even trickier than the above, so even if you intend to perform this optimization, I recommend starting with the above, getting it working perfectly, and only then adding this optimization.

Related

Daily Coding Problem 260 : Reconstruct a jumbled array - Intuition?

I'm going through the question below.
The sequence [0, 1, ..., N] has been jumbled, and the only clue you have for its order is an array representing whether each number is larger or smaller than the last. Given this information, reconstruct an array that is consistent with it.
For example, given [None, +, +, -, +], you could return [1, 2, 3, 0, 4].
I went through the solution on this post but still unable to understand it as to why this solution works. I don't think I would be able to come up with the solution if I had this in front of me during an interview. Can anyone explain the intuition behind it? Thanks in advance!
This answer tries to give a general strategy to find an algorithm to tackle this type of problems. It is not trying to prove why the given solution is correct, but lying out a route towards such a solution.
A tried and tested way to tackle this kind of problem (actually a wide range of problems), is to start with small examples and work your way up. This works for puzzles, but even so for problems encountered in reality.
First, note that the question is formulated deliberately to not point you in the right direction too easily. It makes you think there is some magic involved. How can you reconstruct a list of N numbers given only the list of plusses and minuses?
Well, you can't. For 10 numbers, there are 10! = 3628800 possible permutations. And there are only 2⁹ = 512 possible lists of signs. It's a very huge difference. Most original lists will be completely different after reconstruction.
Here's an overview of how to approach the problem:
Start with very simple examples
Try to work your way up, adding a bit of complexity
If you see something that seems a dead end, try increasing complexity in another way; don't spend too much time with situations where you don't see progress
While exploring alternatives, revisit old dead ends, as you might have gained new insights
Try whether recursion could work:
given a solution for N, can we easily construct a solution for N+1?
or even better: given a solution for N, can we easily construct a solution for 2N?
Given a recursive solution, can it be converted to an iterative solution?
Does the algorithm do some repetitive work that can be postponed to the end?
....
So, let's start simple (writing 0 for the None at the start):
very short lists are easy to guess:
'0++' → 0 1 2 → clearly only one solution
'0--' → 2 1 0 → only one solution
'0-+' → 1 0 2 or 2 0 1 → hey, there is no unique outcome, though the question only asks for one of the possible outcomes
lists with only plusses:
'0++++++' → 0 1 2 3 4 5 6 → only possibility
lists with only minuses:
'0-------'→ 7 6 5 4 3 2 1 0 → only possibility
lists with one minus, the rest plusses:
'0-++++' → 1 0 2 3 4 5 or 5 0 1 2 3 4 or ...
'0+-+++' → 0 2 1 3 4 5 or 5 0 1 2 3 4 or ...
→ no very obvious pattern seem to emerge
maybe some recursion could help?
given a solution for N, appending one sign more?
appending a plus is easy: just repeat the solution and append the largest plus 1
appending a minus, after some thought: increase all the numbers by 1 and append a zero
→ hey, we have a working solution, but maybe not the most efficient one
the algorithm just appends to an existing list, no need to really write it recursively (although the idea is expressed recursively)
appending a plus can be improved, by storing the largest number in a variable so it doesn't need to be searched at every step; no further improvements seem necessary
appending a minus is more troublesome: the list needs to be traversed with each append
what if instead of appending a zero, we append -1, and do the adding at the end?
this clearly works when there is only one minus
when two minus signs are encountered, the first time append -1, the second time -2
→ hey, this works for any number of minuses encountered, just store its counter in a variable and sum with it at the end of the algorithm
This is in bird's eye view one possible route towards coming up with a solution. Many routes lead to Rome. Introducing negative numbers might seem tricky, but it is a logical conclusion after contemplating the recursive algorithm for a while.
It works because all changes are sequential, either adding one or subtracting one, starting both the increasing and the decreasing sequences from the same place. That guarantees we have a sequential list overall. For example, given the arbitrary
[None, +, -, +, +, -]
turned vertically for convenience, we can see
None 0
+ 1
- -1
+ 2
+ 3
- -2
Now just shift them up by two (to account for -2):
2 3 1 4 5 0
+ - + + -
Let's look at first to a solution which (I think) is easier to understand, formalize and demonstrate for correctness (but I will only explain it and not demonstrate in a formal way):
We name A[0..N] our input array (where A[k] is None if k = 0 and is + or - otherwise) and B[0..N] our output array (where B[k] is in the range [0, N] and all values are unique)
At first we see that our problem (find B such that B[k] > B[k-1] if A[k] == + and B[k] < B[k-1] if A[k] == -) is only a special case of another problem:
Find B such that B[k] == max(B[0..k]) if A[k] == + and B[k] == min(B[0..k]) if A[k] == -.
Which generalize from "A value must larger or smaller than the last" to "A value must be larger or smaller than everyone before it"
So a solution to this problem is a solution to the original one as well.
Now how do we approach this problem?
A greedy solution will be sufficient, indeed is easy to demonstrate that the value associated with the last + will be the biggest number in absolute (which is N), the one associated with the second last + will be the second biggest number in absolute (which is N-1) ecc...
And in the same time the value associated with the last - will be the smallest number in absolute (which is 0), the one associated with the second last - will be the second smallest (which is 1) ecc...
So we can start filling B from right to left remembering how many + we have seen (let's call this value X), how many - we have seen (let's call this value Y) and looking at what is the current symbol, if it is a + in B we put N-X and we increase X by 1 and if it is a - in B we put 0+Y and we increase Y by 1.
In the end we'll need to fill B[0] with the only remaining value which is equal to Y+1 and to N-X-1.
An interesting property of this solution is that if we look to only the values associated with a - they will be all the values from 0 to Y (where in this case Y is the total number of -) sorted in reverse order; if we look to only the values associated with a + they will be all the values from N-X to N (where in this case X is the total number of +) sorted and if we look at B[0] it will always be Y+1 and N-X-1 (which are equal).
So the - will have all the values strictly smaller than B[0] and reverse sorted and the + will have all the values strictly bigger than B[0] and sorted.
This property is the key to understand why the solution proposed here works:
It consider B[0] equals to 0 and than it fills B following the property, this isn't a solution because the values are not in the range [0, N], but it is possible with a simple translation to move the range and arriving to [0, N]
The idea is to produce a permutation of [0,1...N] which will follow the pattern of [+,-...]. There are many permutations which will be applicable, it isn't a single one. For instance, look the the example provided:
[None, +, +, -, +], you could return [1, 2, 3, 0, 4].
But you also could have returned other solutions, just as valid: [2,3,4,0,1], [0,3,4,1,2] are also solutions. The only concern is that you need to have the first number having at least two numbers above it for positions [1],[2], and leave one number in the end which is lower then the one before and after it.
So the question isn't finding the one and only pattern which is scrambled, but to produce any permutation which will work with these rules.
This algorithm answers two questions for the next member of the list: get a number who’s both higher/lower from previous - and get a number who hasn’t been used yet. It takes a starting point number and essentially create two lists: an ascending list for the ‘+’ and a descending list for the ‘-‘. This way we guarantee that the next member is higher/lower than the previous one (because it’s in fact higher/lower than all previous members, a stricter condition than the one required) and for the same reason we know this number wasn’t used before.
So the intuition of the referenced algorithm is to start with a referenced number and work your way through. Let's assume we start from 0. The first place we put 0+1, which is 1. we keep 0 as our lowest, 1 as the highest.
l[0] h[1] list[1]
the next symbol is '+' so we take the highest number and raise it by one to 2, and update both the list with a new member and the highest number.
l[0] h[2] list [1,2]
The next symbol is '+' again, and so:
l[0] h[3] list [1,2,3]
The next symbol is '-' and so we have to put in our 0. Note that if the next symbol will be - we will have to stop, since we have no lower to produce.
l[0] h[3] list [1,2,3,0]
Luckily for us, we've chosen well and the last symbol is '+', so we can put our 4 and call is a day.
l[0] h[4] list [1,2,3,0,4]
This is not necessarily the smartest solution, as it can never know if the original number will solve the sequence, and always progresses by 1. That means that for some patterns [+,-...] it will not be able to find a solution. But for the pattern provided it works well with 0 as the initial starting point. If we chose the number 1 is would also work and produce [2,3,4,0,1], but for 2 and above it will fail. It will never produce the solution [0,3,4,1,2].
I hope this helps understanding the approach.
This is not an explanation for the question put forward by OP.
Just want to share a possible approach.
Given: N = 7
Index: 0 1 2 3 4 5 6 7
Pattern: X + - + - + - + //X = None
Go from 0 to N
[1] fill all '-' starting from right going left.
Index: 0 1 2 3 4 5 6 7
Pattern: X + - + - + - + //X = None
Answer: 2 1 0
[2] fill all the vacant places i.e [X & +] starting from left going right.
Index: 0 1 2 3 4 5 6 7
Pattern: X + - + - + - + //X = None
Answer: 3 4 5 6 7
Final:
Pattern: X + - + - + - + //X = None
Answer: 3 4 2 5 1 6 0 7
My answer definitely is too late for your problem but if you need a simple proof, you probably would like to read it:
+min_last or min_so_far is a decreasing value starting from 0.
+max_last or max_so_far is an increasing value starting from 0.
In the input, each value is either "+" or "-" and for each increase the value of max_so_far or decrease the value of min_so_far by one respectively, excluding the first one which is None. So, abs(min_so_far, max_so_far) is exactly equal to N, right? But because you need the range [0, n] but max_so_far and min_so_far now are equal to the number of "+"s and "-"s with the intersection part with the range [0, n] being [0, max_so_far], what you need to do is to pad it the value equal to min_so_far for the final solution (because min_so_far <= 0 so you need to take each value of the current answer to subtract by min_so_far or add by abs(min_so_far)).

The best order to choose elements in the random array to maximize output?

We have an array as input to production.
R = [5, 2, 8, 3, 6, 9]
If ith input is chosen the output is sum of ith element, the max element whose index is less than i and the min element whose index is greater than i.
For example if I take 8, output would be 8+5+3=16.
Selected items cannot be selected again. So, if I select 8 the next array for next selection would look like R = [5, 2, 3, 6, 9]
What is the order to choose all inputs with maximum output in total? If possible, please send dynamic programming solutions.
I'll start the bidding with an O(n2n) solution . . .
There are a number of ambiguities in your description of the problem, that you have declined to address in comments. None of these ambiguities affects the runtime complexity of this solution, but they do affect implementation details of the solution, so the solution is necessarily somewhat of a sketch.
The solution is as follows:
Create an array results of 2n integers. Each array index i will denote a certain subsequence of the input, and results[i] will be the greatest sum that we can achieve starting with that subsequence.
A convenient way to manage the index-to-subsequence mapping is to represent the first element of the input using the least significant bit (the 1's place), the second element with the 2's place, etc.; so, for example, if our input is [5, 2, 8, 3, 6, 9], then the subsequence 5 2 8 would be represented as array index 0001112 = 7, meaning results[7]. (You can also start with the most significant bit — which is probably more intuitive — but then the implementation of that mapping is a little bit less convenient. Up to you.)
Then proceed in order, from subset #0 (the empty subset) up through subset #2n−1 (the full input), calculating each array-element by seeing how much we get if we select each possible element and add the corresponding previously-stored values. So, for example, to calculate results[7] (for the subsequence 5 2 8), we select the largest of these values:
results[6] plus how much we get if we select the 5
results[5] plus how much we get if we select the 2
results[3] plus how much we get if we select the 8
Now, it might seem like it should require O(n2) time to compute any given array-element, since there are n elements in the input that we could potentially select, and seeing how much we get if we do so requires examining all other elements (to find the maximum among prior elements and the minimum among later elements). However, we can actually do it in just O(n) time by first making a pass from right to left to record the minimal value that is later than each element of the input, and then proceeding from left to right to try each possible value. (Two O(n) passes add up to O(n).)
An important caveat: I suspect that the correct solution only ever involves, at each step, selecting either the rightmost or second-to-rightmost element. If so, then the above solution calculates many, many more values than an algorithm that took that into account. For example, the result at index 1110002 is clearly not relevant in that case. But I can't prove this suspicion, so I present the above O(n2n) solution as the fastest solution whose correctness I'm certain of.
(I'm assuming that the elements are nonnegative absent a suggestion to the contrary.)
Here's an O(n^2)-time algorithm based on ruakh's conjecture that there exists an optimal solution where every selection is from the rightmost two, which I prove below.
The states of the DP are (1) n, the number of elements remaining (2) k, the index of the rightmost element. We have a recurrence
OPT(n, k) = max(max(R(0), ..., R(n - 2)) + R(n - 1) + R(k) + OPT(n - 1, k),
max(R(0), ..., R(n - 1)) + R(k) + OPT(n - 1, n - 1)),
where the first line is when we take the second rightmost element, and the second line is when we take the rightmost. The empty max is zero. The base cases are
OPT(1, k) = R(k)
for all k.
Proof: the condition of choosing from the two rightmost elements is equivalent to the restriction that the element at index i (counting from zero) can be chosen only when at most i + 2 elements remain. We show by induction that there exists an optimal solution satisfying this condition for all i < j where j is the induction variable.
The base case is trivial, since every optimal solution satisfies the vacuous restriction for j = 0. In the inductive case, assume that there exists an optimal solution satisfying the restriction for all i < j. If j is chosen when there are more than j + 2 elements left, let's consider what happens if we defer that choice until there are exactly j + 2 elements left. None of the elements left of j are chosen in this interval by the inductive hypothesis, so they are irrelevant. Choosing the elements right of j can only be at least as profitable, since including j cannot decrease the max. Meanwhile, the set of elements left of j is the same at both times, and the set of the elements right of j is a subset at the later time as compared to the earlier time, so the min does not decrease. We conclude that this deferral does not affect the profitability of the solution.

Query on a array

Assume that I have an array A = {a, b, c, d, e, f, g, h........} and Q queries. in each query I will be asked to do one of the following operation:
1 i j -> increase i the element by 1 and decrease j the element by one
2 x -> tell the number of elements of the array which are less than x
if there was no update operation I could have done this by lower bound. I can still do it by sorting the array and finding the lower bound but complexity will be too high since the size of array A and Q can be both 10^5. is there any faster algorithm or way to do this?
The simplest way is to use std::count_if.
What complexity bound do you have to meet? 10^5^2 is still only 10^10.
If you have to do better than that, I suspect you have to have a "value" which has back pointers to the "index", and an "index" which is a pointer to the value. Sort the values initially, and then when you update, move the value to the right point. (Probably best to see if the value needs to move at all before searching).
Then the query is still a lower bound operation.
Once you sort the array (O(n log n) complexity), a query "LESS(X)" will run in log n time since you can use binary search. Once you know that element X is found (or the next largest element in A is found) at position k-th, you know that k is your answer (k elements are less than X).
The (i, j) command implies a partial reorder of the array between the element which is immediately less than min(A[i]+1, A[j]-1) and the one which is immediately after max(A[i], A[j]). These you find both in log n, worst case log n + n, time: this is close to the worst case:
k 0 1 2 3 4 5 6 7 8 9 command: (4, 5)
v 7 14 14 15 15 15 16 16 16 18
^ ^
becomes 16 becomes 14 -- does it go before 3 or before 1?
The re-sort is then worst case n, since your array is already almost sorted except for two elements, which means you'll do well by using two runs of insertion sort.
So with m update queries and q simple queries you can expect to have
n log n + m*2*(log n + 2*n) + q * log n
complexity. Average case (no pathological arrays, reasonable sparseness, no pathological updates, (j-i) = d << n) will be
( n + 2m + q ) * log n + 2m*d
which is linearithmic. With n = m = q = 10^5, you get an overall complexity which is still below 10^7 unless you've got pathological arrays and ad hoc queries, in which case the complexity should be quadratic (or maybe even cubic; I haven't examined it closely).
In a real world scenario, you can also conceivably employ some tricks. Remember the last values of the modified indexes of i and j, and the last location query k. This costs little. Now on the next query, chances are that you will be able to use one of the three values to prime your binary search and shave some time.

Efficient histogram implementation using a hash function

Is there a more efficient approach to computing a histogram than a binary search for a non-linear bin distribution?
I'm actually only interested in the bit of the algorithm that matches the key (value) to the bin (the transfer function?) , i.e. for a bunch of floating point values I just want to know the appropriate bin index for each value.
I know that for a linear bin distribution you can get O(1) by dividing the value by the bin width, and that for non linear bins a binary search gets you O(logN). My current implementation uses a binary search on unequal bin widths.
In the spirit of improving efficiency I was curious as to whether you could use a hash function to map a value to its appropriate bin and achieve O(1) time complexity when you have bins of unequal widths?
In some simple cases you can get O(1).
Suppose, your values are 8-bit, from 0 to 255.
If you split them into 8 bins of sizes 2, 2, 4, 8, 16, 32, 64, 128, then the bin value ranges will be: 0-1, 2-3, 4-7, 8-15, 16-31, 32-63, 64-127, 128-255.
In binary these ranges look like:
0000000x (bin 0)
0000001x
000001xx
00001xxx
0001xxxx
001xxxxx
01xxxxxx
1xxxxxxx (bin 7)
So, if you can quickly (in O(1)) count how many most significant zero bits there are in the value, you can get the bin number from it.
In this particular case you may precalculate a look-up table of 256 elements, containing the bin number and finding the appropriate bin for a value is just one table look-up.
Actually, with 8-bit values you can use bins of arbitrary sizes since the look-up table is small.
If you were to go with bins of sizes of powers of 2, you could reuse this look-up table for 16-bit values as well. And you'd need two look-ups. You can extend it to even longer values.
Ordinary hash functions are intended to scatter different values quite randomly across some range. A single-bit difference in arguments may lead to dozens of bits different in results. For that reason, ordinary hash functions are not suitable for the situation described in the question.
An alternative is to build an array P with entries that index into the table B of bin limits. Given some value x, we find the bin j it belongs to (or sometimes a nearby bin) via j = P[⌊x·r⌋] where r is a ratio that depends on the size of P and the maximum value in B. The effectiveness of this approach depends on the values in B and the size of P.
The behavior of functions like P[⌊x·r⌋] can be seen via the python code shown below. (The method is about the same in any programming language. However, tips for Python-to-C are given below.) Suppose the code is stored in file histobins.py and loaded into the ipython interpreter with the command import histobins as hb. Then a command like hb.betterparts(27, 99, 9, 80,155) produces output like
At 80 parts, steps = 20 = 7+13
At 81 parts, steps = 16 = 7+9
At 86 parts, steps = 14 = 6+8
At 97 parts, steps = 13 = 12+1
At 108 parts, steps = 12 = 3+9
At 109 parts, steps = 12 = 8+4
At 118 parts, steps = 12 = 6+6
At 119 parts, steps = 10 = 7+3
At 122 parts, steps = 10 = 3+7
At 141 parts, steps = 10 = 5+5
At 142 parts, steps = 10 = 4+6
At 143 parts, steps = 9 = 7+2
These parameters to betterparts set nbins=27, topsize=99, seed=9, plo=80, phi=155 which creates a test set of 27 bins for values from 0 to 99, with random seed 9, and size of P from 80 to 155-1. The number of “steps” is the number of times the two while loops in testparts() operated during a test with 10*nbins values from 0 to topsize. Eg, “At 143 parts, steps = 9 = 7+2” means that when the size of P is 143, out of 270 trials, 261 times P[⌊x·r⌋] produced the correct index at once; 7 times the index had to be decreased, and twice it had to be increased.
The general idea of the method is to trade off space for time. Another tradeoff is preparation time versus operation time. If you are going to be doing billions of lookups, it is worthwhile to do a few thousand trials to find a good value of |P|, the size of P. If you are going to be doing only a few millions of lookups, it might be better to just pick some large value of |P| and run with it, or perhaps just run betterparts over a narrow range. Instead of doing 75 tests as above, if we start with larger |P| fewer tests may give a good enough result. For example, 10 tests via “hb.betterparts(27, 99, 9, 190,200)” produces
At 190 parts, steps = 11 = 5+6
At 191 parts, steps = 5 = 3+2
At 196 parts, steps = 5 = 4+1
As long as P fits into some level of cache (along with other relevant data) making |P| larger will speed up access. So, making |P| as large as practical is a good idea. As |P| gets larger, the difference in performance between one value of |P| and the next gets smaller and smaller. The limiting factors on speed then include time to multiply and time to set up while loops. One approach for faster multiplies may be to choose a power of 2 as a multiplier; compute |P| to match; then use shifts or adds to exponents instead of multiplies. One approach to spending less time setting up while loops is to move the statement if bins[bin] <= x < bins[bin+1]: (or its C equivalent, see below) to before the while statements and do the while's only if the if statement fails.
Python code is shown below. Note, in translating from Python to C,
• # begins a comment
• def begins a function
• a statement like ntest, right, wrong, x = 10*nbins, 0, 0, 0 assigns values to respective identifiers
• a statement like return (ntest, right, wrong, stepdown, stepup) returns a tuple of 5 values that the caller can assign to a tuple or to respective identifiers
• the scope of a def, while, or if ends with a line not indented farther than the def, while, or if
• bins = [0] initializes a list (an extendible indexable array) with value 0 as its initial entry
• bins.append(t) appends value t at the end of list bins
• for i,j in enumerate(p): runs a loop over the elements of iterable p (in this case, p is a list), making the index i and corresponding entry j == p[i] available inside the loop
• range(nparts) stands for a list of the values 0, 1, ... nparts-1
• range(plo, phi) stands for a list of the values plo, plo+1, ... phi-1
• if bins[bin] <= x < bins[bin+1] means if ((bins[bin] <= x) && (x < bins[bin+1]))
• int(round(x*float(nparts)/topsize))) actually rounds x·r, instead of computing ⌊x·r⌋ as advertised above
def makebins(nbins, topsize):
bins, t = [0], 0
for i in range(nbins):
t += random.random()
bins.append(t)
for i in range(nbins+1):
bins[i] *= topsize/t
bins.append(topsize+1)
return bins
#________________________________________________________________
def showbins(bins):
print ''.join('{:6.2f} '.format(x) for x in bins)
def showparts(nbins, bins, topsize, nparts, p):
ratio = float(topsize)/nparts
for i,j in enumerate(p):
print '{:3d}. {:3d} {:6.2f} {:7.2f} '.format(i, j, bins[j], i*ratio)
print 'nbins: {} topsize: {} nparts: {} ratio: {}'.format(nbins, topsize, nparts, ratio)
print 'p = ', p
print 'bins = ',
showbins(bins)
#________________________________________________________________
def testparts(nbins, topsize, nparts, seed):
# Make bins and make lookup table p
import random
if seed > 0: random.seed(seed)
bins = makebins(nbins,topsize)
ratio, j, p = float(topsize)/nparts, 0, range(nparts)
for i in range(nparts):
while j<nbins and i*ratio >= bins[j+1]:
j += 1
p[i] = j
p.append(j)
#showparts(nbins, bins, topsize, nparts, p)
# Count # of hits and steps with avg. of 10 items per bin
ntest, right, wrong, x = 10*nbins, 0, 0, 0
delta, stepdown, stepup = topsize/float(ntest), 0, 0
for i in range(ntest):
bin = p[min(nparts, max(0, int(round(x*float(nparts)/topsize))))]
while bin < nbins and x >= bins[bin+1]:
bin += 1; stepup += 1
while bin > 0 and x < bins[bin]:
bin -= 1; stepdown += 1
if bins[bin] <= x < bins[bin+1]: # Test if bin is correct
right += 1
else:
wrong += 1
print 'Wrong bin {} {:7.3f} at x={:7.3f} Too {}'.format(bin, bins[bin], x, 'high' if bins[bin] > x else 'low')
x += delta
return (ntest, right, wrong, stepdown, stepup)
#________________________________________________________________
def betterparts(nbins, topsize, seed, plo, phi):
beststep = 1e9
for parts in range(plo, phi):
ntest, right, wrong, stepdown, stepup = testparts(nbins, topsize, parts, seed)
if wrong: print 'Error with ', parts, ' parts'
steps = stepdown + stepup
if steps <= beststep:
beststep = steps
print 'At {:3d} parts, steps = {:d} = {:d}+{:d}'.format(parts, steps, stepdown, stepup)
#________________________________________________________________
Interpolation search is your friend. It's kind of an optimistic, predictive binary search where it guesses where the bin should be based on a linear assumption about the distribution of inputs, rather than just splitting the search space in half at each step. It will be O(1) if the linear assumption is true, but still works (though more slowly) when the assumption is not. To the degree that its predictions are accurate, the search is fast.
Depends on the implementation of the hashing and the type of data you're working with. For smaller data sets a more simple algorithm like binary search might outperform constant lookup if the lookup-overhead of hashing is larger on average.
The usual implementation of hashing, consists of an array of linked lists and a hashing function that maps a string to an index in the array of linked lists. There's a thing called the load factor, which is the number of elements in the hash map / length of the linked-list array. Thus for load factors < 1 you'll achieve constant lookup in the best case because no linked-list will contain more than one element (best case).
There's only one way to find out which is better - implement a hash map and see for yourself. You should be able to get something near constant lookup :)

Why is the average number of steps for finding an item in an array N/2?

Could somebody explain why the average number of steps for finding an item in an unsorted array data-structure is N/2?
This really depends what you know about the numbers in the array. If they're all drawn from a distribution where all the probability mass is on a single value, then on expectation it will take you exactly 1 step to find the value you're looking for, since every value is the same, for example.
Let's now make a pretty strong assumption, that the array is filled with a random permutation of distinct values. You can think of this as picking some arbitrary sorted list of distinct elements and then randomly permuting it. In this case, suppose you're searching for some element in the array that actually exists (this proof breaks down if the element is not present). Then the number of steps you need to take is given by X, where X is the position of the element in the array. The average number of steps is then E[X], which is given by
E[X] = 1 Pr[X = 1] + 2 Pr[X = 2] + ... + n Pr[X = n]
Since we're assuming all the elements are drawn from a random permutation,
Pr[X = 1] = Pr[X = 2] = ... = Pr[X = n] = 1/n
So this expression is given by
E[X] = sum (i = 1 to n) i / n = (1 / n) sum (i = 1 to n) i = (1 / n) (n)(n + 1) / 2
= (n + 1) / 2
Which, I think, is the answer you're looking for.
The question as stated is just wrong. Linear search may perform better.
Perhaps a simpler example that shows why the average is N/2 is this:
Assume you have an unsorted array of 10 items: [5, 0, 9, 8, 1, 2, 7, 3, 4, 6]. This is all the digits [0..9].
Since the array is unsorted (i.e. you know nothing about the order of the items), the only way you can find a particular item in the array is by doing a linear search: start at the first item and go until you find what you're looking for, or you reach the end.
So let's count how many operations it takes to find each item. Finding the first item (5) takes only one operation. Finding the second item (0) takes two. Finding the last item (6) takes 10 operations. The total number of operations required to find all 10 items is 1+2+3+4+5+6+7+8+9+10, or 55. The average is 55/10, or 5.5.
The "linear search takes, on average, N/2 steps" conventional wisdom makes a number of assumptions. The two biggest are:
The item you're looking for is in the array. If an item isn't in the array, then it takes N steps to determine that. So if you're often looking for items that aren't there, then your average number of steps per search is going to be much higher than N/2.
On average, each item is searched for approximately as often as any other item. That is, you search for "6" as often as you search for "0", etc. If some items are looked up significantly more often than others, then the average number of steps per search is going to be skewed in favor of the items that are searched for more frequently. The number will be higher or lower than N/2, depending on the positions of the most frequently looked-up items.
While I think templatetypedef has the most instructive answer, in this case there is a much simpler one.
Consider permutations of the set {x1, x2, ..., xn} where n = 2m. Now take some element xi you wish to locate. For each permutation where xi occurs at index m - k, there is a corresponding mirror image permutation where xi occurs at index m + k. The mean of these possible indices is just [(m - k) + (m + k)]/2 = m = n/2. Therefore the mean of all all possible permutations of the set is n/2.
Consider a simple reformulation of the question:
What would be the limit of
lim (i->inf) of (sum(from 1 to i of random(n)) /i)
Or in C:
int sum = 0, i;
for (i = 0; i < LARGE_NUM; i++) sum += random(n);
sum /= LARGE_NUM;
If we assume that our random have even distribution of values (each value from 1 to n is equally likely to be produced), then the expected result would be (1+n)/2.

Resources