Why is the average number of steps for finding an item in an array N/2? - arrays

Could somebody explain why the average number of steps for finding an item in an unsorted array data-structure is N/2?

This really depends what you know about the numbers in the array. If they're all drawn from a distribution where all the probability mass is on a single value, then on expectation it will take you exactly 1 step to find the value you're looking for, since every value is the same, for example.
Let's now make a pretty strong assumption, that the array is filled with a random permutation of distinct values. You can think of this as picking some arbitrary sorted list of distinct elements and then randomly permuting it. In this case, suppose you're searching for some element in the array that actually exists (this proof breaks down if the element is not present). Then the number of steps you need to take is given by X, where X is the position of the element in the array. The average number of steps is then E[X], which is given by
E[X] = 1 Pr[X = 1] + 2 Pr[X = 2] + ... + n Pr[X = n]
Since we're assuming all the elements are drawn from a random permutation,
Pr[X = 1] = Pr[X = 2] = ... = Pr[X = n] = 1/n
So this expression is given by
E[X] = sum (i = 1 to n) i / n = (1 / n) sum (i = 1 to n) i = (1 / n) (n)(n + 1) / 2
= (n + 1) / 2
Which, I think, is the answer you're looking for.

The question as stated is just wrong. Linear search may perform better.

Perhaps a simpler example that shows why the average is N/2 is this:
Assume you have an unsorted array of 10 items: [5, 0, 9, 8, 1, 2, 7, 3, 4, 6]. This is all the digits [0..9].
Since the array is unsorted (i.e. you know nothing about the order of the items), the only way you can find a particular item in the array is by doing a linear search: start at the first item and go until you find what you're looking for, or you reach the end.
So let's count how many operations it takes to find each item. Finding the first item (5) takes only one operation. Finding the second item (0) takes two. Finding the last item (6) takes 10 operations. The total number of operations required to find all 10 items is 1+2+3+4+5+6+7+8+9+10, or 55. The average is 55/10, or 5.5.
The "linear search takes, on average, N/2 steps" conventional wisdom makes a number of assumptions. The two biggest are:
The item you're looking for is in the array. If an item isn't in the array, then it takes N steps to determine that. So if you're often looking for items that aren't there, then your average number of steps per search is going to be much higher than N/2.
On average, each item is searched for approximately as often as any other item. That is, you search for "6" as often as you search for "0", etc. If some items are looked up significantly more often than others, then the average number of steps per search is going to be skewed in favor of the items that are searched for more frequently. The number will be higher or lower than N/2, depending on the positions of the most frequently looked-up items.

While I think templatetypedef has the most instructive answer, in this case there is a much simpler one.
Consider permutations of the set {x1, x2, ..., xn} where n = 2m. Now take some element xi you wish to locate. For each permutation where xi occurs at index m - k, there is a corresponding mirror image permutation where xi occurs at index m + k. The mean of these possible indices is just [(m - k) + (m + k)]/2 = m = n/2. Therefore the mean of all all possible permutations of the set is n/2.

Consider a simple reformulation of the question:
What would be the limit of
lim (i->inf) of (sum(from 1 to i of random(n)) /i)
Or in C:
int sum = 0, i;
for (i = 0; i < LARGE_NUM; i++) sum += random(n);
sum /= LARGE_NUM;
If we assume that our random have even distribution of values (each value from 1 to n is equally likely to be produced), then the expected result would be (1+n)/2.

Related

Array operations for maximum sum

Given an array A consisting of N elements. Our task is to find the maximal subarray sum after applying the following operation exactly once:
. Select any subarray and set all the elements in it to zero.
Eg:- array is -1 4 -1 2 then answer is 6 because we can choose -1 at index 2 as a subarray and make it 0. So the resultatnt array will be after applying the operation is : -1 4 0 2. Max sum subarray is 4+0+2 = 6.
My approach was to find start and end indexes of minimum sum subarray and make all elements as 0 of that subarray and after that find maximum sum subarray. But this approach is wrong.
Starting simple:
First, let us start with the part of the question: Finding the maximal subarray sum.
This can be done via dynamic programming:
a = [1, 2, 3, -2, 1, -6, 3, 2, -4, 1, 2, 3]
a = [-1, -1, 1, 2, 3, 4, -6, 1, 2, 3, 4]
def compute_max_sums(a):
res = []
currentSum = 0
for x in a:
if currentSum > 0:
res.append(x + currentSum)
currentSum += x
else:
res.append(x)
currentSum = x
return res
res = compute_max_sums(a)
print(res)
print(max(res))
Quick explanation: we iterate through the array. As long as the sum is non-negative, it is worth appending the whole block to the next number. If we dip below zero at any point, we discard whole "tail" sequence since it will not be profitable to keep it anymore and we start anew. At the end, we have an array, where j-th element is the maximal sum of a subarray i:j where 0 <= i <= j.
Rest is just the question of finding the maximal value in the array.
Back to the original question
Now that we solved the simplified version, it is time to look further. We can now select a subarray to be deleted to increase the maximal sum. The naive solution would be to try every possible subarray and to repeat the steps above. This would unfortunately take too long1. Fortunately, there is a way around this: we can think of the zeroes as a bridge between two maxima.
There is one more thing to address though - currently, when we have the j-th element, we only know that the tail is somewhere behind it so if we were to take maximum and 2nd biggest element from the array, it could happen that they would overlap which would be a problem since we would be counting some of the elements more than once.
Overlapping tails
How to mitigate this "overlapping tails" issue?
The solution is to compute everything once more, this time from the end to start. This gives us two arrays - one where j-th element has its tail i pointing towards the left end of the array(e.g. i <=j) and the other where the reverse is true. Now, if we take x from first array and y from second array we know that if index(x) < index(y) then their respective subarrays are non-overlapping.
We can now proceed to try every suitable x, y pair - there is O(n2) of them. However since we don't need any further computation as we already precomputed the values, this is the final complexity of the algorithm since the preparation cost us only O(n) and thus it doesn't impose any additional penalty.
Here be dragons
So far the stuff we did was rather straightforward. This following section is not that complex but there are going to be some moving parts. Time to brush up the max heaps:
Accessing the max is in constant time
Deleting any element is O(log(n)) if we have a reference to that element. (We can't find the element in O(log(n)). However if we know where it is, we can swap it with the last element of the heap, delete it, and bubble down the swapped element in O(log(n)).
Adding any element into the heap is O(log(n)) as well.
Building a heap can be done in O(n)
That being said, since we need to go from start to the end, we can build two heaps, one for each of our pre-computed arrays.
We will also need a helper array that will give us quick index -> element-in-heap access to get the delete in log(n).
The first heap will start empty - we are at the start of the array, the second one will start full - we have the whole array ready.
Now we can iterate over whole array. In each step i we:
Compare the max(heap1) + max(heap2) with our current best result to get the current maximum. O(1)
Add the i-th element from the first array into the first heap - O(log(n))
Remove the i-th indexed element from the second heap(this is why we have to keep the references in a helper array) - O(log(n))
The resulting complexity is O(n * log(n)).
Update:
Just a quick illustration of the O(n2) solution since OP nicely and politely asked. Man oh man, I'm not your bro.
Note 1: Getting the solution won't help you as much as figuring out the solution on your own.
Note 2: The fact that the following code gives the correct answer is not a proof of its correctness. While I'm fairly certain that my solution should work it is definitely worth looking into why it works(if it works) than looking at one example of it working.
input = [100, -50, -500, 2, 8, 13, -160, 5, -7, 100]
reverse_input = [x for x in reversed(input)]
max_sums = compute_max_sums(input)
rev_max_sums = [x for x in reversed(compute_max_sums(reverse_input))]
print(max_sums)
print(rev_max_sums)
current_max = 0
for i in range(len(max_sums)):
if i < len(max_sums) - 1:
for j in range(i + 1, len(rev_max_sums)):
if max_sums[i] + rev_max_sums[j] > current_max:
current_max = max_sums[i] + rev_max_sums[j]
print(current_max)
1 There are n possible beginnings, n possible ends and the complexity of the code we have is O(n) resulting in a complexity of O(n3). Not the end of the world, however it's not nice either.

Maximize number of inversion count in array

We are given an unsorted array A of integers (duplicates allowed) with size N possibly large. We can count the number of pairs with indices i < j, for which A[i] < A[j], let's call this X.
We can change maximum one element from the array with a cost equal to the difference in absolute values (for instance, if we replace element on index k with the new number K, the cost Y is | A[k] - K |).
We can only replace this element with other elements found in the array.
We want to find the minimum possible value of X + Y.
Some examples:
[1,2,2] should return 1 (change the 1 to 2 such that the array becomes [2,2,2])
[2,2,3] should return 1 (change the 3 to 2)
[2,1,1] should return 0 (because no changes are necessary)
[1,2,3,4] should return 6 (this is already the minimum possible value)
[4,4,5,5] should return 3 (this can accomplished by changing the first 4 into a 5 or the last 5 in a 4)
The number of pairs can be found with a naive O(n²) solution, here in Python:
def calc_x(arr):
n = len(arr)
cnt = 0
for i in range(n):
for j in range(i+1, n):
if arr[j] > arr[i]:
cnt += 1
return cnt
A brute-force solution is easily written as for example:
def f(arr):
best_val = calc_x(arr)
used = set(arr)
for i, v in enumerate(arr):
for replacement in used:
if replacement == v:
continue
arr2 = arr[0:i] + replacement + arr[i:]
y = abs(replacement - v)
x = calc_x(arr2)
best_val = min(best_val, x + y)
return best_val
We can count for each element the number of items right of it larger than itself in O(n*log(n)) using for instance an AVL-tree or some variation on merge sort.
However, we still have to search which element to change and what improvement it can achieve.
This was given as an interview question and I would like some hints or insights as how to solve this problem efficiently (data structures or algorithm).
Definitely go for a O(n log n) complexity when counting inversions.
We can see that when you change a value at index k, you can either:
1) increase it, and then possibly reduce the number of inversions with elements bigger than k, but increase the number of inversions with elements smaller than k
2) decrease it (the opposite thing happens)
Let's try not to count x every time you change a value. What do you need to know?
In case 1):
You have to know how many elements on the left are smaller than your new value v and how many elements on the right are bigger than your value. You can pretty easily check that in O (n). So what is your x now? You can count it with the following formula:
prev_val - your previous value
prev_x - x that you've counted at the beginning of your program
prev_l - number of elements on the left smaller than prev_val
prev_r - number of elements on the right bigger than prev_val
v - new value
l - number of elements on the right smaller than v
r - number of elements on the right bigger than v
new_x = prev_x + r + l - prev_l - prev_r
In the second case you pretty much do the opposite thing.
Right now you get something like O( n^3 ) instead of O (n^3 log n), which is probably still bad. Unfortunately that's all what I came up for now. I'll definitely tell you if I come up with sth better.
EDIT: What about memory limit? Is there any? If not, you can just for each element in the array make two sets with elements before and after the current one. Then you can find the amount of smaller/bigger in O (log n), making your time complexity O (n^2 log n).
EDIT 2: We can also try to check, what element would be the best to change to a value v, for every possible value v. You can make then two sets and add/erase elements from them while checking for every element, making the time complexity O(n^2 log n) without using too much space. So the algorithm would be:
1) determine every value v that you can change any element, calculate x
2) for each possible value v:
make two sets, push all elements into the second one
for each element e in array:
add previous element (if there's any) to the first set and erase element e from the second set, then count number of bigger/smaller elements in set 1 and 2 and calculate new x
EDIT 3: Instead of making two sets, you could go with prefix sum for a value. That's O (n^2) already, but I think we can go even better than this.

Find the Element Occurring b times in an an array of size n*k+b

Description
Given an Array of size (n*k+b) where n elements occur k times and one element occurs b times, in other words there are n+1 distinct Elements. Given that 0 < b < k find the element occurring b times.
My Attempted solutions
Obvious solution will be using hashing but it will not work if the numbers are very large. Complexity is O(n)
Using map to store the frequencies of each element and then traversing map to find the element occurring b times.As Map's are implemented as height balanced trees Complexity will be O(nlogn).
Both of my solution were accepted but the interviewer wanted a linear solution without using hashing and hint he gave was make the height of tree constant in tree in which you are storing frequencies, but I am not able to figure out the correct solution yet.
I want to know how to solve this problem in linear time without hashing?
EDIT:
Sample:
Input: n=2 b=2 k=3
Aarray: 2 2 2 3 3 3 1 1
Output: 1
I assume:
The elements of the array are comparable.
We know the values of n and k beforehand.
A solution O(n*k+b) is good enough.
Let the number occuring only b times be S. We are trying to find the S in an array of n*k+b size.
Recursive Step: Find the median element of the current array slice as in Quick Sort in lineer time. Let the median element be M.
After the recursive step you have an array where all elements smaller than M occur on the left of the first occurence of M. All M elements are next to each other and all element larger than M are on the right of all occurences of M.
Look at the index of the leftmost M and calculate whether S<M or S>=M. Recurse either on the left slice or the right slice.
So you are doing a quick sort but delving only one part of the divisions at any time. You will recurse O(logN) times but each time with 1/2, 1/4, 1/8, .. sizes of the original array, so the total time will still be O(n).
Clarification: Let's say n=20 and k = 10. Then, there are 21 distinct elements in the array, 20 of which occur 10 times and the last occur let's say 7 times. I find the medium element, let's say it is 1111. If the S<1111 than the index of the leftmost occurence of 1111 will be less than 11*10. If S>=1111 then the index will be equal to 11*10.
Full example: n = 4. k = 3. Array = {1,2,3,4,5,1,2,3,4,5,1,2,3,5}
After the first recursive step I find the median element is 3 and the array is something like: {1,2,1,2,1,2,3,3,3,5,4,5,5,4} There are 6 elements on the left of 3. 6 is a multiple of k=3. So each element must be occuring 3 times there. So S>=3. Recurse on the right side. And so on.
An idea using cyclic groups.
To guess i-th bit of answer, follow this procedure:
Count how many numbers in array has i-th bit set, store as cnt
If cnt % k is non-zero, then i-th bit of answer is set. Otherwise it is clear.
To guess whole number, repeat the above for every bit.
This solution is technically O((n*k+b)*log max N), where max N is maximal value in the table, but because number of bits is usually constant, this solution is linear in array size.
No hashing, memory usage is O(log k * log max N).
Example implementation:
from random import randint, shuffle
def generate_test_data(n, k, b):
k_rep = [randint(0, 1000) for i in xrange(n)]
b_rep = [randint(0, 1000)]
numbers = k_rep*k + b_rep*b
shuffle(numbers)
print "k_rep: ", k_rep
print "b_rep: ", b_rep
return numbers
def solve(data, k):
cnts = [0]*10
for number in data:
bits = [number >> b & 1 for b in xrange(10)]
cnts = [cnts[i] + bits[i] for i in xrange(10)]
return reduce(lambda a,b:2*a+(b%k>0), reversed(cnts), 0)
print "Answer: ", solve(generate_test_data(10, 15, 13), 3)
In order to have a constant height B-tree containing n distinct elements, with height h constant, you need z=n^(1/h) children per nodes: h=log_z(n), thus h=log(n)/log(z), thus log(z)=log(n)/h, thus z=e^(log(n)/h), thus z=n^(1/h).
Example, with n=1000000, h=10, z=3.98, that is z=4.
The time to reach a node in that case is O(h.log(z)). Assuming h and z to be "constant" (since N=n.k, then log(z)=log(n^(1/h))=log(N/k^(1/h))=ct by properly choosing h based on k, you can then say that O(h.log(z))=O(1)... This is a bit far-fetched, but maybe that was the kind of thing the interviewer wanted to hear?
UPDATE: this one use hashing, so it's not a good answer :(
in python this would be linear time (set will remove the duplicates):
result = (sum(set(arr))*k - sum(arr)) / (k - b)
If 'k' is even and 'b' is odd, then XOR will do. :)

Median of Lists

I was asked this question:
You are given two lists of integers, each of which is sorted in ascending order and each of which has length n. All integers in the two lists are different. You wish to find the n-th smallest element of the union of the two lists. (That is, if you concatenated the lists and sorted the resulting list in ascending order, the element which would be at the n-th position.)
My Solution:
Assume that lists are 0-indexed.
O(n) solution:
A straight-forward solution is to observe that the arrays are already sorted,so we can merge them, and stop after n steps. The first n-1 elements do not need to be copied
into a new array, so this solution takes O(n) time and O(1) memory.
O(log2 n) solution:
The O(log2 n) solution uses alternates binary search on each list. In short, it takes the middle element in the current search interval in the first list (l1[p1]) and searches for it in l2. Since the elements are unique, we will find at most 2 values closest to l1[p1]. Depending on their values relative to l1[p1-1] and l1[p1 + 1] and their indices p21 and p22, we either return the n-th element or recurse: If the sum of any out of the (at most) 3 indices in l1 can be combined with one of the (at most) 2 indices in l2 so that l1[p1'] and l2[p2'] would be right next to each other in the sorted union of the two lists and p1' + p2' = n or p1' + p2' = n + 1, we return one of the 5 elements. If p1 + p2 > n, we recurse to left half of the search interval in l1, otherwise we recurse to the right interval. This way, for out of the O(log n) possible midpoints in l1 we do an O(log n) binary search in l2. Therefore the running time is O(log2 n).
O(log n) solution:
Assuming the lists l1 and l2 have constant access time to any of their elements, we
can use a modified version of binary search to get an O(log n) solution. The easiest approach is to search for an index p1 in just one of the lists and calculate the corresponding index p2 in the other list so that p1 + p2 = n at all times. (This assumes the lists are indexed from 1.)
First we check for the special case when all elements of one list are smaller than any element in the other list:
If l1[n] < l2[0] return l1[n].
If l2[n] < l1[0] return l2[n].
If we do not find the n-th smallest element after this step, call findNth(1,n) with the approximate pseudocode:
findNth(start,end)
p1 = (start + end)/2
p2 = n-p1
if l1[p1] < l2[p2]:
if l1[p1 + 1] > l2[p2]:
return l2[p2]
else:
return findNth(p1+1, end)
else:
if l2[p2 + 1] > l1[p1]:
return l1[p1]
else:
return findNth(start,p1-1)
Element l2[p2] is returned when l2[p2] is greater than exactly p1 + p2-1 = n-1 elements
(and therefore is the n-th smallest). l1[p1] is returned under the same but symmetric conditions. If l1[p1] < l2[p2] and l1[p1+1] < l2[p2], the rank of l2[p2] is greater than n, so we need to take more elements from l1 and less from l2. Therefore we search for p1 in the upper half of the previous search interval. On the other hand, if l2[p2] < l1[p1] and l2[p2 + 1] < l1[p1], the rank of l1[p1] is greater than n. Therefore the real p1 will lie in the bottom half of our current search interval.Since we are halving the size of the problem at each call to findNth and we need to do only constant work to halve the problem size, the recurrence for this algorithm is T(n) = T(n/2) +O(1), which has an O(log n)-time solution.
Interviewer continually ask me different approaches for above problem.I have proposed above three approaches.Is they are correct?Is there any other best possible solution for this question? Actually this question asked lot of times so please provide some good stuff about it.
Not sure if you took a look at this: http://www.leetcode.com/2011/01/find-k-th-smallest-element-in-union-of.html
That solve a more generalized version of the problem you are asking about. Definitely log complexity is possible...
I think this will be the best solution . .
->1 2 3 4 5 6 7 8 9
->10 11 12 13 14 15 16 17 18
take two pointers i and j each pointing at start of arrays, increment i if a[i]< b[j]
increment j if a[i]>b[j]
do this n times.
linear O(n) O(1) space solution.

calculating the number of “inversions” in a permutation

Let A be an array of size N.
we call a couple of indexes (i,j) an "inverse" if i < j and A[i] > A[j]
I need to find an algorithm that receives an array of size N (with unique numbers) and return the number of inverses in time of O(n*log(n)).
You can use the merge sort algorithm.
In the merge algorithm's loop, the left and right halves are both sorted ascendingly, and we want to merge them into a single sorted array. Note that all the elements in the right side have higher indexes than those in the left side.
Assume array[leftIndex] > array[rightIndex]. This means that all elements in the left part following the element with index leftIndex are also larger than the current one in the right side (because the left side is sorted ascendingly). So the current element in the right side generates numberOfElementsInTheLeftSide - leftIndex + 1 inversions, so add this to your global inversion count.
Once the algorithm finishes executing you have your answer, and merge sort is O(n log n) in the worst case.
There is an article published in SIAM in 2010 by Cham and Patrascu entitled Counting Inversions, Offline Orthogonal Range Counting, and Related Problems that gives an algorithm taking O(n sqrt(log(n))) time. This is currently the best known algorithm, and improves the long-standing O(n log(n) / log(log(n))) algorithm. From the abstract:
We give an O(n sqrt(lg n))-time algorithm
for counting the number of inversions
in a permutation on n elements. This
improves a long-standing previous
bound of O(n lg n / lg lg n) that
followed from Dietz's data structure
[WADS'89], and answers a question of
Andersson and Petersson [SODA'95]. As
Dietz's result is known to be optimal
for the related dynamic rank problem,
our result demonstrates a significant
improvement in the offline setting.
Our new technique is quite simple: we
perform a "vertical partitioning" of a
trie (akin to van Emde Boas trees),
and use ideas from external memory.
However, the technique finds numerous
applications: for example, we obtain
in d dimensions, an algorithm to
answer n offline orthogonal range
counting queries in time O(n
lgd-2+1/d n);
an improved
construction time for online data
structures for orthogonal range
counting;
an improved update time
for the partial sums problem;
faster
Word RAM algorithms for finding the
maximum depth in an arrangement of
axis-aligned rectangles, and for the
slope selection problem.
As a bonus,
we also give a simple
(1 + ε)-approximation algorithm for
counting inversions that runs in
linear time, improving the previous
O(n lg lg n) bound by Andersson and
Petersson.
I think the awesomest way to do this (and thats just because I love the data structure) is to use a binary indexed tree. Mind you, if all you need is a solution, merge sort would work just as well (I just think this concept totally rocks!). The basic idea is this: Build a data structure which updates values in O(log n) and answers the query "How many numbers less than x have already occurred in the array so far?" Given this, you can easily answer how many are greater than x which contributes to inversions with x as the second number in the pair. For example, consider the list {3, 4, 1, 2}.
When processing 3, there's no other numbers so far, so inversions with 3 on the right side = 0
When processing 4, the number of numbers less than 4 so far = 1, thus number of greater numbers (and hence inversions) = 0
Now, when processing 1, number of numbers less than 1 = 0, this number of greater numbers = 2 which contributes to two inversions (3,1) and (4,1). Same logic applies to 2 which finds 1 number less than it and hence 2 greater than it.
Now, the only question is to understand how these updates and queries happen in log n. The url mentioned above is one of the best tutorials I've read on the subject.
These are the original MERGE and MERGE-SORT algorithms
from Cormen, Leiserson, Rivest, Stein Introduction to Algorithms:
MERGE(A,p,q,r)
1 n1 = q - p + 1
2 n2 = r - q
3 let L[1..n1 + 1] and R[1..n2 + 1] be new arrays
4 for i = 1 to n1
5 L[i] = A[p + i - 1]
6 for j = 1 to n2
7 R[j] = A[q + j]
8 L[n1 + 1] = infinity
9 R[n2 + 1] = infinity
10 i = 1
11 j = 1
12 for k = p to r
13 if L[i] <= R[j]
14 A[k] = L[i]
15 i = i + 1
16 else A[k] = R[j]
17 j = j + 1
and
MERGE-SORT(A,p,r)
1 if p < r
2 q = floor((p + r)/2)
3 MERGE-SORT(A,p,q)
4 MERGE-SORT(A,q + 1,r)
5 MERGE(A,p,q,r)
at line 8 and 9 in MERGE infinity is the so called sentinel card,
which has such value that all array elements are smaller then it.
To get the number of inversion one can introduce a global counter,
let's say ninv initialized to zero before calling MERGE-SORT
and than to modify the MERGE algorithm by adding one line
in the else statement after line 16, something like
ninv += n1 - i
than after MERGE-SORT is finished ninv will hold the number of inversions

Resources