Sorting algorithm vs. Simple iterations - arrays

I'm just getting started in algorithms and sorting, so bear with me...
Let's say I have an array of 50000 integers.
I need to select the smallest 30000 of them.
I thought of two methods :
1. I iterate the entire array and find each smallest integer
2. I first sort the entire array , and then simply select the first 30000.
Can anyone tell me what's the difference, which method would be faster, and why?
What if the array was smaller or bigger? Would the answer change?

Option 1 sounds like the naive solution. It would involve passing through the array to find the smallest item 30000 times. Each time it finds the smallest, presumably it would swap that item to the beginning or end of the array. In basic terms, this is O(n^2) complexity.
The actual number of operations involved would be less than n^2 because n reduces every time. So you would have roughly 50000 + 49999 + 49998 + ... + 20001, which amounts to just over 1 billion (1000 million) iterations.
Option 2 would employ an algorithm like quicksort or similar, which is commonly O(n.logn).
Here it's harder to provide actual figures, because some efficient sorting algorithms can have a worst-case of O(n^2). But let's say you use a well-behaved one that is guaranteed to be O(n.logn). This would amount to 50000 * 15.61 which is about 780 thousand.
So it's clear that Option 2 wins in this case.
What if the array was smaller or bigger? Would the answer change?
Unless the array became trivially small, the answer would still be Option 2. And the larger your array becomes, the more beneficial Option 2 becomes. This is the nature of time complexity. O(n^2) grows much faster than O(n.logn).
A better question to ask is "what if I want fewer smallest values, and when does Option 1 become preferable?". Although the answer is slightly more complex because of numerous factors (such as what constitutes "one operation" in Option 1 vs Option 2, plus other issues like memory access patterns etc), you can get the simple answer directly from time complexity. Option 1 would become preferable when the number of smallest values to select drops below n.logn. In the case of a 50000-element array, that would mean if you want to select 15 or less smallest elements, then Option 1 wins.
Now, consider an Option 3, where you transform the array into a min-heap. Building a heap is O(n), and removing one item from it is O(logn). You are going to remove 30000 items. So you have the cost of building plus the cost of removal: 50000 + 30000 * 15.6 = approximately 520 thousand. And this is ignoring the fact that n gets smaller every time you remove an element. It's still O(n.logn), like Option 2 but it is probably faster: you've saved time by not bothering to sort the elements you don't care about.
I should mention that in all three cases, the result would be the smallest 30000 values in sorted order. There may be other solutions that would give you these values in no particular order.

30k is close to 50k. Just sort the array and get the smallest 30k e.g., in Python: sorted(a)[:30000]. It is O(n * log n) operation.
If you were needed to find 100 smallest items instead (100 << 50k) then a heap might be more suitable e.g., in Python: heapq.nsmallest(100, a). It is O(n * log k).
If the range of integers is limited—you could consider O(n) sorting methods such as counting sort and radix sort.
Simple iterative method is O(n**2) (quadratic) here. Even for a moderate n that is around a million; it leads to ~10**12 operations that is much worse than ~10**6 for a linear algorithm.

For nearly all practical purposes, sorting and taking the first 30,000 is the likely to be best. In most languages, this is one or two lines of code. Hard to get wrong.
If you have a truly demanding application or are just out to fiddle, you can use a selection algorithm to find the 30,000th largest number. Then one more pass through the array will find 29,999 that are no bigger.
There are several well known selection algorithms that require only O(n) comparisons and some that are sub-linear for data with specific properties.
The fastest in practice is QuickSelect, which - as its name implies - works roughly like a partial QuickSort. Unfortunately, if the data happens to be very badly ordered, QuickSelect can require O(n^2) time (just as QuickSort can). There are various tricks for selecting pivots that the make it virtually impossible to get the worst case run time.
QuickSelect will finish with the array reordered so the smallest 30,000 elements are in the first part (unsorted) followed by the rest.
Because standard selection algorithms are comparison-based, they'll work on any kind of comparable data, not just integers.

You can do this in potentially O(N) time with radix sort or counting sort, given that your input is integers.
Another method is to get the 30000th largest integer by quickselect and simply iterate through the original array. This has Θ(N) time complexity, but in the worst case has O(N^2) for quickselect.

Related

what does worst case big omega(n) means?

If Big-Omega is the lower bound then what does it mean to have a worst case time complexity of Big-Omega(n).
From the book "data structures and algorithms with python" by Michael T. Goodrich:
consider a dynamic array that doubles it size when the element reaches its capacity.
this is from the book:
"we fully explored the append method. In the worst case, it requires
Ω(n) time because the underlying array is resized, but it uses O(1)time in the amortized sense"
The parameterized version, pop(k), removes the element that is at index k < n
of a list, shifting all subsequent elements leftward to fill the gap that results from
the removal. The efficiency of this operation is O(n−k), as the amount of shifting
depends upon the choice of index k. Note well that this
implies that pop(0) is the most expensive call, using Ω(n) time.
how is "Ω(n)" describes the most expensive time?
The number inside the parenthesis is the number of operations you must do to actually carry out the operation, always expressed as a function of the number of items you are dealing with. You never worry about just how hard those operations are, only the total number of them.
If the array is full and has to be resized you need to copy all the elements into the new array. One operation per item in the array, thus an O(n) runtime. However, most of the time you just do one operation for an O(1) runtime.
Common values are:
O(1): One operation only, such as adding it to the list when the list isn't full.
O(log n): This typically occurs when you have a binary search or the like to find your target. Note that the base of the log isn't specified as the difference is just a constant and you always ignore constants.
O(n): One operation per item in your dataset. For example, unsorted search.
O(n log n): Commonly seen in good sort routines where you have to process every item but can divide and conquer as you go.
O(n^2): Usually encountered when you must consider every interaction of two items in your dataset and have no way to organize it. For example a routine I wrote long ago to find near-duplicate pictures. (Exact duplicates would be handled by making a dictionary of hashes and testing whether the hash existed and thus be O(n)--the two passes is a constant and discarded, you wouldn't say O(2n).)
O(n^3): By the time you're getting this high you consider it very carefully. Now you're looking at three-way interactions of items in your dataset.
Higher orders can exist but you need to consider carefully what's it's going to do. I have shipped production code that was O(n^8) but with very heavy pruning of paths and even then it took 12 hours to run. Had the nature of the data not been conductive to such pruning I wouldn't have written it at all--the code would still be running.
You will occasionally encounter even nastier stuff which needs careful consideration of whether it's going to be tolerable or not. For large datasets they're impossible:
O(2^n): Real world example: Attempting to prune paths so as to retain a minimum spanning tree--I computed all possible trees and kept the cheapest. Several experiments showed n never going above 10, I thought I was ok--until a different seed produced n = 22. I rewrote the routine for not-always-perfect answer that was O(n^2) instead.
O(n!): I don't know any examples. It blows up horribly fast.

Which of the following methods is more efficient

Problem Statement:- Given an array of integers and an integer k, print all the pairs in the array whose sum is k
Method 1:-
Sort the array and maintain two pointers low and high, start iterating...
Time Complexity - O(nlogn)
Space Complexity - O(1)
Method 2:-
Keep all the elements in the dictionary and do the process
Time Complexity - O(n)
Space Complexity - O(n)
Now, out of above 2 approaches, which one is the most efficient and on what basis I am going to compare the efficiency, time (or) space in this case as both are different in both the approaches
I've left my comment above for reference.
It was hasty. You do allow O(nlogn) time for the Method 1 sort (I now think I understand?) and that's fair (apologies;-).
What happens next? If the input array must be used again, then you need a sorted copy (the sort would not be in-place) which adds an O(n) space requirement.
The "iterating" part of Method 1 also costs ~O(n) time.
But loading up the dictionary in Method 2 is also ~O(n) time (presumably a throw-away data structure?) and dictionary access - although ~O(1) - is slower (than array indexing).
Bottom line: O-notation is helpful if it can identify an "overpowering cost" (rendering others negligible by comparison), but without a hint at use-cases (typical and boundary, details like data quantities and available system resources etc), questions like this (seeking a "generalised ideal" answer) can't benefit from it.
Often some simple proof-of-concept code and performance tests on representative data can make "the right choice obvious" (more easily and often more correctly than speculative theorising).
Finally, in the absence of a clear performance winner, there is always "code readability" to help decide;-)

Can min/max of moving window achieve in O(N)?

I have input array A
A[0], A[1], ... , A[N-1]
I want function Max(T,A) which return B represent max value on A over previous moving window of size T where
B[i+T] = Max(A[i], A[i+T])
By using max heap to keep track of max value on current moving windows A[i] to A[i+T], this algorithm yields O(N log(T)) worst case.
I would like to know is there any better algorithm? Maybe an O(N) algorithm
O(N) is possible using Deque data structure. It holds pairs (Value; Index).
at every step:
if (!Deque.Empty) and (Deque.Head.Index <= CurrentIndex - T) then
Deque.ExtractHead;
//Head is too old, it is leaving the window
while (!Deque.Empty) and (Deque.Tail.Value > CurrentValue) do
Deque.ExtractTail;
//remove elements that have no chance to become minimum in the window
Deque.AddTail(CurrentValue, CurrentIndex);
CurrentMin = Deque.Head.Value
//Head value is minimum in the current window
it's called RMQ(range minimum query). Actually i once wrote an article about that(with c++ code). See http://attiix.com/2011/08/22/4-ways-to-solve-%C2%B11-rmq/
or you may prefer the wikipedia, Range Minimum Query
after the preparation, you can get the max number of any given range in O(1)
There is a sub-field in image processing called Mathematical Morphology. The operation you are implementing is a core concept in this field, called dilation. Obviously, this operation has been studied extensively and we know how to implement it very efficiently.
The most efficient algorithm for this problem was proposed in 1992 and 1993, independently by van Herk, and Gil and Werman. This algorithm needs exactly 3 comparisons per sample, independently of the size of T.
Some years later, Gil and Kimmel further refined the algorithm to need only 2.5 comparisons per sample. Though the increased complexity of the method might offset the fewer comparisons (I find that more complex code runs more slowly). I have never implemented this variant.
The HGW algorithm, as it's called, needs two intermediate buffers of the same size as the input. For ridiculously large inputs (billions of samples), you could split up the data into chunks and process it chunk-wise.
In sort, you walk through the data forward, computing the cumulative max over chunks of size T. You do the same walking backward. Each of these require one comparison per sample. Finally, the result is the maximum over one value in each of these two temporary arrays. For data locality, you can do the two passes over the input at the same time.
I guess you could even do a running version, where the temporary arrays are of length 2*T, but that would be more complex to implement.
van Herk, "A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels", Pattern Recognition Letters 13(7):517-521, 1992 (doi)
Gil, Werman, "Computing 2-D min, median, and max filters", IEEE Transactions on Pattern Analysis and Machine Intelligence 15(5):504-507 , 1993 (doi)
Gil, Kimmel, "Efficient dilation, erosion, opening, and closing algorithms", IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12):1606-1617, 2002 (doi)
(Note: cross-posted from this related question on Code Review.)

Cache oblivious lookahead array

I am trying to understand simipiled cache oblivious lookahead array which is described at here, and from the page 35 of this presentation
Analysis of Insertion into Simplified
Fractal Tree:
Cost to merge 2 arrays of size X is O(X=B) block I/Os. Merge is very
I/O efficient.
Cost per element to merge is O(1/B) since O(X) elements were
merged.
Max # of times each element is merged is O(logN).
Average insert cost is O(logN/B)
I can understhand #1,#2 and #3, but I can't understand #4, From the paper, merge can be considered as binary addition carry, for example, (31)B could be presented:
11111
when inserting a new item(plus 1), there should be 5 = log(32) merge(5 carries). But, in this situation, we have to merge 32 elements! In addition, if each time we plus 1, then how many carryies will be performed from 0 to 2^k ? The anwser should be 2^k - 1. In other words, one merge per insertion!
so How does #4 is computed?
While you are right on both that the number of merged elements (and so transfers) is N in worst case and that the number of total merges is also of the same order, the average insertion cost is still logarithmic. It comes from two facts: merges vary in cost, and the number of low-cost merges is much higher than the number of high-cost ones.
It might be easier to see by example.
Let's set B=1 (i.e. 1 element per block, worst case of each merge having a cost) and N=32 (e.g. we insert 32 elements into an initially empty array).
Half of the insertions (16) put an element into the empty subarray of size 1, and so do not cause a merge. Of the remaining insertions, one (the last) needs to merge (move) 32 elements, one (16th) moves 16, two (8th and 24th) move 8 elements, four move 4 elements, and eight move 2 elements. Thus, overall number of element moves is 96, giving the average of 3 moves per insertion.
Hope that helps.
The first log B levels fit in (a single page of) memory, and so any stuff that happens in those levels does not incur an I/O. (This also fixes the problem with rrenaud's analysis that there's O(1) merges per insertion, since you only start paying for them after the first log B merges.)
Once you are merging at least B elements, then Fact 2 kicks in.
Consider the work from an element's point of view. It gets merged O(log N) times. It gets charged O(1/B) each time that happens. It's total cost of insertion is O((log N)/B) (need the extra parens to differentiate from O(log N/B), which would be quite bad insertion performance -- even worse than a B-tree).
The "average" cost is really the amortized cost -- it's the amount you charge to that element for its insertion. A little more formally it's the total work for inserting N elements, then divide by N. An amortized cost of O((log N)/B) really means that inserting N elements is O((N log N)/B) I/Os -- for the whole sequence. This compares quite favorable with B-trees, which for N insertions do a total of O((N log N)/log B) I/Os. Dividing by B is obviously a whole lot better than dividing by log B.
You may complain that the work is lumpy, that you sometimes do an insertion that causes a big cascade of merges. That's ok. You don't charge all the merges to the last insertion. Everyone is paying its own small amount for each merge they participate in. Since (log N)/B will typically be much less than 1, everyone is being charged way less than a single I/O over the course of all of the merges it participates in.
What happens if you don't like amortized analysis, and you say that even though the insertion throughput goes up by a couple of orders of magnitude, you don't like it when a single insertion can cause a huge amount of work? Aha! There are standard ways to deamortize such a data structure, where you do a bit of preemptive merging during each insertion. You get the same I/O complexity (you'll have to take my word for it), but it's pretty standard stuff for people who care about amortized analysis and deamortizing the result.
Full disclosure: I'm one of the authors of the COLA paper. Also, rrenaud was in my algorithms class. Also, I'm a founder of Tokutek.
In general, the amortized number of changed bits per increment is 2 = O(1).
Here is a proof by logic/reasoning. http://www.cs.princeton.edu/courses/archive/spr11/cos423/Lectures/Binary%20Counting.pdf
Here is a "proof" by experimentation. http://codepad.org/0gWKC3rW

Exhaustive searches vs sorting followed by binary search

This is a direct quote from the textbook, Invitation to Computer Science by G. Michael Scneider and Judith L. Gersting.
At the end of Section 3.4.2, we talked about the tradeoff between using sequential search on an unsorted list as opposed to sorting the list and then using binary search. If the list size is n=100,000 about how many worst-case searches must be done before the second alternative is better in terms of number of comparisons?
I don't really get what the question is asking for.
Sequential search is of order (n) and binary is of order (lgn) which in any case lgn will always be less than n. And in this case n is already given so what am I supposed to find.
This is one of my homework assignment but I don't really know what to do. Could anyone explain the question in plain English for me?
and binary is of order (lgn) which in any case lgn will always be less than n
This is where you're wrong. In assignment, you're asked to consider the cost of sorting array too.
Obviously, if you need only one search, first approach is better than sorting array and doing binary search: n < n*logn + logn. And you're asked, how many searches you need for second approach to become more effective.
End of hint.
The question is how to decide which approach to choose - to just use linear search or to sort and then use binary search.
If you only search a couple of times linear search is better - it is O(n), while sorting is already O(n*logn). If you search very often on the same collection sorting is better - searching multiple times can become O(n*n) but sorting and then searching with binary search is again O(n*logn) + NumberOfSearches*O(logn) which can be less or more than using linear search depending on how NumberOfSearches and n relate.
The task is to determine the exact value of NumberOfSearches (not the exact number, but a function of n) which will make one of the options preferable:
NumberOfSearches * O(n) <> O(n*logn) + NumberOfSearches * O(logn)
don't forget that each O() can have a different constant value.
The order of the methods is not important here. It tells you something how well algorithms scale when the problem becomes bigger and bigger. You can't do any exact calculations if you only know O(n) == it complexity grows linear in the size of the problem. It won't give you any numbers.
This can well mean that an algorithm with O(n) complexity is faster than a O(logn) algorithm, for some n. Because O(log(n)) scales better when it gets larger, we know for sure, there is a n (a problem size) where the algorithm with O(logn) complexity is faster. We just don't know when (for what n).
In plain english:
If you want to know 'how many searches', you need exact equations to solve, you need exact numbers. How many comparisons does it take to search sequential? (Remember n is given, so you can give a number.) How many comparisons (in the worst case!) does it take to search with a binary search? Before you can do a binary search, you have to sort. Let's add the number of comparisons needed to sort to the cost of binary search. Now compare the two numbers, which one is less?
The binary search is fast, but the sorting is slow. The sequential search is slower than binary search, but faster than sorting. However the sorting needs to be done only once, no matter how many times you search. So, when does one heavy sort outweigh having to do a slow (sequential) search every time?
Good luck!
For sequential search, the worst case is n = 100000, so for p searches p × 100000 comparisons are required.
Using a Θ(n2) sorting algorithm would require 100000 × 100000 comparisons.
Binary search would require 1 + log n = 1 + log 100000 = 17 comparisons for each search,
together there would be 100000×100000 + 17p comparisons.
The first expression is larger than the second, meaning
100000p > 100000^2 + 17p
For p > 100017.
The question is about appreciating the number NUM_SEARCHES needed to compensate the cost of sorting. So we'll have:
time( NUM_SEARCHES * O(n) ) > time( NUM_SEARCHES * O(log(n)) + O(n* log(n)) )
Thank you guys. I think I get the point now. Could you take a look at my answer and see whether I'm on the right track.
For worst case searches
Number of comparison for sequential search is n = 100,000.
Number of comparison for binary search is lg(n) = 17.
Number of comparison for sorting is (n-1)/2 * n = (99999)(50000).
(I'm following my textbook and used the selection sort algorithm covered in my class)
So let p be the number of worst case searches, then 100,000p > (99999)(50000) + 17p
OR p > 50008
In conclusion, I need 50,008 worst case searches to make sorting and using binary search better than a sequential search for a list of n=100,000.

Resources