Problem Statement:- Given an array of integers and an integer k, print all the pairs in the array whose sum is k
Method 1:-
Sort the array and maintain two pointers low and high, start iterating...
Time Complexity - O(nlogn)
Space Complexity - O(1)
Method 2:-
Keep all the elements in the dictionary and do the process
Time Complexity - O(n)
Space Complexity - O(n)
Now, out of above 2 approaches, which one is the most efficient and on what basis I am going to compare the efficiency, time (or) space in this case as both are different in both the approaches
I've left my comment above for reference.
It was hasty. You do allow O(nlogn) time for the Method 1 sort (I now think I understand?) and that's fair (apologies;-).
What happens next? If the input array must be used again, then you need a sorted copy (the sort would not be in-place) which adds an O(n) space requirement.
The "iterating" part of Method 1 also costs ~O(n) time.
But loading up the dictionary in Method 2 is also ~O(n) time (presumably a throw-away data structure?) and dictionary access - although ~O(1) - is slower (than array indexing).
Bottom line: O-notation is helpful if it can identify an "overpowering cost" (rendering others negligible by comparison), but without a hint at use-cases (typical and boundary, details like data quantities and available system resources etc), questions like this (seeking a "generalised ideal" answer) can't benefit from it.
Often some simple proof-of-concept code and performance tests on representative data can make "the right choice obvious" (more easily and often more correctly than speculative theorising).
Finally, in the absence of a clear performance winner, there is always "code readability" to help decide;-)
I'm just getting started in algorithms and sorting, so bear with me...
Let's say I have an array of 50000 integers.
I need to select the smallest 30000 of them.
I thought of two methods :
1. I iterate the entire array and find each smallest integer
2. I first sort the entire array , and then simply select the first 30000.
Can anyone tell me what's the difference, which method would be faster, and why?
What if the array was smaller or bigger? Would the answer change?
Option 1 sounds like the naive solution. It would involve passing through the array to find the smallest item 30000 times. Each time it finds the smallest, presumably it would swap that item to the beginning or end of the array. In basic terms, this is O(n^2) complexity.
The actual number of operations involved would be less than n^2 because n reduces every time. So you would have roughly 50000 + 49999 + 49998 + ... + 20001, which amounts to just over 1 billion (1000 million) iterations.
Option 2 would employ an algorithm like quicksort or similar, which is commonly O(n.logn).
Here it's harder to provide actual figures, because some efficient sorting algorithms can have a worst-case of O(n^2). But let's say you use a well-behaved one that is guaranteed to be O(n.logn). This would amount to 50000 * 15.61 which is about 780 thousand.
So it's clear that Option 2 wins in this case.
What if the array was smaller or bigger? Would the answer change?
Unless the array became trivially small, the answer would still be Option 2. And the larger your array becomes, the more beneficial Option 2 becomes. This is the nature of time complexity. O(n^2) grows much faster than O(n.logn).
A better question to ask is "what if I want fewer smallest values, and when does Option 1 become preferable?". Although the answer is slightly more complex because of numerous factors (such as what constitutes "one operation" in Option 1 vs Option 2, plus other issues like memory access patterns etc), you can get the simple answer directly from time complexity. Option 1 would become preferable when the number of smallest values to select drops below n.logn. In the case of a 50000-element array, that would mean if you want to select 15 or less smallest elements, then Option 1 wins.
Now, consider an Option 3, where you transform the array into a min-heap. Building a heap is O(n), and removing one item from it is O(logn). You are going to remove 30000 items. So you have the cost of building plus the cost of removal: 50000 + 30000 * 15.6 = approximately 520 thousand. And this is ignoring the fact that n gets smaller every time you remove an element. It's still O(n.logn), like Option 2 but it is probably faster: you've saved time by not bothering to sort the elements you don't care about.
I should mention that in all three cases, the result would be the smallest 30000 values in sorted order. There may be other solutions that would give you these values in no particular order.
30k is close to 50k. Just sort the array and get the smallest 30k e.g., in Python: sorted(a)[:30000]. It is O(n * log n) operation.
If you were needed to find 100 smallest items instead (100 << 50k) then a heap might be more suitable e.g., in Python: heapq.nsmallest(100, a). It is O(n * log k).
If the range of integers is limited—you could consider O(n) sorting methods such as counting sort and radix sort.
Simple iterative method is O(n**2) (quadratic) here. Even for a moderate n that is around a million; it leads to ~10**12 operations that is much worse than ~10**6 for a linear algorithm.
For nearly all practical purposes, sorting and taking the first 30,000 is the likely to be best. In most languages, this is one or two lines of code. Hard to get wrong.
If you have a truly demanding application or are just out to fiddle, you can use a selection algorithm to find the 30,000th largest number. Then one more pass through the array will find 29,999 that are no bigger.
There are several well known selection algorithms that require only O(n) comparisons and some that are sub-linear for data with specific properties.
The fastest in practice is QuickSelect, which - as its name implies - works roughly like a partial QuickSort. Unfortunately, if the data happens to be very badly ordered, QuickSelect can require O(n^2) time (just as QuickSort can). There are various tricks for selecting pivots that the make it virtually impossible to get the worst case run time.
QuickSelect will finish with the array reordered so the smallest 30,000 elements are in the first part (unsorted) followed by the rest.
Because standard selection algorithms are comparison-based, they'll work on any kind of comparable data, not just integers.
You can do this in potentially O(N) time with radix sort or counting sort, given that your input is integers.
Another method is to get the 30000th largest integer by quickselect and simply iterate through the original array. This has Θ(N) time complexity, but in the worst case has O(N^2) for quickselect.
I have coded a few sorting methods in C and I would like to find the input size at which the program is optimal (i.e.) profiling each algorithm. But how do I do this? I know to time each method, but I don't know how I can find the size at which it is 'optimal'.
It depends on some factors:
Data behaviour: is your data already partially sorted? or it is very random?
Data size: for a big input (say 1 thousand or more) you can assure that O(N^2) sorting methods will lose to O(N*log(N)) methods..
Data structure of the data: is it array or list or ?. Sorting method with non sequential access to data will be slower for something like list
So the answer is by empirically running your program with some real data you will likely handle combined by varying in the input size.
When a slower method (like O(N^2)) gets beaten by some faster method (like O(N*log(N))) when input size is > X then you can say that the slower method is 'empirically optimal' for input size <= X (the value depends on the characteristics of the input data).
Sort algorithms do not have a single number at which they are optimal.
For pure execution time, almost every sort algorithm will be fastest on a set of 2 numbers, but that it not useful in most cases.
Some sort algorithms may work more efficiently on smaller data sets, but that does not mean they are 'optimal' at that size.
Some sorts may also work better on other characteristics of the data. There are sorts that can be extremely efficient if the data is almost sorted already, but may be very slow if it is not. Others will run the same on any set of a given size.
It is more useful to look at the Big O of the sort (such as O(n^2), O(n log n) etc) and any special properties the sort has, such as operating on nearly sorted data.
To find the input size at which the program is optimal (by which I assume you mean the fastest, or for which the sorting algorithm requires the fewest comparisons) you will have to test it against various inputs and graph the independent axis (input size) against the dependent axis (runtime) and find the minimum.
I am trying to understand simipiled cache oblivious lookahead array which is described at here, and from the page 35 of this presentation
Analysis of Insertion into Simplified
Fractal Tree:
Cost to merge 2 arrays of size X is O(X=B) block I/Os. Merge is very
I/O efficient.
Cost per element to merge is O(1/B) since O(X) elements were
Max # of times each element is merged is O(logN).
Average insert cost is O(logN/B)
I can understhand #1,#2 and #3, but I can't understand #4, From the paper, merge can be considered as binary addition carry, for example, (31)B could be presented:
when inserting a new item(plus 1), there should be 5 = log(32) merge(5 carries). But, in this situation, we have to merge 32 elements! In addition, if each time we plus 1, then how many carryies will be performed from 0 to 2^k ? The anwser should be 2^k - 1. In other words, one merge per insertion!
so How does #4 is computed?
While you are right on both that the number of merged elements (and so transfers) is N in worst case and that the number of total merges is also of the same order, the average insertion cost is still logarithmic. It comes from two facts: merges vary in cost, and the number of low-cost merges is much higher than the number of high-cost ones.
It might be easier to see by example.
Let's set B=1 (i.e. 1 element per block, worst case of each merge having a cost) and N=32 (e.g. we insert 32 elements into an initially empty array).
Half of the insertions (16) put an element into the empty subarray of size 1, and so do not cause a merge. Of the remaining insertions, one (the last) needs to merge (move) 32 elements, one (16th) moves 16, two (8th and 24th) move 8 elements, four move 4 elements, and eight move 2 elements. Thus, overall number of element moves is 96, giving the average of 3 moves per insertion.
Hope that helps.
The first log B levels fit in (a single page of) memory, and so any stuff that happens in those levels does not incur an I/O. (This also fixes the problem with rrenaud's analysis that there's O(1) merges per insertion, since you only start paying for them after the first log B merges.)
Once you are merging at least B elements, then Fact 2 kicks in.
Consider the work from an element's point of view. It gets merged O(log N) times. It gets charged O(1/B) each time that happens. It's total cost of insertion is O((log N)/B) (need the extra parens to differentiate from O(log N/B), which would be quite bad insertion performance -- even worse than a B-tree).
The "average" cost is really the amortized cost -- it's the amount you charge to that element for its insertion. A little more formally it's the total work for inserting N elements, then divide by N. An amortized cost of O((log N)/B) really means that inserting N elements is O((N log N)/B) I/Os -- for the whole sequence. This compares quite favorable with B-trees, which for N insertions do a total of O((N log N)/log B) I/Os. Dividing by B is obviously a whole lot better than dividing by log B.
You may complain that the work is lumpy, that you sometimes do an insertion that causes a big cascade of merges. That's ok. You don't charge all the merges to the last insertion. Everyone is paying its own small amount for each merge they participate in. Since (log N)/B will typically be much less than 1, everyone is being charged way less than a single I/O over the course of all of the merges it participates in.
What happens if you don't like amortized analysis, and you say that even though the insertion throughput goes up by a couple of orders of magnitude, you don't like it when a single insertion can cause a huge amount of work? Aha! There are standard ways to deamortize such a data structure, where you do a bit of preemptive merging during each insertion. You get the same I/O complexity (you'll have to take my word for it), but it's pretty standard stuff for people who care about amortized analysis and deamortizing the result.
Full disclosure: I'm one of the authors of the COLA paper. Also, rrenaud was in my algorithms class. Also, I'm a founder of Tokutek.
In general, the amortized number of changed bits per increment is 2 = O(1).
Here is a proof by logic/reasoning.
Here is a "proof" by experimentation.
Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.