How to search a big array for an object?

How to search a big array for an object? - arrays

I had an interview today, I was asked how search for a number inside an array, I said binarysearch, he asked me how about a big array that has thousands of bjects (for example Stocks) searching for example by price of the stocks, I said binarysearch again, he said sorting an array of thousands will take lot of time before applying binarysearch.
Can you please bear with me and teach me how to approach this problem ?
thanks
your help is appreciated.

I was asked a similar question.The twist was to search in sorted and then an unsorted array .These were my answers all unaccepted
For sorted I suggested we can find the center and do a linear search .Binary search will also work here
For unsorted I suggested linear again .
Then I suggested Binary which is kind of wrong.
Suggested storing the array in a hashset and utilize hashing . (Not accepted since high space complexcity)
I suggested Tree Set which is a Red Black tree quite good for lookup.(Not accepted since high space complexcity)
Copying into Arraylist etch were also considered overhead.
In the end I got a negative feedback.
Though we may think that one of the above is solution but surely there is something special in linear searching which I am missing.
To be noted sorting before searching is also an overhead especially if you are utilizing any extra data structures in between.
Any comments welcomed.

I am not sure what he had in mind.
If you just want to find the number one time, and you have no guarantees about whether the array is sorted, then I don't think you can beat linear search. On average you will need to seek halfway through the array before you find the value, i.e. expected running time O(N); when sorting you have to touch every single value at least once and probably more than that, i.e. expected running time O(N log N).
But if you need to find multiple values then the time spent sorting it pays off quickly. With a sorted array, you can binary search in O(log N) time, so for sure by the third search you are ahead if you invested the time to sort.
You can do even better if you are allowed to build different data structures to help with the problem. You could build some sort of index, such as a hash table; but the champion data structure for this sort of problem probably would be some sort of tree structure. Then you can insert new values into the tree faster than you could append new values and re-sort the array, and the lookup is still going to be O(log N) to find any value. There are different sorts of trees available: binary tree, B-tree, trie, etc.
But as #Hot Licks said, a hash table is often used for this sort of thing, and it's pretty cheap to update: you just append a value on the main array, and update the hash table to point to the new value. And a hash table is very close to O(1) time, which you can't beat. (A hash table is O(1) if there are no hash collisions; assuming a good hash algorithm and a big enough hash table there will be almost no collisions. I think you could say that a hash table is O(N) where N is the average number of hash collisions per "bucket". If I'm wrong about that I expect to be corrected very quickly; this is StackOverflow!)

I think the interviewer wants you to analyze under different case about the array initial state, what algorithm will you use. Of cause , you must know you can build a hash table and then O(1) can find the number, or when the array is sorted (time spent on sorting maybe concerned) , you can use binarysearch, or use some other data structures to finish the job.

Related

What's the most efficient way to construct a new, sorted, array?

Background
Most questions around sorting talk about sorting an existing unsorted array. Is constructing a new array in a sorted order an equivalent problem or a different one? Here's an example that will clear things up:
Example
I'm generating N random numbers and want to insert them into a new array as I generate them, and I want the final array to be sorted.
Possible Solutions
Insertion Sort
My gut told me that putting each element in the correct place as it's generated would be fastest. This is accomplished by doing a binary search to find the correct point in the array to insert the new element. However, this is an insertion sort, which is known to be less efficient on large lists than other sorting algorithms.
Quicksort
Quicksort is generally thought of as the most efficient 'general' sorting algorithm, where nothing is known about the inputs to the array, and it's more efficient than insertion sort on large lists. Would it, therefore, be best to simply put the random numbers in the array in an unsorted order, and then sort them at the end with quicksort?
Other Solutions
Is there another algorithm I haven't thought of?

Most questions around sorting talk about sorting an existing unsorted array. Is constructing a new array in a sorted order an equivalent problem or a different one? 
It boils down to the same problem for random data, due to efficiency considerations.
Given random data, it's actually more efficient to first generate the random values into an array (unsorted) - O(n) time complexity - and then sort it with your favorite O(n log n) sort algorithm, making the entire operation O(2n log n) time complexity, and, depending on sort algorithm used, between O(1) and O(n) space complexity.
There is no way to beat that approach by "keeping an array sorted as it's constructed" for random data, because any approach will require exactly O(n) generations/insertions of the values, and at least O(n log n) comparisons/swaps/shifts - no matter which method, from the numerous mentioned in comments on the original question, is used. Note, as per a very useful comment on my original answer, the binary insertion sort variant suggested in the original question will likely degrade to O(n^2) time complexity, making it an inferior solution to just generating an array of values first and then sorting it.
Using a balanced tree just matches the time complexity of generating an array and then sorting it - but loses in space complexity, as trees have some overhead, compared to an array, to keep track of child nodes, etc. Also of note, trees are heap-allocated, and require a pointer dereference operation for accessing any child node - so even though the Big-O time complexity is equivalent to first generating an array of data and then sorting it, the real performance of the tree solution will be worse, as there's no data locality, and there's extra cost of pointer dereference. An additional consideration on balanced trees is that insertion cost into something like an AVL is quite high - that is, the n in AVL's O(n log n) insertion is not the same cost as n in an in-place sort of an array, due to necessary rotations of tree nodes to achieve balance. Just because Big-O is the same doesn't mean performance is the same. Even if you have an absolute need to be able to grab the data in a sorted order at some point during construction of the array, it might still be cheaper to just sort an array as you need it, unless you need it sorted at each insertion!
Note, this answer pertains to random data - it is possible, and even likely, to come up with a more efficient approach for "keeping an array sorted as it's constructed" if both the size and characteristics of the data are known, and follow some mathematical pattern, other than randomness; however, such approach would necessarily be overfit for the specific data set it relates to, rather than a general solution.

I recommend the Heapsort or Mergesort.
Heapsort is a comparison-based algorithm that uses a binary heap data structure to sort elements. It divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element and moving that to the sorted region.
Mergesort is a comparison-based algorithm that focuses on how to merge together two pre-sorted arrays such that the resulting array is also sorted.

If you want a true O(nlogn) and sorted "as it is constructed", I would recommend using a proper (tree) based data structure instead of array. You can use data structures like self balanced binary tree, AVL trees.

efficient way to remove duplicate pairs of integers in C

I've found answers to similar problems, but none of them exactly described my problem.
so on the risk of being down-voted to hell I was wondering if there is a standard method to solve my problem. Further, there's a chance that I'm asking the wrong question. Maybe the problem can be solved more efficiently another way.
So here's some background:
I'm looping through a list of particles. Each particle has a list of it's neighboring particles. Now I need to create a list of unique particle pairs of mutual neightbours.
Each particle can be identified by an integer number.
Should I just build a list of all the pair's including duplicates and use some kind of sort & comparator to eliminate duplicates or should I try to avoid adding duplicates into my list in the first place?
Performance is really important to me. I guess most of the loops may be vectorized and threaded. On average each particle has around 15 neighbours and I expect, that there will be 1e6 particles at most.
I do have some ideas, but I'm not an experienced coder and I don't want to waste 1 week to test every single method by benchmarking different situations just to find out that there's already a standard meyjod for my problem.
Any suggestions?
BTW: I'm using C.
Some pseudo-code
for i in nparticles
particle=particles[i]; //just an array containing the "index" of each particle
//each particle has a neightbor-list
for k in neighlist[i] //looping through all the neighbors
//k represent the index of the neighbor of particle "i"
if the pair (i,k) or (k,i) is not already in the pair-list, add it. otherwise don't

Sorting the elements each iteration is not a good idea since comparison sort is O(n log n) complex.
The next best thing would be to store the items in a search tree, better yet binary search tree, and better yet self equalizing binary search tree, you can find implementations on GitHub.
Even better solution would give an access time of O(1), you can achieve this in 2 different ways one is a simple identity array, where at each slot you would save say a pointer to item if there is on at this id or some flag defining that current id is empty. This is very fast but wasteful. You'll need O(N) memory.
The best solution in my opinion would be to use a set or a has-map. Which are basically the same because sets can be implemented using hash-map.
Here is a github project with c hash-map implementation.
And stack overflow answer to a similar question.

Check if there's a duplicate in an array using an hash-table

Goal: Check if there's a duplicate number in an array with the size of n.
Basically if we may use an hash-table (open-hash, with linked list), then we could iterate the array and insert the numbers to the table with some value (could be 1, doesn't really matter).
While iterating, if the cell isn't empty then we have a duplicate number.
Since we know that the expected time for read/write is O(1) then the expected time for the algorithm is O(n).
Question #1: Why is the worst-case equal O(nlogn)?
Question #2: Would you do it differently then the solution suggested?

In here, I assume the author referred to a variant of hash table, where in each "bin" there is a BST (or some other deterministic DS), and thus in the worst case, all elements are inserted to the same bin repeatidly - and that requires O(nlogn) overall.
However, hash tables are seldom implemented this way, because this worst case is very unlikely, and a regular linked list is implemented in this implementation, for this case - the worst case will be O(n^2) for this solution.
The other alternative to approach this problem is sort, and iterate to find duplicates (easy in sorted arrays), this is O(nlogn) with significantly less memory usage.
This problem is known as the element distinctness problem, and these two options (with some variants maybe) are the ways to solve it.
It is known to be Omega(nlogn) without using extra memory and hashing.

How to design a hashfunction that is scalable to exactly n elements?

I have a list of n strings (names of people) that I want to store in a hash table or similar structure. I know the exact value of n, so I want to use that fact to have O(1) lookups, which would be rendered impossible if I had to use a linked list to store my hash nodes. My first reaction was to use the the djb hash, which essentially does this:
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
To compress the resulting h into the range [0,n], I would like to simply do h%n, but I suspect that this will lead to a much higher probability of clashes in a way that would essentially render my hash useless.
My question then, is how can I hash either the string or the resulting hash so that the n elements provide a relatively uniform distribution over [0,n]?

It's not enough to know n. Allocation of an item to a bucket is a function of the item itself so, if you want a perfect hash function (one item per bucket), you need to know the data.
In any case, if you're limiting the number of elements to a known n, you're already technically O(1) lookup. The upper bound will be based on the constant n. This would be true even for a non-hash solution.
Your best bet is to probably just use the hash function you have and have each bucket be a linked list of the colliding items. Even if the hash is less than perfect, you're still greatly minimising the time taken.
Only if the hash is totally imperfect (all n elements placed in one bucket) will it be as bad as a normal linked list.
If you don't know the data in advance, a perfect hash is not possible. Unless, of course, you use h itself as the hash key rather than h%n but that's going to take an awful lot of storage :-)
My advice is to go the good-enough hash with linked list route. I don't doubt that you could make a better hash function based on the relative frequencies of letters in people's names across the population but even the hash you have (which is ideal for all letters having the same frequency) should be adequate.
And, anyway, if you start relying on frequencies and you get an influx of people from those countries that don't seem to use vowels (a la Bosniaa), you'll end up with more collisions.
But keep in mind that it really depends on the n that you're using.
If n is small enough, you could even get away with a sequential search of an unsorted array. I'm assuming your n is large enough here that you've already established that (or a balanced binary tree) won't give you enough performance.
A case in point: we have some code which searches through problem dockets looking for names of people that left comments (so we can establish the last member on our team who responded). There's only ever about ten or so members in our team so we just use a sequential search for them - the performance improvement from using a faster data structure was deemed too much trouble.
aNo offence intended. I just remember the humorous article a long time ago about Clinton authorising the airlifting of vowels to Bosnia. I'm sure there are other countries with a similar "problem".

What you're after is called a Perfect Hash. It's a hash function where all the keys are known ahead of time, designed so that there are no collisions.
The gperf program generates C code for perfect hashes.

It sounds like you're looking for an implementation of a perfect hash function, or perhaps even a minimal perfect hash function. According to the Wikipedia page, CMPH might
fit your needs. Disclaimer: I've never used it.

The optimal algorithm for mapping n strings to integers 1-n is to build a DFA where the terminating states are the integers 1-n. (I'm sure someone here will step up with a fancy name for this...but in the end it's all DFA.) Size/speed tradeoff can be adjusted by varying your alphabet size (operating on bytes, half-bytes, or even bits).

Efficient data structure for fast random access, search, insertion and deletion

I'm looking for a data structure (or structures) that would allow me keep me an ordered list of integers, no duplicates, with indexes and values in the same range.
I need four main operations to be efficient, in rough order of importance:
taking the value from a given index
finding the index of a given value
inserting a value at a given index
deleting a value at a given index
Using an array I have 1 at O(1), but 2 is O(N) and insertion and deletions are expensive (O(N) as well, I believe).
A Linked List has O(1) insertion and deletion (once you have the node), but 1 and 2 are O(N) thus negating the gains.
I tried keeping two arrays a[index]=value and b[value]=index, which turn 1 and 2 into O(1) but turn 3 and 4 into even more costly operations.
Is there a data structure better suited for this?

I would use a red-black tree to map keys to values. This gives you O(log(n)) for 1, 3, 4. It also maintains the keys in sorted order.
For 2, I would use a hash table to map values to keys, which gives you O(1) performance. It also adds O(1) overhead for keeping the hash table updated when adding and deleting keys in the red-black tree.

How about using a sorted array with binary search?
Insertion and deletion is slow. but given the fact that the data are plain integers could be optimized with calls to memcpy() if you are using C or C++. If you know the maximum size of the array, you can even avoid any memory allocations during the usage of the array, as you can preallocate it to the maximum size.
The "best" approach depends on how many items you need to store and how often you will need to insert/delete compared to finding. If you rarely insert or delete a sorted array with O(1) access to the values is certainly better, but if you insert and delete things frequently a binary tree can be better than the array. For a small enough n the array most likely beats the tree in any case.
If storage size is of concern, the array is better than the trees, too. Trees also need to allocate memory for every item they store and the overhead of the memory allocation can be significant as you only store small values (integers).
You may want to profile what is faster, the copying of the integers if you insert/delete from the sorted array or the tree with it's memory (de)allocations.

I don't know what language you're using, but if it's Java you can leverage LinkedHashMap or a similar Collection. It's got all of the benefits of a List and a Map, provides constant time for most operations, and has the memory footprint of an elephant. :)
If you're not using Java, the idea of a LinkedHashMap is probably still suitable for a usable data structure for your problem.

Use a vector for the array access.
Use a map as a search index to the subscript into the vector.
given a subscript fetch the value from the vector O(1)
given a key, use the map to find the subscript of the value. O(lnN)
insert a value, push back on the vector O(1) amortized, insert the subscript into
the map O(lnN)
delete a value, delete from the map O(lnN)

Howabout a Treemap? log(n) for the operations described.

I like balanced binary trees a lot. They are sometimes slower than hash tables or other structures, but they are much more predictable; they are generally O(log n) for all operations. I would suggest using a Red-black tree or an AVL tree.

How to achieve 2 with RB-trees? We can make them count their children with every insert/delete operations. This doesn't make these operationis last significantly longer. Then getting down the tree to find the i-th element is possible in log n time. But I see no implementation of this method in java nor stl.

If you're working in .NET, then according to the MS docs http://msdn.microsoft.com/en-us/library/f7fta44c.aspx
SortedDictionary and SortedList both have O(log n) for retrieval
SortedDictionary has O(log n) for insert and delete operations, whereas SortedList has O(n).
The two differ by memory usage and speed of insertion/removal. SortedList uses less memory than SortedDictionary. If the SortedList is populated all at once from sorted data, it's faster than SortedDictionary. So it depends on the situation as to which is really the best for you.
Also, your argument for the Linked List is not really valid as it might be O(1) for the insert, but you have to traverse the list to find the insertion point, so it's really not.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight