Sorting n sets of data into one - arrays

I have n arrays of data, each of these arrays is sorted by the same criteria.
The number of arrays will, in almost all cases, not exceed 10, so it is a relatively small number. In each array, however, can be a large number of objects, that should be treated as infinite for the algorithm I am looking for.
I now want to treat these arrays as if they are one array. However, I do need a way, to retrieve objects in a given range as fast as possible and without touching all objects before the range and/or all objects after the range. Therefore it is not an option to iterate over all objects and store them in one single array. Fetches with low start values are also more likely than fetches with a high start value. So e.g. fetching objects [20,40) is much more likely than fetching objects [1000,1020), but it could happen.
The range itself will be pretty small, around 20 objects, or can be increased, if relevant for the performance, as long as this does not hit the limits of memory. So I would guess a couple of hundred objects would be fine as well.
Example:
3 arrays, each containing a couple of thousand entires. I now want to get the overall objects in the range [60, 80) without touching either the upper 60 objects in each set nor all the objets that are after object 80 in the array.
I am thinking about some sort of combined, modified binary search. My current idea is something like the following (note, that this is not fully thought through yet, it is just an idea):
get object 60 of each array - the beginning of the range can not be after that, as every single array would already meet the requirements
use these objects as the maximum value for the binary search in every array
from one of the arrays, get the centered object (e.g. 30)
with a binary search in all the other arrays, try to find the object in each array, that would be before, but as close as possible to the picked object.
we now have 3 objects, e.g. object 15, 10 and 20. The sum of these objects would be 45. So there are 42 objects in front, which is more than the beginning of the range we are looking for (30). We continue our binary search in the remaining left half of one of the arrays
if we instead get a value where the sum is smaller than the beginning of the range we are looking for, we continue our search on the right.
at some point we will hit object 30. From there on, we can simply add the objects from each array, one by one, with an insertion sort until we hit the range length.
My questions are:
Is there any name for this kind of algorithm I described here?
Are there other algorithms or ideas for this problem, that might be better suited for this issue?
Thans in advance for any idea or help!

People usually call this problem something like "selection in the union of multiple sorted arrays". One of the questions in the sidebar is about the special case of two sorted arrays, and this question is about the general case. Several comparison-based approaches appear in the combined answers; they more or less have to determine where the lower endpoint in each individual array is. Your binary search answer is one of the better approaches; there's an asymptotically faster algorithm due to Frederickson and Johnson, but it's complicated and not obviously an improvement for small ranks.

Related

Range Minimum Query for growing array

I have an array A[0..n] and I need to find the minimum value in the interval A[k₀..n]. Based on that, the array is extended with a value A[n+1] and I need the minimum in A[k₁..n+1]. Again the array is extended with some A[n+2] and queried for the min in A[k₂..n+2]. Is there a way to do each query in O(1) time (after some preprocessing)?
Compared with this earlier question: Range minimum queries when array is dynamic, a difference is that the queried interval start at varying positions k₀, k₁, k₂, ... The end of the queried interval is always the righmost end of the array. In my application I start with an empty array (n=0) so the preprocessing might be trivial. If this helps, in my application the new value used in the extension is always 1+(min returned by last query). But the positions k₀, k₁, k₂, ... depend on data outside of the array.
There is no way that I know of to make both the addition of a new element and the query happen in O(1), and it's probably impossible (though I'm not exactly sure how to prove this). But you can pretty easily make it happen in O(log(n)) using a segment tree. That's probably good enough for any practical application.

Algorithm - What is the best algorithm for detecting duplicate numbers in small array?

What is the best algorithm for detecting duplicate numbers in array, the best in speed, memory and avoiving overhead.
Small Array like [5,9,13,3,2,5,6,7,1] Note that 5 i dublicate.
After searching and reading about sorting algorithms, I realized that I will use one of these algorithms, Quick Sort, Insertion Sort or Merge Sort.
But actually I am really confused about what to use in my case which is a small array.
Thanks in advance.
To be honest, with that size of array, you may as well choose the O(n2) solution (checking every element against every other element).
You'll generally only need to worry about performance if/when the array gets larger. For small data sets like this, you could well have found the duplicate with an 'inefficient' solution before the sort phase of an efficient solution will have finished :-)
In other words, you can use something like (pseudo-code):
for idx1 = 0 to nums.len - 2 inclusive:
for idx2 = idx1 + 1 to nums.len - 1 inclusive:
if nums[idx1] == nums[idx2]:
return nums[idx1]
return no dups found
This finds the first value in the array which has a duplicate.
If you want an exhaustive list of duplicates, then just add the duplicate value to another (initially empty) array (once only per value) and keep going.
You can sort it using any half-decent algorithm though, for a data set of the size you're discussing, even a bubble sort would probably be adequate. Then you just process the sorted items sequentially, looking for runs of values but it's probably overkill in your case.
Two good approaches depend on the fact that you know or not the range from which numbers are picked up.
Case 1: the range is known.
Suppose you know that all numbers are in the range [a, b[, thus the length of the range is l=b-a.
You can create an array A the length of which is l and fill it with 0s, thus iterate over the original array and for each element e increment the value of A[e-a] (here we are actually mapping the range in [0,l[).
Once finished, you can iterate over A and find the duplicate numbers. In fact, if there exists i such that A[i] is greater than 1, it implies that i+a is a repeated number.
The same idea is behind counting sort, and it works fine also for your problem.
Case 2: the range is not known.
Quite simple. Slightly modify the approach above mentioned, instead of an array use a map where the keys are the number from your original array and the values are the times you find them. At the end, iterate over the set of keys and search those that have been found more then once.
Note.
In both the cases above mentioned, the complexity should be O(N) and you cannot do better, for you have at least to visit all the stored values.
Look at the first example: we iterate over two arrays, the lengths of which are N and l<=N, thus the complexity is at max 2*N, that is O(N).
The second example is indeed a bit more complex and dependent on the implementation of the map, but for the sake of simplicity we can safely assume that it is O(N).
In memory, you are constructing data structures the sizes of which are proportional to the number of different values contained in the original array.
As it usually happens, memory occupancy and performance are the keys of your choice. Greater the former, better the latter and vice versa. As suggested in another response, if you know that the array is small, you can safely rely on an algorithm the complexity of which is O(N^2), but that does not require memory at all.
Which is the best choice? Well, it depends on your problem, we cannot say.

how to write order preserving minimal perfect hash for integer keys?

I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
e.g.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.
Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.
There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.
It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.
You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.

Data Structure to do lookup on large number

I have a requirement to do a lookup based on a large number. The number could fall in the range 1 - 2^32. Based on the input, i need to return some other data structure. My question is that what data structure should i use to effectively hold this?
I would have used an array giving me O(1) lookup if the numbers were in the range say, 1 to 5000. But when my input number goes large, it becomes unrealistic to use an array as the memory requirements would be huge.
I am hence trying to look at a data structure that yields the result fast and is not very heavy.
Any clues anybody?
EDIT:
It would not make sense to use an array since i may have only 100 or 200 indices to store.
Abhishek
unordered_map or map, depending on what version of C++ you are using.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
http://www.cplusplus.com/reference/map/map/
A simple solution in C, given you've stated at most 200 elements is just an array of structs with an index and a data pointer (or two arrays, one of indices and one of data pointers, where index[i] corresponds to data[i]). Linearly search the array looking for the index you want. With a small number of elements, (200), that will be very fast.
One possibility is a Judy Array, which is a sparse associative array. There is a C Implementation available. I don't have any direct experience of these, although they look interesting and could be worth experimenting with if you have the time.
Another (probably more orthodox) choice is a hash table. Hash tables are data structures which map keys to values, and provide fast lookup and insertion times (provided a good hash function is chosen). One thing they do not provide, however, is ordered traversal.
There are many C implementations. A quick Google search turned up uthash which appears to be suitable, particularly because it allows you to use any value type as the key (many implementations assume a string as the key). In your case you want to use an integer as the key.

How to know if an array is sorted?

I already read this post but the answer didn't satisfied me Check if Array is sorted in Log(N).
Imagine I have a serious big array over 1,000,000 double numbers (positive and/or negative) and I want to know if the array is "sorted" trying to avoid the max numbers of comparisons because comparing doubles and floats take too much time. Is it possible to use statistics on It?, and if It was:
It is well seen by real-programmers?
Should I take samples?
How many samples should I take
Should they be random, or in a sequence?
How much is the %error permitted to say "the array sorted"?
Thanks.
That depends on your requirements. If you can say that if 100 random samples out of 1.000.000 is enough the assume it's sorted - then so it is. But to be absolutely sure, you will always have to go through every single entry. Only you can answer this question since only you know how certain you need to be about it being sorted.
This is a classic probability problem taught in high school. Consider this question:
What is the probability that the batch will be rejected?
In a batch of 8,000, clocks 7% are defective. A random sample of 10 (without replacement) from the 8,000 is selected and tested. If at least one is defective the entire batch will be rejected.
So you can take a number of random samples from your large array and see if it's sorted, but you must note that you need to know the probability that the sample is out of order. Since you don't have that information, a probabilistic approach wouldn't work efficiently here.
(However, you can check 50% of the array and naively conclude that there is a 50% chance that it is sorted correctly.)
If you run a divide and conquer algorithm using multiprocessing (real parallelism, so only for multi-core CPUs) you can check whether an array is sorted or not in Log(N).
If you have GPU multiprocessing you can achieve Log(N) very easily since modern graphics card are able to run few thousands processes in parallel.
Your question 5 is the question that you need to answer to determine the other answers. To ensure the array is perfectly sorted you must go through every element, because any one of them could be the one out of place.
The maximum number of comparisons to decide whether the array is sorted is N-1, because there are N-1 adjacent number pairs to compare. But for simplicity, we'll say N as it does not matter if we look at N or N+1 numbers.
Furthermore, it is unimportant where you start, so let's just start at the beginning.
Comparison #1 (A[0] vs. A[1]). If it fails, the array is unsorted. If it succeeds, good.
As we only compare, we can reduce this to the neighbors and whether the left one is smaller or equal (1) or not (0). So we can treat the array as a sequence of 0's and 1's, indicating whether two adjacent numbers are in order or not.
Calculating the error rate or the propability (correct spelling?) we will have to look at all combinations of our 0/1 sequence.
I would look at it like this: We have 2^n combinations of an array (i.e. the order of the pairs, of which only one is sorted (all elements are 1 indicating that each A[i] is less or equal to A[i+1]).
Now this seems to be simple:
initially the error is 1/2^N. After the first comparison half of the possible combinations (all unsorted) get eliminated. So the error rate should be 1/2^n + 1/2^(n-1).
I'm not a mathematician, but it should be quite easy to calculate how many elements are needed to reach the error rate (find x such that ERROR >= sum of 1/2^n + 1/2^(n-1)... 1/^(2-x) )
Sorry for the confusing english. I come from germany..
Since every single element can be the one element that is out-of-line, you have to run through all of them, hence your algorithm has runtime O(n).
If your understanding of "sorted" is less strict, you need to specify what exaclty you mean by "sorted". Usually, "sorted" means that adjacent elements meet a less or less-or-equal condition.
Like everyone else says, the only way to be 100% sure that it is sorted is to run through every single element, which is O(N).
However, it seems to me that if you're so worried about it being sorted, then maybe having it sorted to begin with is more important than the array elements being stored in a contiguous portion in memory?
What I'm getting at is, you could use a map whose elements by definition follow a strict weak ordering. In other words, the elements in a map are always sorted. You could also use a set to achieve the same effect.
For example: std::map<int,double> collectoin; would allow you to almost use it like an array: collection[0]=3.0; std::cout<<collection[0]<<std:;endl;. There are differences, of course, but if the sorting is so important then an array is the wrong choice for storing the data.
The old fashion way.Print it out and see if there in order. Really if your sort is wrong you would probably see it soon. It's more unlikely that you would only see a few misorders if you were sorting like 100+ things. When ever I deal with it my whole thing is completely off or it works.
As an example that you probably should not use but demonstrates sampling size:
Statistically valid sample size can give you a reasonable estimate of sortedness. If you want to be 95% certain eerything is sorted you can do that by creating a list of truly random points to sample, perhaps ~1500.
Essentially this is completely pointless if the list of values being out of order in one single place will break subsequent algorithms or data requirements.
If this is a problem, preprocess the list before your code runs, or use a really fast sort package in your code. Most sort packages also have a validation mode, where it simply tells you yes, the list meets your sort criteria - or not. Other suggestions like parallelization of your check with threads are great ideas.

Resources