Comparing two String array in most efficient way - c

This problem is about searching a string in a master array (contains the list of all UIDs). The second array contains all the strings to be searched.
For example:
First array(Master List) contains: UID1 UID2 UID3... UID99
Second array contains: UID3 UID144 UID50
If a match is found in first array then 1 is returned otherwise 0 is return. So the output for the above example should be 101.
What could be the most efficient approach (targeting C) to solve the above keeping in mind that the traditional way dealing with this would be n^2!!!

sort the master string array and do binary search.

Efficient in terms of what?
I would go with #Trying's suggestion as a good compromise between decent running speed, low memory usage, and very (very!) low complexity of implementation.
Just use qsort() to sort the first master array in place, then use bsearch() to search it.
Assuming n elements in the master array and m in the second array, this should give O(m*log n) time complexity which seems decent.

Another option is to build a hash for the strings in the Master list, it's a single O(M) (assuming the lengths are O(1)), then assuming the hash is distributed evenly, searching a single element should take on average O(M/S), with S being the size the hash (the even distribution means that on average this is the amount of elements mapping into the same hash entry). You can further control the size to fine tune the trade off between space and efficiency

There are mainly two good approaches for this problem:
Use a binary search: a binary search requires the UIDs in the first array to be sorted and allows you to find a solution in O(log n) where n is the number of elements in the master array. The total complexity would be O(m log n) with m the number of elements to be searched.
Use a hashmap: You can store the elements of the master array in a hashmap (O(n)) and then check whether your elements of the second array are in the hashmap (O(m)). The total complexity would be O(n+m).
While the complexity of the second approach looks better, you must keep in mind that if your hash is bad, it could be O(m*n) in the worst case (but you would be very very unlikely). Also you would use more memory and the operations are also slower. In your case, I would use the first approach.

Related

Fast way to count smaller/equal/larger elements in array

I need to optimize my algorithm for counting larger/smaller/equal numbers in array(unsorted), than a given number.
I have to do this a lot of times and given array also can have thousands of elements.
Array doesn't change, number is changing
Example:
array: 1,2,3,4,5
n = 3
Number of <: 2
Number of >: 2
Number of ==:1
First thought:
Iterate through the array and check if element is > or < or == than n.
O(n*k)
Possible optimization:
O((n+k) * logn)
Firstly sort the array (im using c qsort), then use binary search to find equal number, and then somehow count smaller and larger values. But how to do that?
If elements exists (bsearch returns pointer to the element) I also need to check if array contain possible duplicates of this elements (so I need to check before and after this elements while they are equal to found element), and then use some pointer operations to count larger and smaller values.
How to get number of values larger/smaller having a pointer to equal element?
But what to do if I don't find the value (bsearch returns null)?
If the array is unsorted, and the numbers in it have no other useful properties, there is no way to beat an O(n) approach of walking the array once, and counting items in the three buckets.
Sorting the array followed by a binary search would be no better than O(n), assuming that you employ a sort algorithm that is linear in time (e.g. a radix sort). For comparison-based sorts, such as quicksort, the timing would increase to O(n*log2n).
On the other hand, sorting would help if you need to run multiple queries against the same set of numbers. The timing for k queries against n numbers would go from O(n*k) for k linear searches to O(n+k*log2n) assuming a linear-time sort, or O((n+k)*log2n) with comparison-based sort. Given a sufficiently large k, the average query time would go down.
Since the array is (apparently?) not changing, presort it. This allows a binary search (Log(n))
a.) implement your own version of bsearch (it will be less code anyhow)
you can do it inline using indices vs. pointers
you won't need function pointers to a specialized function
b.) Since you say that you want to count the number of matches, you imply that the array can contain multiple entries with the same value (otherwise you would have used a boolean has_n).
This means you'll need to do a linear search for the beginning and end of the array of "n"s.
From which you can calculate the number less than n and greater than n.
It appears that you have some unwritten algorithm for choosing these (for n=3 you look for count of values greater and less than 2 and equal to 1, so there is no way to give specific code)
c.) For further optimization (at the expense of memory) you can sort the data into a binary search tree of structs that holds not just the value, but also the count and the number of values before and after each value. It may not use more memory at all if you have a lot of repeat values, but it is hard to tell without the dataset.
That's as much as I can help without code that describes your hidden algorithms and data or at least a sufficient description (aside from recommending a course or courses in data structures and algorithms).

Find the one non-repeating element in array?

I have an array of n elements in which only one element is not repeated, else all the other numbers are repeated >1 times. And there is no limit on the range of the numbers in the array.
Some solutions are:
Making use of hash, but that would result in linear time complexity but very poor space complexity
Sorting the list using MergeSort O(nlogn) and then finding the element which doesn't repeat
Is there a better solution?
One general approach is to implement a bucketing technique (of which hashing is such a technique) to distribute the elements into different "buckets" using their identity (say index) and then find the bucket with the smallest size (1 in your case). This problem, I believe, is also known as the minority element problem. There will be as many buckets as there are unique elements in your set.
Doing this by hashing is problematic because of collisions and how your algorithm might handle that. Certain associative array approaches such as tries and extendable hashing don't seem to apply as they are better suited to strings.
One application of the above is to the Union-Find data structure. Your sets will be the buckets and you'll need to call MakeSet() and Find() for each element in your array for a cost of $O(\alpha(n))$ per call, where $\alpha(n)$ is the extremely slow-growing inverse Ackermann function. You can think of it as being effectively a constant.
You'll have to do Union when an element already exist. With some changes to keep track of the set with minimum cardinality, this solution should work. The time complexity of this solution is $O(n\alpha(n))$.
Your problem also appears to be loosely related to the Element Uniqueness problem.
Try a multi-pass scanning if you have strict space limitation.
Say the input has n elements and you can only hold m elements in your memory. If you use a hash-table approach, in the worst case you need to handle n/2 unique numbers so you want m>n/2. In case you don't have that big m, you can partition n elements to k=(max(input)-min(input))/(2m) groups, and go ahead scan the n input elements k times (in the worst case):
1st run: you only hash-get/put/mark/whatever elements with value < min(input)+m*2; because in the range (min(input), min(input)+m*2) there are at most m unique elements and you can handle that. If you are lucky you already find the unique one, otherwise continue.
2nd run: only operate on elements with value in range (min(input)+m*2, min(input)+m*4), and
so on, so forth
In this way, you compromise the time complexity to a O(kn), but you get a space complexity bound of O(m)
Two ideas come to my mind:
A smoothsort may be a better alternative than the cited mergesort for your needs given it's O(1) in memory usage, O(nlogn) in the worst case as the merge sort but O(n) in the best case;
Based on the (reverse) idea of the splay tree, you could make a type of tree that would
push the leafs toward the bottom once they are used (instead of upward as in the splay tree). This would still give you a O(nlogn) implantation of the sort, but the advantage would be the O(1) step of finding the unique element, it would be the root. The sorting algorithm is the sum of O(nlogn) + O(n) and this algorithm would be O(nlogn) + O(1)
Otherwise, as you stated, using a hash based solution (like hash-implemented set) would result in a O(n) algorithm (O(n) to insert and add a counting reference to it and O(n) to traverse your set to find the unique element) but you seemed to dislike the memory usage, though I don't know why. Memory is cheap, these days...

Sort an array which is partially sorted

I am trying to sort an array which has properties like
it increases upto some extent then it starts decreasing, then increases and then decreases and so on. Is there any algorithm which can sort this in less then nlog(n) complexity by making use of it being partially ordered?
array example = 14,19,34,56,36,22,20,7,45,56,50,32,31,45......... upto n
Thanks in advance
Any sequence of numbers will go up and down and up and down again etc unless they are already fully sorted (May start with a down, of course). You could run through the sequence noting the points where it changes direction, then then merge-sort the sequences (reverse reading the backward sequences)
In general the complexity is N log N because we don't know how sorted it is at this point. If it is moderately well sorted, i.e. there are fewer changes of direction, it will take fewer comparisons.
You could find the change / partition points, and perform a merge sort between pairs of partitions. This would take advantage of the existing ordering, as normally the merge sort starts with pairs of elements.
Edit Just trying to figure out the complexity here. Merge sort is n log(n), where the log(n) relates to the number of times you have to re-partition. First every pair of elements, then every pair of pairs, etc... until you reach the size of the array. In this case you have n elements with p partitions, where p < n, so I'm guessing the complexity is p log(p), but am open to correction. e.g. merge each pair of paritions, and repeat based on half the number of partitions after the merge.
See Topological sorting
If you know for a fact that the data are "almost sorted" and the set size is reasonably small (say an array that can be indexed by a 16-bit integer), then Shell is probably your best bet. Yes, it has a basic time complexity of O(n^2) (which can be reduced by the sequence used for gap sizing to a current best-worst-case of O(n*log^2(n))), but the performance improves with the sortedness of the input set to a best-case of O(n) on an already-sorted set. Using Sedgewick's sequence for gap size will give the best performance on those occasions when the input is not as sorted as you expected it to be.
Strand Sort might be close to what you're looking for. O(n sqrt(n)) in the average case, O(n) best case (list already sorted), O(n^2) worst case (list sorted in reverse order).
Share and enjoy.

How do I find common elements from n arrays

I am thinking of sorting and then doing binary search. Is that the best way?
I advocate for hashes in such cases: you'll have time proportional to common size of both arrays.
Since most major languages offer hashtable in their standard libraries, I hardly need to show your how to implement such solution.
Iterate through each one and use a hash table to store counts. The key is the value of the integer and the value is the count of appearances.
It depends. If one set is substantially smaller than the other, or for some other reason you expect the intersection to be quite sparse, then a binary search may be justified. Otherwise, it's probably easiest to step through both at once. If the current element in one is smaller than in the other, advance to the next item in that array. When/if you get to equal elements, you send that as output, and advance to the next item in both arrays. (This assumes, that as you advocated, you've already sorted both, of course).
This is an O(N+M) operation, where N is the size of one array, and M the size of the other. Using a binary search, you get O(N lg2 M) instead, which can be lower complexity if one array is lot smaller than the other, but is likely to be a net loss if they're close to the same size.
Depending on what you need/want, the versions that attempt to just count occurrences can cause a pretty substantial problem: if there are multiple occurrences of a single item in one array, they will still count that as two occurrences of that item, indicating an intersection that doesn't really exist. You can prevent this, but doing so renders the job somewhat less trivial -- you insert items from one array into your hash table, but always set the count to 1. When that's finished, you process the second array by setting the count to 2 if and only if the item is already present in the table.
Define "best".
If you want to do it fast, you can do it O(n) by iterating through each array and keeping a count for each unique element. Details of how to count the unique elements depend on the alphabet of things that can be in the array, eg, is it sparse or dense?
Note that this is O(n) in the number of arrays, but O(nm) for arrays of length m).
The best way is probably to hash all the values and keep a count of occurrences, culling all that have not occurred i times when you examine array i where i = {1, 2, ..., n}. Unfortunately, no deterministic algorithm can get you less than an O(n*m) running time, since it's impossible to do this without examining all the values in all the arrays if they're unsorted.
A faster algorithm would need to either have an acceptable level of probability (Monte Carlo), or rely on some known condition of the lists to examine only a subset of elements (i.e. you only care about elements that have occurred in all i-1 previous lists when considering the ith list, but in an unsorted list it's non-trivial to search for elements.

How can we find the i'th greatest element of the array?

Algorithm for Finding nth smallest/largest element in an array using data strucuture self balancing binary search tree..
Read the post: Find kth smallest element in a binary search tree in Optimum way. But the correct answer is not clear, as i am not able to figure out the correct answer, for an example that i took...... Please a bit more explanation required.......
C.A.R. Hoare's select algorithm is designed for precisely this purpose. It executes in [expected] linear time, with logarithmic extra storage.
Edit: the obvious alternative of sorting, then picking the right element has O(N log N) complexity instead of O(N). Storing the i largest elements in sorted order requires O(i) auxiliary storage, and roughly O(N * i log i) complexity. This can be a win if i is known a priori to be quite small (e.g. 1 or 2). For more general use, select is usually better.
Edit2: offhand, I don't have a good reference for it, but described the idea in a previous answer.
First sort the array descending, then take the ith element.
Create a sorted data structure to hold i elements and set the initial count to 0.
Process each element in the source array, adding it to that new structure until the new structure is full.
Then process the rest of the source array. For each one that is larger than the smallest in the sorted data structure, remove the smallest from that structure and put the new one in.
Once you've processed all elements in the source array, your structure will hold the i greatest elements. Just grab the last of these and you have your i'th greatest element.
Voila!
Alternatively, sort it then just grab the i'th element directly.
That's a fitting task for the heaps which feature very low insert and low delete_min costs. E.g. pairing heaps. It would have the worst case O(n*log(n)) performance. But since non-trivial to implement, better check first suggested elsewhere selection algorithms.
There are many strategies available for your task (if you don't focus on the self-balancing tree to begin with).
It's usually a tradeoff speed / memory. Most algorithms require either to modify the array in place or O(N) additional storage.
The solution with self-balancing tree is in the latter category, but it's not the right choice here. The issue is that building the tree itself takes O(N*log N), which will dominate the later search term and give a final complexity of O(N*log N). Therefore you're not better than simply sorting the array and use a complex datastructure...
In general, the issue largely depends on the magnitude of i related to N. If you think for a minute, for i == 1 it's trivial right ? It's called finding the maximum.
Well, the same strategy obviously work for i == 2 (carrying the 2 maximum elements around) in linear time. And it's also trivially symmetric: ie if you need to find the N-1 th element, then just carry around the 2 minimum elements.
However, it loses efficiency when i is about N/2 or N/4. Carrying the i maximum elements then mean sorting an array of size i... and thus we fallback on the N*log N wall.
Jerry Coffin pointed out a simple solution, which works well for this case. Here is the reference on Wikipedia. The full article also describes the Median of Median method: it's more reliable, but involves more work and is thus generally slower.
Create an empty list L
For each element x in the original list,
add x in sorted position to L
if L has more than i elements,
pop the smallest one off L
if List2 has i elements,
return the i-th element,
else
return failure
This should take O(N (log (i))). If i is assumd to be a constant, then it is O(N).
Build a heap from the elements and call MIN i times.

Resources